Policy-Guided Heuristic Search

Policy-Guided Heuristic Search (PHS) is a best-first search policy tree search method that augments the Levin tree search cost function with an explicit heuristic factor.

Definition

The cumulative expansion cost along the path from the root to , including the expansion of itself, is given by

where is the expansion cost of node and are all ancestors of (including itself).

Definition

Policy-Guided Heuristic Search is obtained from best-first search by using

where plays the same role as in the Levin cost with generalizing to non-unit expansion losses, and uses heuristic information to rescale this quantity toward the search effort required to reach a descendant solution.

Analysis

While the LTS cost function is monotone, need not be, due to the heuristic factor . To help provide a bound on the search, we need to define a few things.

Define the monotone non-decreasing envelope

To write this in the same multiplicative form, define

with when , such that

We also define as the set of nodes of value at most , and as the set of leaf nodes in .

Theorem

The number of node expansions PHS requires before reaching a solution is bounded by

Practical Heuristics

In policy tree search, search effort depends not only on how much path cost remains, but also how much probability mass must be preserved along the remaining trajectory. The LTS cost function captures the search effort required to reach the given node, but does not account for the remaining effort required to reach a descending goal node.

Many problems in single-agent deterministic search already have a readily available heuristic function. For example, if is a heuristic for A*, then the heuristic factor can be defined as

which results in the evaluation function

The heuristic factor extends the LTS cost to by incorporating an estimate of the remaining path cost. However, this construction inherits only the additive part of the future solution cost. If is least-cost descendant solution of , and was an exact heuristic in the usual A* sense, then , and therefore . Thus, even with an exact heuristic for the remaining cost, still misses the ratio

which accounts for the additional decay in probability along the suffix from to .

Accurately Predicting the Solution Cost (PHS*)

The motivation for is precisely to compensate for this missing probability term by extrapolating the future policy decay from the prefix observed so far. The difficulty is to estimate . One can view

as an average conditional probability per unit loss along the prefix from the root to . If this average is assumed to remain roughly stable along the remaining suffix, then the probability of a descendant solution with total loss can be approximated by

Definition

PHS* is obtained from best-first search by using

where the corresponding heuristic factor is

This produces a more aggressive PHS* evaluation function.

Implementation

Numerical Stability

The path probabilities may decrease exponentially with depth, which can lead to numerical underflow. Since applying a monotone transformation preserves the expansion order induced by the evaluation function, it is often preferable to compute PHS* costs in log-space:

Moreover, if is a child of , then the log-path probability can be updated incrementally:

Avoiding Zero Probabilities

To avoid zero probabilities being assigned to various paths, one can mix the policy with a uniform policy:

Code

template <typename Env>
struct LTSNode {    
    auto compute_log_policy(const Node ) -> std::vector<double>;
    
    auto generate_child(int action) -> BFSNode {
        auto child = *this;
        child.parent = this;
        auto step_cost = child.state.apply_action(action);
        child.g += step_cost;
        child.log_p += child.parent.log_policy[action];
        child.log_policy = compute_log_policy(child);
        return child;
    }
 
    Env state;
    BFSNode *parent = nullptr;
    double g = 0;
    double log_p = 0;
    std::vector<double> log_policy;
    double cost = 0;
    int generated_step = 0;
    int expanded_step = 0;
};
 
template <typename Env>
auto phs_eval_func(const AStarNode<Env> &node) -> double {
    h = (h < 0) ? 0 : h;
    return g == 0 ? 0 : std::log(h + g + EPS) - (log_p * (1.0 + (h / g)));
}
 
template <typename Env>
auto lts(Env initial_state) {
    LTSNode root_node = { .state=initial_state }
    return best_first_search(root_node, lts_eval_func);
}

Sources