What to expect from AlphaZero's value predictions [D]
Our take
An AlphaZero agent has learnt to predict the value of a game state by training on data generated by self-play by the model and a series of predecessor models. By construction, this value should reflect the probability of winning against a copy of itself starting from the given state. To be more precise, the value measures the state's average strength against opponent players collected among all the predecessors of the current model. This average depends on the manner in which the training data is sampled from the pool of self-play data (using a rolling window of self-play by the latest x models, putting more emphasis on recent models by geometric weighting, etc.).
In each round of self-play, we can think of the agents (a copy for each player) making moves following a strategy, albeit a stochastic one (unless the temperature parameter is zero), defined by the PUCT function for the predicted values and policies, but that this strategy is a little perturbed by the addition of some proportion of Dirichlet noise. The purpose of this perturbation is to give the model an opportunity to find successful actions by chance and not get trapped into some rigid, possibly narrow, pattern of playing.
Because of role of noise in deciding which move to make, the formulation above that the value reflects the chances of winning against the model itself is an over-simplification. The data on which the value prediction is based does include "outlier" moves, and - as far as I've understood - this is a heuristic argument for the claim that the model makes its predictions based on experience of playing against a variety of different players.
However, due to the moves that differ the most from the "predicted" ones being outliers, such moves also have a correspondingly small impact on the value predictions: it is the agent's own playing style, and the historical development of said style, that governs value predictions.
So, if the agent meets a strong opponent, either a human being or an algorithm with a strong track record, why should AlphaZero's value prediction be a reliable measure of the agent's chances of winning against this opponent from the given position?
Experience has shown AlphaZero to indeed outperform both human players and other algorithms in a variety of games. I wonder if this success is also to be expected a priori, or is it conceivable that AlphaZero could even fail miserably in some game against a specific algorithm whose moves, though occurring in AlphaZero's training data pool, occur so infrequently that they don't make any significant impact on the predictions?
[link] [comments]
Read on the original site
Open the publisher's page for the full experience