> AlphaZero
used the set of legal actions obtained from the simulator to mask the
policy network at interior nodes. MuZero does not perform any masking
within the search tree, but only masks legal actions at the root of the
search tree where the set of available actions is directly observed. The
policy network rapidly learns to exclude actions that are unavailable,
simply because they are never selected.
MuZero still masks legal moves, but only at the root.
All its parts are eventually trained on the output of its root, and so learn the legal moves.
The justify this root level masking by how the Atari will only allow you to perform legal moves, while a weak enough player may consider illegal moves while planning in your head.
The main thing that's slightly "hidden under the rug" is that for "masking" to make sense in the first place, MuZero needs to know a set of all moves that may be legal at some point in the games.
> while a weak enough player may consider illegal moves while planning in your head
This isn't just weak players. E.g. strong chess players often consider moves as if blocking pawns weren't there. They might consider a bishop to be on a strong diagonal despite there being a blocking pawn because they can imagine moves that would happen if that pawn would disappear.
No it doesn't. MuZero does its planning entirely in its own latent space (it may not even actually think of the game in terms of 'moves' but in whatever steps it considers relevant instead), only the output is filtered for legal moves.
It's no different than a monkey operating a chess computer that makes sure the monkey only performs legal moves. Your suggestion would be akin to suggesting that the chess computer would be affecting the monkey's mind so that it can only think in terms of legal chess moves.
Seems you could equivalently treat rule breaking as a loss, and any algorithm sophisticated enough to learn how to win will also learn to avoid breaking the rules.
> AlphaZero used the set of legal actions obtained from the simulator to mask the policy network at interior nodes. MuZero does not perform any masking within the search tree, but only masks legal actions at the root of the search tree where the set of available actions is directly observed. The policy network rapidly learns to exclude actions that are unavailable, simply because they are never selected.
MuZero still masks legal moves, but only at the root. All its parts are eventually trained on the output of its root, and so learn the legal moves.
The justify this root level masking by how the Atari will only allow you to perform legal moves, while a weak enough player may consider illegal moves while planning in your head.
The main thing that's slightly "hidden under the rug" is that for "masking" to make sense in the first place, MuZero needs to know a set of all moves that may be legal at some point in the games.