Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The point is this

> AlphaZero used the set of legal actions obtained from the simulator to mask the policy network at interior nodes. MuZero does not perform any masking within the search tree, but only masks legal actions at the root of the search tree where the set of available actions is directly observed. The policy network rapidly learns to exclude actions that are unavailable, simply because they are never selected.

MuZero still masks legal moves, but only at the root. All its parts are eventually trained on the output of its root, and so learn the legal moves.

The justify this root level masking by how the Atari will only allow you to perform legal moves, while a weak enough player may consider illegal moves while planning in your head.

The main thing that's slightly "hidden under the rug" is that for "masking" to make sense in the first place, MuZero needs to know a set of all moves that may be legal at some point in the games.



Oh okay. So it's a technique that essentially allows the tree search to be less "pedantic" about the rules in future states. Very interesting.

I would love to see how this might go for more complicated games such as NES adventure games and RPGs.


> while a weak enough player may consider illegal moves while planning in your head

This isn't just weak players. E.g. strong chess players often consider moves as if blocking pawns weren't there. They might consider a bishop to be on a strong diagonal despite there being a blocking pawn because they can imagine moves that would happen if that pawn would disappear.


I suppose you are right. But MuZero won't be able to do this, since it's training forces it to consider legal moves in its planning.


No it doesn't. MuZero does its planning entirely in its own latent space (it may not even actually think of the game in terms of 'moves' but in whatever steps it considers relevant instead), only the output is filtered for legal moves.

It's no different than a monkey operating a chess computer that makes sure the monkey only performs legal moves. Your suggestion would be akin to suggesting that the chess computer would be affecting the monkey's mind so that it can only think in terms of legal chess moves.


Seems you could equivalently treat rule breaking as a loss, and any algorithm sophisticated enough to learn how to win will also learn to avoid breaking the rules.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: