Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is explained in Appendix A of the paper ("Comparison to AlphaZero"): https://arxiv.org/pdf/1911.08265.pdf

Basically, AlphaZero was provided with a simulator that was able to distinguish legal and illegal moves and determine which future game states would be wins or losses. This was used to generate the search tree of possible states and actions.

MuZero doesn't have access to a simulator, it only has access to its direct environment. MuZero excludes actions that are immediately illegal, which solves the problem you mention in your penultimate paragraph, but it needs to learn the game's dynamics in order to determine which future moves and states are possible.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: