Thank you! We tried to keep things as simple as possible on the policy side, but definitely there is a lot more room for innovation to go from 81% to 99%, like using temporal information and global structure of the task.
Another limitation here is that the model is regressive, so for example if a task was to pick up one bottle out of two and the demos showed 50/50 of picking up one than the other, the model would output the mean even though it is not meaningful.
Indeed! In fact, I have a project [0] from last year that uses a GPT-style transformer to address that exact issue :) However, it’s hard to go far outside simulations in real home robotics without a good platform, out of which efforts came Dobb-E.
I've also seen the one that uses the diffusion process for planning, I imagine it's even slower, but maybe with a consistency loss something can be done about it.