It's easy to trick by just having your hand anywhere in front of your face, even a foot away. Is this something current image classifier architecture could address with more data? It seems like it'd be hard to tell whether a hand is large and on a face, or small and far away from the face unless there is some sort of depth estimation going on (based on cues like shadows).