It's a combination of the fact that the source video takes place in a club with ...

It's a combination of the fact that the source video takes place in a club with some flashes/strobe effects going on, and the fact that the algorithm looks at things frame by frame (more or less.) When it flashes it sees more/different details than when it is presented with a dark frame, and it interprets those as a dog.