Downvoters, please see http://matpalm.com/resemblance/simhash/ https://en.wikipe...

overlords · on Jan 20, 2020

I'm going to write this out more clearly, because I'm still getting downvotes for my correct answer.

Why neural networks? https://en.wikipedia.org/wiki/Universal_approximation_theore...

Can polynomials do this? (Yes) https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theo...

What is transformer and attention? https://pathmind.com/wiki/attention-mechanism-memory-network

Attention = Polynomial (x2,x3 etc.)

Polynomial = interaction. VW flag -interaction

1 layer transformer = xx. (x^2)

2 layer tranformer = xxx. (x^3)

3 ... etc

What is reformer? Transformer where LSH is applied.

One type of LSH is SimHash. ngrams of strings, followed by 32 bit hash.

Vowpal Wabbit -n flag for ngrams.

vw -interact xxx -n2 -n3 and you get ngrams + 32 bit hash doing SGD over a vector.

This vector is equivalent to a 2 layer reformer.

Non-linear activation is not needed because polynomials are already nonlinear.

So vw + interact + ngrams (almost)= reformer encoder. (if reformer uses SimHash, then they are identical).

Transformer/Reformer have an advantage, the encoder-decoder can learn from unlabeled data.

However, you can get similar results from unlabeled data using preprocessing such as introducing noise to the data, and then treating it as noise/non-noise binary classification. (it can even be thought of as reinforcement learning, with the 0-1 labels as the reward using vw's contextual bandits functionality. This can then do what GAN's do - climb from noise to perfection).

visarga · on Jan 20, 2020

> This vector is equivalent to a 2 layer reformer.

There is no feed forward layer, no skip connections and no layer normalization in VW. In the reformer, hashing is followed by dot products. In VW hashing just collides some tokens, followed by a linear layer.

Also, 2 layers of transformer is a little shallow. In practice it's 12-14 layers or more.

In order to be equivalent, there would need to be equally good results on translation from VW, but I've never seen it used for translation. I'm wondering why?

overlords · on Jan 20, 2020

- hashing followed by dot product in transformer you said

- you were doing dot products at each layer to introduce non-linearity in transformer (and neural nets in general). Polynomials are already non-linear, so you don't need that. Transformer and vw -interact are polynomials. Maybe the feedforward layers and skip connections are not actually needed.

- 12 layers ? vw -interact xxxxxxxxxxxxx is 12 layers. You need a lot of memory for that, but in principle vw interact can do any number of them

These results are coming from google and their massive compute resources. If they ran vw with -interact x^13 they might get similar results.

We're really talking about polynomial approximation here, both transformer and vw used in this way. And that is in theory able to approximate any continuous function (just like neural networks).

junipertea · on Jan 20, 2020

I guess the simpler proof that they are the same thing would be: Do they work the same?