Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The scale of word embeddings (eg. Distance from 0) is mainly measuring how common the word is in the training corpus. This is a feature of almost all training objectives since word2vec (though some normalize the vectors).

Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.

If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.

If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.

There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.



> If you normalize your vectors, cosine similarity is the same as Euclidean distance.

If you normalize your vectors, cosine similarity is the same as dot product. Euclidean distance is still different.


Oh, thanks for the correction.

If all the vectors are on the unit ball, then cosine = dot product. But then the dot product is a linear transformation away from the euclidean distance:

https://math.stackexchange.com/questions/1236465/euclidean-d...

If you're using it in a machine learning model, things that are one linear transform away are more or less the same (might need more parameters/layers/etc.)

If you're using it for classical statistics uses (analytics), right, they're not equivalent and it would be good to remember this distinction.


To be very explicit, if |x| = |y| = 1, we have |x - y|^2 = |x|^2 - 2xy + |y|^2 = 2 - 2xy = 2 - 2* cos(th). So they are not identical but minimizing the Euclidian distance of two unit vectors is the same as maximizing the cosine similarity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: