Yeah for "Semantic Hashes" (that's a good word for them!) we'd need some sort of "Canonical LLM" model that isn't necessarily used for inference, nor does it need to even be all that smart, but it just needs to be public for the world. It would need to be updated like every 2 to 5 years tho to account for new words or words changing meaning? ...but maybe could be updated in such a way as to not "invalidate" prior vectors, if that makes sense? For example "ride a bicycle" would still point in the same direction even after a refresh of the canonical model? It seems like feeding the same training set could replicate the same model values, but there are nonlinear instabilities which could make it disintegrate.
Maybe the embedding could be paired up with a set of words that embed to somewhere close to the original embedding? Then the embedding can be updated for new models by re-embedding those words. (And it would be more interpretible by a human.)
I mean it was just a thought I had. May be a "solution in search of a problem". I generate those a lot! haha. But it seems to me like having some sort of canonical set of training data and a canonical LLM architecture, we'd end up able to generate consistent embeddings of course, but I'm just not sure what the use cases are.