Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Entity Resolution Challenges (sheshbabu.com)
10 points by rkwz on Aug 6, 2023 | hide | past | favorite | 6 comments


Back in the day we called this the schema matching problem. I wrote a heuristic aided by learning model for epinions that tackled this in the space of product catalogs where you had to determine that a vendors “palm pilot with pen” was different than a “palm pilot pen.” It’s a very difficult problem space that could probably have been helped substantially by fine tuned LLMs. Instead we took our low confidence matches and shipped them to a company in India called Ugam that specialized in matching product catalogs. I was surprised there preexisted entire companies dedicated to the problem.

https://en.m.wikipedia.org/wiki/Schema_matching


Curious, how would LLMs help here?


LLMs tend to be pretty good classifiers with excellent linguistic semantic differentiation. Prior state of the art NLP tends to be relatively weak in comparison, and especially in the 20+ year ago frame I was working on the problem. Presenting a product catalog with a product title, product description, sku, price, and other data and asking GPT4 to classify each field according to a schema generally works remarkably well. Providing a sample of different formats and ways to describe the same and similar items, and a “canonical” reference item set, it is astoundingly capable at sorting them into correct bins without any special work on my part. On the other hand, doing something similar required me to build a complete lucene stemmer, indexer, and a variety of other stuff as well as a bunch of NLP stuff, a scoring system, and a bunch of other plumbing. It took me over a year to build it. Likely I would today use GPT4 or more likely a fine tuned Llama 2 with a vector database of our canonical product catalog, and figure out how to pull out entropy scores and use it as a strong factor in a similar heuristic hierarchical classifier as I did before - but I would rip out 90% of it and just use the rest to verify the LLM didn’t hallucinate. I estimate it would take me maybe two months now vs fourteen or more with prior tools.


Thanks for sharing the tip, would definitely try this out in near future!

> figure out how to pull out entropy scores Can you elaborate on this? "entropy scores" == how unique a value is?


Entropy scores would be, essentially, the probability the output string accrued through the inference process. It’s a poor score for how confident the inference was by the LLM. It doesn’t mean correctness by any means, but a worse entropy score means, in a way, more guessing on each token.


Thanks, TIL!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: