Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’d love to build a suite of local tooling to play around with different embedding approaches.

I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.

I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.



If I have 50,000 historical articles and 5,000 new articles I apply SBERT and then k-means with N=20 I get great results in terms of articles about Ukraine, sports, chemistry, and nerdcore from Lobsters ending up in distinct clusters.

I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: