Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes :(

Do you have an idea of how to remove company websites in an automated way? I didn't want to manually review all 7k indexed websites.

This is the GPT prompt I used for filtering domains to add, but it gives false positives:

  You are an API. Return a JSON array of booleans indicating whether each provided domain is someone's personal website. Use common sense. Make sure to return false for company websites.


Maybe change the API so that GPT can express uncertainty (make it a ternary value or even a confidence percentage), and then check the “uncertain” cases manually.


Yep, most of our systems end up exposing a parameter like that to the customer. Some people only like the system to take action if the system is very sure, hate incorrect action and prefer unprocessed stuff in a queue. Other customers hate unprocessed items and prefer to cleanup incorrect actions. Takes tinkering to find the best.


Great idea, I will try this. Thank you!!


For the 404s (assuming the status code isn't a 4xx), use a URL that you strongly suspect won't exist, then you can do a comparison (levenshtein distance, bag of words, etc.) to see if it's very similar to one of about, ideas, etc. pages.


> For the 404s (assuming the status code isn't a 4xx)

Most are a 4xx code, I checked myself, some may be 301/302 redirect to 4xx not being handled properly by their crawler


Good point. We're using https://crawlee.dev, I think there's a way to handle more status codes as errors...

Right now it only excludes pages based on the text content: https://github.com/lindylearn/aboutideasnow/blob/main/apps/a...


I think openai embeddings API could be useful here. Perhaps one of the neurons responds to corporate speak.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: