Do you have an idea of how to remove company websites in an automated way? I didn't want to manually review all 7k indexed websites.
This is the GPT prompt I used for filtering domains to add, but it gives false positives:
You are an API. Return a JSON array of booleans indicating whether each provided domain is someone's personal website. Use common sense. Make sure to return false for company websites.
Maybe change the API so that GPT can express uncertainty (make it a ternary value or even a confidence percentage), and then check the “uncertain” cases manually.
Yep, most of our systems end up exposing a parameter like that to the customer. Some people only like the system to take action if the system is very sure, hate incorrect action and prefer unprocessed stuff in a queue. Other customers hate unprocessed items and prefer to cleanup incorrect actions. Takes tinkering to find the best.
For the 404s (assuming the status code isn't a 4xx), use a URL that you strongly suspect won't exist, then you can do a comparison (levenshtein distance, bag of words, etc.) to see if it's very similar to one of about, ideas, etc. pages.
Do you have an idea of how to remove company websites in an automated way? I didn't want to manually review all 7k indexed websites.
This is the GPT prompt I used for filtering domains to add, but it gives false positives: