canu7's comments

canu7 · 2025-01-16T19:42:16 1737056536

If they need to query a trained LLM for each page they crawl, I would guess that the training cost would scale up pretty badly...

NathanKP · 2025-01-16T21:11:04 1737061864

Of course you wouldn't do it for every single page. If I was designing this crawler I'd make it sample a percentage of pages, starting at 100% sample rate for a completely unknown website, decreasing the sample rate over time as more "good" pages are found relative to "bad" pages.

After a "good" page percentage threshold is exceeded, stop sampling entirely and just crawl, assuming that all content is good. After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.

With modern models the sampling cost should be quite cheap, especially since Nepenthes has a really small page size. Now if the page was humungous that might make it harder and more expensive to put through an LLM

krior · 2025-01-17T07:17:17 1737098237

> After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.

In the words of Bush jr.: Mission accomplished!

canu7 · on Nov 27, 2024

Looking forward to try it! A shout out to cvise (https://github.com/marxin/cvise) as a python alternative that works well too for non c languages.

canu7 · on March 20, 2024

Phoronix does a lot of CPU benchmarks, including code compilation, but mostly focused on Linux. More result are also available in the OpenBenchmark page, which is also part of the same project. Take a look at the timed compilation test-suit: https://openbenchmarking.org/suite/pts/compilation