For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.
I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.
First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.
Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.
Now that's the challenging part, especially since Mozilla needs to fund browser development. The initial differentiation for a Mozilla search would have been difficult, but at their peak they had 30%+ market share, and if the default search on Firefox was Mozilla Search then they might have been able to make it all work financially. DuckDuckGo makes over $100 million in revenue per year[1], if a Mozilla Search made as much money annually, then even though it's below their current $500 million search contract with Google, with some fiscal responsibility they probably would have had enough to support a search engine and browser development concurrently.
>The biggest issue is that an ancient version of Firefox is the only viable option for a web browser.
There was some work done on porting Palemoon, Otter Browser, and QtWebEngine to ArcaOS. A preview version of a QtWebEngine-based browser[0] was available for download and may work with newer websites (given QtWebEngine is based on Chromium), but I have not tested it myself.
> But I always find I have nothing to speak about.
Odds are you do have something interesting to speak about. Many people are experts in very niche things and don't even realize it. You may be very proficient with a niche piece of software that is not well documented, or may have created software to solve a very specific problem. Writing blog posts about your niche knowledge can be tremendously helpful; I can't tell you how many times a single blog post about an obscure problem has saved me hours (possibly even days or weeks) of research when I've encountered the same problem.
I feel like I never understand anything well enough to tell other people about it - the bare information I have can easily be acquired by anyone else without reading anything I write so my writing is a waste of time (when it comes to writing for other people anyway)
Everyone learns things in a different order and in different ways, so an alternative explanation or source of information can still be valuable. Sometimes I have an important realization simply because something I'd already known for years was presented from a different angle, or maybe just because I rediscovered it in a more fitting moment of my life.
And if someone doesn't find it useful, well, they'll just stop reading and go somewhere else.
Often writing for yourself (“how I did x when I didn’t know how”) is useful enough without being an expert on the subject. You can refer back to it later on if you forgot how you did it, or someone else who was in your shoes will appreciate it. Also makes things less daunting (“hey others are going through the same as me”).
They have an IRC chat room where you can request an invite[0]. They typically ask if you have a personal website, git repo, interesting projects, etc when inviting people.
Lisp has very minimal syntax and so you can build a lisp interpreter fairly easily in many languages. Most languages don't have a readily available python interpreter that you can embed, and building one that is feature parity with CPython (the most popular python implementation) is not an easy task.
Wikipedia has a page on single-user personal wikis that are great for organizing your notes[0].
If you're looking for something simple and unixy, Zim is a pretty good choice[1]. It's an offline GTK-based GUI application for creating personal wikis that saves all the wiki pages as Markdown files and can export your wiki as HTML using various templates. Zim has been available in many linux package repos for over a decade and is GPL-2.0 licensed.
Yes, it can be quite frustrating. On my search engine[0] I tag sites that use captcha services and browser checks so that the user has more transparency and can choose to avoid sites that use these services if they want.
I work on a search engine in my spare time as a side project. What I've learned from working on a search engine is that even though the recent GPT models are quite good for general purpose search, there are still ample opportunities in search, and generally not enough people looking at search to cover everything.
If you don't mind me asking, what do you use Haskell for at work? Also was the company using Haskell before, or did you bring Haskell to the job? It's always fascinating hearing stories about people using less common languages like Common Lisp or Haskell on the job!
I've only ever joined Haskell companies (although I did use it for a startup of my own once!)
I would generally categorize what I do as "backend engineering." The domains have been all over the place. Crypto, fintech, adtech, data engineering, SaaS, infra automation. Lots of Postgres and a bunch of other tech (Redis, Kafka, other KV stores).
The places end up finding me (usually through LinkedIn where I make very clear that Haskell is what I'm about).
It's been fun! I've met a lot of nice people, and I got paid to climb way over the Haskell learning curve. Can't complain :)
[0] https://github.com/anthmn/ai-bot-blocker
[1] https://darkvisitors.com/