Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are there any technical infos about the search engine? Found some information here https://memex.marginalia.nu/projects/edge/about.gmi

Author said they threw it together on consumer hardware. How big is the index? (TB used or entries) how is it realised?

I'm pretty much interested in this since I myself am crawling some pages for my own "search index".

Oh and thx for making and posting. Added it as a keyword to firefox

Edit: Just realized that my question is a bit shallow. What I'm particular interested in is the storage before the indexing. I'm trying to store the raw html so that I can reindex everything with better algorithms, but I'm hitting many limits. It takes a few minutes getting the size of a site-directory (every site has it's own dir) and I'm at a point where I can't reasonably manage the scrape-versioning over git and I cycled through a few filesystems only to find that the metadata management kind of sucks for most of them. It's rather interesting how we store such files and I'm thinking about storing a few sites in a simple sqlite format for easy access and search. I'm thinking about a a few low overhead solutions like facebooks project haystack (implemented open source in seaweedfs) or something similar... Hopefully this gives some context to the question of storage and sites that are indexed



The index is tiny, not even a terabyte. Right now it's a few hundred gigabytes for ~20 million URLs. But it's stored in an extremely dense binary format.

Honestly you may just want to roll your own solution for storing a ton of files. If you don't need a general-purpose filesystem, but an append-only archive with extra metadata, then you can cut a lot of corners. Like if you have a file system that is fixed-size and append-only, you can build it in a way no off-the-shelf stuff can.

This line of thinking is a large part of why my index is so small and fast. I have a lot of special built data-structures that are built for their exact use case. Like a fixed size append-only hash map that uses mapped memory and can in theory be larger than the system memory. Very good for a search engine, absolutely useless almost everywhere else.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: