Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there any real obstacle to building a search engine that primarily shows smaller independent websites? Seems pretty doable.


You can't find what's not there. Way too much of the primary content has moved behind those garden walls.

My local farm has a website where they list the stuff they have available. Meanwhile their actual scheduling and detail updates are on their Instagram because of course it is.


Accurate, up-to-date information for many businesses seems to be about 90% on Instagram and 10% on Facebook. The website, if there is a website, has no information, or old information.

This is frustrating (among other reasons) because Instagram has become much more aggressive in not allowing you to even see their content without logging in. Sometimes you can see the gallery but not an individual posts, sometimes no individual posts, sometimes you can't see anything at all.


Even worse, you cannot use a browser from mobile woth cookies deactivated, at least thta's what Instagram claimed the last time I tried accessing it using DDGs browser. I'm sure that overall nbers improved, I stopped bothering that very day. That leaves WhatsApp from Meta's apps, which seems to become Facebook's Excel: The only reason to use anything from that vendor in the first place.


This is particularly acute in countries that skipped the desktop internet and jumped straight to mobile internet. For much of this userbase, their first and primary interaction with the web is through apps. They never bothered creating blogs or web pages and now all their content is behind the walled gardens of Facebook and Instagram.


I'm not having much trouble finding this type of content on the search engine I built that does exactly this. Not everything is available of course, but 20 years ago, you'd probably have to call someone for that information, so not much difference.


At some point it would probably become useful to teach “internet/tech literacy” to educate people on why this is a problem. But we’re a few decades from something like this.


When I was a kid we had "computer class" that taught how to type, how to use Microsoft office (and open office) applications for different use cases, and this was then mixed in with understanding different sources of information taught by the school library and english classes.

As kids are now raised on smartphones instead of the family desktop, I think they need MORE of this, not less, for at least the very important skill of typing. I wonder how many 12 year olds in america can type using the "standard" method, instead of hunt and peck.

I don't want computing to be something only known by the children of turbo nerds. I want young adults to be able to solve their own problems with computers, ie build some spreadsheets for home finance or even just be able to graph the data from one of their science classes.


True technical literacy is at an all time low, IMO. Whilst more people than ever are "online", the barrier for admission is so low and very few people ever seem to learn more than they absolutely need to.

As you can't develop software on phones and tablets, very few people are tinkering with software. The Pi and iOS app craze brought a momentary change, but it seems to have gone back to how it was—and worse.

Kids of today are mostly out of their depth when put in front of a computer of any description if it is beyond basic website usage. Complex program? Forget it. Decent typing skills? Forget it. Networking know-how they'd have picked up from doing LAN gaming with consoles or PCs? No chance. Change a drive? LOL.

For the handful of kids that game on PCs, they're generally not very clued up and they're just copying builds they've seen on YouTube to the word. It's a sad state of affairs.

And yes—of course, there are the kids or us turbo nerds, because of course there are, but they are so few and far between.


That is probably because the farmers hired somebody one time to make a website which they don't have the skill to maintain, but they know how to use Instagram.


Well, yeah, I know why they do it and it's a rational decision for them. It's really a condemnation of our own evolution as an industry and the incentives involved that it ended up that way.


On the other hand, 20 years ago such farm would not have any internet presence at all, not to mention detailed and up to date inventory info. Instagram et al made it possible.


20 years ago there was no need for any of this. You went to tje farm, dod your groccery shopping and tgat was it. Or you didn't go there. Either way, farms did exist back then just fine.

Who on earth would expect a local farm shop to be on par with Amazon when it comes to inventory and availability data online?


My local farm does already have that information, like, "blueberry picking suspended, waiting a week for ripening". Like, I don't have to expect them to add inventory information, they're already doing it because that's how they communicate with their customers. It's just only available behind the garden walls. It's an indictment of us as an industry that it turned out that way because that was the convenient way to do it instead of a more open and user friendly way like a website.

Twenty years ago you just called them for the information, and it's way better for them to broadcast it than have a hundred 1:1 conversations.



This is amazing and really made with love. Just found a website in the 90's style of guy who asked companies for free stuff by letter.


Nice! I discovered this:

https://tilde.club/~fab1/

(might make your fans spin up)


I searched for “cactus” and found a fun history website: http://www.realhistoryww.com/world_history/ancient/Etruria_t...


The biggest hurdle is that parsing web pages is really hard, and sites do a bad job at providing good meta data.


BeautifulSoup and its clones does parsing pretty well. Just extracting the text out of HTML isn't incredibly hard, and metadata is too unreliable to ever be much use.


The hard part is understanding which parts are the content versus navigation or promotions of other content.

I’ve written a couple search engines. Have you tried making one with beautiful soup?


No I use JSoup for my search engine.

You can calculate anchor tag density across the DOM tree and prune branches that exceed a certain threshold to remove navigational elements with reasonable accuracy if that is a problem.

It's not going to be perfect, but even Google messes this up every once in a while. I wouldn't consider it a major hurdle.


I don't presume the source is available... unbelievably cool project that I'm sure a lot of people have imagined themselves doing.

edit: https://git.marginalia.nu/marginalia/marginalia.nu !!!


The actual feature I described is not in that repo though. It's something I've been working on. Here's the code for that (AGPL):

    private static final double PRUNE_THRESHOLD = .5;

    public void prune(Document document) {
        PruningVisitor pruningVisitor = new PruningVisitor();
        document.traverse(pruningVisitor);

        pruningVisitor.data.forEach((node, data) -> {
            if (data.depth <= 1) {
                return;
            }
            if (data.signalNodeSize == 0) node.remove();
            else if (data.noiseNodeSize > 0
                    && data.signalRate() < PRUNE_THRESHOLD
                    && data.treeSize > 2) {
                node.remove();
            }
        });
    }



    private static class PruningVisitor implements NodeVisitor {

        private final Map<Node, NodeData> data = new HashMap<>();
        private final NodeData dummy = new NodeData(Integer.MAX_VALUE, 1, 0);

        @Override
        public void head(Node node, int depth) {}

        @Override
        public void tail(Node node, int depth) {
            final NodeData dataForNode;

            if (node instanceof TextNode tn) {
                dataForNode = new NodeData(depth, tn.text().length(), 0);
            }
            else if (isSignal(node)) {
                dataForNode = new NodeData(depth,  0,0);
                for (var childNode : node.childNodes()) {
                    dataForNode.add(data.getOrDefault(childNode, dummy));
                }
            }
            else {
                dataForNode = new NodeData(depth,  0,0);
                for (var childNode : node.childNodes()) {
                    dataForNode.addAsNoise(data.getOrDefault(childNode, dummy));
                }
            }



            data.put(node, dataForNode);
        }

        public boolean isSignal(Node node) {

            if (node instanceof Element e) {
                if ("a".equalsIgnoreCase(e.tagName()))
                    return false;
                if ("nav".equalsIgnoreCase(e.tagName()))
                    return false;
                if ("footer".equalsIgnoreCase(e.tagName()))
                    return false;
                if ("header".equalsIgnoreCase(e.tagName()))
                    return false;
            }

            return true;
        }
    }

    private static class NodeData {
        int signalNodeSize = 0;
        int noiseNodeSize = 0;
        int treeSize = 1;
        int depth = 0;

        public void NodeData(int depth) {}

        private NodeData(int depth, int signalNodeSize, int noiseNodeSize) {
            this.depth = depth;
            this.signalNodeSize = signalNodeSize;
            this.noiseNodeSize = noiseNodeSize;
        }

        public void add(NodeData other) {
            signalNodeSize += other.signalNodeSize;
            noiseNodeSize += other.noiseNodeSize;
            treeSize += other.treeSize;
        }

        public void addAsNoise(NodeData other) {
            noiseNodeSize += other.noiseNodeSize + other.signalNodeSize;
            treeSize += other.treeSize;
        }

        public double signalRate() {
            return signalNodeSize / (double)(signalNodeSize + noiseNodeSize);
        }
    }

It renders the text of this link (at present): https://news.ycombinator.com/item?id=32594821

Into this search-engine friendly text:

The hard part is understanding which parts are the content versus navigation or promotions of other content. I’ve written a couple search engines. Have you tried making one with beautiful soup? Why does it matter? You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct. In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway. There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already. How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do. No I use JSoup for my search engine. You can calculate anchor tag density across the DOM tree and prune branches that exceed a certain threshold to remove navigational elements with reasonable accuracy if that is a problem. It's not going to be perfect, but even Google messes this up every once in a while. I wouldn't consider it a major hurdle. I don't presume the source is available... unbelievably cool project that I'm sure a lot of people have imagined themselves doing.


Thanks for sharing! I’ll try this after work on some URLs from the search engine I’m working on as my hobby project.


Yeah, almost certainly could do with some tweaking and tuning, but the basic idea works remarkably well in many cases.


Do you have a fully functioning code example? I didn't realize it was just a snippet when I looked at it earlier.



Yeah, it depends on what you want to prioritize and value in your search engine. I’m coming at it from the angle that if you want to make a good, new, and different kind of search engine you need to do something fundamentally different than Google. No one is going to beat Google at their own game. Leveraging meta data is a very easy way to make something new and different, but it won’t be as comprehensive as Google. I doubt that someone doing what you described over a few months or year could make a search engine that anyone wanted to use.


> I doubt that someone doing what you described over a few months or year could make a search engine that anyone wanted to use.

Dunno, not only are people sending me money to develop my search engine, not enough to live off but still, I also get emails and tweets from people who say they love it almost on a weekly basis.

I think attempting to be as comprehensive (or more) than Google is a trap. The better move is to fly under them. Be cheaper and better at something. Recipes is a great example of something Google is just miserable at, that is easy to do much better. There's plenty of such niches.


Why does it matter?

You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct.

In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway.

There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already.

How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do.


That’s how you make a worse search engine than Google. If you are serious about competing in that space I think you need to do something fundamentally different than Google. Treating pages as a bag of words leads to a shitty search engine. Like I said, I’ve built a few search engines, and I have tried this.

Edit: https://en.wikipedia.org/wiki/Bag-of-words_model


That actually sounds like the solution. If you're getting something standard, you don't want it. If something is too non-standard to be identified, pass it through.


Unless this is a whitelist, it's a metric which will be gamed.

If it is a white list, then why have a search engine rather than an old school curated Yahoo directory?


Search engine spam is actually a fairly solvable problem if you aren't in Google's questionable position of also selling the ads that make the spam economically viable.

They can do everything except the one thing that would actually hurt the search engine spammers right in the coin purse: Penalize websites for having ads.


> The goals of the advertising business model do not always correspond to providing quality search to users.

- Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine


Yeah, that one aged well :P


I don't know about that. If I recall correctly, there was a time when Google was trying to tamp down on pages with excessive advertising, so SEO spammers just switched to making pages that superficially looked like normal informational pages, but the content was all either ad copy or spun text, and all their links went to products they wanted to sell.


I get where you're coming from, but my experience is that even many people making sites for fun stick ads on the site in the hopes of getting a bit of extra cash. Lots of fansites did so in the olden days, and many useful sites, wikis and blogs do so now, especially if they either use hosting that adds them or they get enough traffic they feel they need to.


I'm not saying block ads, I'm saying prioritize sites that doesn't have ads above those that do. Prioritize those that have some over those that have a lot.


> Penalize websites for having ads.

Oh ... this is such a good idea. I'm like tempted to try it and see what happens.


Is Wikipedia's prompt to donate (which they do quite often) an ad?


The whole premise of Google at the beginning was that the web is a collaboration, and we can measure that.

Then it was we can measure that and make money.

Now it’s just we can make money.

Whether that part is sinister or not, we know that we have a good number of bad actors, and from search engine results we can be sure that they have not developed a workable Byzantine fault tolerance mechanism to filter out the bad actors. Those who scream the loudest get put on a stage.


Companies that acquire other companies and talent will always change. The goal of making money over everything is the one thing all Class-C corporations have in common.


An AI could make the list, and pretty soon the gamers of the system would opt for really terse recipe sites.


It cannot be gamed. You wrote your own search engine, you wrote your own filters, you know how it works and noone else.

Whitelists that I wrote by hand also don't introduce new unexpected entries by the way :)

This instead could be more like RSS where as your crawler gets new sites, you get updates on new things, and you could filter in your crawler or in RSS client directly, doesn't matter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: