The SPA's we create can be reliably linked to (the current URL changes as the us...

marginalia_nu · on June 15, 2023

> The SPA's we create can be reliably linked to (the current URL changes as the user moves around, even though the page hasn't reloaded) and they are "stable" because our business would go bankrupt if Google couldn't crawl our content.

This is true for some SPAs, but not all SPAs, and there's not really any way of telling which is which.

I don't personally attempt to crawl SPAs because it's not the sort of content I want to index.

r3trohack3r · on June 15, 2023

I have a pet theory that there are two forms of the web: the document web and the application web. SPAs have some very attractive properties for the application web but complicate/break the document web.

That being said, with sites like HN, Reddit, LinkedIn, Twitter, news outlets, etc. the lines between “document” and “application” get blurred. In some ways they’ve built a micro-application that hosts documents. Content can be user submitted in-browser. Content can be “engaged with” in browser. Some handle this blurring better than others. HN is an example IMO of getting it right where nearly everything that should be addressable (like comments) can be linked to. Others not so much.

(As an aside, I love marginalia!)

marginalia_nu · on June 15, 2023

For application websites like the ones you listed, you'd typically end up building a special integration for crawling against their API or data dumps. This is also true for github, stackoverflow, and even document:y websites like wikipedia.

It's simply not feasible to treat them as any other website if you wanna index their data.