The SPA's we create can be reliably linked to (the current URL changes as the user moves around, even though the page hasn't reloaded) and they are "stable" because our business would go bankrupt if Google couldn't crawl our content.
If Google can crawl it, then you can too. And while Google doesn't use a headless browser (or at least I assume they don't) they absolutely do execute javascript before loading the content of the page. And they execute the click event handlers on every link/button and when we use "history.pushState()" to change the URL Google considers that a new page.
You're just going to get a loading spinner with no content if you do a dumb crawl (I disagree with that and think we should be running a headless browser server side to execute javascript and generate the initial page content for all our pages... but so far management hasn't prioritised that change... instead they just keep telling us to make our client side javascript run faster... imagine if there was no javascript to execute at all? At least none before first contentfull paint)
> The SPA's we create can be reliably linked to (the current URL changes as the user moves around, even though the page hasn't reloaded) and they are "stable" because our business would go bankrupt if Google couldn't crawl our content.
This is true for some SPAs, but not all SPAs, and there's not really any way of telling which is which.
I don't personally attempt to crawl SPAs because it's not the sort of content I want to index.
I have a pet theory that there are two forms of the web: the document web and the application web. SPAs have some very attractive properties for the application web but complicate/break the document web.
That being said, with sites like HN, Reddit, LinkedIn, Twitter, news outlets, etc. the lines between “document” and “application” get blurred. In some ways they’ve built a micro-application that hosts documents. Content can be user submitted in-browser. Content can be “engaged with” in browser. Some handle this blurring better than others. HN is an example IMO of getting it right where nearly everything that should be addressable (like comments) can be linked to. Others not so much.
For application websites like the ones you listed, you'd typically end up building a special integration for crawling against their API or data dumps. This is also true for github, stackoverflow, and even document:y websites like wikipedia.
It's simply not feasible to treat them as any other website if you wanna index their data.
If Google can crawl it, then you can too. And while Google doesn't use a headless browser (or at least I assume they don't) they absolutely do execute javascript before loading the content of the page. And they execute the click event handlers on every link/button and when we use "history.pushState()" to change the URL Google considers that a new page.
You're just going to get a loading spinner with no content if you do a dumb crawl (I disagree with that and think we should be running a headless browser server side to execute javascript and generate the initial page content for all our pages... but so far management hasn't prioritised that change... instead they just keep telling us to make our client side javascript run faster... imagine if there was no javascript to execute at all? At least none before first contentfull paint)