Huh? JS can be cached and preserved just like anything else. Even the (presumabl...

spicybright · on Oct 22, 2021

Browsers evolve and break older pages.

JS that requires requests as you interact with it will need an implementation of the server it uses (the subset and data of it's endpoints)

How do you preserve that?

tinco · on Oct 22, 2021

Browsers don't usually break older pages. The only time this happens when you rely on unstandardized features. At least, I have not noticed any page breaking in the past 15 years except for the ones I built using unstandardized API's.

asddubs · on Oct 22, 2021

-https sites embedding http images

-SameSite:none cookies (with bonus breakage that makes it impossible to use accross older and newer browsers simultaneously without user agent sniffing)

-the planned chrome alert/prompt changes already mentioned below

browsers have been getting a lot more comfortable with the idea of breaking backwards compatibility as of late.

Macha · on Oct 22, 2021

The largest browser maker no longer believes so:

https://www.theregister.com/2021/08/05/google_chrome_iframe/

https://mobile.twitter.com/estark37/status/14226948565440593...

spicybright · on Oct 22, 2021

> except for the ones I built using unstandardized API's.

Think you proved my point.

mappu · on Oct 22, 2021

The changing behaviour of browser autocomplete and the new disregard for autocomplete="off" really harmed multiple large CRM / ERP-style sites i worked on, as passwords would get "helpfully" autofilled into completely wrong fields, causing data loss.

I actually still don't think there's a proper sanctioned solution to this, it seems to be cat-and-moused by web developers and browser developers every year or two.

tenebrisalietum · on Oct 22, 2021

I've had some success in the past using a WARC proxy - it will basically record everything that traverses the browser and can "play it back" on demand. So while it won't automatically download everything on a site the idea is that whatever you visit with and interact with in a session can be "played back" in the future at some point.

genewitch · on Oct 22, 2021

I actually built a machine to do this, and wanted to use l my "old" spindle drives as storage, but I was and am unhappy with the offerings of replicated or raid like storage solutions. I keep waiting for ceph or something to have a better NFS layer, but I might just have to have scripts that balance/replicate the warc however many times based on usage.

For those interested I have 43TB, 210TB, 61-2TB, 212TB. The machine boots and sees all the spindles. ZFS would be ideal but cannot handle differing drive sizes I guess. JBOD misses the replication requirement. Just doing mdraid on all the similar drives and having different folders or JBOD "wastes" too many spindles (at least one per group, so 4-6 wasted spindles by size alone!).

So I'll probably handroll something with whatever that triggered rsync program is, or cronjobs. lsyncd, that's the one.

tshaddox · on Oct 22, 2021

How is that any different than wanting to archive a CGI website from the 90s with a URL structure like http://example.com/?query=foo? Unless there's an index page with links to all possible query values, or you can work out how to manually iterate all possible query values, there's not much you can do. This doesn't seem to have anything to do with JavaScript data visualizations specifically.

titusjohnson · on Oct 22, 2021

That URL structure is trivial for a crawler to walk and index. I'm not sure why you'd assume that there wouldn't be an index page, such a site would have all desired links in the DOM, the crawler just sniffs those out and visits in sequence. There's no need to think that the links would somehow be 'hidden' from the user and have to be randomly enumerated...

Not only that, but a site of that era probably also has a sitemap.xml file which would enumerate all available public endpoints, specifically to make it easier for crawlers to index everything.

tshaddox · on Oct 22, 2021

> I'm not sure why you'd assume that there wouldn't be an index page

I’m not assuming either way. I’m just pointing out that either type of web site could choose to have an index page or choose not to have an index page.

selfhoster11 · on Oct 22, 2021

If it's view-only, request-response replaying could be a viable option. Browser software can be emulated.

programmarchy · on Oct 22, 2021

Maybe a good example is the original Nintendo (NES) emulators. New gaming consoles can't play those old cartridges, but we have a virtual layer that can. The same holds true for browsers, OSs, etc. It does create a pretty long chain of dependencies, though.

spicybright · on Oct 22, 2021

Ignoring the network point I made above, it'll be a monumental effort to get there.

Best we can hope for is virtual machines, and archive files targeting a specific browser version on specific virtual machines.

kbenson · on Oct 22, 2021

If it's dynamically updating based on a database of information that's not shipped to the app in it's entirety, you either have to hope you've somehow seen and preserved all the date from exploring the app, or accept that some data may be lost.

> presumably JSON or CSV

That's presuming a lot. Even if it's accurate for most/all NYTimes infographics today, it doesn't mean it's accurate tomorrow, and it isn't accurate today for a lot of other sites.

tshaddox · on Oct 22, 2021

> If it's dynamically updating based on a database of information that's not shipped to the app in it's entirety, you either have to hope you've somehow seen and preserved all the date from exploring the app, or accept that some data may be lost.

Well, yeah, that's true of all normal websites too. That's precisely what web crawlers are for. If there's no index page that links to all pages, or some way of iterating through all the pages, you wouldn't be able to exhaustively archive any web site.

kbenson · on Oct 22, 2021

> Well, yeah, that's true of all normal websites too.

Not exactly. While you may miss data that isn't requested specifically, you can crawl the site and get most/all that is accessible through links at least. Stuff only available through search results won't show, but if it's discoverable through browsing, you can get it.

The same can't necessarily be said for custom interfaces that are JS heavy, possibly with non-link click actions, custom sliders, a graphical representation of a map that expects a click on a region, etc. An old style page that lists all the regions (like states, or counties in a state), or even that has a dropdown in a form? Those are much easier to crawl and archive.

brundolf · on Oct 22, 2021

Sure, that's fair if they don't have a single call that fetches the whole dataset. Though I'd think an article would often be covering a specific, bounded dataset to make its point, and wouldn't need to query a table of indeterminate length

kbenson · on Oct 22, 2021

We'd hope. Sometimes weird choices are made, or even not-so-weird choices (like if some site in some other country lifts the whole thing and presents it as their own) that cause sites to choose to be a bit harder to scrape than you would assume.

kmeisthax · on Oct 22, 2021

My guess is that it only requests data that it can statically parse (e.g. HTML attributes and tags) and archives that. Anything more complex would require using an actual browser (either via Webdriver, a custom build, or a pile of hacks that implement something identical to one); and would have problems with adversarial content and so on.

I say this because I know that Wayback Machine didn't archive multi-load Flash files. That would require parsing SWFs and executing their embedded Action/ABC tags, which requires writing something equivalent to Flash Player. SPAs aren't much different in terms of archivability as all-Flash websites were.

superkuh · on Oct 22, 2021

It's going to be like java is now. You have to find the exact right date for the right runtime environment to get your JS to execute properly. And that is quite a task. Complex toolkit JS is generally not forwards compatible for more than a couple years.

brundolf · on Oct 22, 2021

That's... not remotely true. A built JS bundle consists of least-common-denominator JS that in theory should continue to run ad infinitum. "Don't break the web" is a mantra among browser devs.

Rebuilding the bundle from scratch might be more complicated. But you don't need to do that to preserve it.

sfink · on Oct 22, 2021

Uh... name one toolkit or platform or library that became incompatible with the JS engines after a couple of years? JS inherits the Web property of extreme backwards-compatibility. Breakages do happen, but they're extraordinarily rare.

Unless you mean something different by compatibility? Sure, you won't be able to mix wildly different versions of libraries because their APIs change. But I wouldn't call that a "runtime environment".