While we're here how does archive.today bypass paywalls?

r721 · 2025-11-06T16:52:27 1762447947

>What scraper or headless browser are you using? it works so well.

>Before 2019 - PhantomJS, after - ordinary (not headless) Chromium/80 with few small patches.

https://blog.archive.today/post/618635148292964352/what-scra... (2020)

>Archive.today launches real browsers (not even headless) and tries to load lazy images, unroll folded content, login into accounts if prompted with login form, remove “subscribe our maillist” modals

https://blog.archive.today/post/642952252228812800/people-of...

scandox · 2025-11-06T17:40:26 1762450826

I get that it convincingly simulates a human but so do I (because I am a human) and I don't get through the paywall...

r721 · 2025-11-06T17:45:51 1762451151

There are some tricks which work for different websites - for example, for NYT it's enough to manually clear nytimes.com cookies, FT used to work after click from twitter/x and so on. So I guess there is some set of heuristics.

kazinator · 2025-11-06T19:19:42 1762456782

It seems that archive.is often has the full article for sites that are completely paywalled to every non-paying visitor: no cookie-driven freebies, nothing.

Publicly revealing everything they are doing would be a strategically bad idea, obviously.

It's not inconceivable that they actually pay for access to some of the sites; it wouldn't be surprising.

postexitus · 2025-11-06T17:38:33 1762450713

They are not actually bypassing firewalls - therefore I think they are on ethically good grounds. Those sites show their full text for web crawlers - only not to humans. Basically, archive.is and the folk simulate that through various means. Headless browsers, better agent injection etc.

mr_mitm · 2025-11-06T18:06:58 1762452418

I don't think that's true. If it was that simple, there would be browser plugins or other apps that would replicate that behavior. Do you know of any?

postexitus · 2025-11-07T10:15:20 1762510520

I always assumed so - because Google can index them full text. It used to be the case that you could see those full snapshots in Google as cache - this was before the sites strong armed Google to remove those snapshots from being accessible, then archive.* folk rose to power. You can test this yourself for searching for a unique quote on those sites and still getting hits in Google. But you are right - why could this not be achieved with a plugin then - don't know.

aspenmayer · 2025-11-06T19:48:51 1762458531

Perhaps not in the same way as described above, but BPC exists.

https://en.wikipedia.org/wiki/Bypass_Paywalls_Clean

cheraderama · 2025-11-06T16:48:55 1762447735

I thought it just sees a full version for crawlers?

runjake · 2025-11-06T17:37:35 1762450655

Nope, see r721's comment above yours for how it purportedly works.