Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While we're here how does archive.today bypass paywalls?


>What scraper or headless browser are you using? it works so well.

>Before 2019 - PhantomJS, after - ordinary (not headless) Chromium/80 with few small patches.

https://blog.archive.today/post/618635148292964352/what-scra... (2020)

>Archive.today launches real browsers (not even headless) and tries to load lazy images, unroll folded content, login into accounts if prompted with login form, remove “subscribe our maillist” modals

https://blog.archive.today/post/642952252228812800/people-of...


I get that it convincingly simulates a human but so do I (because I am a human) and I don't get through the paywall...


There are some tricks which work for different websites - for example, for NYT it's enough to manually clear nytimes.com cookies, FT used to work after click from twitter/x and so on. So I guess there is some set of heuristics.


It seems that archive.is often has the full article for sites that are completely paywalled to every non-paying visitor: no cookie-driven freebies, nothing.

Publicly revealing everything they are doing would be a strategically bad idea, obviously.

It's not inconceivable that they actually pay for access to some of the sites; it wouldn't be surprising.


They are not actually bypassing firewalls - therefore I think they are on ethically good grounds. Those sites show their full text for web crawlers - only not to humans. Basically, archive.is and the folk simulate that through various means. Headless browsers, better agent injection etc.


I don't think that's true. If it was that simple, there would be browser plugins or other apps that would replicate that behavior. Do you know of any?


I always assumed so - because Google can index them full text. It used to be the case that you could see those full snapshots in Google as cache - this was before the sites strong armed Google to remove those snapshots from being accessible, then archive.* folk rose to power. You can test this yourself for searching for a unique quote on those sites and still getting hits in Google. But you are right - why could this not be achieved with a plugin then - don't know.


Perhaps not in the same way as described above, but BPC exists.

https://en.wikipedia.org/wiki/Bypass_Paywalls_Clean


I thought it just sees a full version for crawlers?


Nope, see r721's comment above yours for how it purportedly works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: