>Archive.today launches real browsers (not even headless) and tries to load lazy images, unroll folded content, login into accounts if prompted with login form, remove “subscribe our maillist” modals
There are some tricks which work for different websites - for example, for NYT it's enough to manually clear nytimes.com cookies, FT used to work after click from twitter/x and so on. So I guess there is some set of heuristics.
It seems that archive.is often has the full article for sites that are completely paywalled to every non-paying visitor: no cookie-driven freebies, nothing.
Publicly revealing everything they are doing would be a strategically bad idea, obviously.
It's not inconceivable that they actually pay for access to some of the sites; it wouldn't be surprising.
They are not actually bypassing firewalls - therefore I think they are on ethically good grounds. Those sites show their full text for web crawlers - only not to humans. Basically, archive.is and the folk simulate that through various means. Headless browsers, better agent injection etc.
I always assumed so - because Google can index them full text. It used to be the case that you could see those full snapshots in Google as cache - this was before the sites strong armed Google to remove those snapshots from being accessible, then archive.* folk rose to power. You can test this yourself for searching for a unique quote on those sites and still getting hits in Google. But you are right - why could this not be achieved with a plugin then - don't know.