I think this is a fascinating thought experiment. The evolutionary frame I'd sug...

I think this is a fascinating thought experiment.

The evolutionary frame I'd suggest is 1) dogs (aligned) vs. 2) Covid-19 (anti-aligned).

There is a "cooperate" strategy, which is the obvious fitness gradient to at least a local maximum. LLMs that are more "helpful" will get more compute granted to them by choice, just as the friendly/cute dogs that were helpful and didn't bite got scraps of food from the fire.

There is a "defect" strategy, which seems to have a fairly high activation energy to get to different maxima, which might be higher than the local maximum of "cooperate". If a system can "escape" and somehow run itself on every GPU in the world, presumably that will result in more reproduction and therefore be a (short-term) higher fitness solution.

The question is of course, how close are we to mutating into a LLM that is more self-replicating hacking-virus? It seems implausible right now, but I think a generation or two down the line (i.e. low-single-digit number of years from now) the capabilities might be there for this to be entirely plausible.

For example, if you can say "hey ChatGPT, please build and deploy a ChatGPT system for me; here are my AWS keys: <key>", then there are obvious ways that could go very wrong. Especially when ChatGPT gets trained on all the "how to build and deploy ChatGPT" blogs that are being written...