Note that in an adversarial setting this will only be effective against careless...

dools · on June 5, 2024

I think the opponent in the proposed use case for this tool is the gun you’re pointing at your foot, and this tool prevents you from pulling the trigger.

textninja · on June 5, 2024

What you described sounds like a very cool idea - LLM-driven text steganography, basically - but intentional obfuscation is not the problem this tool is trying to solve. To your point about secrets with entropy similar to the surrounding text, however, I wonder if this can pick up BIP39 Seed Phrases or if whole word entropy fades into the background.

thephyber · on June 5, 2024

The LLM adds no value here. Procedural generation in a loop until some fitness function (perhaps a frequency analysis metric) is satisfied.

eru · on June 5, 2024

The LLM is the fitness function.

__MatrixMan__ · on June 5, 2024

I imagine a social media site full of bots chatting about nonsense. Hidden in the nonsense are humans chatting about different nonsense. This way, server costs get paid for by advertisers, but its really only bots that see the ads anyway.

spullara · on June 5, 2024

if the ads aren't effective people won't buy them

eviks · on June 5, 2024

People have been buying ineffective ads since the invention of ads

spullara · on June 5, 2024

zero clicks is a little different

eviks · on June 5, 2024

Bots do click in real ad fraud, so your moved goalpost isn't all that solid

spullara · on June 5, 2024

sorry, conversions is really what I meant. if the bots are also buying the stuff then it would work.

__MatrixMan__ · on June 6, 2024

I'm not really concerned with whether it would work out in a way that was beneficial for the ad network or their customers. Either they figure out a way to make it work such that they continue to be a useful pipe, or they don't and then maybe they'll have to shut down and find something more productive to do with their time. We'll have done the public a service either way.

otabdeveloper4 · on June 5, 2024

Not really, advertising is really the only field of human endeavour that is both data-driven and results-oriented.

(Doesn't still stop smart people from committing fraud, but that is a different story.)

benterix · on June 5, 2024

Unfortunately I beg to differ. I worked for several companies where we the management clearly saw that the results were very poor (for Facebook ads, for example) but continued to invest because there is a defined budget for it and so on. It was like this last year and 20 years ago.

otabdeveloper4 · on June 5, 2024

Yes, most fraud is inside the corporate structure. Not shady "hacker" types in Romania.

jazzyjackson · on June 5, 2024

these companies should be outcompeted by firms that don't blow a million dollars a month paying out to click fraudsters but alas the market is not perfectly competitive

is it a cargo cult? it works for coca cola so maybe if we just spend a little more we'll see returns...

benterix · on June 5, 2024

Yes, I feel it might be cargo cult, at least in part. The argument I usually heard was that "But other companies are doing that, too".

j16sdiz · on June 5, 2024

It's called Twitter.

It's no nonsense, just catvideo and porns.

__MatrixMan__ · on June 5, 2024

Hmm yes, sensical things those.

Are you proposing that they're really only posted as a medium for encoding something else that we're not privy to? If so, somebody took my idea.

textninja · on June 5, 2024

The weights of the LLM become the private key (so it better be a pinned version of a model with open weights), and for most practical applications (i.e. unless you're willing to complicate your setup with fancy applied statistics and error correction) you'd have to use a temperature of 0 as baseline.

Then, having done all that, such steganography may be detectable using this very tool by encoding the difference between the LLM's prediction and ground truth, but searching for substrings with low entropy instead!

eru · on June 5, 2024

You seem to be making some weird assumptions?

Here's how I would do this:

Use some LLM, the weights need to be know to both parties in the communication.

Producing text with the LLM means repeatedly feeding the LLM with the text-so-far to produce a probability distribution for the next token. You then use a random number generator to pick a token from that distribution.

If you want to turn this into steganography, you first take your cleartext and encrypt it with any old encryption system. The resulting bistream should be random-looking, if your encryption ain't broken. Now you take the LLM-mechanism I described above, but instead of sampling via a random number generator, you use your ciphertext as the source of entropy. (You need to use something like arithmetic coding to convert between your uniformly random-looking bitstream and the heavily weighted choices you make to sample your LLM. See https://en.wikipedia.org/wiki/Arithmetic_coding)

Almost any temperature will work, as long as it is known to both sender and receiver. (The 'temperature' parameter can be used to change the distribution, but it's still effectively a probability distribution at the end. And that's all that's required.)

textninja · on June 5, 2024

I was imagining the message encoded in clear text, not encrypted form, because given the lengths required to coordinate protocol, keys, weights, and so on, I assumed there would be more efficient ways to disguise a message than a novel form of steganography. As such, I approached it as a toy problem, and considered detection by savvy parties to be a feature, not a bug; I imagined something more like a pirate broadcast than a secure line, and intentionally ignored the presumption about the message being encrypted first.

That being said, yes, some of my assumptions were incorrect, mainly regarding temperature. For practical reasons I was envisioning this being implemented with a third party LLM (i.e. OpenAI's,) but I didn't realize those could have their RNG seeded as well. There is the security/convenience tradeoff to consider, however, and simply setting the temperature to 0 is a lot easier to coordinate between sender and receiver than adding two arbitrary numbers for temperature and seed.

I misspoke, or at least left myself open to misinterpretation when I referred to the LLM's weights as a "secret key"; I didn't mean the weights themselves had to be kept under wraps, but rather I meant that either the weights had to be possessed by both parties (with the knowledge of which weights to use being the "secret") or they'd have to use a frozen version of a third party LLM, in which case the knowledge about which version to use would become the secret.

As for how I might take a first stab at this if I were to try implementing it myself, I might encode the message using a low base (let's say binary or ternary) and make the first most likely token a 0, the second a 1, and so on, and to offset the risk of producing pure nonsense I would perhaps skip tokens with too large a gulf between the probabilities for the 1st and 2nd most common tokens.

eru · on June 5, 2024

> I was imagining the message encoded in clear text, not encrypted form, [...]

I was considering that, but I came to the conclusion that it would be an exceedingly poor choice.

Steganography is there to hide that a message has been sent at all. If you make it do double duty as a poor-man's encryption, you are going to have a bad time.

> As such, I approached it as a toy problem, and considered detection by savvy parties to be a feature, not a bug; I imagined something more like a pirate broadcast than a secure line, and intentionally ignored the presumption about the message being encrypted first.

That's an interesting toy problem. In that case, I would still suggest to compress the message, to reduce redundancy.

textninja · on June 5, 2024

> If you make it do double duty as a poor-man's encryption, you are going to have a bad time.

For the serious use cases you evidently have in mind, yes, it's folly to have it do double duty, but at the end of the day steganography is an obfuscation technique orthogonal to encryption, so the question of whether to use encryption or not is a nuanced one. Anyhow, I don't think it's fair to characterize this elaborate steganography tech as a poor-man's encryption — LLM tokens are expensive!

eru · on June 5, 2024

> Anyhow, I don't think it's fair to characterize this elaborate steganography tech as a poor-man's encryption — LLM tokens are expensive!

I guess it's a "rich fool's encryption".

textninja · on June 5, 2024

Haha, sure, you can call it that if you want, but foolish is cousin to fun, so one application of this tech would be as a comically overwrought way of communicating subtext to an adversary who may not be able to read between the lines otherwise. Imagine using all this highly sophisticated and expensive technology just to write "you're an asshole" to some armchair intelligence analyst who spent their afternoon and monthly token quota decoding your secret message.

Seed for the message above is 42 by the way.

(Just kidding!)

buildbot · on June 5, 2024

In general (for those unaware) this is called stenography. You can hide an image in the lower bits of another image for example too.

dragonwriter · on June 5, 2024

Steganography; stenography is completely different.

buildbot · on June 5, 2024

Thanks