Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have a few "SVG of an X riding a Y" tests that I don't publish online which I run occasionally to see if a model is suspiciously better at drawing a pelican riding a bicycle than some other creature on some other form of transport.

I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!



-> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

Que intro: "The gang wastes their time cheating on a dumb benchmark"


A shower thought I just had: there must be some AI training company somewhere that has injested all It is always sunny in Philadelphia, not just the text but all the video from all episodes somehow...


> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

I don't think it's necessarily "cheating", it just happens as they're discovering and ingesting large ranges of content. A problem of public content, it's bound to be included sooner or later, directly or indirectly.

Nice to hear you're doing some sort of contingency though, and looking forward to the inevitable blog post announcing the change to a different bird and vehicle :)


The thing is most of the discussion about it is embarrassingly bad SVGs so training on them would actually hurt their performance.


Regrettably AI is still better at SVG than I am


Your benchmark may or may not be dumb, but it is definitely widely followed. So much so this is what Bing AI has to say on the matter.

> Absolutely — the “pelican riding a bicycle” SVG test is a quirky but clever benchmark created by Simon Willison to evaluate how well different large language models (LLMs) can generate SVG (Scalable Vector Graphics) images from a prompt that’s both unusual and unlikely to be in their training data.


But how would you know it's from what you would consider cheating as opposed to pelicans on bicycles existing in the latest training data? Obviously your blog gets fed into the training set for GPT-6, as well as everyone else talking about your test, so how would the comparison to a secret X riding a Y tell you if an AI lab is cheating as opposed to merely there being more examples in the training data?


Mainly because if they train on the pelican on bicycle SVGs from my blog they are going to get some very weird looking pelicans riding some terrible looking bicycles.


It's not that I claiming they're training on SVG pelicans on bicycles from your blog, it's that thanks to your popularity, there are simply now more pictures of pelicans on bicycles floating around on the Internet and thus ChatGPT's training data. Eg https://www.reddit.com/r/ColoredPencils/comments/1l9l4fq/pel...

How would you determine that improvements to SVG pelicans on bicycles (and not your secret X on Ys) are from an OpenAI employee cheating your benchmark vs being an improvement on pelicans on bicycles thanks to that picture from Reddit and everywhere elsewhere in the training data?



Please do let us know through your blog post if you ever find AI labs to cheat on your benchmark.

But now I am worried that since you have shared that you do SVG of an X riding a Y thing, maybe these models will try to cheat on the whole SVG of X riding Y thing instead of hyper focusing the pelican.

So now I suppose you might need to come up with an entirely new thing though :)


There are so many X and Y combinations that I find it hard to believe they could realistically train for a even a small fraction of them. Someone has to generate the graphics output for the training.

A duck billed platypus riding a unicycle? A man o' war riding a pyrosome? A chicken riding a Quetzalcoatlus? A tardigrade riding a surf board?


You're assuming that given the collection of simonw's publicly available blog posts, the creativity of those combinations can't be narrowed down. Simply reverse engineer his brain this way and you'll get your Xs and Ys ;)


I feel like that would over fit on various snakes like pythons.


I must say that I loved the idea of a tardigrade riding a surfboard. You're welcome.

Granted not an SVG, but still awesome.

https://imgur.com/a/KsbyVNP


If we accept ChatGPT telling me that there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.


It still can't satisfactorily draw a pelican on a bicycle because that's either not in the training data or the signal is too weak, so why would it be able to satisfactorily draw every random noun-riding-noun combination just because you threw a for loop at it?

The point is that in order to cheat on @simonw's benchmark across any arbitrary combination, they'd have to come up with an absurd number of human crafted input-output training pairs with human produced drawings. You can't just ask ChatGPT to generate every combination because all it'll produce is garbage that gets a lot worse the further from a pelican riding a bicycle.

It might work at first for the pelican and a few other animals/transport combination but what does it even mean for a man o' war riding a pyrosome? I asked every model I have access to generate an SVG for a "man o' war riding a pyrosome" and not a single one managed to draw anything resembling a pyrosome. Most couldn't even produce something resembling a man o' war except as a generic ellipsoid-shaped jellyfish with a few tenticles.

Expand that to every weird noun-noun combination and it's just not practical to train even a tiny fraction of them.


https://chatgpt.com/share/68def5c5-8ca4-8009-bbca-feabbe0651...

Man'o'war on a pyrosome. I don't what you expected it to look like, maybe it could be more whiteish translucent instead of orange, but it looks fairly reasonable to me. Took a bit over a minute with the ChatGPT app.

Simonw's test is for the text-only output from an LLM to write an SVG, not "can a multimodal AI in 2025" generate a PNG. By having pictures of pelicans on bicycles in the training data in PNG format, from people wanting to see one, after reading his blog, there are now raster-based images from an image generation model that fairly convincingly look as described in the training data. Now that there's PNGs of pelicans on bicycles, we would expect GPT-6 to be better at generating SVGs of something it's already "seen".

We don't know what simonw's secret combo X and Y is, nor do I want to know, because that would ruin the benchmark (if it isn't ruined already by virtue of him having asked it). 200k nouns is definitely high though. A bit of thought could cut it down to exclude concepts and lot of other things. How much spare GPU capacity OpenAI has, I have no idea. But if I were there, I'd want the GPUs to be running as hot as the cloud provider would let me run them, because they're paying per hour, not per watt, and have a low-priority queue of jobs for employees to generate whatever extra training data they can think of on their off hours.

Oh and here's the pelican PNG so the other platforms can crawl this comment and slurp it up.

https://chatgpt.com/share/68def958-3008-8009-91fa-99127fc053...


I doubt they'd cheat that obviously... But "SVG of X" has become common enough that I suspect most frontier labs train on it, especially since the models are multimodal now anyway.

Not that I mind; I want models to be good at generating SVG! Makes icons much simpler.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: