Micro-agent: make an AI write code until it passes an unit test

nrabulinski · on July 8, 2024

Enterprise developer from hell (https://fsharpforfunandprofit.com/posts/property-based-testi...) but as a CLI tool

mrdevlar · on July 8, 2024

Neat, thanks for this, I enjoyed this viewpoint as it might help systems like this actually be able to build something reasonable.

BiteCode_dev · on July 8, 2024

Indeed, and it will force you to write good tests :)

joseferben · on July 8, 2024

i found that the feedback loop between llm and test suite works really well, especially with sonnet 3.5

i wrote a similar tool the other day: https://github.com/joseferben/makeitpass

it can make all kinds of commands pass by checking stdout/stderr and it’s language agnostic (you need npx to run makeitpass)

shakna · on July 8, 2024

Most LLMs struggle to even do a null check. Could this check for those kinds of glaring security holes?

jejeyyy77 · on July 8, 2024

add a test for it?

shakna · on July 9, 2024

Kinda hard to right a test that a value that is null-checked, when that value may never actually be returned.

For example, have a C function that reads in a file and returns you a string? The string can be checked that malloc actually succeeded, but how do you check that the file actually opened?

jejeyyy77 · on July 9, 2024

what?

shakna · on July 10, 2024

Every time you call fopen, you need to do a null check. Every single time. You also need to check that fclose is there to match the call, every time.

Writing a test for that, when it is generally just a call within the function you want to test, isn't really possible. It's not there in the arguments, or the return value, of the function.

How do you check for the right checks in a function expected to do something like this:

int foo() { FILE *fp = fopen("test.in", "r"); if(!fp) { return -1; }

  for(int i = 0; i < NUM; i ++){
   if(matcher(fp, i)){    
    fclose(fp);
    return i;
   }
  }

  fclose(fp);
  return 0;
 }

jejeyyy77 · on July 10, 2024

you write 2 unit tests for your function... one with a test file that exists, and one with a fake file path. Assert not null and null respectively..

bangaladore · on July 8, 2024

At least in my experience, possibly due to context limitations, or just architecture, SOTA LLMs aren't particularly good at iterating as they tend to loop back around to similar results with bad logic / errors

amatic · on July 8, 2024

This sounds amazing! Are there any metrics on how often different models pass tests? Has someone used a similar process to finetune an LLM?

Arubis · on July 8, 2024

I'd much sooner accept and commit implementation code written by an AI against unit tests written by a human than the reverse.

benve · on July 8, 2024

A very sad future awaits us if a developer's only job is to write tests

awwaiid · on July 8, 2024

I knew declarative languages would eventually win!

tiborsaas · on July 8, 2024

It's worse, it starts by writing the test so you have to verify if the tests work :)

philote · on July 8, 2024

That's what I don't get about this. Instead of writing code that may or may not be correct, it's writing tests that may or may not be correct.

root_axis · on July 8, 2024

I've tried these LLM "code from test" things (and vice-versa) dozens of times over the last couple of years... they're not even close to approaching being practical.

colechristensen · on July 8, 2024

Why? It will evolve into a slightly higher level language where the compiler is an ML model. Was it a tragedy when developers mostly didn’t have to write assembly any more?

benve · on July 8, 2024

I think it's different... I like high level languages, but this is not a programming language, this is a technique for writing tests in an existing language and leaving the implementation to the AI.

I like programming for problem solving, I don't really like writing tests, but that's personal taste, a lot of people like to just use PowerPoint and Jira and tell others what they need to implement, but these people are not software developers.

selcuka · on July 8, 2024

> Was it a tragedy when developers mostly didn’t have to write assembly any more?

It wasn't, but for starters compilers have always been generally deterministic.

I'm not saying that this is completely useless (I personally think code completion tools such as GitHub CoPilot are fantastic), but it is still early to compare it to a compiler.

TuringNYC · on July 8, 2024

Perhaps a minority opinion, but i LOVE writing tests. I write them before I write the code, it is like playing chess with yourself.

benve · on July 8, 2024

I appreciate that your workflow is so linear. I often write tests, then the implementation, then I realize that the tests need to be corrected, then I change the implementation, then I change the tests, then I add other tests etc... etc...

I don't really like maintaining tests, it's often a lot of code that needs to be understood and changed carefully

autonomousErwin · on July 8, 2024

Really it's just validator code instead of feature code. I think this is the only realistic way forward for production level code written by AI, don't ask it to write code - ask it to pass your validation tests.

Essentially, everyone becomes a red team member trying to think of clever ways they can outwit the AI's code which I for one think this is going to be a lot of fun in the future - though we're still quite a way from there yet!

_flux · on July 8, 2024

Maybe in the future developers will be able to write just specifications.

benve · on July 8, 2024

this already happens, in many companies mid-level managers write the specifications (ambiguously) and other low-cost people implement them.

And when things don't work they call people like me, to try to understand the performance problems of something poorly defined and worse written.

jacamera · on July 8, 2024

That's exactly what we do now.

_flux · on July 8, 2024

Arguably we write instructions: instead of writing out the problem and what the solution looks like, we describe a set of steps we go through—and if those steps are incorrect, there's nothing to compare against, because that was what we called the "specification".

Whether there's a difference there is in the eye of the beholder, but it does look like that specification languages such as TLA+/PlusCal/Squint or Alloy, or theorem proving languages like Coq (to be renamed Rocq) or Lean look a lot different from the likes of C, JavaScript or even Haskell.

ola_esponel · on July 8, 2024

Does this work with any llm?

jeylum22 · on July 8, 2024

[flagged]

itchyjunk · on July 8, 2024

Can you elaborate on why you think it won't? That might be more valuable use of everyone's time instead of the binary X will / won't happen stance comment.