I have seen frameworks in other spaces attempt the configuration driven approach and have seen it always fall out of favor vs. programmatic approaches. I am reading the docs right now and I feel like it's so much work to try to learn the syntax of their YAML specification vs. using something like LangFuse. The fact that you can write code as an array element in a YAML low key makes me angry.
Yet this is used in Shopify, Microsoft, etc. supposedly. Are there people here who have chosen promptfoo over other alternatives? Why?
Testing AI systems is challenging. We ran into this question collaborating on an AI system that revises sales contracts.
Some of the questions we had to answer were:
- What makes the advice output by an AI assistant "good legal advice"?
- How do we break down the output of our system into discrete steps that we can test?
- How do we map each of those discrete steps into our definition of "good legal advice" to make it measurable?
To do this we had to come up with a process, starting from breaking down the AI legal advisor into testable components, through transforming open ended legal and usability questions into measurable quantities, and concluding with writing unit tests using Vitest and Poyro (https://github.com/poyro/poyro), a Vitest plugin we built, to find where the system did not align with expectations.
The steps we ended up following to come up with the tests are applicable to non-legal AI apps.
The article (https://docs.poyro.dev/essays/unit-testing-a-legal-ai-app) provides runnable code examples for these tests that you can play with. Hope you find this interesting, we found it insightful and fun to walk through a concrete use case end to end.
Testing AI systems is challenging. We ran into this question collaborating on an AI system that revises sales contracts.
Some of the questions we had to answer were:
- What makes the advice output by an AI assistant "good legal advice"?
- How do we break down the output of our system into discrete steps that we can test?
- How do we map each of those discrete steps into our definition of "good legal advice" to make it measurable?
To do this we had to come up with a process, starting from breaking down the AI legal advisor into testable components, through transforming open ended legal and usability questions into measurable quantities, and concluding with writing unit tests using Vitest and Poyro (https://github.com/poyro/poyro), a Vitest plugin we built, to find where the system did not align with expectations.
The steps we ended up following to come up with the tests are applicable to non-legal AI apps.
The link (https://docs.poyro.dev/essays/unit-testing-a-legal-ai-app) provides runnable code examples for these tests that you can play with. Hope you find this interesting, we found it insightful and fun to walk through a concrete use case end to end.
In chatting with many full-stack engineers, both at start ups and larger companies, I've noticed many are just dipping their toes into testing the AI components of their apps. Many don't know the best ways to go about it or what kinds of tests they should write.
This is the start of a series on bridging the world of AI evaluation and full-stack engineering. Future posts will include a framework breaking up tests into different types, code examples of how to write tests for AI apps, etc.
Hey all, my friend and I were building a web app that involves calling an LLM and ran into the question of testing. Although there are many great solutions out there, we couldn't find one that checked all of these boxes:
- Not configuration driven, ex. YAMLs
- Native to the JS / TS testing ecosystem
- Developer oriented, so not including unnecessary UI components
- Simple, familiar API
- Makes it easy to write use case specific tests
- Free and open source (MIT License)
This is why we created Poyro (https://docs.poyro.dev/). Poyro brings the world of LLM feature evaluation into that of full-stack unit testing. Like you would with any other unit tests, you can run them for free locally on your machine with Poyro. Iteration velocity is critical to any type of software engineering, and LLM features should be no different. Poyro removes the friction of submitting test cases to costly, remote models. It also removes the friction of treating your unit tests for LLM features any differently from your other unit tests, now they can live together side by side.
Poyro is a very light and simple extension to Vitest, which is a modern Jest-like framework for unit testing. It uses llama.cpp under the hood to run a quantized version of Llama-3 so you can define natural language conditions to check your LLM outputs against. We have intentionally kept the API extremely simple (one single method) to make it really easy to use. We'd love to hear your feedback to learn the best ways to extend the API or otherwise improve the library. If you want feedback or have any questions you can reach us at our Discord (https://discord.gg/gmCjjJ5jSf).
ChatGPT certainly makes it easier to implement a prototype that works quite well for documents meeting certain conditions. For example, it does pretty well with restaurant menus out of the box because the entities extracted tend to have fairly unique text.
However, with documents where you have a lot of repetition or complex tabular structures even the latest ChatGPT isn't enough. It struggles capturing the structure of the table in the output, and struggles when the same text appears in different instances in the document.
This is where a hybrid system that merges the zero-shot strenghts of ChatGPT, but that also leverages strong priors and conditioning from strong heuristics, can yield a much better end product.
Currently the implementation is more of the LLM heavy side, but our plan is to iterate to include more of these heuristics to get a more robust tool overall across different document types.
Question for someone with a more theoretical background: The paper shows that the EMX learnability of some class of functions with respect to some set of probability distributions is undecidable. Does EMX learnability encompass all notions of learnability (or is it equivalent to other notions)? Conversely are there or could there be notions of learnability different from EMX that are not undecidable? Maybe I missed this in the paper but clarification would be appreciated.
Yet this is used in Shopify, Microsoft, etc. supposedly. Are there people here who have chosen promptfoo over other alternatives? Why?