It's worse than that. The documentation is a confusing mess that completely omits the explanation of key default parameters and details. And the abstractions are horrendously brittle. And difficult to fix, because there are too many layers.
The best use of LangChain is probably just looking at the included prompts in the source code for inspiration.
I disagree. Production systems don't need to be full of AbstractSingletonProxyFactoryBeans which is basically what LangChain is. For example, Linux certainly isn't.
I like the idea, but I think a library that focuses on producing requests and parsing responses according to schema is better. Sending requests to the server is orthogonal to the purpose.
What we've found useful in practice in dealing with similar problems:
- Use json5 instead of json when parsing. It allows trailing commas.
- Don't let it respond in true/false. Instead, ask it for a short sentence explaining whether it is true or false. Afterwards, use a small embedding model such as sbert to extract true/false from the sentence. We've found that GPT is able to reason better in this case, and it is much more robust.
- For numerical scores, do a similar thing by asking GPT for a description, then with the small embedding model write a few examples matching your score scale, and for each response use the score of the best matched example. If you let GPT give you scores directly without explanation, 20% of the time it will give you nonsense.
> Don't let it respond in true/false. Instead, ask it for a short sentence explaining whether it is true or false. Afterwards, use a small embedding model such as sbert to extract true/false from the sentence. We've found that GPT is able to reason better in this case, and it is much more robust.
Have you tried just getting it to do both? It reasons far better given some space to think, so I often have it explain things first then give the answer. You're effectively then using gpt for the extraction too.
This hugely improved the class hierarchies it was creating for me, significantly improving the reuse of classes and using better classes for fields too.
There's a benefit in having a model that can output only true/false if that's all that's acceptable, but if I was doing this myself I'd want to see how far I could get with just one model (and then the simple dev approach of running it again if it fails to produce a valid answer, or feeding it back with the error message). If it works 99% of the time you can get away with rerunning pretty cheaply.
Thanks for the thoughts! I've deployed a few meta models that act like you're describing for second stage predictions, but for fuzzy task definitions have actually seen similar luck with having GPT explicitly explain its rational and then force it to choose a true/false rating. My payloads often end up looking like:
class Payload:
reasoning: str = Field(description="Why this value might be true or false),
answer: bool
Since it's autoregressive I imagine the schema helps to define the universe of what it's supposed to do, then the decoder attention when it's filling the `answer` can look back on the reasoning and weigh the sentiment internally. I imagine the accuracy specifics depend a lot on the end deployment here.
Didn't know about json5, so I had to deal with trailing commas in another way. I found that providing an example of an array without trailing commas was enough for GPT to pick up on it.
The tips on booleans and numerics are interesting! Will keep them in mind if I ever need to do that. I've definitely experienced a few quirks like that (E.g. ChatGPT 'helpfully' responding with "Here's your JSON" instead of just giving me JSON).
I’ve also found good results by asking for it to give the answer first, then to explain its answer. Best of both worlds, since I can just ignore everything following and it still seems to do the internal preparatory ‘thinking’.
For us LangChain actually caused more problems than it solved. We had a system in production which after working fine a few weeks suddenly started experiencing frequent failures (more than 30% of requests). On digging it seems that LangChain sets a default timeout of 60 seconds for every requests. And this behaviour isn't documented! Such spurious decisions made by LangChain are everywhere, and will all eventually come back to bite. In the end we replaced everything with vanilla request clients. Definitely not recommended to build a system on a library that provides very limited value while hiding a huge amount of details and decisions from you.
This is my experience too. While I'd really love the Open Source models to catch up, currently they struggle even with dead-simple summarization tasks: they hallucinate too much, or omit essential points. ChatGPT don't often hallucinate when summarizing, only when answering questions.
This is hugely misleading. If your bot just memorizes Shakespeare and output segments from memorization, of course nobody can tell the difference. But as soon as you start interacting with them the difference can't be more pronounced.
>With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness.
No, it's not just memorising shakespeare, real humans interacted with the models and rated them.
That's not what I meant by interaction. The evaluator had to ask the models to do tasks for them that they thought of by their own. Otherwise there are just too many ways that information could have leaked.
OpenAI's model isn't immune from this either, so take any so-called evaluation metrics with a huge grain of salt. This also highlights the difficulties of properly evaluating LLMs: any metrics, once set up, can become a memorization target for LLMs and lose their meaning.
Are you sure? I have yet to see any evidence that anyone at all (including Google) has built a model (or a "platform" as you prefer to refer to them) that can follow instructions as well as 50% of ChatGPT, let alone GPT-4. I don't think any amount of work in LangChain and vector databases is enough to fix this: you really need a strong base model that is trained to align with human intentions well. Of course if you just want a bot that can answer free-form simple questions, then maybe people can't tell the difference. Just give them some real work to do and it becomes glaringly obvious.
Vector databases such as Milvus are only there to help reduce/minimize hallucinations rather than get rid of them completely. Until we have a model architecture that can perform completion from the _prompt only_ rather than pre-training data, hallucinations will always be present.
This is a chatbot model you can run on your own computer with a powerful graphics card. It is not as good as GPT-4 but it has not been "locked down" the way GPT-4 is and can be asked to do things GPT-4 would refuse to do.