People have already said that LangChain is useful to get ideas about what can be...

kordlessagain · on July 15, 2023

I've looked at LangChain multiple times and there is some cool stuff in there to enable a quick prototype. That said, needing ALL the cool stuff in one particular use is unlikely and trying to figure out what to do when you have a specific requirement might not be worth the learning curve.

To illustrate the complexity of this, here's a list of things that you might have to do if implementing a document bot:

  1. Handle uploading or storing documents somewhere and keeping track of the location.
  2. Handling different document types. Sticking to PDFs for this list.
  3. Manually or using the PDF to augment the documents with tags, keyterms, titles, etc.
  4. At this point you need somewhere to store the metadata. Maybe a DB or using the vector store.
  5. Just dealing with PDFs requires some type of PDF library. Other documents may or may not require an additional library.
  6. Extracting text from the PDF with something like pdf2image. Not all PDFs have extracted (selectable) text in them. Also, PDFs have images, which sometimes have text.
  7. Doing some sort of OCR is very likely. Think about OCR'ing the whole thing to deal with no extracted text/images with text.
  8. Assuming that, converting pages to images. Also, consider images have data in them, so extracting an image from the image and running some type of detection on it...
  9. Using some OCR model to extract text, or figure out how to extract them from the PDF data.
  10. Cleaning up that text, then parsing it cleanly. nltk comes into play here.
  11. Fragmentation/windowing of text. How long to create the fragments? Or should it be variable?
  12. Using the text to get more text via a prompt to a model. Here we can get additional keyterms, or perhaps a summary or question about the text fragment. (we'll use something we write for prompting the LLM here in a second)
  13. Storing the fragment. Most people use a vector database for this now, so we can use Weaviate or Pinecone, or ???. Also, consider a moderate amount of fragments and their vectors can be stored in a pickle format with manual dot products for ranking.
  14. Figuring out where you are going to get a user prompt. Assuming the easiest thing, collect input from the user in a command prompt.
  15. What to do with the user's prompt once you get it. Do you ask an LLM for more info on the prompt? Or do you just jump to...
  16. Embed the user's prompt to get a vector back. (Weaviate does this transparently, but you can easily do it yourself using the ada-002 endpoint from OpenAI)
  17. Taking that vector (from the embedding/inference to the embed model) do a comparison to other vectors/text you've stored.
  18. Think about what text is important for a new prompt to the LLM. Should it contain directives? How much reference text from the documents does it need? Is a cosign distance or some approximate nearest neighbor match going to be enough?
  19. Think about augmenting the vector search with keyterms that were extracted earlier (by both the PDF itself + any LLM inference step you impelment)
  20. Take the text you pull back from wherever you stored the vector/text and then build a long string to stuff into a prompt.
  21. Consider some type of template structure for the prompts, so you can tweak them without losing your mind. String templates for files in Python are great for $this.
  22. Calling the various LLM endpoints. There are multiple models, in a variety of API endpoints, with tokens usually for auth.
  23. Consider you may just want text back from the LLM, or maybe you want it to complete or write a dict or array (in which case you may want to make this configurable). You may want to eval things that the LLM writes too.
  24. Consider the LLM (ChatGPT for example) may do a completion that contains a block delimited by ```python or similar.
  25. Consider those two things may require different completion endpoints, and some endpoints may be deprecated by the provider later.
  26. Think if you need function completion calling. GPT-X supports this, so you need a function and a way to pass that function's parameters to the LLM. 
  27. Build the prompt and submit it. Don't forget to protect your tokens, using env or config.py files.
  28. Take the response and do something with it that makes sense. Maybe give it to the user, or use it to build another prompt.
  29. Loop back to interact with the user. If the interaction is complicated, like with Discord integration, you may have to do this asynchronously.
  27. Think about storing the interaction for use in building future prompts. Hack this into #15.
  28. Always consider optimizing your prompt length.
  29. Consider how many tokens you are chewing through doing all this.
  30. Consider questions by the user about "what is on page 2?" need context. Another good one is "how many pages is this document", or "what is the title of the document?". A hard one would be "how many images are in this PDF?", meaning how many illustrations...
  31. If the document discusses code, and the model outputs code, or SQL, do you run it and if you do, how?

Example of most of this in action: https://github.com/FeatureBaseDB/DoctorGPT