pypdfium2 is a great choice and a solid piece of software!
You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.
this still seems GPL. another OCR worth considering is easyOCR [0] (apache license). AFAIK there is not layout detection but they do provide bounding boxes and support many languages also detecting text on many different world objects from images (signpost, etc)
This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code
I would also think that generating a very large amount of C code using coding LLMs (using deepseek, for example, + verifying that the output compiles) as synthetic training data would be quite beneficial in this situation. Generally the quality of synthetic training data is one of the main concerns, but in this case, the ability for the code to compile is the crux.
I would think that the primary benefit of this over existing decompiler tools would be the ability to use sensible names for identifiers, break up a project to be a sensible set of modules, and maybe even add realistic / helpful comments. If you're synthesizing code to do that, you'll probably gain on the front of generating code that compiles, at the cost of these advantages.
We specifically chose AGPL-3 because we wanted it to be permissive, but we didn't want others to fork our project, take it closed source, and charge for it without adding back anything of value.
We also don't expect companies to customize the functionality, just to self-host it or use the cloud version, or use it for personal projects.
Coding aid for unittests. Debugging aid for languages / frameworks I'm not particularly familiar with. Work that requires reformatting. Translating from rough drafts to more polished / professional language. Learning more about domains I don't have much expertise in where I need specific conceptual questions answered.
Whether or not to split is more a measure of whether or not these two concepts are likely to split down the road than whether or not share similarity today.
Yes, the way I phrase this when defending it in review is "these things have different reasons why they would change, despite being the same right now." Thanks Sandi Metz.
Some people are so addicted to DRY that they want to write helper functions for every 3 lines of code that appear together. Nobody else can figure out what the fuck their code does, but only 3 of them will tell them to their faces.
Sandi Metz is a great resource for working through some of the dogmas out there. She's fully bought into OO and design patterns so it's not a situation where an outsider is assailing your whole worldview, and she puts into words what my intuition is about these things super well.
Interestingly it sounds like offloading could be made quite efficient in a batch setting if you primarily care about throughput rather than latency. Though I guess for most current LLM applications latency is quite important.
You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.