Hacker Newsnew | past | comments | ask | show | jobs | submit | madisonmay's commentslogin

pypdfium2 is a great choice and a solid piece of software!

You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.


this still seems GPL. another OCR worth considering is easyOCR [0] (apache license). AFAIK there is not layout detection but they do provide bounding boxes and support many languages also detecting text on many different world objects from images (signpost, etc)

[0] https://github.com/JaidedAI/EasyOCR


Yup, easy OCR is good.

My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.

It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.


Where did you find benchmarks for OCR tools? There have been so many OCR engines coming lately, I would love to see benchmarks!


I google this for a while...


Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR

Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.


I haven't, testing it out is on my todo list for sure


interesting!


This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code


I would also think that generating a very large amount of C code using coding LLMs (using deepseek, for example, + verifying that the output compiles) as synthetic training data would be quite beneficial in this situation. Generally the quality of synthetic training data is one of the main concerns, but in this case, the ability for the code to compile is the crux.


I would think that the primary benefit of this over existing decompiler tools would be the ability to use sensible names for identifiers, break up a project to be a sensible set of modules, and maybe even add realistic / helpful comments. If you're synthesizing code to do that, you'll probably gain on the front of generating code that compiles, at the cost of these advantages.


It's more like saying "I've upgraded to 128GB of RAM, I'll never use my disk again".


See figure-2


Why the decision to license as GPL?


We specifically chose AGPL-3 because we wanted it to be permissive, but we didn't want others to fork our project, take it closed source, and charge for it without adding back anything of value.

We also don't expect companies to customize the functionality, just to self-host it or use the cloud version, or use it for personal projects.


what is your concern with gpl? you can still commerialize apps that use it as long as you use the normal interfaces it exposes.


Thanks, I hate it.


Coding aid for unittests. Debugging aid for languages / frameworks I'm not particularly familiar with. Work that requires reformatting. Translating from rough drafts to more polished / professional language. Learning more about domains I don't have much expertise in where I need specific conceptual questions answered.


Whether or not to split is more a measure of whether or not these two concepts are likely to split down the road than whether or not share similarity today.


Yes, the way I phrase this when defending it in review is "these things have different reasons why they would change, despite being the same right now." Thanks Sandi Metz.


'Idiomatic' is another good word.

Some people are so addicted to DRY that they want to write helper functions for every 3 lines of code that appear together. Nobody else can figure out what the fuck their code does, but only 3 of them will tell them to their faces.


Sandi Metz is a great resource for working through some of the dogmas out there. She's fully bought into OO and design patterns so it's not a situation where an outsider is assailing your whole worldview, and she puts into words what my intuition is about these things super well.


Imperfect systems are still useful, and any sufficiently complex system is imperfect.


Interestingly it sounds like offloading could be made quite efficient in a batch setting if you primarily care about throughput rather than latency. Though I guess for most current LLM applications latency is quite important.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: