More

madisonmay · 2025-02-15T13:03:00 1739624580

pypdfium2 is a great choice and a solid piece of software!

You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.

pzo · 2025-02-15T13:55:05 1739627705

this still seems GPL. another OCR worth considering is easyOCR [0] (apache license). AFAIK there is not layout detection but they do provide bounding boxes and support many languages also detecting text on many different world objects from images (signpost, etc)

[0] https://github.com/JaidedAI/EasyOCR

nhirschfeld · 2025-02-15T14:02:41 1739628161

Yup, easy OCR is good.

My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.

It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.

cdrini · 2025-02-15T16:28:49 1739636929

Where did you find benchmarks for OCR tools? There have been so many OCR engines coming lately, I would love to see benchmarks!

nhirschfeld · 2025-02-15T18:20:58 1739643658

I google this for a while...

alex_suzuki · 2025-02-15T14:20:27 1739629227

Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR

Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.

nhirschfeld · 2025-02-15T18:22:20 1739643740

I haven't, testing it out is on my todo list for sure

nhirschfeld · 2025-02-15T13:10:28 1739625028

interesting!

madisonmay · on March 17, 2024

This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code

bt1a · on March 17, 2024

I would also think that generating a very large amount of C code using coding LLMs (using deepseek, for example, + verifying that the output compiles) as synthetic training data would be quite beneficial in this situation. Generally the quality of synthetic training data is one of the main concerns, but in this case, the ability for the code to compile is the crux.

Zambyte · on March 18, 2024

I would think that the primary benefit of this over existing decompiler tools would be the ability to use sensible names for identifiers, break up a project to be a sensible set of modules, and maybe even add realistic / helpful comments. If you're synthesizing code to do that, you'll probably gain on the front of generating code that compiles, at the cost of these advantages.

madisonmay · on Feb 15, 2024

It's more like saying "I've upgraded to 128GB of RAM, I'll never use my disk again".

madisonmay · on July 18, 2023

See figure-2

madisonmay · on May 22, 2023

Why the decision to license as GPL?

jasonwcfan · on May 23, 2023

We specifically chose AGPL-3 because we wanted it to be permissive, but we didn't want others to fork our project, take it closed source, and charge for it without adding back anything of value.

We also don't expect companies to customize the functionality, just to self-host it or use the cloud version, or use it for personal projects.

ipv4dhcp · on May 23, 2023

what is your concern with gpl? you can still commerialize apps that use it as long as you use the normal interfaces it exposes.

madisonmay · on May 18, 2023

Thanks, I hate it.

madisonmay · on May 7, 2023

Coding aid for unittests. Debugging aid for languages / frameworks I'm not particularly familiar with. Work that requires reformatting. Translating from rough drafts to more polished / professional language. Learning more about domains I don't have much expertise in where I need specific conceptual questions answered.

madisonmay · on April 14, 2023

Whether or not to split is more a measure of whether or not these two concepts are likely to split down the road than whether or not share similarity today.

giraffe_lady · on April 14, 2023

Yes, the way I phrase this when defending it in review is "these things have different reasons why they would change, despite being the same right now." Thanks Sandi Metz.

hinkley · on April 14, 2023

'Idiomatic' is another good word.

Some people are so addicted to DRY that they want to write helper functions for every 3 lines of code that appear together. Nobody else can figure out what the fuck their code does, but only 3 of them will tell them to their faces.

camgunz · on April 14, 2023

Sandi Metz is a great resource for working through some of the dogmas out there. She's fully bought into OO and design patterns so it's not a situation where an outsider is assailing your whole worldview, and she puts into words what my intuition is about these things super well.

madisonmay · on April 4, 2023

Imperfect systems are still useful, and any sufficiently complex system is imperfect.

madisonmay · on Jan 2, 2023

Interestingly it sounds like offloading could be made quite efficient in a batch setting if you primarily care about throughput rather than latency. Though I guess for most current LLM applications latency is quite important.