So can any type of file -- that doesn't have any relevance to the supposed design of every file type in existence. Now, later versions of PDF do have explicit support for signatures, but what does this have to do with preventing OCR? OCR reads a file, it doesn't change the original file.
Some OCR solutions do change the original file, like OCRmyPDF. They take layers that were just images before and replace it with text layers so that you can search the document.
That isn't OCR, but an application of the resulting output of OCR. Again, a signature on a PDF or any type of file doesn't prevent you from reading it. (It also doesn't technically prevent you from changing it, it just enables the detection of changes to a particular file.)
There's nothing about PDFs or image formats that prevent anyone from doing OCR. The reason construction documents are difficult to OCR is because OCR models are not well trained for them, and they're very technical documents where small details are significant. It doesn't have anything to do with the file format
That's not really what I would call reverse engineering. If you read a pdf, and type it into word is that reverse engineering? Either way whatever you get is in no way going to convince anybody that it is the original.
PDFs are merely an collection of objects, that can be plainly read by reading the file -- some of those are straight up plain text that doesn't even need to be OCR'd, it can be simply extracted. It is also possible to embed image objects in PDFs, (this is common for scanned files) which might be what you are thinking of. But this is not a design feature of PDF, but rather the output format of a scanner: an image. Editing PDFs is a simple matter of simply editing a file, which you can do plainly as you would any other.
It is not by design! PDFs that are made from scanned documents or collections of images would require OCRing but that is true of any format that the scans/images are put into. These days the vast majority of PDFs do not need to be OCRed as the pages are just made up of text, line drawings and images. And although it can get tricky you can edit those text, line and image commands as much as you want.
For example: add this is in the contents stream for a pdf page and it'll put hello world on the page
BT
/myfont 50 Tf
100 200 Td
(Hello World) Tj
ET
(Note: a bit more is required to select the font etc)
Yes! Zig's cross-compilation story is one of the reasons I chose it
for this project.
Zig can target x86-windows-gnu and lets you set os_version_min to .xp
in the build config. This tells the linker to use only APIs available
on XP SP3.
In practice there were a few things I had to deal with:
- RtlGetSystemTimePrecise doesn't exist on XP, so I wrote a
compatibility shim that redirects to GetSystemTimeAsFileTime
at startup
- The build uses -OReleaseSmall, strips symbols, and is
single-threaded to keep the binary small (~750 KB) and
compatible with XP's threading model
- There's a dedicated build-xp.zig that sets all the right flags,
but you can also just do:
zig build -Dtarget=x86-windows-gnu
- No UCRT or MSVC runtime dependency — this was critical because
XP doesn't ship with the Universal CRT, and you can't install
it reliably on minimal XP systems
The same codebase cross-compiles to Linux x86/x64/ARM with
"zig build cross" — that's where Zig really shines compared to
writing this in C with manual cross-compilation toolchains.
One thing I should mention: Zig 0.15 (currently in use) dropped
some older target support, but x86-windows with XP compat still
works. I'd recommend testing with the exact version in the repo
if you want to reproduce it.
reply