> The default MacOS PDF printer will actually remap the font cmap making born-digital PDFs where the "text" is something else entirely (say "$" maps to "a").
What? Why!? I've heard of doing that as a form of DRM, but I can't imagine Darwin defaulting to doing that.
I never dug deeper into it, so I am not aware of why it does that or if it's a specific version or whatnot, but take a PDF from which you can extract the text (with pdftotext/pdfbox for example). Open it in the document viewer and "print" it to PDF. If you extract the text again it is not readable anymore.
This wouldn't be an issue if it was a conscious choice, but when I parsed a lot of born-digital PDFs we ended up with a lot that were like that from various source. Try explaining that...
Could it be “compacting” the fonts? So if U+0000 to U+0007F aren’t used at all, remove those glyphs and set U+0000’s glyph to be what was U+0080? Yes, I know NULL doesn’t have a glyph, but I hope that gets the idea across.
What? Why!? I've heard of doing that as a form of DRM, but I can't imagine Darwin defaulting to doing that.