Looking closely, this only works if you know the font name, size, and weight use...

if_by_whisky · on Dec 17, 2022

Guessing is actually easy. For the kinds of files that end up as redacted pdfs (legal, government, etc), there's probably 5-8 font options that make up 98% of documents. Sizes and weights are immediately recognizable to the slightly trained eye. I'm pretty sure I could guess all 3 attributes at a glance.

happyopossum · on Dec 17, 2022

Or just look at the unredacted text around it and use that. Nobody is changing fonts on text before pixelation.

mmoskal · on Dec 17, 2022

Often only parts of text are pixelated.

taneq · on Dec 18, 2022

It's also a proof of concept. Slap a couple more for() loops in there to iterate through different font options and try a range of alignments and you could have it fully automatic.

kg · on Dec 18, 2022

There are lots of existing tools that can guess a font accurately if you feed them an image of enough text, so that's not a big obstacle.

lelandfe · on Dec 18, 2022

Or just use the rest of the document to build the corpus?