Embedded PDF viewer in Firefox 81 supports filling forms

kebman · on Sept 22, 2020

I looked into coding PDFs once. Then I closed my MacBook (Pro) and went for a long walk into the ocean. I think I almost got to America, but then I turned and swam back again. Turnd out I had just fallen asleep and had a nightmare. I was actually just working with regular text files, and everything was fine.

rvense · on Sept 22, 2020

My favourite PDF fact is that it doesn't have to start at the beginning or end at the end of a file. Any sea of bytes that contains a PDF file is an acceptable PDF file...

TeMPOraL · on Sept 22, 2020

Did anyone try to pluck out PDFs from /dev/urandom? How about from radiotelescope feed? Maybe the first evidence of extraterrestrial life will be some poor alien's tax form?

willis77 · on Sept 22, 2020

The digits of pi contain every pdf that ever could and ever will exist.

vcxy · on Sept 22, 2020

"Find the earliest valid pdf in consecutive digits of pi"

paulmd · on Sept 22, 2020

I mean, the answer is trivially zero, there exists a PDF-like structure somewhere in Pi, and the offset of that doesn't have to be zero, it can start or end anywhere. So the range [0, N] is a valid PDF.

infogulch · on Sept 22, 2020

"Find the last byte of the first valid PDF in the binary digits of Pi"

paulmd · on Sept 23, 2020

Since the PDF also doesn't have to be end-aligned, the answer is trivially [0, infinity].

The first place a valid PDF could be ended, perhaps.

infogulch · on Sept 24, 2020

A pdf at [0, N] sorts before the one at [0, N+1], by "first valid pdf".

paulmd · on Sept 25, 2020

No, both start at 0. Also, [0, infinity] and [0, infinity+1] are the same thing.

infogulch · on Sept 29, 2020

https://www.online-python.com/OuZ0thRs6D

paulmd · on Sept 29, 2020

your example fails to satisfy the invariant. 11 is less than infinity.

you're just pasting random python snippits at me now. It's time to move on.

again, just to summarize: PDF files do not have to be zero aligned, and they do not have to be end aligned. Therefore the answer to the question "what is the first segment of Pi that is a valid PDF file" is trivially (0,infinity). That is a correct statement. The non-greedy (in the regex sense) answer to that question will be different, however.

infogulch · on Oct 6, 2020

Why is this so hard? If the tuple (0,10) represents the range of a valid pdf, then the next tuple (0,11) is also a valid pdf. Or any after it up to and including (0,infinity).

Note the word "next", implying that (0,10) sorts before (0,11); you even say it yourself "11 is less than infinity". Where I'm from "first" and "less" are related (the first element in a unique sorted list is defined to be less than all other elements). So if there is any valid pdf in pi that can be identified by the range tuple (0,N), then the first valid pdf must occur before N -> infinity. Therefore (0,infinity) can never be the first valid pdf, even though it may be a valid pdf.

Maybe a picture would help:

    Potential pdf file ranges in pi: [(0,0),(0,1),(0,2),(0,3),(0,4),...,(0,N-1),(0,N),(0,N+1),(0,N+2),...,(0,infinity)]
    Is it a valid pdf?                 no    no    no    no    no  (no)  no      yes   yes     yes   (yes) yes
    Which one is first?                                                          ^^^

I thought linking to a python script that shows the order comparison of a tuple (0,N) as less than the tuple (0,N+1) would clearly demonstrate this, but it appears to have failed to communicate that to you. We don't need non-greedy regex rules to do a less than comparison.

drevil-v2 · on Sept 22, 2020

Please don't give them any ideas.. the whiteboard interview coding tests are hard enough as it is

gedy · on Sept 22, 2020

How else will we weed out the fakers and people coasting for 10 years? Our CRUD SaaS app needs top people.

ehsankia · on Sept 22, 2020

Doesn't sound like that hard of a question, given you are provided the structure of the PDF header. I guess it really comes down to substring search.

btown · on Sept 22, 2020

Imagine if it was a PDF that simply rendered the number 42.

ddalex · on Sept 22, 2020

If that happens we know for a fact that we are in a simulation

mtzet · on Sept 22, 2020

Well maybe. We don't know if pi is a normal number.

poizan42 · on Sept 23, 2020

Actually it only needs to be a disjunctive (or rich) number which is a weaker condition.

We don't know whether pi is that either for any integer base.

SteveGoob · on Sept 22, 2020

> We don't know if pi is a normal number.

Sure we do. There are plenty of proofs out there that pi is an irrational number.

andreareina · on Sept 22, 2020

Irrational does not imply normal. For example, 1.01001000100001... is irrational but it's certainly not normal.

x3c · on Sept 22, 2020

Technically, 1.01001000100001... can be normal depending on what ... stands for. :)

bscphil · on Sept 22, 2020

Well, obviously. But presumably the ... is meant to imply that this is the summation of 1/(10^(x(x+3)/2)).

coddle-hark · on Sept 22, 2020

Or what 1 or 0 or . stands for.

landryl · on Sept 22, 2020

Actually I'd argue the example you provided is normal, as long as you authorise a particular encoding where every number n you're looking for is encoded as a string of n zeros.

It's then trivial to see that every number you can think of is encoded in there, and therefore any data, piece of music or movie that ever existed.

(I'm not sure we're allowed to fiddle with the encoding, but since we allow ourselves to represent a piece of music into a number, we're already talking about encoding anyway, so it doesn't seem like cheating to me...)

enedil · on Sept 22, 2020

Normality of a number is with respect to number bases, so your trick with encoding is invalid. Otherwise, every computable number could be considered normal - take an algorithm for generating of it, supply a random string (this is the encoding), disregard the random string, and you have a perfectly valid normal representation of your number. So it is cheating.

meithecatte · on Sept 23, 2020

I agree that normality is a specific formalized concept, but you could always require that an encoding function like this is injective.

dheera · on Sept 22, 2020

Encoding doesn't count. Normality is a very specific mathematical concept: https://en.wikipedia.org/wiki/Normal_number

Also, 1.01001000100001... is a good example of a number that is both irrational and transcendental but not normal.

jimktrains2 · on Sept 22, 2020

Normal in this sense means that all the frequency of all digits approaches a uniform distribution as the length of the sample increases towards infinity. Basically if we could see "all of" π and count all the 0s, 1s 2s, 3s, &c to 9 all the counts would be equal.

gugagore · on Sept 22, 2020

That on its own can't be right, because 0.12345678901234.....

According to wikipédia, you gave a definition for "simply normal", and for normal numbers the distribution of any sequence of digits is uniform. So 00, 01, ..., 99 each occur uniformally too.

enedil · on Sept 22, 2020

Moreover you need to consider it with regards to all other bases than 10 too.

gorgoiler · on Sept 23, 2020

Is this correct, mathematically?

I understand the point that PI contains every possible piece of information, theoretically.

However, the chance of finding a given string in PI depends on the string’s length. The longer the string, the more the probability tends to 0.

The paradox therefore is that PI contains every PDF, but you will never find them, so in what sense does it really contain them at all?

lsiebert · on Sept 23, 2020

No, all strings theoretically exist in 𝛑 given enough digits, so longer strings don't reduce probability of existence, they just mean that it will take more digits to find them.

tsbertalan · on Sept 23, 2020

See Borell-Cantelli lemma.

gorgoiler · on Sept 25, 2020

I looked this up but I’m not sure I grasp your point.

Are you saying that:

- given a long string, we might ask “can this string be found in PI?”

- the probability of finding a long string in PI is infinitely small

- the number of possible strings in PI is infinitely large

- it’s not possible to decide if the answer is yes or no?

rezeroed · on Sept 23, 2020

If a tree falls in the forest and no one is around to hear it fall. Or a modern take, if a disease has no symptoms is it really a disease.

nroets · on Sept 22, 2020

Including a PDF that generates the digits of pi

adtac · on Sept 23, 2020

actually, if you find the citation, let me know, you might be in for an award

egocodedinsol · on Sept 23, 2020

I'm not sure that's necessarily true. It is true (at least with a non-constructive proof) that if you pick a 'random' real number then it contains all possible PDFs with probability one ( or that the set of numbers for which this is not true has lebesgue measure zero). But I'm not sure it's known that pi has this property.

dheera · on Sept 22, 2020

Pi is thought to be normal but it hasn't been proven yet, so we can't say that for sure, but it's likely true.

ducktective · on Sept 22, 2020

I don't think that is a proven fact.

rbonvall · on Sept 22, 2020

Since a PDF can begin with non-PDF content, then pi itself is a valid PDF file.

cgb223 · on Sept 22, 2020

My favorite pdf fact is that the security flags for things like copy protection and passwords are on the viewer to implement so you can just turn them off and all the security is gone

BHSPitMonkey · on Sept 22, 2020

Debian actually goes out of their way to patch those checks out in their PDF-related packages as part of their stance against DRM, like this example with "pdftk":

https://sources.debian.org/patches/pdftk/2.02-4/drm_fix/

r00fus · on Sept 22, 2020

This is not entirely true, you can encrypt PDFs [1] since v1.3 of the spec but the cypher is often so weak (RC4 until v1.6) they can be bruteforced in reasonable amounts of time.

[1] https://www.pdflib.com/pdf-knowledge-base/pdf-password-secur...

Thorrez · on Sept 23, 2020

You can encrypt them to completely prevent them from being opened. But cgb223 wasn't talking about that, cgb223 was talking about the ability to open them but not copy text, or not print.

benmmurphy · on Sept 23, 2020

You can make the text uncopyable by using non-standard font indexing. The reader will be able to copy the text but it will be gobbly-gook. It forces the user to OCR the PDF or reverse the font mapping.

hnick · on Sept 24, 2020

And ain't that a treat when clients send us PDFs to sort, print, and mail, but the address extraction fails completely.

toomanybeersies · on Sept 22, 2020

You can also circumvent copy protection on PDFs by taking a screenshot, or taking a photo of the screen with your phone.

Drdrdrq · on Sept 22, 2020

My somewhat less favorite pdf fact is that if you do that, you are still breaking protection, legally speaking.

maxerickson · on Sept 22, 2020

Seems to be a reasonable analogy with trespass, where you are violating the law when you cross an invisible line. The need for marking the line varies considerably.

And even places with strong roaming rights tend place limits on well marked land.

smegger001 · on Sept 22, 2020

So what if you open it in a postscript viewer instead of a PDF viewer? Because they are compatible formats except for some edge cases like security flags.

mkl · on Sept 22, 2020

Postscript and PDF are definitely not compatible formats. The drawing model is similar, but the structure and code are completely different.

sp332 · on Sept 22, 2020

On the other hand, this allows for some incredible polyglot files, like some of the tricks with PoC||GtfO issues where the file is a readable PDF but also a game cartridge and also a zip file with the proof-of-concept code in the issue. And the front cover has the MD5 hash of the whole file printed on it... but that's another trick entirely!

rvense · on Sept 22, 2020

Yeah, next time I need a CV it'll be a single-file Ruby web server and PDF that's also an archive of its own sources.

toomanybeersies · on Sept 22, 2020

I'm currently looking for a job as a Rails Developer. I might just do that.

Probably won't send it to any recruiters, but it will be a funny anecdote for interviews.

HellsMaddy · on Sept 22, 2020

Can a PDF file contain a PDF file, and if so can that PDF file contain a PDF file?

betatim · on Sept 22, 2020

Yes. Because the PDF standard specifies a mechanism that lets you "attach" files to a PDF :)

brailsafe · on Sept 23, 2020

I worked for PDFTron on their WebViewer product earlier this year, and primarily spent time implementing this feature in JS. Understanding the spec on this was tricky, because standard PDF viewers need to be able to uncompress the stuff you jam in there. It kind of blew my mind that you can literally jam any arbitrary file into a PDF.

layoutIfNeeded · on Sept 22, 2020

ZIP files can: https://research.swtch.com/zip

folmar · on Sept 24, 2020

My stupid bank sends encrypted attachments as an encrypted PDF with HTML file attached.

brailsafe · on Sept 23, 2020

chuinard · on Sept 23, 2020

I never understood that Google security blog post on how they could make 2 different PDFs with different content have the same SHA but now that you mention you can stuff bytes in a file unrelated to the PDF, it makes sense...

nostoc · on Sept 22, 2020

It'll depend on the pdf reader you're using, but I'm pretty sure the PDF header needs to start in the first 1K of the file.

mkl · on Sept 22, 2020

Some readers won't need a header at all, I think. Near the end (usually!) of the file there's an index of objects (page data etc.) with byte offsets, which can point to anywhere in the file.

agumonkey · on Sept 22, 2020

I can never find the PDF hack talk where author explains all 100 ways to embed things in pdf or pdf into things

wffurr · on Sept 22, 2020

It's hidden in a PDF in the digits of Pi.

alexott · on Sept 22, 2020

You can imagine the pain when you need to reliably detect PDF mime type on web proxy, or something like...

belval · on Sept 22, 2020

As someone still working with PDF processing, I can confirm that it doesn't get easier.

mettamage · on Sept 22, 2020

What's your favorite PDF feature that causes a brain meltdown?

I've read a few comments on HN how PDF is, well, not developer-friendly. If people are interested in providing some more examples here, I'd be curious to know!

pierrebai · on Sept 22, 2020

In the early 2000 I coded a PDF library for an industrial printer suite. (Print, proof, impositions)

I personally think the structural PDF format is a really great format. It's entirely ASCII-based, a pure text format, yet it can embed arbitrary binary data and compress that data. The actual structure is simple and support just enough functionality, like a tree of object, dictionaries and arrays, unicode strings, date formats, etc.

I think if you limit yourself to pure structural PDF woulde have been a great format to standardize upon, much better than JSON or XML. It';s richer than JSON, simpler and saner than XML. Again, it's top-notch ability to embed binary is great. It has other great characteristics, for example you can update anything just by appending.

The ugly bits are in the "semantic" PDF: the page descriptions, media, etc. Even then, the early version of PDF were nice, mainly just simplified Postscript.

hnick · on Sept 24, 2020

I'm of a similar opinion but I'll say the format is quite good, but the many and varied implementations are often not.

A common case is clients who use utilities that generate single customer documents then merge them into a bigger file for bulk print and mail (bills and statements, not identical copy). Without fail that results in thousands of similar but different subset fonts whereupon most printers I've encountered eventually fail due to memory issues.

Typically this leads to a discussion about "I can open it on my computer fine" and bending over backwards to find a workaround. Merging and consolidating these fonts doesn't seem to be a simple task, although some tools claim to work some of the time.

Something that scopes object resources for disposal could be nice in the PDF spec (maybe it exists), but something like a LRU caching mechanism on the printer would potentially resolve this too.

izacus · on Sept 22, 2020

Being able to do SQL queries to remote servers, upload form contents directly to a server, embedded 3D models and being able to have a fully featured page embedded Tetris game due to support for JS.

Having said that (and worked on a commercial PDF library), despite all the cruft that came with age, it's a well built format that survived the test of time with good reasons.

dwd · on Sept 23, 2020

Nice

I worked on a Pdf with inbuilt tracking solution that updated the form layout using ActionScript based on the workflow status and the role of the user (ie the line manager had a different group of fields in the form to complete like their signature while viewing what the initial requestor had entered) Lots of callbacks to the server saving in progress data and updating the status of who had the form, who was next based on department and emailing it to that person if it passed validation.

An initial fun discovery was that you could force the form to download and replace itself with the latest version even if they had just opened some old file they had on their pc.

modoc · on Sept 23, 2020

Can chat with you about PDFs? devon at digitalsanctuary dot com

jahewson · on Sept 22, 2020

Acrobat can read fantastically corrupt PDF files none of which are covered by the spec. The endless surprises induce a special kind of madness.

Streams just suddenly end? That’s ok. Totally corrupt xref tables? Ok. Incorrect image headers? Ok. Unrecognisably mangled Type1 font formats? Fine!

belval · on Sept 22, 2020

That's great because it creates client expectations regarding what my PDF application should support. Implementing the spec is not good enough, you have to do what PDFium or Adobe do.

fgonzag · on Sept 23, 2020

On the other hand, if they never supported those broken PDFs to begin with, we wouldn't have them in the wild and wouldn't have to deal with them.

folmar · on Sept 24, 2020

20 years ago I were more surprised when I got a PDF with correct xref table than with broken one.

nl · on Sept 23, 2020

I think one of the biggest pain points that developers hit that hasn't been mentioned here is content extraction.

A lot of the time developers want access to the text inside PDFs. Unlike HTML or formats like MS Word (XML or old binary format) getting "text" isn't really possible.

Most "document" formats have the concept of words or strings: a set of characters separated by whitespace. PDF isn't a "document" format in that sense - it's a page description language. Instead of strings of text you have character glyphs positioned at a particular location.

If you want to "read" the text, you have to work out the orientation (which can change throughout the page - think of table header alignment), and use some kind of heuristic to guess the word spacing based on the font and character spacing.

There's also this whole thing with clipping, where some text can be hidden behind other objects (or off page) so you have to try to deal with that.

There's lots of libraries that try to do this for you, but there are lots because none get it right 100% of the time...

belval · on Sept 22, 2020

- Remapping font tables to different characters for to reduce code usage.

- Clipping path logic, you can write text outside of it, which makes it effectively invisible yet it will show up if you try to extract the text.

- Anything regarding the graphicstate stack, it's a pain to debug.

- Extracting content from AcroForm/JS "XFA" forms

PDF is great format for printing, it's just a pain for pretty much everything else.

aidos · on Sept 22, 2020

This is also my list. Except for the forms, that's one I don't have to deal with.

My other one is the use of multiple subset fonts that are actually the same font with a different subset of glyphs that you want to merge back together.

arh68 · on Sept 22, 2020

"Identical" PDFs are not necessarily byte-identical; you can't just check equivalence through checksums. (AFAIK, it's been a while, feel free to correct) I don't remember a good way to normalize/disambiguate, at least I never attacked the problem long enough to have learned.

EDIT: oh yeah, I'm pretty sure it contains Mail, just like Zawinski said

amelius · on Sept 22, 2020

At least it's probably better than MS-Word's internal format ... (?)

liversage · on Sept 22, 2020

Microsoft Word stores XML documents inside a zip archive. There is a detailed specification of the format available: https://docs.microsoft.com/en-us/openspecs/office_standards/...

xaldir · on Sept 22, 2020

I think he was talking about the classic .doc format which was a clusterfuck and not the open XML.

akie · on Sept 22, 2020

If I remember correctly the XML format was just an XML-encoded version of the binary counterpart. Including all or most of the bugs and weird hacks.

amaccuish · on Sept 22, 2020

with the previous format being essentially a memory dump, i'd say that's progress

alexott · on Sept 22, 2020

That’s correct - I worked with MS team that documented old formats, and they said that sometimes they don’t have people left who knew what specific struct was intended for - although that was mostly for people PowerPoint and Visio, excel and word was better documented

tinus_hn · on Sept 24, 2020

Actually Excel was the only one that had official, freely available documentation for the old (now legacy) file format.

patrec · on Sept 22, 2020

You don't remember correctly. Word's docx format is far more intelligent than openoffice ODT, despite propaganda to the contrary. With one exception: word's zip files don't have a convenient magic header. The way it works with ODT, and a bunch of other formats is that you put an uncompressed identifier file (`mimetype`) as the first entry inside your zipfile. At byte 30 (of your zipfile) you then get `mimetype$THE_MIMEMETYPE`. This is a nice trick and works for any zip-based format. Sadly, docx does not do that so you have to go by file extension or look at (more of) the contents of the zipfile.

yoz-y · on Sept 24, 2020

IIRC the original doc (and xls) formats were unwieldy mainly because of performance requirements. In order to save and load fast they were basically a bunch of binary dumped structs.

DavidPeiffer · on Sept 22, 2020

I'm not saying it's beautiful, but isn't Ms-Word's internal format basically a series of XML files that are zipped up?

The old .Doc and .xls files were a bad format, but my understanding is that since Office 2007 the format is generally much better.

alexott · on Sept 22, 2020

Ms office files prior to office 2007 were mostly memory dumps of specific components, wrapped into composite files aka OLE2 storage - their content varied depending on office versions and often locale

Polylactic_acid · on Sept 23, 2020

Its a wonder alternative editors ever supported the format..

jhoechtl · on Sept 22, 2020

To be fair the old doc format was conceived in DOS era and memory efficiency was a primer back in the days.

smegger001 · on Sept 22, 2020

and if they hadn't waited so long to update to a sane file format no one would complain but they waited until 2007 to fix the format long after the dos era memory excuse had long ceased to be an issue. even then they only did it to allow them to shoe horn their format in as the iso standard after one was already selected bribing there way through the process.

divbzero · on Sept 22, 2020

I had to handle action buttons for a PDF once. I swam out from America a long long ways before turning back. Might have spotted you middle of the ocean.

meshaneian · on Sept 22, 2020

Legit username warns of one PDF peril.

moultano · on Sept 22, 2020

It's kinda crazy that this is the format we've standardized on to carry all of the output of academia into the future.

larrybud · on Sept 24, 2020

Actually, it’s not. The archiving standard is PDF/A, which is (as I understand it) much more structured and standardized.

mjcohen · on Sept 22, 2020

A lot of the input is LaTeX, so that's ok.

MayeulC · on Sept 22, 2020

And arxiv asks for the original latex source when submitting.

Well, at least, pdf is probably better than printed paper for that purpose.

phpdave11 · on Sept 22, 2020

It's really not that difficult if you read and understand the PDF specification. As a learning exercise, I created a simple PDF generator library that creates ASCII PDF documents (you can open them in Notepad) and includes comments about what each drawing instruction does.

https://github.com/phpdave11/davepdf

recursive · on Sept 22, 2020

I'm sure generating PDFs is much easier than reading them, such that "it just works" with any kind of PDF.

phpdave11 · on Sept 22, 2020

Reading them is also easy. I wrote a library that reads PDFs and imports page(s) from an existing PDF into a new PDF as a Form XObject.

https://github.com/phpdave11/gofpdi

bjoli · on Sept 23, 2020

I had a look through the XPS standard and had a similar feeling. I complained to a friend that had been involved in one of the bigger pdf libraries. He then made me compare it to his version of the pdf 2.0 (iirc) standard.

That is truly nightmare material. Especially considering a non-trivial percentage of pdfs circulating are non-conformant and people still expect them to render...

emiliosic · on Sept 23, 2020

Years ago, I worked on a project that required generating PDF invoices. Used the FPDF library and I was shocked how small the files are (including a properly sized logo) when generated compared to most other PDFs 'rendered' from word or print processors.

saagarjha · on Sept 24, 2020

I just had to do tech support for someone whose 3 MB PowerPoint of text and some shapes became a 70 MB PDF that they couldn’t send through email anymore :/

ajsnigrutin · on Sept 23, 2020

The last time i wanted to do that, i noped the fuck out, generated latex, and pdflatex-ed the thing, to get the pdf (we're talking generic reports, text, table, graph, email to customer on a schedule).

thehappypm · on Sept 23, 2020

I am actually interested in doing pure JS pdf processing. All of the web interfaces for PDF processing are server side — which means it’s tough to process large files. The dream is a purely JavaScript solution that never leaves the local computer. I’ve got a few client-side success stories that do fairly significant image generation through the canvas. So far PDF seems reasonably manageable through manipulating the text, but it’s not the best format.

anoopelias · on Sept 23, 2020

> The dream is a purely JavaScript solution that never leaves the local computer.

Mozilla's pdfjs[1] project is a pure HTML/JavaScript solution for PDF rendering. This is the same code that ships in Firefox browser as well. This is standalone, AFAIK, it doesn't talk to a mothership.

[1]https://mozilla.github.io/pdf.js/

thehappypm · on Sept 26, 2020

It’s not quite what I’m looking for—more of a viewer.

schoolornot · on Sept 22, 2020

I thought I found a nifty trick by using OpenOffice to create a form with a pre-filled value. I decoded the PDF using pdktk or one of the free tools, and then modified the value. Nope, that caused some kind of cascading/checksum error.

Ended up just making the app generate HTML before calling wkhtmltopdf.

The PDF spec is insane! But like all things what you get out of Word/OpenOffice is 100x more complex than if you wrote it yourself, which is indeed doable.

1vuio0pswjnm7 · on Sept 22, 2020

Did you look at this:

https://github.com/catseye/pdf.lua

gorgoiler · on Sept 22, 2020

Only Forward

...a wondeful novel along these lines. They only get shot and kidnapped though. Nothing so bad as PDFs.

godelmachine · on Sept 23, 2020

I didn’t get it. Are you implying that coding PDF is an onerous task?

maushu · on Sept 23, 2020

Check out the specs file for PDF and you will understand.

I will be extremely surprised if anyone (besides Adobe) has implemented 100% of it.

saagarjha · on Sept 24, 2020

I’d be surprised if Adobe has implemented 100% of it. With a format this complex, there’s bound to be discrepancies between the spec and the code they have.

enriquto · on Sept 23, 2020

Reading PDF files is certainly a nightmare. But you can easily produce a valid and simple pdf file just by printf-ing whatever needs to be printf-ed. There's an ugly header and the rest is essentially your text.

Someone · on Sept 22, 2020

OK, they added various ways of data compression, but PDF is, basically, a text-based format.

As far as I know, any PDF can be losslessly converted to an equivalent PDF that can be edited in any text editor, even Notepad. And yes, you could fill in the forms there, too (if you were stubborn enough)

roflc0ptic · on Sept 22, 2020

It sounds like you either know a lot more than me or a lot less than me. The PDFs I've dealt with don't store text as strings, they store it as individual characters. This left me having to write a heuristic based algorithm to group the characters into words, words into lines, lines into paragraphs, paragraphs into columns.

Again, as far as I know, there are no heuristics good enough to get that right for all values of PDF.

belval · on Sept 22, 2020

He knows a lot less than you probably because there is absolutely no requirements for PDFs to be in text format and most aren't. The "text" he is editing could render to completely different characters depending on how the PDF document was created.

The default MacOS PDF printer will actually remap the font cmap making born-digital PDFs where the "text" is something else entirely (say "$" maps to "a").

yjftsjthsd-h · on Sept 22, 2020

> The default MacOS PDF printer will actually remap the font cmap making born-digital PDFs where the "text" is something else entirely (say "$" maps to "a").

What? Why!? I've heard of doing that as a form of DRM, but I can't imagine Darwin defaulting to doing that.

belval · on Sept 22, 2020

I never dug deeper into it, so I am not aware of why it does that or if it's a specific version or whatnot, but take a PDF from which you can extract the text (with pdftotext/pdfbox for example). Open it in the document viewer and "print" it to PDF. If you extract the text again it is not readable anymore.

This wouldn't be an issue if it was a conscious choice, but when I parsed a lot of born-digital PDFs we ended up with a lot that were like that from various source. Try explaining that...

colejohnson66 · on Sept 22, 2020

Could it be “compacting” the fonts? So if U+0000 to U+0007F aren’t used at all, remove those glyphs and set U+0000’s glyph to be what was U+0080? Yes, I know NULL doesn’t have a glyph, but I hope that gets the idea across.

belval · on Sept 22, 2020

That would be my personal guess yes, as there are other ways to "protect" PDF text like curve-based rendering.

Someone · on Sept 23, 2020

Where did I say PDFs have to be in text format? I said every PDF has an equivalent text-only representation. The format, like postscript and eps started as text-only. Compression, making the files binary was added as an afterthought, to make files smaller (much smaller). If it were binary from the start, its designers would not have made the table of contents at the end waste bytes by writing offsets in ascii.

See http://blog.idrsolutions.com/?s=%22Make+your+own+PDF+file for more info on how to write PDF by hand (also shows why I said you have to be stubborn to do that)

belval · on Sept 23, 2020

"OK, they added various ways of data compression, but PDF is, basically, a text-based format."

You are trying to move the goal post here. The above statement is simply untrue, PDF is not a text based format and it's that simple.

mkl · on Sept 22, 2020

More. Go here and download the PDF spec.: https://www.adobe.com/devnet/pdf/pdf_reference.html

Look at Chapter 3, Syntax. The code is all text based. We are not talking about the visible characters in a PDF viewer, but the code of the PDF file itself.

roflc0ptic · on Sept 23, 2020

Oh, true. I misread, and also didn't know that. Thanks for the info!

hnick · on Sept 24, 2020

Going by your original premise though, just so you know the reference does indeed have examples of text as strings e.g.

BT /F13 12 Tf 288 720 Td (ABC) Tj ET

This can be extended to include spaces so you can essentially mark up entire lines of text at one time. What it can't do is cohesive paragraphs and flow/wrap, you need to use the relative positions to work out what text is in one block (and usually I'd defer to something like pdftotext for simple cases).

Laying out individual characters is common though. It's probably due to kerning concerns.

quickthrowman · on Sept 22, 2020

That’s news to me and Bluebeam’s PDF search feature! It turns out you can make PDFs (This usually happens with architectural drawings) that are comprised purely of images that are not searchable, and therefore you are wrong.

I silently thank every architect that provides searchable PDFs, it makes my job way easier

mkl · on Sept 22, 2020

GP is right. The code that makes up a PDF is text-based. Those images can be encoded in the PDF file using the ASCIIHexDecode filter, i.e. as editable ASCII text code.

yunohn · on Sept 23, 2020

GP uses the phrases "losslessly convert to text" and "fill in the forms". I think you've misunderstood them, they're clearly talking about the display text, and not the code that comprises it.

xfer · on Sept 22, 2020

Yes just use a hex editor and every data is text-based.

mkl · on Sept 23, 2020

Only the "binary" bits like fonts and bitmap images would need to be hex (and then only so Notepad doesn't mangle them). Everything else in an uncompressed PDF is already text. Go here and download the PDF spec.: https://www.adobe.com/devnet/pdf/pdf_reference.html

Look at Chapter 3, Syntax (and the rest of it, really).

I literally have scripts that use bash and sed, or Python, to modify PDFs by editing the text code. Doing it in Notepad is possible but tricky, as there's a table of object byte offsets near the end that it's easy to mess up by inserting a character.

burtonator · on Sept 22, 2020

I'm the author of Polar (https://getpolarized.io/) that uses PDF.js as its PDF backend.

This is a somewhat big update for PDF.js which is kind of cool in that they haven't really been updating it as aggressively as they usually do in the last year or so.

It's a bit frustrating to work with though. The entire concept of rendering a PDF via JS is fascinating but actually using the API has been a huge pain for us.

We've had to fork it internally and work on typescript bindings and other features to get it to work.

They seem to have a silly policy of only allow developers to use a subset of the API not the whole API itself so that it doesn't look like PDF.js (which I don't understand).

A lot of the functionality just isn't available otherwise.

brendandahl · on Sept 22, 2020

PDF.js dev here. I'm a bit confused on which part of internal API you would like to use? The way I think of it, there are really three API's in pdf.js: 1) Main thread API (api.js) which we base the version off 2) The code that runs in the worker 3) The viewer components (web/*)

Quite awhile ago when we decided what parts of the API to version, we thought more people would want to use #1. Now that the project is mature we could probably expose some more base the version off of that.

As for the "so that it doesn't look like PDF.js", we don't limit the API because of this. That suggestion (which I don't totally agree with) came from what we saw people doing, where they'd copy the entire viewer, when it'd probably be better to just let the user's browser choose how to show the PDF.

torresjrjr · on Sept 22, 2020

> PDF.js dev here

I'm so sorry about being forward but why the hell don't the vim keys (hjkl) smooth scroll? Its so frustrating. Is there an option to set it as so? Using the arrow keys is so cumbersome.

tarikozket · on Sept 22, 2020

here, I'll make it easier for you to contribute to the project by providing you with the lines you'd need to update:

https://github.com/mozilla/pdf.js/blob/83e1bbea6e23db8744420...

mavsman · on Sept 22, 2020

Brilliant

colejohnson66 · on Sept 22, 2020

Because not everyone uses Vim?

cpeterso · on Sept 23, 2020

PDF.js already supports h/j (and p/n) keys to page up and down. (I added them years ago. :) I think GP is asking for the keys to scroll the page by smaller steps instead of page up and down.

52-6F-62 · on Sept 22, 2020

I've been working on an internal tool for my company using the same library. It's saved me a ton of work, but my experience has been similar to yours. I've even had to lock in to a much older version for want of putting a lot more work on my plate since the API seems to have changed a fair bit. (well, between it and JSDOM which I am using to some rendering on the server). And like you I've had to write a bunch of the bindings/definitions myself or just reduce them to nil (declare module "yadda/yadday/thing" as any)—which is thankfully permissible since it just needs to be built once and run "forever" with near-zero need for feature additions, etc.

All to extract images in a routine fashion.

Just the same, I'm still immensely thankful they've published the library as OSS.

exikyut · on Sept 23, 2020

I'm benignly curious how pdfimages explodes in your use case.

52-6F-62 · on Sept 23, 2020

Unfortunately, it and Poppler, and Imagemagick, etc were all off the table. I was confined to running everything within a Node instance. Couldn't make calls out to command line tools. I tried probably 20 ways of using Poppler-based libraries to no avail.

It definitely would have been the better performing route, and simpler to implement. As it is, since the app is low traffic for actual processing of the images it doesn't matter too much, thankfully.

This library was helpful: https://github.com/ScientaNL/pdf-extractor

I haven't had time to do it cleanly, but I should contribute back with the types I wrote after cleaning them up...

0xFF0123 · on Sept 22, 2020

Looks cool! On a sidenote, I've always been curious with product sites: what's your metric for including other orgs under "Used and Trusted by Top Organizations"? How do you know they use / trust it?

maest · on Sept 22, 2020

Do you have, by any chance, any good resources on PDF.js? The README on the github is ok, but it doesn't really cover what workers are supposed to do and provide any useful mental model for the architecture of the whole thing.

0x6A75616E · on Sept 22, 2020

Hey! Polarized looks pretty cool. Question, has this feature (forms) been merged into the public pdf.js master yet?

john4532452 · on Sept 25, 2020

If you are a emacs user try the lesser know pdf-tools mode. Its amazing.

_coveredInBees · on Sept 22, 2020

Sheesh, what's with the hate for a generally all-round useful feature in an Open source browser? The last thing I want is to have to install 3rd-party software on my machine and have my browser be held hostage to it just to view PDF documents on the web. Being able to fill them in is a very useful feature and the in-browser PDF readers are still way less bloated than most other plugins.

kibwen · on Sept 22, 2020

Yes, like it or not PDF is a de facto standard of the web, in the same way that Flash was nearly a de facto standard before the industry-wide decade-long effort to kill it. A browser that doesn't support PDFs is as lacking in the eyes of users as a browser that doesn't support PNGs.

jval43 · on Sept 22, 2020

If Flash was rendered natively in the browser, sandboxed and across different browsers, and with high enough performance/low enough battery impact, it would have stayed.

There were efforts similar to PDF.js to run Flash content using JS but they were never able to tick all those boxes.

davidu · on Sept 23, 2020

https://beta.rive.app/ (OSS runtime, xplatform, native rendering)

Eduard · on Sept 22, 2020

I don't agree.

PDF is fine to be some binary blob to download just as most other binary blob formats are.

Would you expect to have .exe files being directly interpreted by a browser?

jiveturkey · on Sept 22, 2020

> Would you expect to have .exe files being directly interpreted by a browser?

no, i wouldn't. and yet here we are: wasm.

yjftsjthsd-h · on Sept 22, 2020

Yes, this is a nice feature added to a basically-reasonable implementation of a PDF viewer. I think the objection is that that PDF viewer should be an actual independent application, not baked into a browser that already is too many things to too many people. It's like Chrome including a basic antivirus function (https://support.google.com/chrome/answer/2765944?co=GENIE.Pl...) - yes it's useful, yes I trust it more than a lot of AV products, but no I don't think it's reasonable to bundle it into the program that's supposed to be here to render web pages for me. (Similar arguments, to varying degrees, are made against WebRTC and Pocket)

_coveredInBees · on Sept 22, 2020

I really don't see why it should be an independent application. I mean it's not like we expect a PNG viewer or HTML5 video viewer to be a separate application in a browser. Being able to view (and in this case fill/interact with) PDFs is pretty much a basic necessity on the web. Beyond the core HN crowd, almost nobody cares to have a 3rd party application that they have to install to view PDFs in their browser. Having a lightweight and secure PDF viewer that is also not made by some 3rd party company that could be collecting any amount of data on you is a good thing in general.

yjftsjthsd-h · on Sept 22, 2020

> I mean it's not like we expect a PNG viewer or HTML5 video viewer to be a separate application in a browser.

PDFs are generally an actual document, separate from the site they're on. If images and videos weren't a part of the web pages being viewed, I would be quite skeptical of including them in the browser. I mean, there are JS viewers for STL files (https://www.viewstl.com/) - should browsers include a 2D modeling environment?

> Beyond the core HN crowd, almost nobody cares to have a 3rd party application that they have to install to view PDFs in their browser.

See, I have the exact opposite experience; I've had less-technical family complain to me they were annoyed at Firefox because it stopped just opening PDFs in Adobe and forced them into a crippled slow viewer inside itself. Unfortunately, I can't tell which of us is in a bubble.

> Having a lightweight and secure PDF viewer that is also not made by some 3rd party company that could be collecting any amount of data on you is a good thing in general.

That many PDF viewers are awful is an argument for making a better PDF viewer, but not for baking it into a browser.

_coveredInBees · on Sept 23, 2020

> PDFs are generally an actual document, separate from the site they're on. If images and videos weren't a part of the web pages being viewed, I would be quite skeptical of including them in the browser.

See, I don't really agree with that because to me, PDFs are a pretty core part of content on the internet that users browse to via their browser. Pretty much every restaurant makes their menu available on their website as a PDF document. Almost all users will interact with PDF documents while browsing the web at some point or the other. Otoh, a tiny fraction will even know what an STL file is, let alone care about opening/viewing one. So that comparison really isn't a fair one.

> That many PDF viewers are awful is an argument for making a better PDF viewer, but not for baking it into a browser.

That's a bit of an odd statement. If anything, it proves exactly why this is a good move from Mozilla. The PDF standard has been around forever, and yet there is a dearth of free, high-quality PDF viewers that aren't bloated or filled with ads or spyware or trying to get you to upgrade to a paid version of their software. So Mozilla has finally taken matters into their own hands and provided a pretty good, light-weight and integrated solution that will do the job for most users. Power users who care can still enable other software via the plugin system as their default PDF viewer. I'm not sure how you can blame Mozilla for addressing a very real deficiency in the state of available software for PDF viewing.

bad_user · on Sept 22, 2020

No, it's like Chrome including a PDF viewer.

yjftsjthsd-h · on Sept 22, 2020

I expect that people who object to bundling a PDF viewer in Firefox object equally to Chrome doing the same.

dgellow · on Sept 22, 2020

A note for Linux and macOS users, from someone who switched to windows one year ago: it’s maybe surprising but it is a VERY REAL pain in the Windows world to find a pdf reader that also allows you to edit forms, that doesn’t also come with malware or adware, and has even just a modest UX!

So for sure you already have access to Evince and Preview.app, they already do everything you want, but Windows users don’t really have that luxury! Being able to say to users to just install Firefox if they want to edit PDF is really good IMHO, way better than the current situation.

fiblye · on Sept 23, 2020

Yeah, I never understood the PDF hate at all when I only used Macs. They were snappy, had smooth scaling, editing them wasn't too hard, and scrolling was smooth. It was a fine way to read documents or even books on a computer.

Then I had to use Windows. Good god, PDFs are horrible here. No matter what I use, every application is horrible in its own unique ways. Nothing can compare to the default software provided for free with Macs. I'd prefer to manage PDFs on my phone than my work computer.

If Mozilla can help people edit PDFs to any extent, they're doing the world a service.

ImaCake · on Sept 22, 2020

Just to provide anecdata against the current comments. I totally agree with you. It's not particularly hard if you are pretty tech savvy, but the for the average user you pretty much are stuck with adobe. Or you can try your luck with the edge/chrome pdf form fill but there's a decent chance it just won't bother saving your input. On adbobe, it is still full of extra crap that is irrelevant to everyday use. I think it still bugs people to update it all the time, but I don't use adobe, so I don't know.

maxerickson · on Sept 22, 2020

What comes with Adobe Reader?

I have it on my work computer and haven't noticed anything I would rate as particularly obnoxious, but I don't use it much.

mey · on Sept 23, 2020

If you aren't careful during the download/install process they will attempt to bundle various McAfee "security" products and an Adobe chrome extension into the install. Additionally they have made it less obvious where to download Reader instead of buying Acrobat.

Edit: Example, which you may not see if you aren't on Windows. https://musteat.org/images/hn/abobe_install.png

inopinatus · on Sept 22, 2020

I read that as suggesting this is potentially a killer app for Firefox adoption in the enterprise.

mickotron · on Sept 22, 2020

Okular can "edit" forms. I have been doing this on Linux and Windows for a while. Not the most usable but it works. What I can't do in Okular, I do in Gimp.

I will use Firefox for editable form pdfs but for those that don't have editable forms, I will continue to use Okular/Gimp.

I actually stumbled across the ability to edit forms in Firefox only recently. I was like... What? This is amazing! For some reason the pdf i clicked on opened in Firefox and yeah, surprised.

neop1x · on Sept 23, 2020

As a PDF "editor" you may also like Libreoffice Draw. I was surprised how well it can work with PDFs!

MayeulC · on Sept 22, 2020

And IIRC it's available on the windows store. It probably has msi as well.

shmerl · on Sept 23, 2020

Okular works for it in KDE on Linux. I wouldn't trade KDE for Windows :)

mey · on Sept 23, 2020

The least problematic options have been Microsoft Edge (already on the system) and Adobe (more problematic).

dgellow · on Sept 23, 2020

Edge can edit PDFs? I’m learning something today!

jiveturkey · on Sept 22, 2020

eh? acroread is very easily found.

zapzupnz · on Sept 22, 2020

If you've heard of it, sure. If you know of it, sure. If you know it's better than the competing options available for Windows, sure.

blackbrokkoli · on Sept 22, 2020

I have a feeling this thread has a strong bias from highly automated valley life. In more provincial regions and even just much of Europe lots of forms have to be filled out and printed.

It is not something you have to everyday or something, but the existing solutions suck massively. You either have to use Adobe, which requires Windows (or Mac, I suppose) and your firstborn or use some massively shady online service. So personally, I love this feature!

(And I also do not think that this will halt all other development at Mozilla like some comments here imply)

Jaxan · on Sept 22, 2020

In Mac you can fill in PDFs with the builtin Preview app. I like it.

chadrs · on Sept 23, 2020

And yet every time I go to my parents' they've somehow installed Adobe Reader for Mac. I've deleted that app like 6 times.

mkskm · on Sept 22, 2020

There's also the paid app PDF Expert which is generally excellent.

boogies · on Sept 22, 2020

Isn’t evince capable of this and the default PDF viewer on GNOME?

krastanov · on Sept 22, 2020

I have had evince fail render forms, but okular (the KDE default) has worked pretty well.

folmar · on Sept 24, 2020

Works fine for me.

jhoechtl · on Sept 22, 2020

Gnome - no. Okular the KDE counterpart works very fine.

drdaeman · on Sept 22, 2020

Okular works on Windows. Or, at least, used to be, some years ago.

gcarvalho · on Sept 23, 2020

It's available on the Store. No need to log in to download it too.

https://www.microsoft.com/en-us/p/okular/9n41msq1wnm8?active...

vbarrielle · on Sept 23, 2020

Tried it a few months ago, it works.

throwaway33339 · on Sept 23, 2020

> In more provincial regions and even just much of Europe lots of forms have to be filled out and printed.

This is changing quickly with the covid pandemic. One of its silver linings is that I can't remember the last time I had to wait in line for one hour only to be told off by some exhausted bureaucrat about my missing grand-parents' birth certificate or whatever the hell they come up with. Take that, bureaucracy!

dehrmann · on Sept 22, 2020

PDF support in Firefox is one of the most important additions in recent years. My gripe with Mozilla was they're pursing all these side projects when they really should be targeting feature parity with Chrome. That's the only way people will ever switch.

kodablah · on Sept 22, 2020

> My gripe with Mozilla was they're pursing all these side projects when they really should be targeting feature parity with Chrome. That's the only way people will ever switch.

A bit off topic from the post at hand, but my gripe was the opposite. The relentless pursuit of parity made them indistinguishable giving users no reason to switch (and taking dev time away from distinguishing features). Granted the pursuit of users instead of principles is its own folly that's hard to overcome when money is needed.

eddiecalzone · on Sept 22, 2020

I'm curious what you notice is missing in terms of feature parity. I'm mostly a back-end developer (not diving into devtools very often) and switched a year ago. I'm much happier and haven't looked back.

banana_maker · on Sept 22, 2020

I just want support for APIs, mainly. I get websites from time to time that just refuse to load. E.g.

* blank pages when trying to load an imgur gallery on v68 (esr).

* image uploading not working right on instagram and various other sites, either producing blank images or ones with weird lines.

* several teleconferencing / video meeting websites just don't work properly, whether it's not detecting hardware properly, etc

I have to keep chromium installed so I can use these sites properly.

sfink · on Sept 23, 2020

Please file bugs for these. Also, to the author of this comment: please file bugs for these. (I work for Mozilla and yet I'm pretty lazy about bothering to file these.)

https://bugzilla.mozilla.org/enter_bug.cgi?product=Web%20Com... if you're on desktop.

chris_wot · on Sept 23, 2020

Can I just say, for the record, thank you for the work you and all the devs put into Firefox.

Tagbert · on Sept 23, 2020

This kind of problem also happens when developers only test on Chrome. Or if they are using Chrome-specific features and don’t properly handle failover.

DHowett · on Sept 22, 2020

> feature parity

Like PDF support?

nacs · on Sept 22, 2020

That’s what OP said yes. That features like PDF fill is essential while things like Pocket are basically non-core side projects.

bad_user · on Sept 22, 2020

Pocket is an acquired company, the integration with FF has been minimal (it does less than the Chrome extension ;)) and I'm pretty sure it pays for itself.