Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Embedded PDF viewer in Firefox 81 supports filling forms (support.mozilla.org)
1093 points by muxator on Sept 22, 2020 | hide | past | favorite | 375 comments


I looked into coding PDFs once. Then I closed my MacBook (Pro) and went for a long walk into the ocean. I think I almost got to America, but then I turned and swam back again. Turnd out I had just fallen asleep and had a nightmare. I was actually just working with regular text files, and everything was fine.


My favourite PDF fact is that it doesn't have to start at the beginning or end at the end of a file. Any sea of bytes that contains a PDF file is an acceptable PDF file...


Did anyone try to pluck out PDFs from /dev/urandom? How about from radiotelescope feed? Maybe the first evidence of extraterrestrial life will be some poor alien's tax form?


The digits of pi contain every pdf that ever could and ever will exist.


"Find the earliest valid pdf in consecutive digits of pi"


I mean, the answer is trivially zero, there exists a PDF-like structure somewhere in Pi, and the offset of that doesn't have to be zero, it can start or end anywhere. So the range [0, N] is a valid PDF.


"Find the last byte of the first valid PDF in the binary digits of Pi"


Since the PDF also doesn't have to be end-aligned, the answer is trivially [0, infinity].

The first place a valid PDF could be ended, perhaps.


A pdf at [0, N] sorts before the one at [0, N+1], by "first valid pdf".


No, both start at 0. Also, [0, infinity] and [0, infinity+1] are the same thing.



your example fails to satisfy the invariant. 11 is less than infinity.

you're just pasting random python snippits at me now. It's time to move on.

again, just to summarize: PDF files do not have to be zero aligned, and they do not have to be end aligned. Therefore the answer to the question "what is the first segment of Pi that is a valid PDF file" is trivially (0,infinity). That is a correct statement. The non-greedy (in the regex sense) answer to that question will be different, however.


Why is this so hard? If the tuple (0,10) represents the range of a valid pdf, then the next tuple (0,11) is also a valid pdf. Or any after it up to and including (0,infinity).

Note the word "next", implying that (0,10) sorts before (0,11); you even say it yourself "11 is less than infinity". Where I'm from "first" and "less" are related (the first element in a unique sorted list is defined to be less than all other elements). So if there is any valid pdf in pi that can be identified by the range tuple (0,N), then the first valid pdf must occur before N -> infinity. Therefore (0,infinity) can never be the first valid pdf, even though it may be a valid pdf.

Maybe a picture would help:

    Potential pdf file ranges in pi: [(0,0),(0,1),(0,2),(0,3),(0,4),...,(0,N-1),(0,N),(0,N+1),(0,N+2),...,(0,infinity)]
    Is it a valid pdf?                 no    no    no    no    no  (no)  no      yes   yes     yes   (yes) yes
    Which one is first?                                                          ^^^
I thought linking to a python script that shows the order comparison of a tuple (0,N) as less than the tuple (0,N+1) would clearly demonstrate this, but it appears to have failed to communicate that to you. We don't need non-greedy regex rules to do a less than comparison.


Please don't give them any ideas.. the whiteboard interview coding tests are hard enough as it is


How else will we weed out the fakers and people coasting for 10 years? Our CRUD SaaS app needs top people.


Doesn't sound like that hard of a question, given you are provided the structure of the PDF header. I guess it really comes down to substring search.


Imagine if it was a PDF that simply rendered the number 42.


If that happens we know for a fact that we are in a simulation


Well maybe. We don't know if pi is a normal number.


Actually it only needs to be a disjunctive (or rich) number which is a weaker condition.

We don't know whether pi is that either for any integer base.


> We don't know if pi is a normal number.

Sure we do. There are plenty of proofs out there that pi is an irrational number.


Irrational does not imply normal. For example, 1.01001000100001... is irrational but it's certainly not normal.


Technically, 1.01001000100001... can be normal depending on what ... stands for. :)


Well, obviously. But presumably the ... is meant to imply that this is the summation of 1/(10^(x(x+3)/2)).


Or what 1 or 0 or . stands for.


Actually I'd argue the example you provided is normal, as long as you authorise a particular encoding where every number n you're looking for is encoded as a string of n zeros.

It's then trivial to see that every number you can think of is encoded in there, and therefore any data, piece of music or movie that ever existed.

(I'm not sure we're allowed to fiddle with the encoding, but since we allow ourselves to represent a piece of music into a number, we're already talking about encoding anyway, so it doesn't seem like cheating to me...)


Normality of a number is with respect to number bases, so your trick with encoding is invalid. Otherwise, every computable number could be considered normal - take an algorithm for generating of it, supply a random string (this is the encoding), disregard the random string, and you have a perfectly valid normal representation of your number. So it is cheating.


I agree that normality is a specific formalized concept, but you could always require that an encoding function like this is injective.


Encoding doesn't count. Normality is a very specific mathematical concept: https://en.wikipedia.org/wiki/Normal_number

Also, 1.01001000100001... is a good example of a number that is both irrational and transcendental but not normal.


Normal in this sense means that all the frequency of all digits approaches a uniform distribution as the length of the sample increases towards infinity. Basically if we could see "all of" π and count all the 0s, 1s 2s, 3s, &c to 9 all the counts would be equal.


That on its own can't be right, because 0.12345678901234.....

According to wikipédia, you gave a definition for "simply normal", and for normal numbers the distribution of any sequence of digits is uniform. So 00, 01, ..., 99 each occur uniformally too.


Moreover you need to consider it with regards to all other bases than 10 too.


Is this correct, mathematically?

I understand the point that PI contains every possible piece of information, theoretically.

However, the chance of finding a given string in PI depends on the string’s length. The longer the string, the more the probability tends to 0.

The paradox therefore is that PI contains every PDF, but you will never find them, so in what sense does it really contain them at all?


No, all strings theoretically exist in 𝛑 given enough digits, so longer strings don't reduce probability of existence, they just mean that it will take more digits to find them.


See Borell-Cantelli lemma.


I looked this up but I’m not sure I grasp your point.

Are you saying that:

- given a long string, we might ask “can this string be found in PI?”

- the probability of finding a long string in PI is infinitely small

- the number of possible strings in PI is infinitely large

- it’s not possible to decide if the answer is yes or no?


If a tree falls in the forest and no one is around to hear it fall. Or a modern take, if a disease has no symptoms is it really a disease.


<citation needed>

Including a PDF that generates the digits of pi


actually, if you find the citation, let me know, you might be in for an award


I'm not sure that's necessarily true. It is true (at least with a non-constructive proof) that if you pick a 'random' real number then it contains all possible PDFs with probability one ( or that the set of numbers for which this is not true has lebesgue measure zero). But I'm not sure it's known that pi has this property.


Pi is thought to be normal but it hasn't been proven yet, so we can't say that for sure, but it's likely true.


I don't think that is a proven fact.


Since a PDF can begin with non-PDF content, then pi itself is a valid PDF file.


My favorite pdf fact is that the security flags for things like copy protection and passwords are on the viewer to implement so you can just turn them off and all the security is gone


Debian actually goes out of their way to patch those checks out in their PDF-related packages as part of their stance against DRM, like this example with "pdftk":

https://sources.debian.org/patches/pdftk/2.02-4/drm_fix/


This is not entirely true, you can encrypt PDFs [1] since v1.3 of the spec but the cypher is often so weak (RC4 until v1.6) they can be bruteforced in reasonable amounts of time.

[1] https://www.pdflib.com/pdf-knowledge-base/pdf-password-secur...


You can encrypt them to completely prevent them from being opened. But cgb223 wasn't talking about that, cgb223 was talking about the ability to open them but not copy text, or not print.


You can make the text uncopyable by using non-standard font indexing. The reader will be able to copy the text but it will be gobbly-gook. It forces the user to OCR the PDF or reverse the font mapping.


And ain't that a treat when clients send us PDFs to sort, print, and mail, but the address extraction fails completely.


You can also circumvent copy protection on PDFs by taking a screenshot, or taking a photo of the screen with your phone.


My somewhat less favorite pdf fact is that if you do that, you are still breaking protection, legally speaking.


Seems to be a reasonable analogy with trespass, where you are violating the law when you cross an invisible line. The need for marking the line varies considerably.

And even places with strong roaming rights tend place limits on well marked land.


So what if you open it in a postscript viewer instead of a PDF viewer? Because they are compatible formats except for some edge cases like security flags.


Postscript and PDF are definitely not compatible formats. The drawing model is similar, but the structure and code are completely different.


On the other hand, this allows for some incredible polyglot files, like some of the tricks with PoC||GtfO issues where the file is a readable PDF but also a game cartridge and also a zip file with the proof-of-concept code in the issue. And the front cover has the MD5 hash of the whole file printed on it... but that's another trick entirely!


Yeah, next time I need a CV it'll be a single-file Ruby web server and PDF that's also an archive of its own sources.


I'm currently looking for a job as a Rails Developer. I might just do that.

Probably won't send it to any recruiters, but it will be a funny anecdote for interviews.


Can a PDF file contain a PDF file, and if so can that PDF file contain a PDF file?


Yes. Because the PDF standard specifies a mechanism that lets you "attach" files to a PDF :)


I worked for PDFTron on their WebViewer product earlier this year, and primarily spent time implementing this feature in JS. Understanding the spec on this was tricky, because standard PDF viewers need to be able to uncompress the stuff you jam in there. It kind of blew my mind that you can literally jam any arbitrary file into a PDF.



My stupid bank sends encrypted attachments as an encrypted PDF with HTML file attached.


Yes.


I never understood that Google security blog post on how they could make 2 different PDFs with different content have the same SHA but now that you mention you can stuff bytes in a file unrelated to the PDF, it makes sense...


It'll depend on the pdf reader you're using, but I'm pretty sure the PDF header needs to start in the first 1K of the file.


Some readers won't need a header at all, I think. Near the end (usually!) of the file there's an index of objects (page data etc.) with byte offsets, which can point to anywhere in the file.


I can never find the PDF hack talk where author explains all 100 ways to embed things in pdf or pdf into things


It's hidden in a PDF in the digits of Pi.


You can imagine the pain when you need to reliably detect PDF mime type on web proxy, or something like...


As someone still working with PDF processing, I can confirm that it doesn't get easier.


What's your favorite PDF feature that causes a brain meltdown?

I've read a few comments on HN how PDF is, well, not developer-friendly. If people are interested in providing some more examples here, I'd be curious to know!


In the early 2000 I coded a PDF library for an industrial printer suite. (Print, proof, impositions)

I personally think the structural PDF format is a really great format. It's entirely ASCII-based, a pure text format, yet it can embed arbitrary binary data and compress that data. The actual structure is simple and support just enough functionality, like a tree of object, dictionaries and arrays, unicode strings, date formats, etc.

I think if you limit yourself to pure structural PDF woulde have been a great format to standardize upon, much better than JSON or XML. It';s richer than JSON, simpler and saner than XML. Again, it's top-notch ability to embed binary is great. It has other great characteristics, for example you can update anything just by appending.

The ugly bits are in the "semantic" PDF: the page descriptions, media, etc. Even then, the early version of PDF were nice, mainly just simplified Postscript.


I'm of a similar opinion but I'll say the format is quite good, but the many and varied implementations are often not.

A common case is clients who use utilities that generate single customer documents then merge them into a bigger file for bulk print and mail (bills and statements, not identical copy). Without fail that results in thousands of similar but different subset fonts whereupon most printers I've encountered eventually fail due to memory issues.

Typically this leads to a discussion about "I can open it on my computer fine" and bending over backwards to find a workaround. Merging and consolidating these fonts doesn't seem to be a simple task, although some tools claim to work some of the time.

Something that scopes object resources for disposal could be nice in the PDF spec (maybe it exists), but something like a LRU caching mechanism on the printer would potentially resolve this too.


Being able to do SQL queries to remote servers, upload form contents directly to a server, embedded 3D models and being able to have a fully featured page embedded Tetris game due to support for JS.

Having said that (and worked on a commercial PDF library), despite all the cruft that came with age, it's a well built format that survived the test of time with good reasons.


Nice

I worked on a Pdf with inbuilt tracking solution that updated the form layout using ActionScript based on the workflow status and the role of the user (ie the line manager had a different group of fields in the form to complete like their signature while viewing what the initial requestor had entered) Lots of callbacks to the server saving in progress data and updating the status of who had the form, who was next based on department and emailing it to that person if it passed validation.

An initial fun discovery was that you could force the form to download and replace itself with the latest version even if they had just opened some old file they had on their pc.


Can chat with you about PDFs? devon at digitalsanctuary dot com


Acrobat can read fantastically corrupt PDF files none of which are covered by the spec. The endless surprises induce a special kind of madness.

Streams just suddenly end? That’s ok. Totally corrupt xref tables? Ok. Incorrect image headers? Ok. Unrecognisably mangled Type1 font formats? Fine!


That's great because it creates client expectations regarding what my PDF application should support. Implementing the spec is not good enough, you have to do what PDFium or Adobe do.


On the other hand, if they never supported those broken PDFs to begin with, we wouldn't have them in the wild and wouldn't have to deal with them.


20 years ago I were more surprised when I got a PDF with correct xref table than with broken one.


I think one of the biggest pain points that developers hit that hasn't been mentioned here is content extraction.

A lot of the time developers want access to the text inside PDFs. Unlike HTML or formats like MS Word (XML or old binary format) getting "text" isn't really possible.

Most "document" formats have the concept of words or strings: a set of characters separated by whitespace. PDF isn't a "document" format in that sense - it's a page description language. Instead of strings of text you have character glyphs positioned at a particular location.

If you want to "read" the text, you have to work out the orientation (which can change throughout the page - think of table header alignment), and use some kind of heuristic to guess the word spacing based on the font and character spacing.

There's also this whole thing with clipping, where some text can be hidden behind other objects (or off page) so you have to try to deal with that.

There's lots of libraries that try to do this for you, but there are lots because none get it right 100% of the time...


- Remapping font tables to different characters for to reduce code usage.

- Clipping path logic, you can write text outside of it, which makes it effectively invisible yet it will show up if you try to extract the text.

- Anything regarding the graphicstate stack, it's a pain to debug.

- Extracting content from AcroForm/JS "XFA" forms

PDF is great format for printing, it's just a pain for pretty much everything else.


This is also my list. Except for the forms, that's one I don't have to deal with.

My other one is the use of multiple subset fonts that are actually the same font with a different subset of glyphs that you want to merge back together.


"Identical" PDFs are not necessarily byte-identical; you can't just check equivalence through checksums. (AFAIK, it's been a while, feel free to correct) I don't remember a good way to normalize/disambiguate, at least I never attacked the problem long enough to have learned.

EDIT: oh yeah, I'm pretty sure it contains Mail, just like Zawinski said


At least it's probably better than MS-Word's internal format ... (?)


Microsoft Word stores XML documents inside a zip archive. There is a detailed specification of the format available: https://docs.microsoft.com/en-us/openspecs/office_standards/...


I think he was talking about the classic .doc format which was a clusterfuck and not the open XML.


If I remember correctly the XML format was just an XML-encoded version of the binary counterpart. Including all or most of the bugs and weird hacks.


with the previous format being essentially a memory dump, i'd say that's progress


That’s correct - I worked with MS team that documented old formats, and they said that sometimes they don’t have people left who knew what specific struct was intended for - although that was mostly for people PowerPoint and Visio, excel and word was better documented


Actually Excel was the only one that had official, freely available documentation for the old (now legacy) file format.


You don't remember correctly. Word's docx format is far more intelligent than openoffice ODT, despite propaganda to the contrary. With one exception: word's zip files don't have a convenient magic header. The way it works with ODT, and a bunch of other formats is that you put an uncompressed identifier file (`mimetype`) as the first entry inside your zipfile. At byte 30 (of your zipfile) you then get `mimetype$THE_MIMEMETYPE`. This is a nice trick and works for any zip-based format. Sadly, docx does not do that so you have to go by file extension or look at (more of) the contents of the zipfile.


IIRC the original doc (and xls) formats were unwieldy mainly because of performance requirements. In order to save and load fast they were basically a bunch of binary dumped structs.


I'm not saying it's beautiful, but isn't Ms-Word's internal format basically a series of XML files that are zipped up?

The old .Doc and .xls files were a bad format, but my understanding is that since Office 2007 the format is generally much better.


Ms office files prior to office 2007 were mostly memory dumps of specific components, wrapped into composite files aka OLE2 storage - their content varied depending on office versions and often locale


Its a wonder alternative editors ever supported the format..


To be fair the old doc format was conceived in DOS era and memory efficiency was a primer back in the days.


and if they hadn't waited so long to update to a sane file format no one would complain but they waited until 2007 to fix the format long after the dos era memory excuse had long ceased to be an issue. even then they only did it to allow them to shoe horn their format in as the iso standard after one was already selected bribing there way through the process.


I had to handle action buttons for a PDF once. I swam out from America a long long ways before turning back. Might have spotted you middle of the ocean.


Legit username warns of one PDF peril.


It's kinda crazy that this is the format we've standardized on to carry all of the output of academia into the future.


Actually, it’s not. The archiving standard is PDF/A, which is (as I understand it) much more structured and standardized.


A lot of the input is LaTeX, so that's ok.


And arxiv asks for the original latex source when submitting.

Well, at least, pdf is probably better than printed paper for that purpose.


It's really not that difficult if you read and understand the PDF specification. As a learning exercise, I created a simple PDF generator library that creates ASCII PDF documents (you can open them in Notepad) and includes comments about what each drawing instruction does.

https://github.com/phpdave11/davepdf


I'm sure generating PDFs is much easier than reading them, such that "it just works" with any kind of PDF.


Reading them is also easy. I wrote a library that reads PDFs and imports page(s) from an existing PDF into a new PDF as a Form XObject.

https://github.com/phpdave11/gofpdi


I had a look through the XPS standard and had a similar feeling. I complained to a friend that had been involved in one of the bigger pdf libraries. He then made me compare it to his version of the pdf 2.0 (iirc) standard.

That is truly nightmare material. Especially considering a non-trivial percentage of pdfs circulating are non-conformant and people still expect them to render...


Years ago, I worked on a project that required generating PDF invoices. Used the FPDF library and I was shocked how small the files are (including a properly sized logo) when generated compared to most other PDFs 'rendered' from word or print processors.


I just had to do tech support for someone whose 3 MB PowerPoint of text and some shapes became a 70 MB PDF that they couldn’t send through email anymore :/


The last time i wanted to do that, i noped the fuck out, generated latex, and pdflatex-ed the thing, to get the pdf (we're talking generic reports, text, table, graph, email to customer on a schedule).


I am actually interested in doing pure JS pdf processing. All of the web interfaces for PDF processing are server side — which means it’s tough to process large files. The dream is a purely JavaScript solution that never leaves the local computer. I’ve got a few client-side success stories that do fairly significant image generation through the canvas. So far PDF seems reasonably manageable through manipulating the text, but it’s not the best format.


> The dream is a purely JavaScript solution that never leaves the local computer.

Mozilla's pdfjs[1] project is a pure HTML/JavaScript solution for PDF rendering. This is the same code that ships in Firefox browser as well. This is standalone, AFAIK, it doesn't talk to a mothership.

[1]https://mozilla.github.io/pdf.js/


It’s not quite what I’m looking for—more of a viewer.


I thought I found a nifty trick by using OpenOffice to create a form with a pre-filled value. I decoded the PDF using pdktk or one of the free tools, and then modified the value. Nope, that caused some kind of cascading/checksum error.

Ended up just making the app generate HTML before calling wkhtmltopdf.

The PDF spec is insane! But like all things what you get out of Word/OpenOffice is 100x more complex than if you wrote it yourself, which is indeed doable.



Only Forward

...a wondeful novel along these lines. They only get shot and kidnapped though. Nothing so bad as PDFs.


I didn’t get it. Are you implying that coding PDF is an onerous task?


Check out the specs file for PDF and you will understand.

I will be extremely surprised if anyone (besides Adobe) has implemented 100% of it.


I’d be surprised if Adobe has implemented 100% of it. With a format this complex, there’s bound to be discrepancies between the spec and the code they have.


Reading PDF files is certainly a nightmare. But you can easily produce a valid and simple pdf file just by printf-ing whatever needs to be printf-ed. There's an ugly header and the rest is essentially your text.


OK, they added various ways of data compression, but PDF is, basically, a text-based format.

As far as I know, any PDF can be losslessly converted to an equivalent PDF that can be edited in any text editor, even Notepad. And yes, you could fill in the forms there, too (if you were stubborn enough)


It sounds like you either know a lot more than me or a lot less than me. The PDFs I've dealt with don't store text as strings, they store it as individual characters. This left me having to write a heuristic based algorithm to group the characters into words, words into lines, lines into paragraphs, paragraphs into columns.

Again, as far as I know, there are no heuristics good enough to get that right for all values of PDF.


He knows a lot less than you probably because there is absolutely no requirements for PDFs to be in text format and most aren't. The "text" he is editing could render to completely different characters depending on how the PDF document was created.

The default MacOS PDF printer will actually remap the font cmap making born-digital PDFs where the "text" is something else entirely (say "$" maps to "a").


> The default MacOS PDF printer will actually remap the font cmap making born-digital PDFs where the "text" is something else entirely (say "$" maps to "a").

What? Why!? I've heard of doing that as a form of DRM, but I can't imagine Darwin defaulting to doing that.


I never dug deeper into it, so I am not aware of why it does that or if it's a specific version or whatnot, but take a PDF from which you can extract the text (with pdftotext/pdfbox for example). Open it in the document viewer and "print" it to PDF. If you extract the text again it is not readable anymore.

This wouldn't be an issue if it was a conscious choice, but when I parsed a lot of born-digital PDFs we ended up with a lot that were like that from various source. Try explaining that...


Could it be “compacting” the fonts? So if U+0000 to U+0007F aren’t used at all, remove those glyphs and set U+0000’s glyph to be what was U+0080? Yes, I know NULL doesn’t have a glyph, but I hope that gets the idea across.


That would be my personal guess yes, as there are other ways to "protect" PDF text like curve-based rendering.


Where did I say PDFs have to be in text format? I said every PDF has an equivalent text-only representation. The format, like postscript and eps started as text-only. Compression, making the files binary was added as an afterthought, to make files smaller (much smaller). If it were binary from the start, its designers would not have made the table of contents at the end waste bytes by writing offsets in ascii.

See http://blog.idrsolutions.com/?s=%22Make+your+own+PDF+file for more info on how to write PDF by hand (also shows why I said you have to be stubborn to do that)


"OK, they added various ways of data compression, but PDF is, basically, a text-based format."

You are trying to move the goal post here. The above statement is simply untrue, PDF is not a text based format and it's that simple.


More. Go here and download the PDF spec.: https://www.adobe.com/devnet/pdf/pdf_reference.html

Look at Chapter 3, Syntax. The code is all text based. We are not talking about the visible characters in a PDF viewer, but the code of the PDF file itself.


Oh, true. I misread, and also didn't know that. Thanks for the info!


Going by your original premise though, just so you know the reference does indeed have examples of text as strings e.g.

BT /F13 12 Tf 288 720 Td (ABC) Tj ET

This can be extended to include spaces so you can essentially mark up entire lines of text at one time. What it can't do is cohesive paragraphs and flow/wrap, you need to use the relative positions to work out what text is in one block (and usually I'd defer to something like pdftotext for simple cases).

Laying out individual characters is common though. It's probably due to kerning concerns.


That’s news to me and Bluebeam’s PDF search feature! It turns out you can make PDFs (This usually happens with architectural drawings) that are comprised purely of images that are not searchable, and therefore you are wrong.

I silently thank every architect that provides searchable PDFs, it makes my job way easier


GP is right. The code that makes up a PDF is text-based. Those images can be encoded in the PDF file using the ASCIIHexDecode filter, i.e. as editable ASCII text code.


GP uses the phrases "losslessly convert to text" and "fill in the forms". I think you've misunderstood them, they're clearly talking about the display text, and not the code that comprises it.


Yes just use a hex editor and every data is text-based.


Only the "binary" bits like fonts and bitmap images would need to be hex (and then only so Notepad doesn't mangle them). Everything else in an uncompressed PDF is already text. Go here and download the PDF spec.: https://www.adobe.com/devnet/pdf/pdf_reference.html

Look at Chapter 3, Syntax (and the rest of it, really).

I literally have scripts that use bash and sed, or Python, to modify PDFs by editing the text code. Doing it in Notepad is possible but tricky, as there's a table of object byte offsets near the end that it's easy to mess up by inserting a character.


I'm the author of Polar (https://getpolarized.io/) that uses PDF.js as its PDF backend.

This is a somewhat big update for PDF.js which is kind of cool in that they haven't really been updating it as aggressively as they usually do in the last year or so.

It's a bit frustrating to work with though. The entire concept of rendering a PDF via JS is fascinating but actually using the API has been a huge pain for us.

We've had to fork it internally and work on typescript bindings and other features to get it to work.

They seem to have a silly policy of only allow developers to use a subset of the API not the whole API itself so that it doesn't look like PDF.js (which I don't understand).

A lot of the functionality just isn't available otherwise.


PDF.js dev here. I'm a bit confused on which part of internal API you would like to use? The way I think of it, there are really three API's in pdf.js: 1) Main thread API (api.js) which we base the version off 2) The code that runs in the worker 3) The viewer components (web/*)

Quite awhile ago when we decided what parts of the API to version, we thought more people would want to use #1. Now that the project is mature we could probably expose some more base the version off of that.

As for the "so that it doesn't look like PDF.js", we don't limit the API because of this. That suggestion (which I don't totally agree with) came from what we saw people doing, where they'd copy the entire viewer, when it'd probably be better to just let the user's browser choose how to show the PDF.


> PDF.js dev here

I'm so sorry about being forward but why the hell don't the vim keys (hjkl) smooth scroll? Its so frustrating. Is there an option to set it as so? Using the arrow keys is so cumbersome.


here, I'll make it easier for you to contribute to the project by providing you with the lines you'd need to update:

https://github.com/mozilla/pdf.js/blob/83e1bbea6e23db8744420...

https://github.com/mozilla/pdf.js/blob/83e1bbea6e23db8744420...


Brilliant


Because not everyone uses Vim?


PDF.js already supports h/j (and p/n) keys to page up and down. (I added them years ago. :) I think GP is asking for the keys to scroll the page by smaller steps instead of page up and down.


I've been working on an internal tool for my company using the same library. It's saved me a ton of work, but my experience has been similar to yours. I've even had to lock in to a much older version for want of putting a lot more work on my plate since the API seems to have changed a fair bit. (well, between it and JSDOM which I am using to some rendering on the server). And like you I've had to write a bunch of the bindings/definitions myself or just reduce them to nil (declare module "yadda/yadday/thing" as any)—which is thankfully permissible since it just needs to be built once and run "forever" with near-zero need for feature additions, etc.

All to extract images in a routine fashion.

Just the same, I'm still immensely thankful they've published the library as OSS.


I'm benignly curious how pdfimages explodes in your use case.


Unfortunately, it and Poppler, and Imagemagick, etc were all off the table. I was confined to running everything within a Node instance. Couldn't make calls out to command line tools. I tried probably 20 ways of using Poppler-based libraries to no avail.

It definitely would have been the better performing route, and simpler to implement. As it is, since the app is low traffic for actual processing of the images it doesn't matter too much, thankfully.

This library was helpful: https://github.com/ScientaNL/pdf-extractor

I haven't had time to do it cleanly, but I should contribute back with the types I wrote after cleaning them up...


Looks cool! On a sidenote, I've always been curious with product sites: what's your metric for including other orgs under "Used and Trusted by Top Organizations"? How do you know they use / trust it?


Do you have, by any chance, any good resources on PDF.js? The README on the github is ok, but it doesn't really cover what workers are supposed to do and provide any useful mental model for the architecture of the whole thing.


Hey! Polarized looks pretty cool. Question, has this feature (forms) been merged into the public pdf.js master yet?


If you are a emacs user try the lesser know pdf-tools mode. Its amazing.


Sheesh, what's with the hate for a generally all-round useful feature in an Open source browser? The last thing I want is to have to install 3rd-party software on my machine and have my browser be held hostage to it just to view PDF documents on the web. Being able to fill them in is a very useful feature and the in-browser PDF readers are still way less bloated than most other plugins.


Yes, like it or not PDF is a de facto standard of the web, in the same way that Flash was nearly a de facto standard before the industry-wide decade-long effort to kill it. A browser that doesn't support PDFs is as lacking in the eyes of users as a browser that doesn't support PNGs.


If Flash was rendered natively in the browser, sandboxed and across different browsers, and with high enough performance/low enough battery impact, it would have stayed.

There were efforts similar to PDF.js to run Flash content using JS but they were never able to tick all those boxes.


https://beta.rive.app/ (OSS runtime, xplatform, native rendering)


I don't agree.

PDF is fine to be some binary blob to download just as most other binary blob formats are.

Would you expect to have .exe files being directly interpreted by a browser?


> Would you expect to have .exe files being directly interpreted by a browser?

no, i wouldn't. and yet here we are: wasm.


Yes, this is a nice feature added to a basically-reasonable implementation of a PDF viewer. I think the objection is that that PDF viewer should be an actual independent application, not baked into a browser that already is too many things to too many people. It's like Chrome including a basic antivirus function (https://support.google.com/chrome/answer/2765944?co=GENIE.Pl...) - yes it's useful, yes I trust it more than a lot of AV products, but no I don't think it's reasonable to bundle it into the program that's supposed to be here to render web pages for me. (Similar arguments, to varying degrees, are made against WebRTC and Pocket)


I really don't see why it should be an independent application. I mean it's not like we expect a PNG viewer or HTML5 video viewer to be a separate application in a browser. Being able to view (and in this case fill/interact with) PDFs is pretty much a basic necessity on the web. Beyond the core HN crowd, almost nobody cares to have a 3rd party application that they have to install to view PDFs in their browser. Having a lightweight and secure PDF viewer that is also not made by some 3rd party company that could be collecting any amount of data on you is a good thing in general.


> I mean it's not like we expect a PNG viewer or HTML5 video viewer to be a separate application in a browser.

PDFs are generally an actual document, separate from the site they're on. If images and videos weren't a part of the web pages being viewed, I would be quite skeptical of including them in the browser. I mean, there are JS viewers for STL files (https://www.viewstl.com/) - should browsers include a 2D modeling environment?

> Beyond the core HN crowd, almost nobody cares to have a 3rd party application that they have to install to view PDFs in their browser.

See, I have the exact opposite experience; I've had less-technical family complain to me they were annoyed at Firefox because it stopped just opening PDFs in Adobe and forced them into a crippled slow viewer inside itself. Unfortunately, I can't tell which of us is in a bubble.

> Having a lightweight and secure PDF viewer that is also not made by some 3rd party company that could be collecting any amount of data on you is a good thing in general.

That many PDF viewers are awful is an argument for making a better PDF viewer, but not for baking it into a browser.


> PDFs are generally an actual document, separate from the site they're on. If images and videos weren't a part of the web pages being viewed, I would be quite skeptical of including them in the browser.

See, I don't really agree with that because to me, PDFs are a pretty core part of content on the internet that users browse to via their browser. Pretty much every restaurant makes their menu available on their website as a PDF document. Almost all users will interact with PDF documents while browsing the web at some point or the other. Otoh, a tiny fraction will even know what an STL file is, let alone care about opening/viewing one. So that comparison really isn't a fair one.

> That many PDF viewers are awful is an argument for making a better PDF viewer, but not for baking it into a browser.

That's a bit of an odd statement. If anything, it proves exactly why this is a good move from Mozilla. The PDF standard has been around forever, and yet there is a dearth of free, high-quality PDF viewers that aren't bloated or filled with ads or spyware or trying to get you to upgrade to a paid version of their software. So Mozilla has finally taken matters into their own hands and provided a pretty good, light-weight and integrated solution that will do the job for most users. Power users who care can still enable other software via the plugin system as their default PDF viewer. I'm not sure how you can blame Mozilla for addressing a very real deficiency in the state of available software for PDF viewing.


No, it's like Chrome including a PDF viewer.


I expect that people who object to bundling a PDF viewer in Firefox object equally to Chrome doing the same.


A note for Linux and macOS users, from someone who switched to windows one year ago: it’s maybe surprising but it is a VERY REAL pain in the Windows world to find a pdf reader that also allows you to edit forms, that doesn’t also come with malware or adware, and has even just a modest UX!

So for sure you already have access to Evince and Preview.app, they already do everything you want, but Windows users don’t really have that luxury! Being able to say to users to just install Firefox if they want to edit PDF is really good IMHO, way better than the current situation.


Yeah, I never understood the PDF hate at all when I only used Macs. They were snappy, had smooth scaling, editing them wasn't too hard, and scrolling was smooth. It was a fine way to read documents or even books on a computer.

Then I had to use Windows. Good god, PDFs are horrible here. No matter what I use, every application is horrible in its own unique ways. Nothing can compare to the default software provided for free with Macs. I'd prefer to manage PDFs on my phone than my work computer.

If Mozilla can help people edit PDFs to any extent, they're doing the world a service.


Just to provide anecdata against the current comments. I totally agree with you. It's not particularly hard if you are pretty tech savvy, but the for the average user you pretty much are stuck with adobe. Or you can try your luck with the edge/chrome pdf form fill but there's a decent chance it just won't bother saving your input. On adbobe, it is still full of extra crap that is irrelevant to everyday use. I think it still bugs people to update it all the time, but I don't use adobe, so I don't know.


What comes with Adobe Reader?

I have it on my work computer and haven't noticed anything I would rate as particularly obnoxious, but I don't use it much.


If you aren't careful during the download/install process they will attempt to bundle various McAfee "security" products and an Adobe chrome extension into the install. Additionally they have made it less obvious where to download Reader instead of buying Acrobat.

Edit: Example, which you may not see if you aren't on Windows. https://musteat.org/images/hn/abobe_install.png


I read that as suggesting this is potentially a killer app for Firefox adoption in the enterprise.


Okular can "edit" forms. I have been doing this on Linux and Windows for a while. Not the most usable but it works. What I can't do in Okular, I do in Gimp.

I will use Firefox for editable form pdfs but for those that don't have editable forms, I will continue to use Okular/Gimp.

I actually stumbled across the ability to edit forms in Firefox only recently. I was like... What? This is amazing! For some reason the pdf i clicked on opened in Firefox and yeah, surprised.


As a PDF "editor" you may also like Libreoffice Draw. I was surprised how well it can work with PDFs!


And IIRC it's available on the windows store. It probably has msi as well.


Okular works for it in KDE on Linux. I wouldn't trade KDE for Windows :)


The least problematic options have been Microsoft Edge (already on the system) and Adobe (more problematic).


Edge can edit PDFs? I’m learning something today!


eh? acroread is very easily found.


If you've heard of it, sure. If you know of it, sure. If you know it's better than the competing options available for Windows, sure.


I have a feeling this thread has a strong bias from highly automated valley life. In more provincial regions and even just much of Europe lots of forms have to be filled out and printed.

It is not something you have to everyday or something, but the existing solutions suck massively. You either have to use Adobe, which requires Windows (or Mac, I suppose) and your firstborn or use some massively shady online service. So personally, I love this feature!

(And I also do not think that this will halt all other development at Mozilla like some comments here imply)


In Mac you can fill in PDFs with the builtin Preview app. I like it.


And yet every time I go to my parents' they've somehow installed Adobe Reader for Mac. I've deleted that app like 6 times.


There's also the paid app PDF Expert which is generally excellent.


Isn’t evince capable of this and the default PDF viewer on GNOME?


I have had evince fail render forms, but okular (the KDE default) has worked pretty well.


Works fine for me.


Gnome - no. Okular the KDE counterpart works very fine.


Okular works on Windows. Or, at least, used to be, some years ago.


It's available on the Store. No need to log in to download it too.

https://www.microsoft.com/en-us/p/okular/9n41msq1wnm8?active...


Tried it a few months ago, it works.


> In more provincial regions and even just much of Europe lots of forms have to be filled out and printed.

This is changing quickly with the covid pandemic. One of its silver linings is that I can't remember the last time I had to wait in line for one hour only to be told off by some exhausted bureaucrat about my missing grand-parents' birth certificate or whatever the hell they come up with. Take that, bureaucracy!


PDF support in Firefox is one of the most important additions in recent years. My gripe with Mozilla was they're pursing all these side projects when they really should be targeting feature parity with Chrome. That's the only way people will ever switch.


> My gripe with Mozilla was they're pursing all these side projects when they really should be targeting feature parity with Chrome. That's the only way people will ever switch.

A bit off topic from the post at hand, but my gripe was the opposite. The relentless pursuit of parity made them indistinguishable giving users no reason to switch (and taking dev time away from distinguishing features). Granted the pursuit of users instead of principles is its own folly that's hard to overcome when money is needed.


I'm curious what you notice is missing in terms of feature parity. I'm mostly a back-end developer (not diving into devtools very often) and switched a year ago. I'm much happier and haven't looked back.


I just want support for APIs, mainly. I get websites from time to time that just refuse to load. E.g.

* blank pages when trying to load an imgur gallery on v68 (esr).

* image uploading not working right on instagram and various other sites, either producing blank images or ones with weird lines.

* several teleconferencing / video meeting websites just don't work properly, whether it's not detecting hardware properly, etc

I have to keep chromium installed so I can use these sites properly.


Please file bugs for these. Also, to the author of this comment: please file bugs for these. (I work for Mozilla and yet I'm pretty lazy about bothering to file these.)

https://bugzilla.mozilla.org/enter_bug.cgi?product=Web%20Com... if you're on desktop.


Can I just say, for the record, thank you for the work you and all the devs put into Firefox.


This kind of problem also happens when developers only test on Chrome. Or if they are using Chrome-specific features and don’t properly handle failover.


> feature parity

Like PDF support?


That’s what OP said yes. That features like PDF fill is essential while things like Pocket are basically non-core side projects.


Pocket is an acquired company, the integration with FF has been minimal (it does less than the Chrome extension ;)) and I'm pretty sure it pays for itself.


On rereading I agree with your interpretation, but it's easy to read "all these side projects" as referring to the PDF reader.


Exactly. He meant this is the sort of thing they should be doing. Not Pocket and WebXR.


Search in PDF is broken since this update. I must go through the whole document for Firefox to load it up and being able to search in it. Couldn't find a similar issue on Bugzilla. Anyone having the same problem?



I've seen the same happen, so I've filed a bug with my reproduction: https://bugzilla.mozilla.org/show_bug.cgi?id=1666575


I just tested PDF search (in an IRS PDF in Firefox 81 on Windows) and it works for me.

Do you see the problem in all PDFs? Maybe there is something unique to the PDF you are searching?


I've built fillable PDFs for a manufacturing business. Links are provided within to the company website to the PDF files, which typically now open in the browser, with varying degrees of reliability. Unfortunately, many people assume this is just another page of the website and that they should be able to interact with like any other web form. Always fun trying to explain this.


Can we just step away from PDFs to a better standard? Every time I deal with PDFs or I have to on behalf of my parents it is a true waste of time and resources - there has to be a better way.


Sure you can: find an ideologically motivated tech billionaire, buy Adobe, release a new version of PDF and make the spec an inaccessible trade secret, aggressively legislate against anyone who attempts to implement it, start charging for Reader, increase the price by a compounding 2% every year, and put that money towards a foundation with a purpose of openly designing and implementing a better, freely-licensed replacement. I predict this would only take 20 to 30 years. :)


Sounds like you've been thinking about this. You don't happen to be an ideologically motivated billionaire who happens to think the best thing for humanity and return on capital is to rebuild the pdf spec do you? * fingers crossed


Well, no, we can't just do that. But it's nice to dream.


Well, yes, we can, but the outcome will be far worse. The naive imagine "something better." The real world will interpret "better" as 27 half baked alternatives, 2 of which will work on something other than Chrome running on Windows.


Agree with this - minus the naive part. I think you can dream and you can have "something better" but you risk all the good work that has been put in already which might put you in a worse spot. If you ran a hypothetical model on what the future outcomes might be the likelihood of something better is probably pretty small. That said, we can always dream.


OK sure, let's just do it then. Let's start... now. Is it happening yet? Can't we just do it?


One of the reasons PDFs are still so common is they do their job pretty well, i.e. accurately displaying documents.

Any alternative would need some very compelling reason to use it instead. Take Microsoft’s XPS which I think is it’s closest rival. It is an open standard based on XML. It’s built into Windows, Office, and many printers support it natively along with major software vendors, but I can’t think of a single time I’ve come across an XPS file online.


At the very least, you need a replacement that's technically better, works well cross-platform, has a layman-acceptable UI, and supports 99.999%+ of all the use cases PDFs currently supports. It also has to convert old PDFs into the new format.

Then, you have to worry about market share and acceptance.


Just in case you were looking for an actual answer to your question: "no".


Finally. Its a nightmere trying to fill out a pdf form on linux.


Okular can handle basically everything for me, except for those Adobe-proprietary ones that require JS and all kinds of other dumb features that only Acrobat supports.


I recently made the switch to gnome as the multi-monitor support, fractional scaling and general Wayland support is only excelled by sway. I sorely miss Okular!


Can't you still run KDE apps under Gnome, even with Wayland? I use a few. Some of them look better with the "QT_QPA_PLATFORM=wayland" environment variable.


Most KDE apps work not just under Gnome, but even under gasp Windows! I think Okular and some others are even in the MS app store.


Okular should work fine on GNOME, but you might need extra disk space for all the KDE dependencies.


Thats the point. Apps which only use QT like keepassx are ok but Okular would swap in half of KDE.


I just use Libreoffice Draw to add text into stubborn pdfs on windows and any on Linux. It's a good, free OSS way to get the job done, though not pretty.


I very rarely have any issues using evince. What PDF viewer are you using?


I purchased PDF Studio Pro and it works pretty well for me.


FINALLY! Now I can finally uninstall Chrome.

Of course, I do wish Sumatra supported filling forms. Then I could uninstall Firefox too! ;-)


Okular works quite well for filling forms in my experience :)


Many people needs clickable links in PDFs more. https://bugzilla.mozilla.org/show_bug.cgi?id=454059


I wold like to see a version that allows forms to be signed


Microsoft Edge allows you to draw on PDFs and save them easily. I use it for signing all the time.


I was pleasantly surprised by this recently. Just worked using my touchscreen laptop. So rare on Windows.


I think he meant digital signature


In case he meant regular (drawn) signature, it can be done via Preview on Mac.

For a local web use, I built for myself https://formulairemagique.fr for this very reason


Preview.app is just so good. It’s my favourite default Mac app by miles.

It and terminal.app have survived the thing Apple does where they update applications and remove all the application’s power to achieve anything.


good job on the simple UI! I think it will prove useful next time I have a form to fill.


I actually like the pdf.js viewer enough that I use the chrome extension version on chromium. But I see it hasn't been updated in over a year now. Hopefully it will get updated!

https://chrome.google.com/webstore/detail/pdf-viewer/oemmndc...


Poppler (a library behind most of the modern PDF open source viewers) still has many[1] issues with PDF Forms...

[1] https://gitlab.freedesktop.org/poppler/poppler/-/issues?labe...


I just found out that this feature was coming last night, and I hadn't realized that today was release day! I did discover that if you want to enable it on Firefox 80, you can toggle `pdfjs.renderInteractiveForms` in about:config


Seems so absurd that filling a form digitally is breaking tech news in 2020. PDF in a nutshell.

Does anyone see a trend moving away from the PDF standard in recent years? Tried to look for data on it but found nothing.


Any chance Firefox will have built-in support for printing to PDF? There's a browser extension[1], but it was last updated 3 years ago. Seems the Chrome browser has had this feature for ages.

1: https://addons.mozilla.org/en-US/firefox/addon/print-to-pdf-...


It's not ready for release yet, but if you flip the preference `print.tab_modal.enabled` to true you'll get the replacement printing interface which has a "Save as PDF" pseudo-printer.


Does your operating system not support this natively from the print dialog?


On Windows at least, using the built-in PDF printer with Firefox results in text in the PDF file being converted to paths (not text). Huge file and you can't copy/paste. I've tried 3rd party PDF printers (PDFForge) and the result is the same, so I think it might a FF bug (or feature)?

Chrome's save-as PDF produces actual text. It's the main reason I still have chrome installed.


That seems.... odd. I am on Firefox on Windows and I print to PDF all the time using the Windows built-in PDF printer ("Microsoft Print to PDF"), without issue. In fact sometimes that printer is the only one that can get things to format correctly!

Something on your system might be interfering with the printing process.


There must be something strange your particular set up, or maybe the behavior changes based on the page. Firefox 81, Windows 10 version 2004, multiple computers, printing this page with the "Microsoft Print to PDF" printer this page of comments all result in a PDF of ~470KB with selectable text.


Can it handle Cyrillic and Japanese text fields? For example, poppler can't solve this problem for 12 years[1] already. You can use the attached files to the issue for testing.

[1] https://gitlab.freedesktop.org/poppler/poppler/-/issues/463


It's open source. Don't blame the developers that cannot read Cyrillic or Japanese. Blame those who can read it but don't contribute.


Will it support only the standardised kind of forms, or also the proprietary Adobe-only kind of forms? (Yes, there's two, and the latter are what Swedish administrative agencies use, so I'm forced to choose the “non-fillable PDF” option lest I get a file intentionally made unreadable to non-Adobe software.)


Only the standardized acroforms. XFA forms are deprecated anyway...

http://blog.pdfshareforms.com/pdf-2-0-release-bid-farewell-x...


ISO deprecating it won't actually improve things at all, surely? It's Adobe that has the power and created the problem.


Adobe was part of the standards making process...presumably it wouldn't have been deprecated if they wanted to continue using it indefinitely.

https://blog.adobe.com/en/publish/2017/08/08/taking-document...


In theory. In practice you still can meet the XFA forms, especially for some governmental organizations.


Yup! Keep running into having no way to fill in Canadian Government forms. It's all XFA.


Example? There are ways of converting XFA forms to Acroforms...


Fwiw, I converted this random CA secured XFA form to an unlocked Acroform in a few seconds.

From https://www.canada.ca/en/revenue-agency/services/forms-publi... to

https://slack-files.com/T02FQ9S94-F01BA4CS9QA-a692aafd70

Only difference I can see is the "Clear Data" scripted js button no longer works.


How do you do this exactly? Would be helpful to know


Unlock form in multiple ways (easy to google). Convert to acroforms using Acrobat Pro using the extract function.

https://imgur.com/a/3mAi3l0

Will remove XFA "capabilities", but otherwise works...


This is really helpful, thank you! Will make my citizen ship application a lot easier :D


You're welcome!


This is huge. I've felt like an outsider for years here because the gov uses a lot of online forms in PDF.


today I took a screenshot of a PDF and uploaded it to an OCR service and copied the result into a doc.

The PDF was text-based but every time I copied something it added millions of new lines and hyphens and extra text that wasn't shown on the page.


I think one of the first things I do after a clean Firefox installation is to set PDFs to be opened with the Windows default reader. Reading (and mostly searching text in) a PDF in the browser is terrible.


Does anyone find the embedded viewer not usable for copy/paste situations where keeping the format is important. I always need to save and open with adobe reader.


PDF form support still doesn't work very well -- it cannot export filled fields correctly, nor will do they print correctly.


SubmitForm buttons such as PdfAction.CreateSubmitForm from iText does still not work.


Still waiting for the SVG backend to be fully implemented for high quality printing.


Can we just get support for math text? For years I accidentally print research papers from the browser only to have to open it back up in a non-browser PDF reader and reprint.

With that and form fill I basically don't need another PDF reader, which is nice.


Does this support digital signatures via signing certificate?


One Giant Leap for mankind! not /s.


Which PDF forms standard is this?


This is a nice feature to have!


About time.


[flagged]


Well, it's a PDF reader that doesn't come with a tracking package, so in a way—yes.


What's the problem with the Poppler-based ones? I've been producing (with LaTeX) and consuming (with Poppler/Okular) PDFs for a decade and never once have I had to worry about anything related to the format (including tracking).


Poppler looks great! But, I _just_ learned about it and I don't think that the majority of population, say, outside of HN knows about its existence, so it's good to have a fairly mainstream alternative available.

OK, Firefox is, sadly, far from being a mainstream browser nowadays, but still I suspect it has a larger user base than Poppler.


Which PDF readers contain tracking? Anyway, there are several open-source ones that don't.


Acrobat and Chrome come to mind.


> it's a PDF reader that doesn't come with a tracking package

Uh, what? Firefox supports javascript. PDFs support javascript. Javascript empowers tracking.


Firefox PDF support is actually Javascript.

https://github.com/mozilla/pdf.js


following that logic

- every browser that supports cookies comes with a tracking package

- electron comes with a tracking package

- every language interpreter, runtime or compiler is a tracking package.

- your OS can run tracking software, thus coming with a tracking package.

- Anyone carrying their phone comes with a tracking package

Hang on, did you just post something on the internet? Your HN account comes with a tracking package!


You say in jest, but this simple upgrade very likely improves the lives of more people more significantly than some billion dollar unicorns ever do.


And here I was thinking we were living in the future when I could print out a pdf, fill out the fields with a pencil, take a picture of it, then email it to myself, change the file type back to pdf, and send it to whomever requested it...


I think it was William Gibson who once stated something like, "The future is already here, it is simply unequally distributed...in that, some people just fill out PDF forms, while others have to print it out, fill it out with a pencil...etc...." Ok, maybe i'm remembering that quote inaccurately. ;-)


I just want to smooth scroll with vim keys (hjkl). Too much to ask? :/


here, I'll make it easier for you to contribute to the project by providing you with the lines you'd need to update:

https://github.com/mozilla/pdf.js/blob/83e1bbea6e23db8744420...

https://github.com/mozilla/pdf.js/blob/83e1bbea6e23db8744420...


How is that relevant to this thread?


> After entering data into these fields you can download the file to have the filled out version saved to your computer.

And then what? Fax it? Sounds like a missed opportunity to me. It would be nice if you can add a Submit button to have the data posted to the server, just like any other web-based form.


That would be nice if websites support that. But in my experienced all PDF forms I fill in have to be printed and then signed and posted...


I haven't printed a PDF to sign in years. Why don't you just affix a digital image of your signature to the file? Save it and email it back to whomever.


This, and in the rare circumstance where they only accept regular mail or faxes, I use HelloFax.


Many places don’t accept emails. Sure you can sign digitally and then print.


e-mail it, or save it for your records.


And what would the recipient do with the email? Type it in manually? You don't see any room for improvement here?


Chrome has had this forever, right?


That built in PDF viewer is another feature that could have been an addon. It's bloat which increases the browser's attack surface. It's completely unneeded given that just about every OS ships with some kind of PDF reader out of the box.


> It's bloat which increases the browser's attack surface.

AFAICT PDF.js is just another JavaScript application and thus as sandboxed as any other website.


It's a js application and thus less exploitable than your average C application with tons of unsound code, but IIRC it belongs to the class of "privileged js" layer that Firefox has, so has special rights that usual website js doesn't have.


The built-in PDF reader on Windows is literally to open the PDF in Edge.. so not very good UX for Firefox and also a good argument that browsers are expected to have PDF readers.


Thank Google - Chrome was the first browser to ship with a pdf reader, and people loved it. Now it's just expected that any browser should have a workable PDF reader built-in.


The alternative was installing an Adobe plugin with no sandbox, so it made sense at the time.


I would like to see Mozilla modularize Firefox more. Browsers are such huge beasts that contain everything imaginable plus the kitchen sink these days. It would be nice for these kinds of features to be add-ons that can be disabled or deleted if their functionality is not needed or desired, freeing resources for other use.

They can be part of the initial install so that Mozilla can provide the browser as they envision it, but be able to be removed for those who have other ideas of what their browser should consist of.

I don't know how technically feasible that is with their code, but it makes sense to me from a developer standpoint.


Is there something hard about fillable forms on PDF? Why have a PDF viewer at all if it couldn't fill out a form?


> Is there something hard about fillable forms on PDF?

In the sense of a “form” just being lines on paper that you can arbitrarily add some text to — no, that’s easy.

Likewise, in the sense of a “form” being some defined input regions that accept your keystrokes and turn them into new text DOM nodes in the PDF itself — easy enough. Though, unlike HTML, there’s no concept of an <input> tag that just has the semantics of accepting keystrokes and turning them into (persisted) input; instead, this all has to be done through scripting [i.e. writing event-handlers, or having some PDF authoring software generate them]; and there are several incompatible scripting languages for PDF that get used, some of which are proprietary with no open specification.

But, doing form validation? Or, worse yet, making one of those fancy PDF forms that auto-calculates fields like an Excel spreadsheet? Now you’re getting into the hairy stuff, because IIRC none of the open-standard PDF scripting systems provide these sorts of mechanisms, so these are inherently proprietary things.

And when I say “proprietary”, I mean “like old versions of Word or Photoshop, where each version emitted its own in-memory data-structures to disk without formal serialization; and it was the job of authors of future versions to write importers to deserialize whatever format resulted.”


While PDF is an open format on paper, in practice it's as proprietary as any ancient format. Supporting it in full is not trivial.


The real problem here is that, 20+ years on, printing to PDF is still a totally natural and easy-to-understand metaphor for a normal office desktop user; but producing HTML for the browser is still impossible for them.

If we simply had print-to-HTML functionality which resulted in a document identical to what you view onscreen while editing, PDF could die the death it deserves.

But HTML+CSS somehow manages to suck just as much for common usage, so it persists.


I wish epub would catch on for more than books. An epub is just HTML and CSS in a zip file, and a large part of the world population has a device than can load it and present it cleanly.


I don't know about you, but >98% of the PDFs I use are just for reading and don't contain a fillable form.

And implementing a PDF viewer is already a major undertaking; adding the form functionality complicates things even more.


I posted a link (above) to the app I built to solve that problem.

The vast majority of form is indeed not « ready » for input, requiring users to go through hoops to fill them. And that work is done again by the next person.


I wasn't referring to any problem. Again, all those PDFs I wrote about are intended for reading only; no filling in, signing or other form of interaction other than reading letters that form words on a page is involved in any way, and that is entirely fine and how it should be, so both the PDF itself and a PDF reader that has no form-filling functionality would be entirely fit for purpose (notice the word "reader").


In that case, all is well indeed.

The problem I was referring to, is when one is expected to fill a PDF that was not built to be filled, such as the scan of a paper form.


Yes! PDF forms are amazingly complex. Text in PDF is very complex and the forms themselves are a kind of templated vector graphics. Multiply this by all the weird and corrupt PDF forms out there which Acrobat support and you have a challenging task.


“After entering data into these fields you can download the file to have the filled out version saved to your computer.”

What’s the use case? Printing out filled-in forms? But otherwise, who would want the PDF in electronic format? It doesn’t seem like a practical way for users to submit data.


Printing is one use case. Also plenty places that want filled out forms uploaded, e-mailed, ... + it allows you to keep a copy with what you entered.


That's my point. Other than printing a nice looking form (which includes faxing), the content of the field data is hard to reuse. Searching PDF content on your own hard drive is problematic.

Are there utilities that extract PDF field data and submit it to a database? I'd be grateful to see examples.

What about field validation? The PDF may have some minor validation, but that's no substitute for the validation done in a DBMS.

If you want users to be able to save a nice looking form, you'd still want the data entered online directly into a DBMS. I'd offer a "download PDF of your input" as an option, for example.


Sure, its a structured format, so you totally can extract the individual fields. AFAIK Adobe sells a server product that does that, but I'm sure there's competitors and I have seen the underlying parsing in feature in PDF libraries before.

That said, plenty of users of PDFs have a very paper-based/manual workflow still, and not the motivation and expertise to run and update an online form thing. Or they need to have the ability to handle odd inputs anyways, because paper forms have even worse input validation.

And from a browser/user perspective, the feature here is useful because people expect me to handle PDFs and do not provide nice web forms. They might have terrible reasons for doing so, but I still need to live with that.


> the content of the field data is hard to reuse.

A lot of people don't care, because they come from forms in cartaceous - where they have to manually retype everything anyway. For many, their "DBMS" will be an Excel sheet with a dozen rows. The more advanced types likely have some Adobe software that does all the magic.

Fillable PDF forms are really seen as a courtesy to users more than anything particularly useful to the emitter.


Well, if the form needs to be faxed (still a thing!) then having the filled PDF makes it easy to use an e-fax service. But I assume sending the file to another computer for printing is the main use case.


Last year applying for jobs most places had a pdf form, if you were lucky it was an actual form too! So, filling the form and emailing it back is useful -- much better than trying to overwrite text with a PDF background; far better than printing the form, filling with a pen, scanning, then sending.


I find the built-in PDF reader in Firefox to be bloat. It's OK, it works 95% of the time, but really I want to use a native PDF viewer.

Is there a version of Firefox that removes this bloat?

Given that Mozilla is very resource constrained, why are they working on features that aren't necessary?


And here I found it much less bloated than the other free desktop PDF viewers.

I think Mozilla's line of thought here is that PDF documents are widespread in the web, to the point where they are a de facto web document type. So it makes sense for a web browser to support them rather than calling out to a user's desktop program (though I assume you can configure it to do so instead).

There's probably a bit of "our competitors do it, so we have to too" in there as well.


I like that single-page PDFs stay in the browser. I don’t want to keep them; I just want to see them. Like any other web-page. I want to be able to hit back, or close the tab, and continue on with my day.

And I also like that I can preview long-form PDFs in the browser, before choosing whether to save them and read them “for real.”

Imagine if every time you opened a direct-linked JPEG image in your browser, it treated it as an attachment, downloading it and opening it in your external image-previewer app, rather than rendering it as a synthesized HTML DOM wrapper around the image. Wouldn’t you be annoyed by how cluttered your Downloads directory would get with random files you never actually wanted to save?


Given that Chrome/Edge also just added the feature, I would point out: All web browsers are using the same library for PDF handling, a feature in pdf.js ends up benefiting a lot of people.

And the reasons for not requiring an outside PDF reader are major: It's yet another likely-to-have-vulnerabilities program people need to install, then update. In most cases, avoiding Adobe programs on your PC is a good way to avoid a lot of vulnerabilities.


> All web browsers are using the same library for PDF handling

Chrome/Chromium uses PDFium, not PDF.js, so no. Not sure about Edge.

PDFium has been able to fill out forms for a long time. What’s new for Chrome is the ability to save edited PDF (as fillable).


If Chrome doesn't use pdf.js, then neither would Edge, which is a Chrome fork. My original comment may have been mistaken.


A Chromium fork could replace certain components if they so choose. The PDF rendering component would be one of the easier ones to replace.

However, I was able to confirm that Edge uses extension ID mhjfbmdgcfjbbpaeojofohoefgiehjai to render PDF internally[1], same as Chrome, so indeed it's using PDFium.

[1] The rendered PDF element would look like

  <embed id="plugin" type="application/x-google-chrome-pdf" src="..." stream-url="chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/..." headers="..." ...>
And here's the extension's manifest in Chromium source, where you can find the extension ID: https://github.com/chromium/chromium/blob/2baa2b094cdd60e980...


Given the amount of PDF exploits over the years and the habit of browsers to automatically invoke your PDF viewer of choice either as a plug-in or call out, they're an easy target.

Having a sandboxed PDF viewer that works 95% of the time is great. For those 5% circumstances where I am actively trying to view a PDF and it won't work in browser, I'll gladly go through the minimal effort to open it in an external viewer.


I find it extremely convenient. Also I know a lot of security issues in PDF viewers are effectively solved by running it in the browser's JS sandbox.


Firefox's sandboxing incomplete or nonexistent (e.g. GPU process is not sandboxed on Linux).


You could also disable Firefox's built-in PDF viewer and instead use an external PDF viewer that doesn't even support Javascript.


Native PDF clients have had lots of security holes. In this case having the client written in JS means we can repurpose the battle hardened JS sandbox to also contain PDF exploits.


Not all PDF vulnerabilities involve JS though.


You misunderstand the argument the parent comment makes. It's not about Javscript in PDFs.


I hope no one is listing to you.

I don't want to start explaining to my mother, over the phone, how to install and use the pdf viewer anymore :|


A lot of everyday users likely benefit from being able to fill out PDFs in the browser.


It was once an add-on, and it was once disableable. It may still be disableable, but I'm sure there's some strange procedure you have to go through to do it.


Why is Firefox spending all their money and goodwill on a piece of technology that should be done away with?

PDF is a dork. It's an accessibility nightmare with no obvious advantage over simple ordinary webpages. Somewhere in the comments below, it is mentioned that supporting PDFs is a non-trivial piece of technology. May be! Even steam engines have non-trivial technology under the hood.


> It's an accessibility nightmare with no obvious advantage over simple ordinary webpages.

It is easy to criticize something when you don't look back at the historical context through which it emerged. It has plenty of advantages over HTML but they're easy to dismiss if you don't have a use case for them.


> It has plenty of advantages over HTML but they're easy to dismiss if you don't have a use case for them.

Can you discuss some of the advantages? The only advantage that comes to mind is that Apple has built-in support for writing PDFs and that has a lot to do with Adobe rather than PDF being a better candidate.


I work for US Federal courts, I can assure you html isn’t sufficient over PDF’s for court cases. Evidence are filed in pdfs. Documents (PDFs) need to be a historical archive, and the ability to modify would damage the credibility of those documents.


> ability to modify

How are PDFs any less modifiable than HTML other than requiring (widely available) specialized tools instead of a text editor?


Cryptographic signing is a core feature of PDF, but not HTML.


Yeah, but does Firefox need to solve the use-case of a court system? Also, tangentially the solution to guarantee "tamperproof" archiving is in cryptography and that's not a feature of PDF.


No, Firefox doesn’t need to support the use case of a court system. That’s not what GP is saying. All we’re establishing here is that PDF is a useful format, and Firefox is supporting it.

Also, cryptographic signatures do happen to be a feature of PDF.


Now that I read my comment I see the issue with it.

What I meant to say is that Firefox should focus on implementing cryptographic signing over HTML then. And not a PDF viewer on the web--in that, enabling cryptographic signatures isn't tied to the format PDF per se.


PDF prints infinitely better than HTML, and it can be somewhat hardened against modification by average users.

If you think MSOffice users would prefer to output HTML over PDF, you don't live in the same corporate world I inhabit.


PDF's ubiquity is 100% that it printed the same (or close to same) on any postscript compatible printer. It's tech so old many in the industry ignore the reason it existed (and still exists). Every solution beyond PDF has also been either closed source (read Microsoft) or ignored. It's useful, that's why it exists. Yes, it's archaic, yes, it's hard to read for tech people, but for non tech people, it solves an issue that plagues the entire software industry: Standardization.


> Why is Firefox spending all their money and goodwill

I doubt "all" their money goes towards the pdf-reader bit. And tbh, I'd say nobody will really lower their goodwill towards Mozilla because they add features that a lot of people actually need.


There are lots of use cases for PDF where a web page is totally unsuitable.


Yes, maybe generally; but let’s talk about the specific case here — filling of complex PDF forms.

When a PDF that has interactive form fields, calculated auto-populated fields, fields that are enabled/disabled according to the inputs of other fields, etc. — the organization that created it (usually government or education) usually does that because they want you to fill it out using a PDF viewer; save it (which will persist the form inputs “into” the resulting PDF); and then submit the modified PDF file back to them. They want this, because they can use automated backend processes to extract the data from the PDF. They don’t want you to just print out the thing and fill it out. In fact, many such “fillable” PDFs start off in a state with many of their form-fields disabled and voided, such that printing them out in that state would result in a form you can’t really write on!

So, at that point, why didn’t they just make the PDF a web page? They’ve essentially reinvented a web form, but with extra steps. The only benefit a client gets is the ability to edit and save the form offline (but that can be done in a browser, too, with local storage); and furthermore, the ability to treat the resulting filled form as a file, moving it around before you submit it. But the cases where you need that are very niche, compared to the cases where you can just direct employees to your Intranet portal.


1. A webpage form requires a server to be up and running, which requires an IT person to manage it, separate from the dept making the form. PDF forms can be created by a person given the right tools (I think Word does it)

2. IT person + webserver costs have to included in the budget somewhere. Which can be a big problem.

3. The webpage form can fail, and the support for it has to be provided by the IT dept. If the PDF form fails, dept can handle it on its own, and will often accept a filled+scanned print out of the PDF form.

4. Adding to the point above, PDF forms degrade gracefully, If they don't work, or internet doesn't work, or someone is on holiday, you can still print, fill and hand them in person. Webpages can degrade catastrophically where you whole dept grinds to halt while the IT person tries to fix the problem.


Re: all four of your points — see my sibling post. I'm not talking about encapsulated-PostScript "Print and Fill" forms (which do certainly degrade gracefully), or even open-standard PDF "Fill and Print" forms (which degrade gracefully if you don't set them up with a bad default state where there's big "N/A" text over all the disabled fields until you fill in other fields.)

Instead, I'm talking about the PDFs you can basically only load in Acrobat (though, other PDF viewers do try to render them, to varying success) that actually do data-binding to some remote database; do XHRs to submit the form data on success; do "online" onBlur-XHR-esque field validation; generate new output PDFs using scripting, from scratch when you ask them to save/print; etc.

These are applications, not documents. You can't print them. You just use Acrobat as a glorified application host to fill and submit them. (You can press Ctrl+P to get Acrobat to request to the loaded PDF application that it perform some scripted action to generate a print output. This may or may not do anything, depending on how the PDF was created. It usually just pops a "Printing is not implemented for this form" box. It certainly won't work in non-Acrobat PDF viewers.)

When other PDF viewers say they don't support "fillable PDF eForms", these are the things they're talking about. They usually support "Fill and Print" forms just fine, because "Fill and Print" forms are a somewhat-sane format, rather than being a competitor to Lotus Notes.


I understand better what you are saying. I don't think I have ever seen any PDF forms that require an internet connection. The Canadian Visa application forms have inbuilt validation code, that checks the form, and once you upload it, I believe data is extracted into a database.

The benefit of these forms is that the validated form that you submit online is actually printable. Which means that what you see on your screen/paper is pixel by pixel identical to what Canada receives, and therefore _legally_, there is no confusion about what was communicated between Canada and the candidate.

Webforms are not as strongly accepted as such by courts. Because they have to be manipulated further before being printed.

I have read a bunch of your replies, and you are thinking of all the technical reasons why webforms are better than PDF (you are right in that), but PDFs have legal and operational and budgetary advantages, that are more relevant to various organizations.


> In fact, many such “fillable” PDFs start off in a state with many of their form-fields disabled and voided, such that printing them out in that state would result in a form you can’t really write on!

I have never seen this. Do you have an example? Every use if fillable PDFs I have encountered is a use case where submitting a handwritten form is still an option.

> The only benefit a client gets is the ability to edit and save the form offline (but that can be done in a browser, too, with local storage); and furthermore, the ability to treat the resulting filled form as a file, moving it around before you submit it.

I have yet to see a web form that actually saves a readable, properly-formatted, self-contained, easy to access, fully-offline copy.

> But the cases where you need that are very niche, compared to the cases where you can just direct employees to your Intranet portal.

This is not a trivial need; most forms sent as fillable PDFs need to or should be retained for some period after submission. Also, I don't know what "employees" and "Intranet" has to do with anything.

You are also missing the use case where a form legally requires a live signature from one or more parties and need to be printed, even if just to scan and return. I recently had to do this for some insurance paperwork.


> You are also missing the use case where a form legally requires a live signature from one or more parties and need to be printed, even if just to scan and return. I recently had to do this for some insurance paperwork.

My company have to do this for one state government. They required the signature to be written black inked. It is PITA to do since we all have digital signature set up. But nope, this state government required the written signature.


The Canadian visa application form is an example.


> I have never seen this. Do you have an example?

I don't have one on-hand, no. But I've certainly had to fill them out in the past. IIRC an especially-bad one came in the form [heh] of a student-loan application for the college I attended. It was essentially a Hypercard stack in the guise of a PDF.

Here are some early Adobe marketing materials (as a PDF, because of course it is) talking about the advantages of "eForm Solutions": https://planetpdf.com/planetpdf/pdfs/pdf2k/02E/ldefurio_pdff...

It sounds like every PDF form you've ever dealt with is what Adobe, in this brochure, calls a "Type 1: Print and Fill" or "Type 2: Fill and Print" form. But Type 3 and Type 4 forms do exist in the wild! (They're not often created any more; most of the ones that exist now are from around a decade or two ago, when Adobe was really pushing this idea.) Creating such forms was basically the point of Acrobat as a software product.

When PDF viewers (e.g. Apple Preview) say they don't support "PDF forms", they're not talking about Type 2 forms. They usually support those just fine. They're talking about Type 3 and Type 4 forms. And more specifically, the ones that use Adobe's proprietary AcroForms data-embedding system, rather than the open-standard XFA data-embedding system.

(I could swear I saw an HN post about the horrors of AcroForms once, but I can't find it now.)

> I have yet to see a web form that actually saves a readable, properly-formatted, self-contained, easy to access, fully-offline copy.

To be clear, that was what I meant by the second qualifier, "as a file." Browsers support persisting the state of the form. Just, not as a file. They persist the state internally, when the form's author does the client-side Javascript work to enable that.

For the use-case where the user wants to stop filling out the form for now (e.g. because they don't have some required information on-hand), and then come back to it to finish it later, in-browser persistence works perfectly well.

Even cleaner, though, is just building a web-form as a wizard, where fields are submitted one-at-a-time, and you can also freely navigate to previously-filled "steps" to change your answers. That doesn't even require JavaScript; just pure 90s HTML-generated-on-the-backend. Most government sites that thought PDF eForms were a good idea, are now falling back to this approach.

> Also, I don't know what "employees" and "Intranet" has to do with anything.

Secure installations. The main use-case for fillable PDFs (as can be seen in Adobe's marketing brochure, where "government" is the core client) is a case where public or cloud solutions just aren't tenable, i.e. in secure government/military/etc. installations, where the workstations are air-gapped from the public Internet. In such a case, PDF forms can still be sent around via a local non-Internet-routable email server, for the workers there to fill in.

Today, this need can be served just as well by setting up a non-Internet-routable web portal for those same workers to use. But back in the 90s and 00s, "Intranet web portals" were a fancy thing only the most forward of IT bigcorps had on offer. They had Intranets, for sure, but they weren't hosting web-apps on them.

So, what did they do instead? Well, Adobe had two main competitors in the "eForm" market:

• Lotus Notes form documents, connecting to a Lotus Domino database server;

• Microsoft Excel sheets that use VBA to data-bind to an accessible Microsoft Access database file sitting on an SMB network share.

None of these "forms" were hand-submittable. They're all little self-contained interactive applications, that happen to look like forms.

AcroForms did have the fancy property, though, that the AcroForms application-PDF could generate or export a bog-standard output-PDF representing the filled form. But that's not actually a modified copy of the source PDF. That's the PDF using scripting to generate you another PDF, from scratch.

------

To be clear, I agree with all the stuff you're talking about; those are all valid use-cases for "PDFs" (i.e. encapsulated PostScript containers.) But they're not what I mean by "PDF forms." I mean the Type 3/4 forms referred to above. There's no reason, in the modern era, that one would implement one of these Type 3/4 "eForm solutions", instead of just putting up a webpage.

If you need an e-signature at the end, have them fill out the web form, then generate a raw PostScript PDF representing their inputs, and let them sign it by dropping a signature vector image on the dotted line in any standard PDF viewer.


The use case you're describing wasn't feasible until about 20 years after PDFs were introduced. Web Storage isn't that old, has only recently become widely deployed, and in a lot of cases is disabled for security concerns.


As someone working on formats, I disagree with your generalization. But let's get into specifics. List the things about PDF that you believe can't be done with web pages?


It's easy to dismiss things in their entirety and then require someone else to "prove you wrong". Why don't you prove you're right instead?

Why don't you list all of things that PDFs can do that can also be done with web pages?


Sure, here's my list: everything + more.

There's nothing a PDF can do that a webpage can't. In fact there are a hundred of things that a webpage can do, but a PDF can't. Including, form fields, input fields and seamless form submissions.

Webpages can also do this: https://bubblin.io/cover/official-handbook-by-marvin-danig#f...

Disclosure: It's my work.


Anyone can create a PDF form to capture data and signatures, email it to someone who can then fill it out offline, and then email it back. That's not something easily done with a webpage, and it's not something my mom can do.

PDFs are easy to make and easy to work with. Web pages aren't.

Your work is impressive, and why would anyone want that? Do you envision lawyers putting all their legal contracts into fancy flippy books?


> Do you envision lawyers putting all their legal contracts into fancy flippy books?

Someone will have to solve it for the lawyers in a not so 'fancy consumerish' way. Point is that it is possible to do that, and Firefox shouldn't be solving this problem using an ancient format and a layer of cruft in between.


But we have PDF today and everyone is already using it. What does this bring to the table that's improved over PDF from a user's perspective?

You can be a developer who enjoys the smell of your own farts all day long but that doesn't mean anyone else wants to smell them.


Well, if you sit so close to someone's opinions or comments on the Internet, wouldn't you have to smell whatever whether you like it or not? ;-)


Wish I could look at your work but my browser doesn't support javascript. I wonder what it is about.


Those books do work without javascript. Go troll someone else.


Distributing a document with functioning kerning and embedded fonts that works offline


Serviceworkers+@font-faces+font-kerning property of CSS3. Done, next.


I think you missed the point of distributing. I’m never going to let you email me your serviceworkers because I can’t forward this document to anyone without relying on you hosting a server / not changing the content.


Oh, I'm all in for email/attachment based distribution. Just not with Firefox sporting it on the web browser where you'd in all certainty require someone to host a server and for you to trust them that no changes have been made to the content.

That was the entire point of my comment at the top.


Going to a particular page and only having to render that one page. Large HTML documents are unwieldy.


The modern web is slow for a lot of reasons, but none of them are about rendering lots of static html. Anyway just break things up into multiple pages if necessary.


PDF is widely used and supported. And FWIW, edge does support it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: