Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Pdf-Bot, an API/CLI for Generating PDFs Using Headless Chrome (github.com/esbenp)
235 points by esbenp on Aug 22, 2017 | hide | past | favorite | 79 comments


Pretty cool! Check out Google's headless browser API Puppeteer too, they provide a few really useful functions for doing stuff like this.

https://github.com/GoogleChrome/puppeteer/blob/master/exampl...

Really easy to work around, I used it to build a simple CLI to generate device screenshots of a webpage by modifying the user-agent and resolution to match each device.

https://github.com/umpox/generateDeviceScreenshots


Yeah I think Puppeteer is a very cool project. Unfortunately, it came out literally one or two days after i finished the initial version of pdf-bot. Maybe I will incorporate it soon! :-)


Cool project


Really nice work.

I've had a great experience working with Headless Chrome to convert webpages to PDF for my side project EmailThis (https://www.emailthis.me).

It uses Puppeteer by Chrome DevTools team - https://github.com/GoogleChrome/puppeteer.


Appreciate it. EmailThis lookes very cool! I have considered using Puppeteer for pdf-bot, it came out right after I finished it :-)


Cool project! I think I am going to start regularly making use of it. Does it email you a PDF attachment, or does it just send an email with the contents of the article within it? When I attempted to use it I did not see a pdf attachment. Regardless/either way, really cool project.


Thanks, if it is unable to extract useful content from a page, EmailThis will save it as PDF as send it as an attachmentment.

If you try saving a any discussion website (HN, Stackoverflow, Reddit), you will get the PDF.


Just a heads up -- your site emailthis.me is down.


Sorry about the brief downtime. It's back up now.


All I need now is CMYK support in Chrome and I can make HTML based print ready PDF rendering. That would be quite upgrade compared to my current options.


I work on Chrome's PDF generation.

Please file a bug report if you think this is an important feature.


I'll upvote both this and track bug report, if linked.

Yes, CMYK support is very tempting feature for the whole print industry.


What's up with the built in queue? I feel like that belongs in a different script. For one, the built in nodeJS queue is useless in a multiple server environment. You'd still need a distributed queue since this built in one is only local to one server/thread. So the built in queue becomes redundant/pointless for any kind of solution that needs to scale


I was first wondering about all the complexity in the API as well (why a built-in queue, webhook, retry policy and storage interface when the actual transaction I'm interested in is just "url -> pdf blob"?)

However, I think this is necessary if you want to fit it into a microservice with a REST interface. For REST, I think the usual expectation is that a) the request returns quickly and b) you can submit any number of requests in parallel. Given that loading a page into headless chrome, rendering it and generating a pdf is both resource intensive and time consuming, I guess you need some way to decouple that process from the interface.


Not the OP, but maybe he wanted to keep the install as simple as possible without requiring something like redis. I also haven't looked at the code, but a queue per process is still useful as long as the results can be accessed from any process. Not sure if this is the case. If not, you would seem to be right, in that subsequent requests after "queuing" the job could go to another process/server not aware of the queued job.


How well does it work with multiple-page PDFs? One of our banes is generating mixed text/image downloadable reports with sensible page breaks. To save time, we're actually doing those as docx files, with the bonus/risk that clients can edit the content before saving it as a PDF.


The holy grail of PDF generators, sensible page breaks. Never been done and makes peace in the middle east seem like an easy task.


PrinceXML does a solid job of it, and has for years.


Just use WkhtmlToPdf https://wkhtmltopdf.org and wrap a simple service around it.


That's one hell of a "just".


I did it in 2 days. Its not very hard.


DonnyV don't you know we need 15 million of the same things in technology :)


Wkhtmltopdf has basically not been updated in years aside from minor bug fixes. It has major issues the author has no plans to fix. I've spend 100s of hours applying workarounds to legacy codebases. All that code could be refactored now. Phantomjs and wkhtmltopdf don't even support doing $(.htmlE).width() from JavaScript. This can complicate laying out the page needless to say. https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2419


Why are you doing layout code in javascript? If it can't be done using CSS then your doing something wrong. This is being used to generate PDFs of exact width and height size documents. Hard code the width and height of your page.


wkhtmltopdf is a solid tool but its mostly abandoned now. rendering in chrome headless is way faster and more accurate. chrome headless though is lacking a lot of features that wkhtmltopdf provides like headers/footers


Headers/footers are 2 huge features. Plus all of the other small features wkhtmltopdf has. It may not have all the new bells and whistles but it works reliably.


There's CSS that's supposed to help with doing this manually, right? Or does your docx export handle this auto-magically? (If so, I'm interested in more details!)

https://developer.mozilla.org/en-US/docs/Web/CSS/page-break-...


CSS isn't really that flexible though, say you want a repeated header or footer on each page, you are going to have a real hard time getting it to work perfectly. And good luck if you then need to internationalize it and support different document sizes (A4 vs legal). It's not impossible, but it's a lot more work than it should be.

One of the best ways I've used to generate PDFs is by using a DOCX as a template, and replacing certain placeholders within the document (a DOCX is a ZIP containing a few XML files). It's great if you work in a corporate environment, as it's easy for non-technical staff to make it look exactly how they want and it's easy to update (just replace a file and check it works). You can use headless LibreOffice to convert DOCX to PDF.


I was wondering why OP stopped at DOCX without taking it from there to PDF. The suggestion of a template document is a very practical tip; thanks!


Nice work !

Depending on your use case, I feel like you guys might be interested in http://weasyprint.org/. It is an open source HTML to PDF converter written in Python. It passes the Acid2 test and implements CSS Paged Media.


Always nice to discover another open source HTML rendering engine!

I recently discovered https://github.com/ArthurHub/HTML-Renderer formerly know as https://htmlrenderer.codeplex.com/

PDF generation [...] 100% managed (C#), High performance HTML Rendering library


WeasyPrint seems awesome. Going to prototype with it later! Thanks


This is interesting :-)

I am currently using athenapdf[1] but I will have a play with pdf-bot.

[1] https://github.com/arachnys/athenapdf


Core developer of `athenapdf` here :)

I had a quick look at `pdf-bot`, and though we both rely on the same underlying technology (we are only just moving to headless Chromium; we were on Electron before), I believe we have slightly different ambitions with our respective project. But, I may be biased.

For example, `pdf-bot` seems to be tied exclusively to a specific converter, and storage backend. With `athenapdf` however, we are moving more, and more towards building a toolkit or rather, framework for other people to construct their own conversion processes (or even microservice)[0].

Consequently, we are working towards general abstractions like fetching, converting, and uploading, that can have different implementations (e.g. wkhtmltopdf, LibreOffice, Weasyprint, etc).

With our microservice assembly as well, we are focused heavily on ensuring we have:

1. Instrumentation, and metrics (which `pdf-bot` appears to currently lack)

2. Support for different retry mechanisms (e.g. retry using the same converter or retry using a different converter)

3. Support for multiple input MIME types

4. Synchronous API calls (`pdf-bot` appears to be mostly asynchronous, with batch processing, and callbacks)

5. Ease of installation (e.g. Docker), and configuration

We also have a CLI assembly[1] that can support custom JavaScript plugins[2] (e.g. Markdown -> PDF, Readability, etc). So you don't need to run a service or make API calls for conversions.

[0] https://github.com/arachnys/athenapdf/tree/v3/pkg

[1] https://github.com/arachnys/athenapdf/blob/v3/cmd/cli/main.g...

[2] https://github.com/arachnys/athenapdf/tree/v3/pkg/runner/plu...


Thank you for athenapdf and for rescuing me from the pains of wkhtmltopdf - I am a happy user. :)

My only small problem with it was the somewhat complex setup for using athenapdf-service with a new project (especially since I use docker-machine) but I have now mostly automated the whole thing.

Just out of interest - do you consider asynchronous an advantage (being a Node developer I generally love async very much)? Not that it matters to me - my needs are trivial for the service to handle.

Edit: actually I can see how it async would make my life much more complicated for my simple use case - I would have to write something to track requests and responses rather than just looping through a bunch of URL's that need converting.


That's interesting feedback! Thank you :)

We actually went with Docker for the set up because it simplified dependency management tremendously, and it allowed us to deploy on platforms like Kubernetes, Swarm, and ECS. As a plus, it gave us some confidence that if it works for us, it should work for others (obviously, we have come across cases where Docker behaves differently across platforms).

I consider asynchronous processing (in this context) as advantageous in some cases. Indeed, when we were refactoring `athenapdf`, we considered introducing a message queue for workers to pull work from, and to put back when the work completes. The problem with this however, is that we can't as easily scale horizontally (i.e. introduce node replicas behind a load balancer), as if we tried to get / update a job, we may not get the same node we originally got. I mean, the solution can be as easy as introducing a centralised message queue of sorts (or even a sticky session), but that complicates the set up process, so we decided against it.

Taken together, for our specific use cases, we believe it is a lot simpler to consume a synchronous API. No webhooks / callbacks. No polling. No concerns over acknowledgement. If a HTTP call fails, we will know about it immediately. If a complex retry mechanism is needed, we think this should be accomplished in the client application.

In the long term, I believe we should have a toolkit that can easily be plugged into a wider orchestration engine like Conductor (https://netflix.github.io/conductor/). That way, anyone can develop their own conversion process pipeline with ease.


hi and thanks for athenapdf

I was wondering, does it support custom page headers (or generally running elements)


We have an issue already filed for that, and unfortunately no. That's something CSS Paged Media is supposed to solve.


Unfortunately, Chrome's kerning when it comes to printing is atrocious. Over the years, I've constantly tested it every once in a while with the hope that it would improve to no avail.

Currently, the only print ready HTML to PDF processor that I know is Prince [1] and to a lesser extent Firefox.

[1]: https://www.princexml.com/


Prince is awesome. Its support for paged media has impressed me again and again.


does Firefox support css paged media?


Headless chrome is awesome. I'm using it to generate multiple PNG from SVG.


I agree with you but as a 'old' developer, I find it pretty sad that you have to fire up a headless browser + its gigaton of code, to convert a SVG to PNG. But I agree with you, headless chrome can be very useful.


We currently use Apache Batik (JVM/Scala) to generate PNG for server generated SVGs of charts. SVG is wonderful for generating charts, easy even without any framework.


I take your point in general, but I'm not sure this specific task - converting of SVGs - has ever been one that has easily been solved by some tiny amount of code: probably the way I would have done this in the past (say, ten years ago) would have been using Apache Batik's rasteriser, which is far from a lightweight solution itself.


It's not tiny, but in case anybody else is trying to do the same thing, I'd reach for librsvg (https://wiki.gnome.org/Projects/LibRsvg).


You could also do that with inkscape on the commandline, not sure if it's faster, though...


Indeed, I see lately many cool projects emerging. Is a pity that not all projects are using the same GitHub tag #headless-chromium [1]

This other project[2] was recently on HN but you can't find it easily on github search.

[1] https://github.com/search?q=topic%3Aheadless-chromium&type=R...

[2] https://github.com/DevExpress/testcafe


Are there any similar wrappers around headless Firefox, which has been released recently (Firefox 55)?

Mozilla's documentation (https://developer.mozilla.org/en-US/Firefox/Headless_mode) is still incomplete.


I assume they're waiting for Servo


idea: using this as "send to kindle" generating pdfs from urls you stumble upon, and seamlessly sending them to the *@kindle.com email address to consume in the device

I don't know if there's an easier way or service these days


Instapaper did that pretty well, and I think Pocket Premium does that too - in MOBI format, which is much easier to read on the Kindle than PDF.

These days I use Calibre instead because the overall experience is better, but for single articles it's a bit overkill.


I use a script that implements calibre's ebook-convert command line to convert and mail mobi files to myself from html urls. The ebook-convert command is powerful enough that I can join several html pages together, and also add chapter hooks and titles. I used to use this to download and read Wheel of Time rereads on Tor for a while.


I use the Push to Kindle app from FiveFilters.org to do this.


That is a cool idea! Would only require to setup a small server as webhook endpoint that creates the e-mail.


Interesting to compare this to some of the "old school" solutions for converting web pages to PDF such as htmldoc[0] or html2ps[1].

[0] https://github.com/michaelrsweet/htmldoc [1] http://user.it.uu.se/~jan/html2ps.html


The old school solutions lack any sort of javascript support (per the docs, htmldoc doesn't even support css), so they wouldn't work for a lot of real world websites. That's not really the same use case.

A better comparison would be against the likes of wkhtmltopdf[0], which uses webkit, or the pdf generation features of phantomjs.

[0] https://wkhtmltopdf.org/


https://github.com/wkhtmltopdf/wkhtmltopdf/issues

1,047 Open 975 Closed

Yep, that's about how I remember it. It was such a pain to build on Windows (especially to get a single static binary) that people contributing fixes would often attain hero status by attaching a random binary to an issue. Specifically, GIF support was broken on the official Windows build for 4+ years:

https://web.archive.org/web/20140917181225/http://code.googl...


I am guessing this works by splitting screenshots of a web page and gluing them as pages in the PDF file. Doesn't that mean the size of the PDF would grow to be large once it passes few pages? How does it handle content (like, tables) that don't have line breaks?


It uses Chrome's built-in print-to-PDF functionality via Chrome Debug/DevTools Protocol. In other words it creates PDF files with real vector graphics and text, not just images embedded in PDF.

Page.printToPDF: https://chromedevtools.github.io/devtools-protocol/tot/Page/...


I didn't know that existed. How good is it with corner cases? HTML->PDF is a notoriously difficult problem; even generating PDF is. There are several software services which charge well for doing that (Docraptor, PrinceXML). If it's smooth and handles everything well, is there any reason someone should pay for them?


PDF generation (especially from a JavaScript-enhanced HTML page) has enough corner cases that it is typically best implemented with commercial support paying someone to polish away the rough edges.

There are many "free as in beer" (closed-source), freemium, and/or free trial options offered as a carrot leading to a commercial product. Most have a watermark and/or page count limitations.

http://selectpdf.com/community-edition/ (5 pages max)


It's available from the normal Chrome print menu, so you can test it yourself easily. But to answer: I haven't used it extensively, but CSS and Javascript tend to make it a bit tricky. When you are viewing a webpage in the browser, you have one viewport, and scrolling can change the appearance or position of elements. Translating this to one long PDF is troublesome on some websites. As to what companies do that provide this as a service? I've got no clue, maybe brand this as their unique service? :D


(To clarify, Docraptor is just a user-friendly way to use PrinceXML without licensing the latter and setting up all the infrastructure.)


Actually, all printing in Chrome goes through the PDF generator as an intermediate step.


Thank you very much. I'm excited to try it out. Great documentation, I wish more people would explain their project's software architecture in the README.


How does this compare to wkhtmltopdf, which, IIRC, uses WebKit to render the pdf?


We actually used wkhtmltopdf before we started using pdf-bot. wkhtmltopdf development has slowed a lot, it is very unstable and you need to run a 2 year old alpha version to support flexbox (if I remember correctly) :-) headless chrome is a lot more stable choice imo.


We had an absolutely ghastly time last year trying to implement wkhtmltopdf in a Rails app - we probably wasted an entire week fighting with both Wicked PDF and PDFKit before we just gave up and wrote something using Prawn instead (which was, of course, extremely time-consuming in a different way, but at least the end result was good).


What problems did you run into with wkhtmltopdf? We have been using it without much trouble. Chrome pdf generation is nice but wkhtmltopdf generates smaller PDFs with table of contents.


All kinds of problems that others have mentioned above, plus in terms of the Rails integration, it felt like we hit almost every one of the open issues on the GitHub repos for both Wicked PDF and PDFKit. I vaguely recall fonts in production being a problem, a general lack of reliability, performance issues, fiddling around with various different binaries of wkhtmltopdf to find one that maybe worked... probably other things besides. It was a bad week and I wish I hadn't reminded myself!

(With no disrespect, of course, to the authors of these libraries - they just didn't work well for us.)


Buggy as hell. Doesn't render things as expected. I'm using headless chrome for pdf generation internally and couldn't be happier. I directly call chrome from the command line it couldn't be simpler. I don't know why people need all these wrappers.


as far as I know wkhtmltopdf still exceeds in some aspects, better page control, margins, headers, footers and tables of contents. chrome still doesn't support them and also no browser supports CSS Paged Media which allows for setting up running elements like page numbering etc


Can it generate a table of contents with page numbers?


calibre has been able to convert arbitrary HTML files to PDF with Table of Contents with page numbers, links, embedded fonts, arbitrary headers/footers for years, all rendered using WebKit, without a running X server, for years.

ebook-convert file.html file.pdf --pdf-add-toc


It can do what HTML can do. You can have a TOC with internal links. Those links should continue to work inside the PDF.


Whould this work on AWS Lambda?



Since it's designed to run daemonized as a queue I'd think you'd have an issue with execution time limit




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: