Really easy to work around, I used it to build a simple CLI to generate device screenshots of a webpage by modifying the user-agent and resolution to match each device.
Yeah I think Puppeteer is a very cool project. Unfortunately, it came out literally one or two days after i finished the initial version of pdf-bot. Maybe I will incorporate it soon! :-)
Cool project! I think I am going to start regularly making use of it. Does it email you a PDF attachment, or does it just send an email with the contents of the article within it? When I attempted to use it I did not see a pdf attachment. Regardless/either way, really cool project.
All I need now is CMYK support in Chrome and I can make HTML based print ready PDF rendering. That would be quite upgrade compared to my current options.
What's up with the built in queue? I feel like that belongs in a different script. For one, the built in nodeJS queue is useless in a multiple server environment. You'd still need a distributed queue since this built in one is only local to one server/thread. So the built in queue becomes redundant/pointless for any kind of solution that needs to scale
I was first wondering about all the complexity in the API as well (why a built-in queue, webhook, retry policy and storage interface when the actual transaction I'm interested in is just
"url -> pdf blob"?)
However, I think this is necessary if you want to fit it into a microservice with a REST interface. For REST, I think the usual expectation is that a) the request returns quickly and b) you can submit any number of requests in parallel. Given that loading a page into headless chrome, rendering it and generating a pdf is both resource intensive and time consuming, I guess you need some way to decouple that process from the interface.
Not the OP, but maybe he wanted to keep the install as simple as possible without requiring something like redis. I also haven't looked at the code, but a queue per process is still useful as long as the results can be accessed from any process. Not sure if this is the case. If not, you would seem to be right, in that subsequent requests after "queuing" the job could go to another process/server not aware of the queued job.
How well does it work with multiple-page PDFs? One of our banes is generating mixed text/image downloadable reports with sensible page breaks. To save time, we're actually doing those as docx files, with the bonus/risk that clients can edit the content before saving it as a PDF.
Wkhtmltopdf has basically not been updated in years aside from minor bug fixes. It has major issues the author has no plans to fix. I've spend 100s of hours applying workarounds to legacy codebases. All that code could be refactored now. Phantomjs and wkhtmltopdf don't even support doing $(.htmlE).width() from JavaScript. This can complicate laying out the page needless to say. https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2419
Why are you doing layout code in javascript? If it can't be done using CSS then your doing something wrong. This is being used to generate PDFs of exact width and height size documents. Hard code the width and height of your page.
wkhtmltopdf is a solid tool but its mostly abandoned now. rendering in chrome headless is way faster and more accurate.
chrome headless though is lacking a lot of features that wkhtmltopdf provides like headers/footers
Headers/footers are 2 huge features. Plus all of the other small features wkhtmltopdf has. It may not have all the new bells and whistles but it works reliably.
There's CSS that's supposed to help with doing this manually, right? Or does your docx export handle this auto-magically? (If so, I'm interested in more details!)
CSS isn't really that flexible though, say you want a repeated header or footer on each page, you are going to have a real hard time getting it to work perfectly. And good luck if you then need to internationalize it and support different document sizes (A4 vs legal). It's not impossible, but it's a lot more work than it should be.
One of the best ways I've used to generate PDFs is by using a DOCX as a template, and replacing certain placeholders within the document (a DOCX is a ZIP containing a few XML files). It's great if you work in a corporate environment, as it's easy for non-technical staff to make it look exactly how they want and it's easy to update (just replace a file and check it works). You can use headless LibreOffice to convert DOCX to PDF.
Depending on your use case, I feel like you guys might be interested in http://weasyprint.org/. It is an open source HTML to PDF converter written in Python. It passes the Acid2 test and implements CSS Paged Media.
I had a quick look at `pdf-bot`, and though we both rely on the same underlying technology (we are only just moving to headless Chromium; we were on Electron before), I believe we have slightly different ambitions with our respective project. But, I may be biased.
For example, `pdf-bot` seems to be tied exclusively to a specific converter, and storage backend.
With `athenapdf` however, we are moving more, and more towards building a toolkit or rather, framework for other people to construct their own conversion processes (or even microservice)[0].
Consequently, we are working towards general abstractions like fetching, converting, and uploading, that can have different implementations (e.g. wkhtmltopdf, LibreOffice, Weasyprint, etc).
With our microservice assembly as well, we are focused heavily on ensuring we have:
1. Instrumentation, and metrics (which `pdf-bot` appears to currently lack)
2. Support for different retry mechanisms (e.g. retry using the same converter or retry using a different converter)
3. Support for multiple input MIME types
4. Synchronous API calls (`pdf-bot` appears to be mostly asynchronous, with batch processing, and callbacks)
5. Ease of installation (e.g. Docker), and configuration
We also have a CLI assembly[1] that can support custom JavaScript plugins[2] (e.g. Markdown -> PDF, Readability, etc). So you don't need to run a service or make API calls for conversions.
Thank you for athenapdf and for rescuing me from the pains of wkhtmltopdf - I am a happy user. :)
My only small problem with it was the somewhat complex setup for using athenapdf-service with a new project (especially since I use docker-machine) but I have now mostly automated the whole thing.
Just out of interest - do you consider asynchronous an advantage (being a Node developer I generally love async very much)? Not that it matters to me - my needs are trivial for the service to handle.
Edit: actually I can see how it async would make my life much more complicated for my simple use case - I would have to write something to track requests and responses rather than just looping through a bunch of URL's that need converting.
We actually went with Docker for the set up because it simplified dependency management tremendously, and it allowed us to deploy on platforms like Kubernetes, Swarm, and ECS. As a plus, it gave us some confidence that if it works for us, it should work for others (obviously, we have come across cases where Docker behaves differently across platforms).
I consider asynchronous processing (in this context) as advantageous in some cases. Indeed, when we were refactoring `athenapdf`, we considered introducing a message queue for workers to pull work from, and to put back when the work completes. The problem with this however, is that we can't as easily scale horizontally (i.e. introduce node replicas behind a load balancer), as if we tried to get / update a job, we may not get the same node we originally got. I mean, the solution can be as easy as introducing a centralised message queue of sorts (or even a sticky session), but that complicates the set up process, so we decided against it.
Taken together, for our specific use cases, we believe it is a lot simpler to consume a synchronous API. No webhooks / callbacks. No polling. No concerns over acknowledgement. If a HTTP call fails, we will know about it immediately. If a complex retry mechanism is needed, we think this should be accomplished in the client application.
In the long term, I believe we should have a toolkit that can easily be plugged into a wider orchestration engine like Conductor (https://netflix.github.io/conductor/). That way, anyone can develop their own conversion process pipeline with ease.
Unfortunately, Chrome's kerning when it comes to printing is atrocious. Over the years, I've constantly tested it every once in a while with the hope that it would improve to no avail.
Currently, the only print ready HTML to PDF processor that I know is Prince [1] and to a lesser extent Firefox.
I agree with you but as a 'old' developer, I find it pretty sad that you have to fire up a headless browser + its gigaton of code, to convert a SVG to PNG. But I agree with you, headless chrome can be very useful.
We currently use Apache Batik (JVM/Scala) to generate PNG for server generated SVGs of charts. SVG is wonderful for generating charts, easy even without any framework.
I take your point in general, but I'm not sure this specific task - converting of SVGs - has ever been one that has easily been solved by some tiny amount of code: probably the way I would have done this in the past (say, ten years ago) would have been using Apache Batik's rasteriser, which is far from a lightweight solution itself.
idea: using this as "send to kindle" generating pdfs from urls you stumble upon, and seamlessly sending them to the *@kindle.com email address to consume in the device
I don't know if there's an easier way or service these days
I use a script that implements calibre's ebook-convert command line to convert and mail mobi files to myself from html urls. The ebook-convert command is powerful enough that I can join several html pages together, and also add chapter hooks and titles. I used to use this to download and read Wheel of Time rereads on Tor for a while.
The old school solutions lack any sort of javascript support (per the docs, htmldoc doesn't even support css), so they wouldn't work for a lot of real world websites. That's not really the same use case.
A better comparison would be against the likes of wkhtmltopdf[0], which uses webkit, or the pdf generation features of phantomjs.
Yep, that's about how I remember it. It was such a pain to build on Windows (especially to get a single static binary) that people contributing fixes would often attain hero status by attaching a random binary to an issue. Specifically, GIF support was broken on the official Windows build for 4+ years:
I am guessing this works by splitting screenshots of a web page and gluing them as pages in the PDF file. Doesn't that mean the size of the PDF would grow to be large once it passes few pages? How does it handle content (like, tables) that don't have line breaks?
It uses Chrome's built-in print-to-PDF functionality via Chrome Debug/DevTools Protocol. In other words it creates PDF files with real vector graphics and text, not just images embedded in PDF.
I didn't know that existed. How good is it with corner cases? HTML->PDF is a notoriously difficult problem; even generating PDF is. There are several software services which charge well for doing that (Docraptor, PrinceXML). If it's smooth and handles everything well, is there any reason someone should pay for them?
PDF generation (especially from a JavaScript-enhanced HTML page) has enough corner cases that it is typically best implemented with commercial support paying someone to polish away the rough edges.
There are many "free as in beer" (closed-source), freemium, and/or free trial options offered as a carrot leading to a commercial product. Most have a watermark and/or page count limitations.
It's available from the normal Chrome print menu, so you can test it yourself easily.
But to answer: I haven't used it extensively, but CSS and Javascript tend to make it a bit tricky. When you are viewing a webpage in the browser, you have one viewport, and scrolling can change the appearance or position of elements. Translating this to one long PDF is troublesome on some websites. As to what companies do that provide this as a service? I've got no clue, maybe brand this as their unique service? :D
Thank you very much. I'm excited to try it out.
Great documentation, I wish more people would explain their project's software architecture in the README.
We actually used wkhtmltopdf before we started using pdf-bot. wkhtmltopdf development has slowed a lot, it is very unstable and you need to run a 2 year old alpha version to support flexbox (if I remember correctly) :-) headless chrome is a lot more stable choice imo.
We had an absolutely ghastly time last year trying to implement wkhtmltopdf in a Rails app - we probably wasted an entire week fighting with both Wicked PDF and PDFKit before we just gave up and wrote something using Prawn instead (which was, of course, extremely time-consuming in a different way, but at least the end result was good).
What problems did you run into with wkhtmltopdf? We have been using it without much trouble. Chrome pdf generation is nice but wkhtmltopdf generates smaller PDFs with table of contents.
All kinds of problems that others have mentioned above, plus in terms of the Rails integration, it felt like we hit almost every one of the open issues on the GitHub repos for both Wicked PDF and PDFKit. I vaguely recall fonts in production being a problem, a general lack of reliability, performance issues, fiddling around with various different binaries of wkhtmltopdf to find one that maybe worked... probably other things besides. It was a bad week and I wish I hadn't reminded myself!
(With no disrespect, of course, to the authors of these libraries - they just didn't work well for us.)
Buggy as hell. Doesn't render things as expected. I'm using headless chrome for pdf generation internally and couldn't be happier. I directly call chrome from the command line it couldn't be simpler. I don't know why people need all these wrappers.
as far as I know wkhtmltopdf still exceeds in some aspects, better page control, margins, headers, footers and tables of contents. chrome still doesn't support them and also no browser supports CSS Paged Media which allows for setting up running elements like page numbering etc
calibre has been able to convert arbitrary HTML files to PDF with Table of Contents with page numbers, links, embedded fonts, arbitrary headers/footers for years, all rendered using WebKit, without a running X server, for years.
https://github.com/GoogleChrome/puppeteer/blob/master/exampl...
Really easy to work around, I used it to build a simple CLI to generate device screenshots of a webpage by modifying the user-agent and resolution to match each device.
https://github.com/umpox/generateDeviceScreenshots