Oh wow, I just wrote my own version of this rant, reaching a radically different...

chrismorgan · on Aug 24, 2020

So, a megabyte of PDF which both downloads and renders far more slowly than the equivalent HTML, which could be 20KB of HTML, no JavaScript, 1KB of CSS and perhaps 50KB of images (which don’t block page rendering, either), all bundled into one HTML file (base64 the raster images) of under 100KB.

I can imagine why you might have concluded as you have, but your conclusion is nonetheless baffling. Most of the reasons you’ve chosen PDF over HTML apply just as truly to HTML; page-orientation is the only one that doesn’t, and I refute your claims that that page orientation is a good thing for the web at large or the hardware that most use. (Guess how I read the document? By scrolling, not by pagination; the footer and header of each page transition is just a minor annoyance in the middle of a paragraph of text.) I disagree with every point of your assessment on PDF’s historical disadvantages, too: PDF as used still includes patent-encumbered things, and implementations are complex enough that major deviations and incompatibilities are common; PDF files are still ginormous (nothing has changed about this ever), and tooling is terrible (perhaps even non-existent?) for trying to shrink files without butchering the lot, with yours as a good example of not being an order of magnitude smaller as claimed; it’s still far harder to make PDFs accessible, because most free tools just can’t do it, and it takes far more effort than HTML where it’s easy (as before, tooling is terrible, where with HTML you can just edit the source in a text editor); most PDF readers can’t reflow text (don’t think I’ve ever used a tool—reader or writer—that could); most PDFs can’t be edited freely; and PDFs don’t render well on screen.

You’ve thrown out the HTML ecosystem because some people abuse it, and are choosing to use PDF despite the rampant abuse and many problems of the format, because it’s theoretically possible to work around those problems. (You may quibble with my judgement of “rampant abuse”, but most PDFs I encounter on the web are PDFs for no good reason and perform terribly, loading slowly and being worse for task performance. It is perhaps a slightly different class of abuse, since spying on users via JavaScript isn’t part of it, but it’s related inasmuch as it’s not about meeting the needs of the user.) This does not seem internally consistent.

jl6 · on Aug 24, 2020

Thank you for your feedback. I’m not going to vigorously defend PDF, because a lot of the distaste you express is legitimate. It’s a statement of my disappointment in current web/browser trends that I would be willing to accept all of PDF’s hardships to distance myself from the churn.

For what it’s worth though...

That 1MB of PDF still loads nearly instantly on my phone - subjectively no slower than the equivalent much-smaller bare-bones HTML version.

PDF patents are all licensed royalty-free for the normative parts of the PDF 2.0 spec.

PDF tooling sucks, but I’d rather see effort put into improving this situation than into yet another expansion of the web browser.

guidoism · on Aug 24, 2020

Ok, let's see what we can do to improve your PDF. I ran it through qpdf so I could view it in my text editor: qpdf --stream-data=uncompress 0.pdf 0.uncompressed.pdf

The Tj operators (for printing text) are operating on numbers instead of ascii text making it difficult to read.

I'll see what I can do to duplicate your PDF using some cli tools and a text editor... (hopefully I have time to do it today)

jl6 · on Aug 24, 2020

I’m intrigued and would be very interested if you find any tools that can create or verify tag structure.

FYI the file has already been linearized through qpdf.

chrismorgan · on Aug 24, 2020

It seems that Firefox is finishing downloading the full document before it is able to render the title page. (Some documents can have early pages rendered before the document’s all fetched, some can’t. Not sure what the technical difference is.) For me where I am, it’s taking 8–15 seconds (quite variable) to load the document and render the first page, or 1.5–3 seconds when cached. The equivalent single-file HTML would render completely in easily under two seconds, even including TLS negotiation when I’m 150ms away, and reload in half a second.

Doing things like jumping to the end takes perhaps 200–400ms to render that page in the PDF, where HTML would be instant (meaning “less than 16ms”).

No way would I access this on my not-overly-powerful phone: Firefox would download the PDF and try to open it in a local app (EBookDroid) instead, so all up I’d expect that’d make it something like 20–30 seconds to load, instead of 1–2. And the text would be minuscule (or only tiny in landscape mode) instead of sanely sized, further disincentive.

Good to know about the PDF patent situation. Do you happen to know how relevant that actually is to PDF tooling? Is it a new version of the file format, or a specification of the existing? (My knowledge of PDF is limited; I know the general concepts and how it’s put together, but not much of the intricate detail or PDF versions.) That is, does this help for existing documents, or are existing documents still stuck in a patent tangle?

On tooling, I just don’t believe PDF tooling is capable of being excellent; it’s a publishing format—a compilation target more than anything else—and by design not conducive to manipulability, where HTML is an authoring format, so you can work with it. Much of the stuff you can do with HTML tooling is by design fundamentally impossible with PDF. They’re very different types of formats.

jl6 · on Aug 24, 2020

PDF 2.0 is mostly a cleanup of the existing spec, plus a few minor new features. Adobe’s Public Patent License is similar to AOMedia’s AV1 approach to patents - the patent owners have granted royalty-free usage. However, AV1 also suffers from competing claims of patent ownership from a non-member of AOMedia. I’m not aware of any such claim relating to PDF, and as far as I know no PDF documents have any patent issues, and there are no royalties required to create PDF documents or tools.

I don’t think there’s a conflict between PDF being a publishing format and having great tools for producing that format. The editing can be done to an intermediate application file format (e.g. reStructuredText, or OpenDocument Text, or Microsoft Visio), with the result rendered to PDF.

Oh and that’s unfortunate to hear the mobile Firefox experience is poor. I note there’s a project to add pdf.js support - this is the kind of thing I mean by improving PDF support.

guidoism · on Aug 24, 2020

Please don't confuse the monstrosities that our poor tools produce with the nice files that our good tools produce. Using a shit PDF from some manufacturers website to criticize the format is like using the HTML produced by Microsoft Word to criticize how shitty HTML is.

Remember that PDF is mostly a text format that's pretty simple at its heart. You can actually hand-write a PDF, though it's painful: https://brendanzagaeski.appspot.com/0005.html

Take a look at the cli tool QPDF. Once you uncompress each of the dictionaries within the file you can load it up into a text editor and see what is possible. Take a shitty large PDF as an example and then one produced by LaTeX. See the difference.

Free tools can and do produce amazing PDFs that are small and quick to render. Honestly, PanDoc is probably the place to start.

Both ecosystems — HTML and PDF — are under rampant abuse, I agree. But with PDF we could end up with something better — Not because it's most constraining or more freeing but because we would be doing the hard work of layout once, on the server, not a billion times on our battery and cpu constrained portable devices.

irrational · on Aug 24, 2020

As another person said, your pdf fonts are way too small. At least on my iPhone 8+ I’m having a super hard time with the size.

jl6 · on Aug 24, 2020

I tested on that exact model of phone and I found that when zoomed to the width of the main column of text, in portrait orientation, the font size was only a little smaller than the size used on Hacker News. In landscape orientation, zoomed in the same way, it’s a little bigger.

irrational · on Aug 24, 2020

Maybe it is the font and not the size. I don't know. I just know that I'm having a very hard time reading it.

guidoism · on Aug 24, 2020

Hah! I came to the same conclusion. PDF isn't actually that bad of a format and it's much easier and faster to parse and render than HTML/CSS.

My conclusion:

- Use the streaming format with the index headers at the start

- Break the document into pages to aid rendering speed while jumping around but use zero margins on top/bottom to hide the fact that they are separate pages

- Do the layout at publish time for a few screen widths — No need to reflow perfectly for every possible width

We can get started with this and then maybe later tweak the PDF format — A text format with a binary format with a text format within a binary format kinda sucks. Should be completely binary. Maybe use an existing serialization format like protobufs.

jl6 · on Aug 24, 2020

Reflow is still important to support accessibility, and it depends on Tagged PDF, not that I can find any open tools that actually can do this.

fmakunbound · on Aug 24, 2020

Wish I could read it, but the pdf fonts make it too small to read on my phone.

twknotes · on Aug 24, 2020

Why not PostScript?

jl6 · on Aug 24, 2020

PDF is more mainstream with more user-friendly tools. PDF supports transparency. PDF is (mostly) consistently rendered by all kinds of client applications. PDF is not a general purpose programming language (at least in its PDF/A subset) - and I’m considering that a feature rather than a limitation.