A Clean Start for the Web

roca · on Aug 24, 2020

This comes up so often I wrote an a blog post explaining why it won't work that I can refer to every time it comes up. Here it is: https://robert.ocallahan.org/2020/05/why-forking-html-into-s...

brundolf · on Aug 24, 2020

It's a classic case of engineers thinking they can solve people-problems with technical solutions. Both the thing that makes the web so important (a critical mass of adoption) and the things that are ruining it (Google's monopoly, adtech, product-over-engineering mentality) are societal. You can make pristine new monuments to your personal vision for the web all day long, but no amount of engineering by itself will have the tiniest impact on this status-quo. Changing things requires changing minds - of product managers, of legislators, etc.

eitland · on Aug 24, 2020

> It's a classic case of engineers thinking they can solve people-problems with technical solutions.

Linux, Wikipedia and a number of other proves this is not the entire story.

If the new <x> is good/fast/cheap enough people will sometimes start using it.

Often this is gets great help from the incumbent solution being either really bad and/or slow and/or expensive.

This way of thinking is not just not entirely correct but more importantly it is demotivating. Maybe what we do won't succeed but for me it sure beats watching more TV :-)

luckylion · on Aug 24, 2020

Does Wikipedia count as a technical solution? I don't think the tech does anything without the people editing it. I believe it's the policy ("everyone can edit") that made the advancement, not the specific system they built to allow it.

I agree on the second point, but it's more by accident in my opinion. If the technical solution leads to people finding it easier or cheaper to do something, they'll adopt it. Whether it's new, good, pristine, perfect, free & open etc, doesn't matter. It has to make their life easier, save them time and/or money or otherwise enhance the experience, that's what counts.

fauigerzigerk · on Aug 24, 2020

I don't think that's necessarily true. What is true is that the impact of new technology is not determined by tech visionaries alone and not even mostly by them.

But new technology has had a dramatic impact on societies in the past, disrupting monopolies, creating new ones, confronting society with new choices, opportunities and challenges.

Technology can force a rethink. It cannot force outcomes.

brodo · on Aug 24, 2020

The same is true for operating systems and programming languages. It‘s not the best ones that win. Economics is the most important factor.

jariel · on Aug 24, 2020

"are societal"

Structural, not societal.

And Google's monopoly is not good, but it's not really the root cause of the issues articulated in the article.

tannhaeuser · on Aug 24, 2020

Indeed it does come up often, but your linked blog post doesn't state a single reason why "forking HTML doesn't make sense". First, nobody is "forking" anything. Second, you assume the role of a "web developer" as a given. If I, as a web "user" (reader) want to read some text (say, about gardening), then I sure want that text to be written by an expert rather than a "web developer". If in doubt, I can live with the presentation being plain; but there's no point in reading stuff about gardening by a dilettante gardener who happens to know HTML, CSS, and JS. As a corollary, "the web" fails to deliver for the ___domain expert as a simple means for self-publishing; instead a self-referential man-in-the-middle snake oil industry has been build.

bryanrasmussen · on Aug 24, 2020

The greatest gardener and the worst gardener in the world can self-publish equally well on the web, unfortunately proving to others which is which is outside the area of self-publishing and they cannot do this easily. This problem is also a general one for people publishing books or anything else.

roca · on Aug 25, 2020

It is very very easy for self-publishers to publish plain static HTML content on the Web. A good modern solution is Github Pages, but there have always been good solutions for this.

Yes, people have gravitated to centralized platforms for various reasons, and those platforms tend to make it difficult to publish entirely handwritten raw HTML/CSS, but it's not because the alternatives ceased to exist.

millstone · on Aug 24, 2020

Every time my browser's Reader Mode works, it's a better experience. Some day it will work often enough that I won't look back.

tenebrisalietum · on Aug 25, 2020

At some point the Web will be so slow and bloated that it won't attract revenue or developers. A clean slate is possible then.

hinkley · on Aug 25, 2020

You would need something that turns a SPA into a much smaller standalone app by only implementing 80% of the spec.

apitman · on Aug 24, 2020

> persuading Web developers to use it is where the real problem lies, and that is immensely difficult and no-one has any good ideas for how to do it

Let the different subsets compete. You'd have multiple different simple browser that each support a different subset. Developers would have to choose which one they agree with and make their sites compatible. They would all render properly in full browsers. You can find the sweet spot over time.

adrianN · on Aug 24, 2020

Competition doesn't work in the presence of network effects.

kjeetgill · on Aug 24, 2020

We lived through browser fragmentation before, with IE support vs other web standards. It just meant that developers had to make a version for everything except for big critical things like banks and .gov will be 'ie-only' (or maybe 'chrome-only', now).

It's way better to have one jankier universal platform with a few good implementations.

giantrobot · on Aug 24, 2020

We had pretty much this exactly with XML and XSLT. The XML would be entirely about content/data with only suggested presentation based on linked stylesheets. The browser could use the linked stylesheets or ignore them and use its own. Since the XML would declare its schema a browser could handle pretty much any data so long as it had a handler for that schema.

Since the XML could carry pretty much any sort of data server endpoints could back an "Application Web" just as easily as a "Document Web". A browser could take a music playlist and transform it into a listing of tracks via XSLT while a music player would take that exact same URL and enqueue the tracks for playback.

Unfortunately the mixed presentation and content model had a lot of momentum. By the time a lot of web developers caught on to separating concerns XmlHttpRequest existed and let everyone reimplement all those cool XML/Web 2.0 ideas poorly in JavaScript.

Mandatum · on Aug 24, 2020

As someone who has done a lot of work in XSLT for data transformation in ESB or integration..

Honestly if I actually improved my tools a bit better, I'd almost enjoy using it. But it sucks, so damn much, in the current state.. And now nobody seems to be interested in XSLT ("XML is legacy"), I don't think I'm ever going to be interested in "fixing" it.

ilovetux · on Aug 24, 2020

This is sad because as Json seems to be taking over. Despite its terseness Json lacks a standardized way to do validation (ie xsd), transformations (ie xslt) and service definitions (ie wsdl).

Its regrettable that such a vast ecosystem exists for a markup language to facilitate its use in data transfer, transformation and validation.

It seems like a waste that so many working standards seem to have just been tossed away with nothing to put into their place because Json saves a few bits on the wire.

Also regrettable is the fact that there seems to be resistance in developing similar standards for json.

kungato · on Aug 24, 2020

Json is not popular because it saves bits but because it's easily readable and maps 1:1 with js objects. I've worked with xml and xslt and while I see the advantages I still prefer json for anything. Your validation should come from annotated objects on the backend anyways which should be a single source of truth and then it doesn't matter which markup format you are validating

ilovetux · on Aug 24, 2020

I can kind of see what you're saying, but what does "annotated objects on the backend" mean. If there is not a standard I can only hack a solution together. I cannot ensure anything with 100% certainty, instead I'm left guessing if my logic is correct.

With a standard, however, I can be sure that what I intend is being validated and if my logic is wrong then I can change it to match the spec.

dsego · on Aug 24, 2020

Json schema maybe http://json-schema.org/

giantrobot · on Aug 24, 2020

The tooling did/does leave a lot to be desired in XML workflows. I think, in XML's heyday, it was way easier to template with PHP/ASP/JSP on the server. While those languages could have just as easily spit out XML to be transformed on the client, writing the XSL was non-trivial and lacked good tools to make the process easier.

I find JSON to be a lackluster replacement for XML. The efforts to bolt on XML features (JSON Schema etc.) are half assed and not widely supported. Without a bunch of extra complication you can't really know what a value in JSON means beyond its translation to some primitive.

The web development world doesn't deserve XML! /s

I am a bit of an XML fanboy as it solves complicated problems in a logical way. I also bought into (and continue to believe) in a lot of Web 2.0 ideas beyond pastel colors and pseudo-reflections on logos.

pkphilip · on Aug 24, 2020

There is really no need to reinvent or start with a clean slate. You can just do the following:

1. Use HTML and vanilla JS or jquery without using the huge set of javascript libraries needed to render a simple page.. as an SPA. What was the problem with plain old HTML in the first place? and whats the problem with server side rendering?

2. Avoid Google Adwords and the rest of the nonsense.. including AMP.

3. Support search engines like DuckDuckGo

Now, none of this means your website will be visited often or find its content listed high on Google.

However, if you can live with that, why not?

bruce511 · on Aug 24, 2020

So I'm not sure I understand the point of the article. Or if I do, then the argument seems unlikely.

Split the Web? On the server side? Or the client side?

Splitting the server seems unnecessary - since servers already let you serve as simple a site as you like. If you want to make simple sites with no javascript you are welcome to do that.

So split the client? Have one app to read some sites and another to read some other sites? That sounds contrarian to literally every program evolution in history. Because all client software strives to read all relevant formats (think Word loading WordPerfect files.) why would a user choose 2 separate apps when 1 would do?

It seems to me that the author misses the point that sites are javascript heavy not because they have to be, but because their makers want them to be (largely so they can make money)

bE9a3S5So8igd3 · on Aug 24, 2020

Correction: Sites are javascript heavy because it is easy to accidentally create javascript-heavy sites, and the programmers who do this also do not care that they have done it.

codegladiator · on Aug 24, 2020

> it is easy to accidentally create javascript-heavy sites

It's definitely not easy by any definition. And also not happening accidentally. Almost all frameworks/libraries mention about how small they are in overhead.

bE9a3S5So8igd3 · on Aug 25, 2020

I'm sure the front-end programmers you work with really pay attention to the size of their JS build /sarcasm

cblconfederate · on Aug 24, 2020

it's OK to have one browser engine -- this means the web is mature (We have one linux kernel after all). The problem is that google controls that. To attack that, one has to fork chrome and add features that google would never add , because they directly threaten their business, e.g. instant in-browser payments. Brave browser does that (ironically, i dont use it because my websites have ads and i do want to see if they are showing up). More of these google-unfriendly features (e.g. built in pseudonymous blockchain-based or other decentralized identity) would appeal to web devs and cause enough people to switch.

Otherwise, the problems he s describing have 100% to do with the miserable culture of CV-padding that has taken over the silicon valley and led to SPA monstrosities, rather than with web standards themselves. WWW as a standard is OK, old websites work, and HTML/CSS is still there. Trying to make the DOM work as some kind of window rendering engine is a travesty that doesnt work.

jakelazaroff · on Aug 24, 2020

> Otherwise, the problems he s describing have 100% to do with the miserable culture of CV-padding that has taken over the silicon valley and led to SPA monstrosities, rather than with web standards themselves.

I don’t understand what you’re saying here. SPAs aren’t circumventing web standards, like Flash — web standards are what make SPAs as they currently exist possible in the first place.

Indeed, the complaint is that the standards themselves have gotten too complex, such that only Google —with its obvious conflicts of interest — has the resources to develop a browser that complies with them.

tarsinge · on Aug 24, 2020

> Trying to make the DOM work as some kind of window rendering engine is a travesty that doesnt work.

Regarding the CV padding theory the explanation is more straights for me: it simply works better that what we had previously for enterprise software. It’s more relevant to compare SPA not to the WWW but to 90’s or 00’s corporate software.

Like what would you use for the typical CRUD with some analytics business app? The drawbacks of SPA are nothing compared to the universality of the web interface (“oh and it must work on my iPad too”) in the enterprise world.

irrational · on Aug 24, 2020

> For folks who just want to create a web page, who don’t want to enter an industry, there’s a baffling array of techniques, but all the simplest, probably-best ones are stigmatized. It’s easier to stumble into building your resume in React with GraphQL than it is to type some HTML in Notepad.

I’ve experienced this. I started web development in 1996/1997, so I have a lot of experience building pages by hand. These days I do use Vue.js, but sometimes I’m still known to build pages by hand. I’ve gotten so much grief for it over the years. Not because there is anything technically wrong with my files, but because “it’s not the way things are done today”.

chrismorgan · on Aug 24, 2020

I’m going to focus on one aspect of the article only.

> You might want to start with a lightweight markup language, which will ironically be geared toward generating HTML. Markdown’s strict specified variation, Commonmark, seems like a pretty decent choice.

Markdown of any form would be a catastrophically bad choice for something like this. Remember this: Markdown is all about HTML. Markdown has serious limitations and is very poorly designed in places, but it works well enough despite being so bad specifically because underneath it’s all HTML, so you can fall down to that when necessary. If you go with Markdown, you’re not actually supporting “Markdown” as your language, but HTML. If the user writes HTML in their Markdown, what are you going to do about it? You’re supposed to be able to write the HTML equivalent of any given Markdown construct and have it work, and I find that perfectly normal documents need to write raw HTML regularly, because Markdown is so terribly limited—so your document web is either (a) not actually Markdown and uselessly crippled (seriously, plain Markdown without HTML is extremely limiting for documents), or (a) actually completely back to being a subset of HTML, which is exactly what you said you didn’t want it to be (rule #1).

If you want to get away from HTML, you want something like reStructuredText, where you don’t have access to HTML (unless the platform decides to set up a directive or role for raw HTML), but have a whole lot more semantic functionality, functionality that had to be included because HTML wasn’t there to fall back on. I think AsciiDoc is like this too.

But the whole premise that “HTML must be replaced because it’s a performance or accessibility bottleneck for documents” is just not in the slightest bit true. HTML is fine. Even CSS is well enough (though most pages have terribly done stylesheets that are far heavier than they should be, e.g. because they include four resets and duplicate everything as well as overriding font families five times). The problem lies almost entirely in what JavaScript makes possible—not that JavaScript is inherently a problem, but it makes problems possible. The proposed variety of document web would be unlikely to perform substantially better than the existing web with JavaScript disabled.

All up, Robert O’Callahan’s response is entirely correct. You can’t split the web like this; you will fail. Especially: if your new system is not compatible with the old (which your rule #2 precluded), it will fail, unless it is extremely more compelling for everyone (rule #3), but I don’t believe there’s anywhere near enough scope for improvement to succeed.

jabirali · on Aug 24, 2020

> Remember this: Markdown is all about HTML. Markdown has serious limitations and is very poorly designed in places, but it works well enough despite being so bad specifically because underneath it’s all HTML, so you can fall down to that when necessary.

Originally, this was the case, but I'd argue that modern MarkDown flavors have been separated from HTML. It's common to compile via other pathways than HTML now (e.g. MarkDown → LaTeX → PDF), and many implementations don't support inline HTML anymore (e.g. many MarkDown-based note-taking apps).

> I find that perfectly normal documents need to write raw HTML regularly, because Markdown is so terribly limited

Out of curiosity, what do you need it for? I write a lot of MarkDown, but haven't felt any need for writing raw HTML. Modern versions of MarkDown (e.g. the Pandoc variant) have native support for things like tables, LaTeX equations, syntax-highlighted code, even bibliography management.

chrismorgan · on Aug 24, 2020

These are the three most common cases that I find for needing raw HTML:

1. Adding classes to elements, for styling; admittedly this may be inapplicable to some visions of a document web, if you can’t write stylesheets.

2. Images. If you’re dealing with known images, you should always set the width and height attributes on the <img> tag, so that the page need not reflow as the image loads. Markdown’s image syntax (![description](src)) doesn’t cover that. (Perhaps an app could load the image and fill out the width and height as part of its Markdown-to-HTML conversion, but I haven’t encountered any that do this.)

3. Tables. CommonMark doesn’t include tables, and even dialects that do support tables are consistently insufficient so that I have to write HTML. For example: I often want the first column to be a heading; but I don’t think any Markdown table syntaxes allow you to get <th> instead of <td> for the first cell of each row.

jabirali · on Aug 24, 2020

Fair enough :). For the record, Pandoc MarkDown supports all of these via its extended MarkDown syntax. For the first you can write [desc](src){.test} to get a class=test attribute on the link, for example. For the second, you can write ![desc](src){width=50%} to set the image size. For the last, tables do automatically get <th> on the first cell of each row when converted via Pandoc. This is however not standard MarkDown but Pandoc's extended version of MarkDown.

slmjkdbtl · on Aug 24, 2020

So everyone is dreaming for their own version of the web / browser. As a game developer I was just thinking couple days ago the browsers should only provide the lowest level rendering, audio, user input and networking API, let users provide anything that they need on top like HTML/CSS/JS engines and network protocols. Here the author talks about a Markdown web, I would argue this should be built on top of my low level browser, there's never going to be an answer. Let's just start building, the fun is immense designing and building protocols like this, of course it won't change the way the mainstream web but I really like how projects like gemini turned out, having its own scene and supporters

bgirard · on Aug 24, 2020

That would absolutely wreck performance, particularly startup times. You're not going to beat the performance of having 100 open tabs share the same highly optimized PGO-ed browser runtime by replacing that with 100 different runtimes. You're talking about downloading and keeping a small browser engine in memory for every page.

dmux · on Aug 24, 2020

I've not seen PGO before, are you referring to this? https://en.wikipedia.org/wiki/Profile-guided_optimization

bgirard · on Aug 29, 2020

Yes, Chrome just made recent changes to their PGO: https://blog.chromium.org/2020/08/chrome-just-got-faster-wit...

slmjkdbtl · on Aug 27, 2020

Very true, I'm talking about an unlikely scenario where HTML/CSS/JS is an unpopular need. Yeah there's no solution that'll fit every one's need.

bgirard · on Aug 29, 2020

You can actually get a decent half-way hybrid by shipping WASM but still binding to HTML and using rich Web APIs as appropriate.

jonathanaird · on Aug 24, 2020

This is how Flutter works.

kunfuu · on Aug 24, 2020

My 2cents.

1. A markup language taking multi-lingual considerations to heart is a must for me. Markdown doesn't work well for non-alphabetic languages in terms of typesetting etc. The web is global and multi-lingual, a new web (if there is one) should be, too.

2. I prefer a new content web rather than a new document web. When I use the web, I absorb and sometimes create content beyond documents. I don't want art communities, for example, to be excluded from the new web. Although an art sharing community can be modeled after documents technically, 'content' is the more apropos. Your mileage may vary, though.

bebna · on Aug 24, 2020

For example hugo allows multi language. This is done by file name or directory [0]. I find this much more usable than trying to fit multiple languages in one file.

[0]: https://regisphilibert.com/blog/2018/08/hugo-multilingual-pa...

maxxk · on Aug 24, 2020

I don't have experience with non-alphabetic writing systems, but I was under impression that for simple written communication nowadays plain text is good enough for every language supported by Unicode. I've heard e.g. about Han unification, but is it bad enough even for non-linguistic purposes?

Which features, in your opinion, are required for decent non-alphabetic language compatibility on top of plain Unicode text?

irrational · on Aug 24, 2020

The first thing that comes to mind is things like right-to-left or top-to-bottom languages. Some ancient languages were written right-to-left and then left-to-right and then repeat.

kunfuu · on Aug 24, 2020

Markdown isn't plain Unicode text. It mixes markup marks with plain text, and is intended to provide more layout than plain text. Plus, the question is rather broad, as there are so many non-alphabetic languages around the globe. I'm not knowledgeable enough to offer an answer.

You may want to read the work by W3C[0]. Some requirements mentioned exceed the needs of simple written communication, but not all in my experience. The problems usually arise when you mix different scripts.

[0] https://www.w3.org/TR/typography/

EDIT: clarifications.

Animats · on Aug 24, 2020

"Stop, already" might be a good idea.

As in, no more of the stuff on this list [1] without really broad buy-in. We're seeing Google do "expand, engulf, devour" to the Web.

Already, it's hard to send mail unless you're a big player. Writing a browser is a huge job. Just fetching an HTTP file is now complicated enough that people run "curl" as a subprocess to do it.

[1] https://mozilla.github.io/standards-positions/

66fm472tjy7 · on Aug 24, 2020

I think the document web markup language would have to be more like server-side HTML templating languages[1] with the added ability to talk to REST endpoints than markdown. That means declarative support for

* declaring JSON (or similar format, but why reinvent the wheel) variables

* rendering the values of these variables

* interacting with the variables with controls (inputs, check boxes, selects, etc)

* conditional rendering based on those variables

* iteration over them

* submitting a form as a JSON object

* saving a JSON response from the server into a variable

* re-rendering the effects of a variable change without a full page reload

I think a major reason for why we ended up here[2] is because browsers did not add these features to HTML and thus devs resorted to AJAX and small JS snippets at first and then kept expanding the use of JS.

[1] as a Java dev who used to do front-end before the ascendance of client side SPA frameworks I'm thinking of JSP, JSF, Thymeleaf, etc. Maybe the client side stuff used nowadays is similar, I'm just not familiar with it.

[2] being forced to run a turing-complete programming language just to display a fully static document, leaving one exposed to security risks (yes, the browser guys are doing a much better job sandboxing than Flash, Java Applets, etc) and tracking.

ratww · on Aug 24, 2020

> I think a major reason for why we ended up here[2] is because browsers did not add these features to HTML and thus devs resorted to AJAX and small JS snippets at first and then kept expanding the use of JS.

That's my hunch as well. If HTML had built-in features such as modals (maybe as an evolution of popups), or partial rendering of server side data (maybe as an evolution of frames/iframes), it would have taken more time to reach the current state.

Btw, your list reminds me of Intercooler.js.

baystep · on Aug 24, 2020

I wonder if something like Nunjucks but with CommonMark could work great for the document web.

bE9a3S5So8igd3 · on Aug 24, 2020

https://en.wikipedia.org/wiki/Apache_Tapestry

chrismorgan · on Aug 24, 2020

User stylesheets were removed from Chrome for entirely valid and pragmatic reasons. The feature was poorly implemented and a maintenance burden, and fares much better in an extension (though I gather that the replacement wasn’t quite perfect until two or three years later due to precedence of style application).

I don’t believe Chrome’s user stylesheets even supported anything like Firefox’s @-moz-document for scoping the styles to an origin or similar—it was a completely blunt instrument that had very limited use, even less than Firefox’s userContent.css.

I wouldn’t complain if userContent.css were removed from Firefox for the same reasons (though I would certainly complain if they removed userChrome.css, since no alternative exists in that case). I used to use userContent.css. I’ve used extensions (now Stylus) for the equivalent functionality for years now.

The tweet this article cites about its removal six years ago is terrible: it quotes the most recent comment on that issue as though it were fact stated by the developers, when it is in fact a random person (probably a disgruntled user) jeering; said comment is completely false.

jrochkind1 · on Aug 24, 2020

I don't think a clean division into two categories "document" and "application" will work.

It's more of a spectrum than a binary, a lot of things exist at various points in between "completely document" and "completely app". Even wikipedia has editing interfaces which are arguably "app" right? I guess it could be split into two separate "things", one using the "document web" for display and another using the "application web" for editing. This doesn't seem all that appealing.

I think the ability to be wherever you want on the document/application spectrum, and have a site (or a part of a site) move along the spectrum over times without having to "rewrite in a new technology" is part of what leads to success of the current web as it is.

The OP is probably not wrong that it's also part of what leads to the technology mess. But I don't think it's realistic to think you can bifurcate the architecture for two ideal use cases like that.

frank2 · on Aug 25, 2020

>I don't think a clean division into two categories "document" and "application" will work.

Do you also believe that there can be no clean division into "data" and "code"?

If a program I wrote is interacting with an API, is it not completely unambiguous for me to ask the maintainer of the API to give me some way to directly read all of the persistent data behind the API?

jrochkind1 · on Aug 25, 2020

Sure. I specifically don't think it's realistic to think the architecture of the web can be divided into two separate architectures, one for web sites that are purely "documents" and another for websites that are purely "applications". Because I think the success of the web has been due to the fact that you can move along the spectrum from more "static document" to "interactive application" with the same technologies and architectures.

I don't think it's a general principle about "data" and "code" in general", although of course data and code can also slide into each other in many other venues.

But of course you can write programs that access APIs. I'm sorry if I was unclear and you thought I was saying it was impossible to write programs that access APIs to get persistent data.

apitman · on Aug 24, 2020

In many ways, browsers have become Rube Goldberg machines for rendering text, images, and video.

A major reason people are concerned about the recent Mozilla layoffs is because they're one of the few competing browser vendors. Why does that matter? Because browsers are too complicated for reasonably sized teams to implement competing solutions. If a larger percentage of web content was accessible to simpler software, this would be much less of an issue.

It's not hard to make simple document browsers. My personal content is accessible in the console as plain text:

    curl https://apitman.com/txt/19

    nc apitman.com 2052 <<< /txt/19

I like the idea of splitting the web between documents and apps (and using separate browsers to access them), but don't necessarily agree making it incompatible is the way to go. I think specifying a subset of HTML/CSS and allowing the "new web" to grow over time might see more adoption better in the long run.

jl6 · on Aug 24, 2020

Oh wow, I just wrote my own version of this rant, reaching a radically different conclusion:

https://lab6.com/0

TL;DR: I’m abandoning HTML and switching to PDF.

chrismorgan · on Aug 24, 2020

So, a megabyte of PDF which both downloads and renders far more slowly than the equivalent HTML, which could be 20KB of HTML, no JavaScript, 1KB of CSS and perhaps 50KB of images (which don’t block page rendering, either), all bundled into one HTML file (base64 the raster images) of under 100KB.

I can imagine why you might have concluded as you have, but your conclusion is nonetheless baffling. Most of the reasons you’ve chosen PDF over HTML apply just as truly to HTML; page-orientation is the only one that doesn’t, and I refute your claims that that page orientation is a good thing for the web at large or the hardware that most use. (Guess how I read the document? By scrolling, not by pagination; the footer and header of each page transition is just a minor annoyance in the middle of a paragraph of text.) I disagree with every point of your assessment on PDF’s historical disadvantages, too: PDF as used still includes patent-encumbered things, and implementations are complex enough that major deviations and incompatibilities are common; PDF files are still ginormous (nothing has changed about this ever), and tooling is terrible (perhaps even non-existent?) for trying to shrink files without butchering the lot, with yours as a good example of not being an order of magnitude smaller as claimed; it’s still far harder to make PDFs accessible, because most free tools just can’t do it, and it takes far more effort than HTML where it’s easy (as before, tooling is terrible, where with HTML you can just edit the source in a text editor); most PDF readers can’t reflow text (don’t think I’ve ever used a tool—reader or writer—that could); most PDFs can’t be edited freely; and PDFs don’t render well on screen.

You’ve thrown out the HTML ecosystem because some people abuse it, and are choosing to use PDF despite the rampant abuse and many problems of the format, because it’s theoretically possible to work around those problems. (You may quibble with my judgement of “rampant abuse”, but most PDFs I encounter on the web are PDFs for no good reason and perform terribly, loading slowly and being worse for task performance. It is perhaps a slightly different class of abuse, since spying on users via JavaScript isn’t part of it, but it’s related inasmuch as it’s not about meeting the needs of the user.) This does not seem internally consistent.

jl6 · on Aug 24, 2020

Thank you for your feedback. I’m not going to vigorously defend PDF, because a lot of the distaste you express is legitimate. It’s a statement of my disappointment in current web/browser trends that I would be willing to accept all of PDF’s hardships to distance myself from the churn.

For what it’s worth though...

That 1MB of PDF still loads nearly instantly on my phone - subjectively no slower than the equivalent much-smaller bare-bones HTML version.

PDF patents are all licensed royalty-free for the normative parts of the PDF 2.0 spec.

PDF tooling sucks, but I’d rather see effort put into improving this situation than into yet another expansion of the web browser.

guidoism · on Aug 24, 2020

Ok, let's see what we can do to improve your PDF. I ran it through qpdf so I could view it in my text editor: qpdf --stream-data=uncompress 0.pdf 0.uncompressed.pdf

The Tj operators (for printing text) are operating on numbers instead of ascii text making it difficult to read.

I'll see what I can do to duplicate your PDF using some cli tools and a text editor... (hopefully I have time to do it today)

jl6 · on Aug 24, 2020

I’m intrigued and would be very interested if you find any tools that can create or verify tag structure.

FYI the file has already been linearized through qpdf.

chrismorgan · on Aug 24, 2020

It seems that Firefox is finishing downloading the full document before it is able to render the title page. (Some documents can have early pages rendered before the document’s all fetched, some can’t. Not sure what the technical difference is.) For me where I am, it’s taking 8–15 seconds (quite variable) to load the document and render the first page, or 1.5–3 seconds when cached. The equivalent single-file HTML would render completely in easily under two seconds, even including TLS negotiation when I’m 150ms away, and reload in half a second.

Doing things like jumping to the end takes perhaps 200–400ms to render that page in the PDF, where HTML would be instant (meaning “less than 16ms”).

No way would I access this on my not-overly-powerful phone: Firefox would download the PDF and try to open it in a local app (EBookDroid) instead, so all up I’d expect that’d make it something like 20–30 seconds to load, instead of 1–2. And the text would be minuscule (or only tiny in landscape mode) instead of sanely sized, further disincentive.

Good to know about the PDF patent situation. Do you happen to know how relevant that actually is to PDF tooling? Is it a new version of the file format, or a specification of the existing? (My knowledge of PDF is limited; I know the general concepts and how it’s put together, but not much of the intricate detail or PDF versions.) That is, does this help for existing documents, or are existing documents still stuck in a patent tangle?

On tooling, I just don’t believe PDF tooling is capable of being excellent; it’s a publishing format—a compilation target more than anything else—and by design not conducive to manipulability, where HTML is an authoring format, so you can work with it. Much of the stuff you can do with HTML tooling is by design fundamentally impossible with PDF. They’re very different types of formats.

jl6 · on Aug 24, 2020

PDF 2.0 is mostly a cleanup of the existing spec, plus a few minor new features. Adobe’s Public Patent License is similar to AOMedia’s AV1 approach to patents - the patent owners have granted royalty-free usage. However, AV1 also suffers from competing claims of patent ownership from a non-member of AOMedia. I’m not aware of any such claim relating to PDF, and as far as I know no PDF documents have any patent issues, and there are no royalties required to create PDF documents or tools.

I don’t think there’s a conflict between PDF being a publishing format and having great tools for producing that format. The editing can be done to an intermediate application file format (e.g. reStructuredText, or OpenDocument Text, or Microsoft Visio), with the result rendered to PDF.

Oh and that’s unfortunate to hear the mobile Firefox experience is poor. I note there’s a project to add pdf.js support - this is the kind of thing I mean by improving PDF support.

guidoism · on Aug 24, 2020

Please don't confuse the monstrosities that our poor tools produce with the nice files that our good tools produce. Using a shit PDF from some manufacturers website to criticize the format is like using the HTML produced by Microsoft Word to criticize how shitty HTML is.

Remember that PDF is mostly a text format that's pretty simple at its heart. You can actually hand-write a PDF, though it's painful: https://brendanzagaeski.appspot.com/0005.html

Take a look at the cli tool QPDF. Once you uncompress each of the dictionaries within the file you can load it up into a text editor and see what is possible. Take a shitty large PDF as an example and then one produced by LaTeX. See the difference.

Free tools can and do produce amazing PDFs that are small and quick to render. Honestly, PanDoc is probably the place to start.

Both ecosystems — HTML and PDF — are under rampant abuse, I agree. But with PDF we could end up with something better — Not because it's most constraining or more freeing but because we would be doing the hard work of layout once, on the server, not a billion times on our battery and cpu constrained portable devices.

irrational · on Aug 24, 2020

As another person said, your pdf fonts are way too small. At least on my iPhone 8+ I’m having a super hard time with the size.

jl6 · on Aug 24, 2020

I tested on that exact model of phone and I found that when zoomed to the width of the main column of text, in portrait orientation, the font size was only a little smaller than the size used on Hacker News. In landscape orientation, zoomed in the same way, it’s a little bigger.

irrational · on Aug 24, 2020

Maybe it is the font and not the size. I don't know. I just know that I'm having a very hard time reading it.

guidoism · on Aug 24, 2020

Hah! I came to the same conclusion. PDF isn't actually that bad of a format and it's much easier and faster to parse and render than HTML/CSS.

My conclusion:

- Use the streaming format with the index headers at the start

- Break the document into pages to aid rendering speed while jumping around but use zero margins on top/bottom to hide the fact that they are separate pages

- Do the layout at publish time for a few screen widths — No need to reflow perfectly for every possible width

We can get started with this and then maybe later tweak the PDF format — A text format with a binary format with a text format within a binary format kinda sucks. Should be completely binary. Maybe use an existing serialization format like protobufs.

jl6 · on Aug 24, 2020

Reflow is still important to support accessibility, and it depends on Tagged PDF, not that I can find any open tools that actually can do this.

fmakunbound · on Aug 24, 2020

Wish I could read it, but the pdf fonts make it too small to read on my phone.

twknotes · on Aug 24, 2020

Why not PostScript?

jl6 · on Aug 24, 2020

PDF is more mainstream with more user-friendly tools. PDF supports transparency. PDF is (mostly) consistently rendered by all kinds of client applications. PDF is not a general purpose programming language (at least in its PDF/A subset) - and I’m considering that a feature rather than a limitation.

dgudkov · on Aug 24, 2020

I support the idea of re-starting the web. Although, I believe the new web should be more deeply differentiated from the current one than what's offered by the author. Decentralized payments, notifications, persona-based authentication, privacy, content aggregation, structured data should be deeply ingrained in the new web, instead of being an afterthought. Re-starting the web requires a thorough analysis of what's lacking in the current web, and an open discussion on addressing the shortcomings in the new web.

Also, while not it doesn't have to be compatible with the current web, there should be mechanisms that translate the new web into (maybe limited) form of the current web, to utilize at least some of the tools and standards.

teraku · on Aug 24, 2020

I wonder if it would be impossible to carve out a set of standards of a browser (e.g. WebRender, HTML, CSS, JS Engine, ...) and then have everything outside of that be a modular plugin you can install. And make building those modules easy.

"Netflix needs a DRM module. Download this official release from <Browser Org>, <Netflix>, <Tom who really likes DRM and encoding alrogithms>".

Or am I just too naive and don't know enough about how browsers work?

hliyan · on Aug 24, 2020

Another proposal: a standardized search protocol over HTTP that can be implemented by most web servers and a DNS-like system of search indexes that use websites' built in index query endpoints (rather than crawling). Essentially decoupling search and advertising by building a search function into the internet standards.

Perhaps commercial products can improve over the quality of results, but simple text and keyword search can work this way.

diegolo · on Aug 24, 2020

you can't do that. when you implement search you need a global view over the collection to understand what are the most relevant documents. You are describing the paradise of the spammers :)

aembleton · on Aug 24, 2020

Would make SEO much easier, and I guess would increase the amount of spam on Google. Any website can return the search results that it would like.

I guess search engines could apply a score to each site so that those who spam, get down-graded, but then how would they do the scoring? By crawling? Then you still need a crawler and might as well build your own index.

smlckz · on Aug 24, 2020

If I think about a document web, I don't think everyone will like to write in Markdown. There are many ''lightweight markup language''s to choose from. So maybe there could be a ''low-level'' markup language like Scribe or Texinfo to compile into from your favourite markup language. I have a small example of what that might look like here (just the sketch, not impl): https://news.ycombinator.com/item?id=24249636

What do you think?

chrismorgan · on Aug 24, 2020

You seem to have just taken HTML and provided it with a different syntax, for no clear reason.

HTML is the compilation target, the low-level markup language that Markdown, reStructuredText, &c. can render to.

smlckz · on Aug 24, 2020

Yes, this does look like html in another cloth. I learnt about this from here: https://nofluffjuststuff.com/blog/douglas_crockford/2007/06/...

Compare:

        <strong>important</strong>

with:

        @strong{important}

It looks like some characters saved here. I don't know how much savings would be in the long run.

For big sections, one could use @begin{blah} content here @end{blah}.

It looks a bit plesant in my eyes (than html). And only _one_ special character @, which is repeated twice to escape it.

chrismorgan · on Aug 24, 2020

Yeah, but… why? We already have all our tooling built around HTML syntax, why start something new that is not meaningfully better? All you’re doing is fracturing the ecosystem needlessly.

smlckz · on Aug 24, 2020

hmm. all toolings are around html. it isn't or meant to be going mainstream. just a niche...

>> All you're doing is fracturing the ecosystem needlessly.

I haven't done anything yet ;)

twknotes · on Aug 24, 2020

Document and App web has an underlying unified component -- message. Design a universal message format and exchange protocol, each only in one A4 page, that may be the foundation of New WEB.

makeworld · on Aug 24, 2020

Glad to see Gemini mentioned, I think it provides a good answer to his Document Web 2.0 idea.

Gemtext might not be as full featured as some might like, but you can always serve markdown instead.

limomium · on Aug 24, 2020

Last I checked, Gemini can't have pictures, only text.

maxxk · on Aug 24, 2020

Well, Gemini can have links to pictures and clients may render pictures — inline, in separate "graphics" column etc. No one does it yet, as far as I know, but we will see.

Table and math bring more problems. Probably SVG or other images can solve them, but it is a pain for authors. In my opinion, no plain-text markup language today has tables with features even remotely close to HTML (at least colspan, rowspan and alignment) while remaining convenient to both write and read. AsciiDoc is fine in terms of features, but not in terms of syntax:

https://asciidoc.org/newtables.html

z3t4 · on Aug 24, 2020

The problem is that you can not sell something that is 20 years old, works, easy to use, and is freely available.

trenchgun · on Aug 23, 2020

Yes, please.