HTML was historically an application of SGML, and SGML could do includes. You could define a new "entity", and if you created a "system" entity, you could refer to it later and have it substituted in.
<!DOCTYPE html example [
<!ENTITY myheader SYSTEM "myheader.html">
]>
....
&myheader;
SGML is complex, so various efforts were made to simplify HTML, and that's one of the capabilities that was dropped along the way.
The XML subset of SGML still includes most forms of entity usage SGML has, including external general entities as described by grandparent. XInclude can include any fragment not just a complete document, but apart from that was redundant, and what remains of XInclude in HTML today (<svg href=...>) doesnt't make use of fragments and also does away with the xinclude and other namespaces. For reusing fragments OTOH, SVG has the more specific <use href=...> construct. XInclude also really worked bad in the presence of XML Schema.
It's too bad we didn't go down the XHTML/semantic web route twenty years ago.
Strict documents, reusable types, microformats, etc. would have put search into the hands of the masses rather than kept it in Google's unique ___domain.
The web would have been more composible and P2P. We'd have been able to slurp first class article content, comments, contact details, factual information, addresses, etc., and built a wealth of tooling.
Google / WhatWG wanted easy to author pages (~="sloppy markup, nonstandard docs") because nobody else could "organize the web" like them if it was disorganized by default.
Once the late 2010's came to pass, Google's need for the web started to wane. They directly embed lifted facts into the search results, tried to push AMP to keep us from going to websites, etc.
Google's decisions and technologies have been designed to keep us in their funnel. Web tech has been nudged and mutated to accomplish that. It's especially easy to see when the tides change.
As a programmer, I really liked XHTML because it meant I could use a regular XML parser/writer to work with it. Such components can be made small and efficient if you don't need the more advanced features of XML (ex: schemas), on the level of JSON. I remember an app I wrote that had a "print" feature that worked by generating an HTML document. We made it XHTML, and used the XML library we already used elsewhere to generate the document. Much more reliable than concatenating strings (hello injections!) and no need for an additional dependency.
In addition, we used XSLT quite a bit too. It is nice being able to open your XML data files in a web browser and having it nicely formatted without any external software. All you needed was a link to the style sheet.
The thing I liked the most about XHTML was how it enforced strict notation.
Elements had to be used in their pure form, and CSS was for all visual presentation.
It really helped me understand and be better at web development - getting the tick from the XHTML validator was always an achievement for complicated webpages.
I don't think there was ever a sustainable route to a semantic web that would work for the masses.
People wanted to write and publish. Only a small portion of people/institutions would have had the resources or appetite to tag factual information on their pages. Most people would have ignored the semantic taxonomies (or just wouldn't have published at all). I guess a small and insular semantic web is better than no semantic web, but I doubt there was a scenario where the web would have been as rich as it actually became, but was also rigidly organized.
Also even if you do have good practices of semantic tagging, there are huge epistemological problems around taxonomies - who constructs them, what do the terms actually mean, how to organize them and so on.
In my experience trying to work with wikidata taxonomies, it can be a total mess when it's crowdsourced, and if you go to am "expert" derived taxonomy there are all kinds of other problems with coverage, meaning, democracy.
I've had a few flirtations with the semantic web going back to 2007 and long ago came to the personal conclusion that unfortunately AI is the only viable approach.
The "semantic" part was what eventually became W3C's RDF stuff (a pet peeve of TBL's predating even the Web). When people squeeze poetry, threaded discussion, and other emergent text forms into a vocabulary for casual academic publishing and call that "semantic HTML", that still doesn't make it semantic.
The "strict markup" part can be (and always could be) had using SGML which is just a superset of XML that also supports HTML empty elements, tag inference, attribute shortforms, etc. HTML was invented as SGML vocabulary in the first place.
Agree though that Google derailed any meaningful standardization effort for the readins you stated. Actually, it started already with CSS and the idioticy to pile yet another item-value syntax over SGML/HTML, when it already has attributes for formatting. The "semantic HTML" postulate is kind of just an after-the-fact justification for insane CSS complexity that could grow because it wasn't part of HTML proper and the scrutinity that goes with introducing new elements or attributes with it.
I kinda bailed on being optimistic/enthusiastic about the Web when xhtml wasn't adopted as the way forward.
It was such a huge improvement. For some reason rather than just tolerating old tag-soup mess while forging the way for a brighter future, we went "nah, let's embrace the mess". WTF.
It was so cool to be able to apply XML tools to the Web and have it actually work. Like getting a big present for Christmas. That was promptly thrown in a dumpster.
I kinda agree with you but I'd argue the "death" of microformats is unrelated to the death of XHTML (tho schema.org is still around).
You could still use e.g. hReview today, but nobody does. In the end the problem of microformats was that "I want my content to be used outside my web property" is something nobody wants, beyond search engines that are supposed to drive traffic to you.
The fediverse is the only chance of reviving that concept because it basically keeps attribution around.
That’s not how the history went at all. When I worked at an internet co in the late 1990s (ie pre google’s dominance) SGML was a minority interest back then. We used to try to sell clients on an intranet based on SGML because of the flexibility etc and there was little interest and sloppy markup and incorrect html was very much the norm on the web back then (pre chrome etc)
Me personally, I didn't even care that much about strict semantic web, but XML has the benefits of the entire ecosystem around it (like XPath and XSLT), composable extensibility in form of namespaces etc. It was very frustrating to see all that thrown out with HTML5, and the reasoning never made any sense to me (backwards compatibility with pre-XHTML pages would be best handled by defining a spec according to which they should be converted to XHTML).
There was an ill-advised XHtml 2.0 project which was supposed to be incompatible, but it was abandoned. Currently xhtml is defined as an alternative “serialization” of html, but the semantics are exactly the same as html.
The semantic web is a silly dream of the 90s and 00s. It's not a realizabile technology, and Google basically showed exactly why: as soon as you have a fixed algorithm for finding pages on the web, people will start gaming that algorithm to prioritize their content over others'. And I'm not talking about malicious actors trying to publish malware, but about every single publisher that has theoney to invest in figuring out how and doing it.
So any kind of purely algorithmic, metadata based retrieval algorithm would very quickly return almost pure garbage. What makes actual search engines work is the constant human work to change the algorithm in response to the people who are gaming it. Which goes against the idea of the semantic web somewhat, and completely against the idea of a local-first web search engine for the masses.
I would encourage you to go and read more about triples/asserting facts, and the trust/provenance of facts in this context.
You are basically saying "it's impossible to make basic claims" in your comment, which perhaps you don't realize
I'm as big a critic of Google as anyone, but I'm always surprised at modern day takes around the lost semantic web technologies - they are missing facts or jumping to conclusions in hindsight.
Here's what people should know.
1) The failure of XHTML was very much a multi-vendor, industry-wide affair; the problem was that the syntax of XML was stricter than the syntax of HTML, and the web was already littered with broken HTML that the browser vendors all had to implement layers of quirk handling to parse. There was simply no clear user payoff for moving to the stricter parsing rules of XML and there was basically no vendor who wanted to do the work. To my memory Google does not really stand out here, they largely avoided working on what was frequently referred to as a science project, like all the other vendors.
A few things stand out as interesting. First of all, the old semantic web never had a business case. JSON+LD Structured Data does: Google will parse your structured data and use it to inform the various snippets, factoids, previews and interactive widgets they show all over their search engine and other web properties. So as a result JSON+LD has taken off massively. Millions of websites have adopted it. The data is there in the document. It is just in a JSON+LD section. If you work in SEO you know all about this. Seems to be quite rare that anyone on Hacker News is aware of it however.
Second interesting thing, why did we end up with the semantic data being in JSON in a separate section of the file? I don't know. I think everyone just found that interleaving it within the HTML was not that useful. For the legacy reasons discussed earlier, HTML is a mess. It's difficult to parse. It's overloaded with a lot of stuff. JSON is the more modern thing. It seems reasonable to me that we ended up with this implementation. Note that Google does have some level of support for other semantic data, like RDFa which I think is directly in the HTML - it is not popular.
Which brings us to the third interesting thing, the JSON+LD schemas Google uses, are standards, or at least... standard-y. The W3C is involved. Google, Yahoo, Yandex and Microsoft have made the largest contributions to my knowledge. You can read all about it on schema.org.
TL;DR - XHTML was not a practical technology and no browser or tool vendor wanted to support it. We eventually got the semantic web anyway!
Google does support multiple semantic web standards: RDFa, JSON+LD and I believe microdata as well.
JSON+LD is much simpler to extract and parse, however it makes site HTML bigger because information gets duplicated compared to RDFa where values could be inclined.
I remember just using PHP sessions back then on a XHTML document produced parse errors. Because PHP added the session to the query strings of links and used the raw & character instead of & for separating params in the query string. Thus causing a XML parse error.
There was a push to prevent browsers to be too lenient with the syntax in order to avoid the problem that sloppy HTML produced (inconsistent rendering across browsers)
The “semantic web” has been successful in a few areas but not so much as SQL or document databases. Many data formats use it, such RSS feeds and XMP metadata used by Adobe tools.
As someone who worked in the field of "semantic XML processing" at the time I can tell you that while the "XML processing" part was (while full of unnecessary complications) well understood, the "semantic" part was purely aspirational and never well understood. The common theme with the current flurry of LLMs and their noisy proponents is that it is, in both cases, possible to do worthwhile and impressive demos with these technologies and also real applications that do useful things, but people who have their feet on the ground know that XML doesn't engender "semantics" and LLMs are not "conscious". Yet the hype meddlers keep the fire burning by suggesting that if you just do "more XML" and build bigger LLMs, then at some point real semantics and actual conscience will somehow emerge like a hatching chicken from the egg. And, being emergent properties, who is to say semantics and conscience will not emerge, at some point somehow? A "heap" of grains is emergent after all, and so is the "wetness" of water. But I have strong doubts about XHTML being more semantic than HTML5.
And anyway, even if Google had nefarious intentions and even if they managed to steer the standardization, one has also to concede that all search engines before Google were encumbered by too much structure, too rigid approaches. When you were looking for a book in a computerized library at that point it was standard to be sat in front of a search form with many, many fields; one for the author's name, one for the title and so forth, and searching was not only a pain, it was also very hard to do for a user without prior training. Google had demonstrated it could deliver far better results with a single short form field filled out by naive users that just plonked down three or five words that were on their mind et voila. They made it plausible that instead of imposing a structure onto data at creation time maybe it's more effective to discover associations in the data at search time (well, at indexing time really).
As for the strictness of documents, I'm not sure what it will give you what we don't get with sloppy documents. OK web browsers could refuse to display a web page if any one image tag is missing the required `alt` attribute. So now what happens, will web authors duly include alt="picture of a cat" for each picture of a cat? Maybe, to a degree, but the other 80% of alt tags will just contain some useless drivel to appease the browser. I'm actually more for strict documents than I used to be, but on the other hand we (I mean web browsers) have become quite good at reconstructing usable HTML documents from less-than perfect sources, and the reconstructed source is also a strictly validating source. So I doubt this is the missing piece; I think the semantic web failed because the idea never was strong, clear, compelling, well-defined and rewarding enough to catch on with enough people.
If we're honest, we still don't know, 25 years later, what 'semantic' means after all.
That’s what lots of sites used to do in the late 90s and early aughts in order to have fixed elements.
It was really shit. Browser navigation cues disappear, minor errors will fuck up the entire thing by navigating fixed element frames instead of contents, design flexibility disappears (even as consistent styling requires more efforts), frames don’t content-size so will clip and show scroll bars all over, debugging is absolute ass, …
Yes it did, and there are HTML 5.x DTDs for HTML versions newer than HTML 4.x ar [1], including post-HTML 5.2 review drafts until 2023; see notes at [2].
Yes! The attack on SolarWinds Orion was an attack on its build process. A verified reproducible build would have detected the subversion, because the builds would not have matched (unless the attackers managed to detect and break into all the build processes).
Only if you try to reproduce the signature. Usually the signature is stored separately. That way, the reproduced work's signature applies to it as well.
"a language for extensions should... be a real programming
language, designed for writing and maintaining substantial programs...
The first Emacs used a string-processing language, TECO, which was
inadequate. We made it serve, but it kept getting in our way. It
made maintenance harder, and it made extensions harder to write...
Tcl was not designed to be a serious programming language. It was
designed to be a "scripting language", on the assumption that a
"scripting language" need not try to be a real programming language.
So Tcl doesn't have the capabilities of one. It lacks arrays; it
lacks structures from which you can make linked lists. It fakes
having numbers, which works, but has to be slow. Tcl is ok for
writing small programs, but when you push it beyond that, it becomes
insufficient.
Tcl has a peculiar syntax that appeals to hackers because of its
simplicity. But Tcl syntax seems strange to most users."
> It fakes having numbers, which works, but has to be slow.
This hasn't been the case for 25 years, since the 8.0 release.
Tcl will store data internally in whatever format it was last used, and only convert if needed. Good coding practice pays attention to this to avoid "shimmering" a value back and forth between different internal representations unnecessesarily.
> It lacks arrays;
It does have associative arrays; and lists when used appropriately can fulfil many roles that would otherwise have been implemented in an array in another language
And tcllib[0], a collection of utilities commonly used with tcl, provides support for a number of different and complex data structions [1], many of which are written in C, not just more tcl scripts.
It's worth noting that Stallman's criticism linked above is more than three decades out of date. As with any programming tool, once you go beyond a superficial understanding of basic syntax, it can serve as a a very expressive and sufficient language.
But would that have solved anything here? The main maintainer was overwhelmed. The back door was obfuscated inside a binary blob there for so-called testing purposes. I doubt anyone was reviewing the binary blobs or the autoconf code used to load it in, and for that matter it’s not clear anything was getting reviewed. Fetching and building straight from GitHub doesn’t solve that if the malicious actor simply puts the binary blob into the repo.
Might not be a big chance depending on the project in question, but it's still tons more likely for someone randomly clicking through commits to find a backdoor committed to a git repo than within autogenerated text in a tarball. I click around random commits of random projects I'm interested in every now and then at least. At the very least it changes the attack from "yeah noone's discovering this code change" to "let's hope no random weirdo happens to click into this commit".
A binary blob by itself is harmless, you need something to copy it over from the build env to the final binary. So it's "safe" to ignore binary blobs that you're sure that the build system (which should all be human-written human-readable, and a small portion of the total code in sane projects) never touches.
That said, of course, there's still many options for problematicness - some projects have commit autogenerated code; bootstrapping can bring in a much larger surface area of things that might copy the binary blob; and more.
> At the very least it changes the attack from "yeah noone's discovering this code change" to "let's hope no random weirdo happens to click into this commit".
There's also value in leaving a trail to make auditing easier in the event that an attack is noticed or even if there is merely suspicion that something might be wrong. More visibility into internal processes and easier UX to sort through the details can easily make the difference between discovery versus overlooking an exploit.
It's widely agreed that formal verification does not boost software productivity, in the sense that formal verification doesn't speed up development of "software that compiles and runs and we don't care if it's correct".
The point of formal verification is to ensure that the software meets certain requirements with near certainty (subject to gamma radiation, tool failure, etc.). If mistakes aren't important, formal verification is a terrible tool. If mistakes matter, then formal verification may be what you need.
What this and other articles show is that doing formal verification by hand is completely impractical. For formal verification to be practical at all, it must be supported by tools that can automate a great deal of it.
The need for automation isn't new in computing. Practically no one writes machine code directly, we write in higher-level languages, or rarely assembly languages, and use automated tools to generate the final code. It's been harder to create practical tools for formal verification, but clearly automation is a minimum requirement.
Automation of Hoare logic is quite good these days. Dafny, from MS Research (https://dafny.org), is probably the most friendly formal language of any kind. It's built around Hoare logic and its extension, separation logic. The barrier of entry is low. A seasoned imperative or functional programmer can get going with Dafny in just a few days. Dafny has been used to verify large systems, including many components of AWS.
I am hoping that LLMs make more advanced languages, such as Liquid Haskell or Agda, less tedious to use. Ideally, lots of code should be autocompleted once a human provides a type signature. The advantage of formal verification is that we can be sure the generated code is correct.
It is hard to understand what you mean by type signature, but I think what you mean is something like the type signature in Java, Go or C.
That isn't what people are talking about when they talk about formal verification with a type system. They are talking about much more complex type systems that have equivalent power and expressivity to formal logic.
This intro is good at introducing the topic. Unfortunately, the correspondence between advanced type systems and logic is itself fairly abstract and uses lots of concepts in non-intuitive ways. This means that you can easily wind up in a situation where you can do lots of proofs about number theory using a tool like Lean4, but you don't really see the corresponding types (this is what happened to me).
With dependent types, if you really want to. (Or just with conventional generics or typeclasses, the difference between a ring and a monoid will show, and that's enough for many cases).
If you ignore syntax and pretend that the following is a snippet of Java code, you can declare that a variable x always holds an int, like so:
var x: int = y + 5
Here x is the variable being defined, it is declared to hold values of type int, and its initial value is given by the term y + 5.
In many mainstream languages, types and terms live in distinct universes. One starts by asking whether types and terms are all that different. The first step in this direction of inquiry is what are called refinement types. With our imaginary syntax, you can write something like:
val x: { int | _ >= 0 } = y + 5
Once again, x is the variable being defined, it is declared to always hold a value of type int at all relevant instants in all executions, and that its initial value is given by the term y + 5. But we additionally promise that x will always hold a non-negative value, _ >= 0. For this to typecheck, the typechecker must somehow also confirm that y + 5 >= 0.
But anyway, we have added terms to the previously boring world of types. This allows you to do many things, like so:
val x: int = ...
val y: int = ...
val z: { int | _ >= x && _ >= y } = if x >= y then x else y
We not only declare that z is an integer, but also that it always holds a value that exceeds both x and y.
You asked for the type of a function that multiplies two numbers. The type would look weird, so let me show you an imaginary example of the type of a function that computes the maximum:
val f : (x : int) -> (y : int) -> { int | _ >= x && _ >= y } = ...
This doesn't really get you to the maximum, because f might be computing max(x, y) + 5, but it does show the idea.
The final step in this direction is what are called full-blown dependent types, where the line between types and terms is completely erased.
So the existence of R is evidence that Lisp's syntax is a serious problem. If a language providing syntactic sugar has a significantly increased adoption rate, that suggests that the bitter pill of your syntax is a problem :-).
Language adoption is a complicated story. Many Lispers won't agree with me anyway. But I do think the poor uptake of Lisp demands an explanation. Its capabilities are strong, so that's not it. Its lack of syntactic sugar is, to me, the obvious primary cause.
R isn't simply syntactic sugar over a Lisp-like runtime, to make it acceptable, but an implementation of an earlier language S, using Lisp techniques under the hood.
I've written a lot of Lisp in my past. I'm used to reading Lisp. Still hate it. For example, everyone else has figured out that mathematical notation uses infix. Stock Lisp's inability to handle infix notation means most software developers will not consider using it seriously. Obviously not all will agree with me :-).
reply