https://issues.chromium.org/issues/451401343 tracks work needed in the upstream xml-rs repository, so it seems like the team is working on addressing issues that would affect standards compliance.
Disclaimer: I work on Chrome and have occasionally dabbled in libxml2/libxslt in the past, but I'm not directly involved in any of the current work.
I hope they will also work on speeding it up a bit. I needed to go through 25-30 MB SAML metadata dumps, and an xml-rs pull parser took 3x more time than the equivalent in Python (using libxml2 internally, I think.) I rewrote it all with quick-xml and got a 7-8x speedup over Python, i.e., at least 20x over xml-rs.
Python ElementTree uses Expat, only lxml uses libxml2. Right now, I'm working on SIMD acceleration in my not yet released, GPL-licensed fork of libxml2. If you have lots of character data or large attribute values like in SVG, you will see tremendous speed improvements (gigabytes per second). Unfortunately, this is unlikely to make it into web browsers.
Wait. They are going along with a XML parser that supports DOCTYPES? I get XSLT is ancient and full of exploits, but so is DOCTYPE. Literally poster boy for billion laughs attack (among other vectors).
You don't need DOCTYPE for that, you can put an ENTITY declaration straight in your source file ("internal subset") and the XML spec it needs to be processed. (I seem to recall someone saying that Adobe tools are fond of putting those in their exported SVG files.)
The billion laughs bug was fixed in libxml2 in 2008. (As far as I understand in .Net this bug was fixed in 2014 with .Net 4.5.2. In 2019 a bug similar to "billion laughs" was found in Go YAML parser although it was explicitly mentioned and forbidden by YAML specs. Among other products it affected Kubernetes.)
Other vectors probably mean a single vector: external entities, where a) you process untrusted XML on server and b) allow the processor to read external entities. This is not a bug, but early versions of XML processors may lack an option to disallow access to external entities. This also has been fixed.
XSLT has no exploits at all, that is no features that can be misused.
> Other vectors probably mean a single vector: external entities,
XXE injection (which comes in several flavors), remote DTD retrieval, and quadratic blowup (a sort of twin to the billion laughs attack).
You aren't wrong though. They all live in <!DOCTYPE definition. Hence, my puzzlement.
Why process it at all? If this is as security focused as Google claims, fill the DOCTYPE with molten tungsten and throw it into the Mariana Trench. The external entities definition makes XSLT look well designed in comparison.
Disclaimer: I work on Chrome/Blink and I've also contributed a (very small) number of patches to libxml/libxslt.
It's not just a matter of replacing the libxslt; libxslt integrates quite closely with libxml2. There's a fair amount of glue to bolt libxml2/libxslt on to Blink (and WebKit); I can't speak for Gecko.
Even when there's no work on new XML/XSLT features, there's a passive cost to just having that glue code around since it adds quirks and special cases that otherwise wouldn't exist.
Disclaimer: I work on Chrome and I have contributed a (very) small number of fixes to libxml2/libxslt for some of the recent security bugs.
Speaking from personal experience, working on libxslt... not easy for many reasons beyond the complexity of XSLT itself. For instance:
- libxslt is linked against by all sorts of random apps and changes to libxslt (and libxml2) must not break ABI compatibility. This often constrains the shape of possible patches, and makes it that much harder to write systemic fixes.
- libxslt reaches into libxml and reuses fields in creative ways, e.g. libxml2's `xmlDoc` has a `compression` field that is ostensibly for storing the zlib compression level [1], but libxslt has co-opted it for a completely different purpose [2].
- There's a lot of missing institutional knowledge and no clear place to go for answers, e.g. what does a compile-time flag that guards "refactored parts of libxslt" [3] do exactly?
Sounds like libxslt needs more than just a small number of fixes, and it sounds like Google could be paying someone, like you, to help provide the necessary guidance and feedback to increase the usability and capabilities of the library and evolve it for the better.
Instead Google and others just use it, and expect that any issues that come up to be immediately fixed by the one or two open source maintainers that happen to work on it in their spare time. The power imbalance must not be lost on you here...
If you wanted to dive into what [3] does, you could do so, you could then document it, or refactor it so that it is more obvious, or remove the compile time flag entirely. There is institutional knowledge everywhere...
or, the downstream users who use it and benefit directly from it could step up, but websites and their users are extremely good at expecting things to just magically keep working especially if they don't pay for it. it was free, so it should be free forever, and someone set it up many moons ago, so it should keep working for many more magically!
// of course we know that, as end-users became the product, Big Tech [sic?] started making sure that users remain dumb.
Browser vendors aren't maintaining the web for fee, they are for profit corporations that have chosen to take on that role for the benefits it provides to them. It's only fair that we demand that they also respect the responsibilities that come with it. And we can also point out the hollowness about complaints of hardship due to having to maintain the web's legacy when they keep making it harder for independent browser developers by adding tons on new complexity.
Sure, of course, but unless funding is coming from users the economics won't change, because:
The vendors cite an aspect of said responsibility (security!) to get rid of an other aspect (costly maintenance of a low-revenue feature).
The web is evolving, there's a ton of things that developers (and website product people, and end-users) want. Of course it comes with a lot of "frivolous" innovation, but that's part of finding the right abstractions/APIs.
(And just to make it clear, I think it's terrible for the web and vendors that ~100% of the funding comes from a shady oligopoly that makes money by selling users - but IMHO this doesn't invalidate the aforementioned resource allocation trade off.)
> libxslt is linked against by all sorts of random apps and changes to libxslt (and libxml2) must not break ABI compatibility. This often constrains the shape of possible patches, and makes it that much harder to write systemic fixes.
I’m having trouble expressing this in a way that won’t likely sound harsher than I really want, but, uh, yes? That’s the fundamental difference between maintaining a part of the commons that anybody can benefit from and a subdirectory in a monorepo. The bazaar incurs coordination costs, and not being able to go and fix all the callers is one of them.
(As best as I can see, Chrome’s approach is largely to make everything a part of the monorepo, so maintaining a part of the commons may not be high on the list of priorities.)
This not to defend any particular ABI choice. Too often ABI is left to luck and essentially just happens instead of being deliberately designed, and too often in those cases we get unlucky. (I’m tempted to recite an old quote[1] about file formats, which are only a bit more sticky than public ABI, because of how well it communicates the amount of seriousness the subject ought to evoke: “Do you, Programmer, take this Object to be part of the persistent state of your application, to have and to hold, through maintenance and iterations, for past and future versions, as long as the application shall live?”)
I’m not even deliberately singling out what seems to me like the weakest of the examples in your list. It’s just that ABI, to me, is such a fundamental part of lib-anything that raising it as an objection against fixing libxslt or libxml2 specifically feels utterly bizarre.
It's one thing if the library was proactively written with ABI compatibility in mind. It's another thing entirely if the library happens to expose all its implementation details in the headers, making it that much harder to change things.
When i first encountered the early GNOME 1 software back in the very late 1990s, and DV (libml author) was active, i was very surprised when i asked for the public API for a library and was told, look at the header files and the source.
They simply didn’t seem to have a concept of data hiding and encapsulation, or worse, felt it led to evil nasty proprietary hidden code and were better than that.
They were all really nice people, mind you—i met quite a few of them, still know some—and the GNOME project has grown up a lot, but i think that’s where libxml was coming from. Daniel didn’t really expect it to be quite so widely used, though, i’m sure.
I’ve actually considered stepping up to maintain libxslt, but i don’t know enough about building on Windows and don’t have access to non-Linux systems really. Remote access will only go so far on Windows i think, although it’d be OK on Mac.
It might be better to move to one of the Rust XML stacks that are under active development (one more active than the other).
At least some of the implementation complexity is already there under the hood. WebKit/Blink have an optimization to use 8-bit characters for strings that consist only of latin1 characters.
A large part of the problem is the legacy burden of libxml2 and libxslt. A lot of the implementation details are exposed in headers, and that makes it hard to write improvements/fixes that don't break ABI compatibility.
As someone who had the misfortune of working on clipboard support in Chrome, I thought "wow, there's no way we do that in places other than Linux".
... turns out we do and I helped review that patch. Doh!
For how widely the clipboard is, the actual implementation (both in the OS and in the browser) is surprisingly unloved and unmaintained.
FWIW, Chrome intentionally doesn't plumb through the original image bytes. I wasn't around when it was initially implemented, but even for many years afterwards, there were no (Windows) platform conventions for passing around non-bitmap images on the (Windows) clipboard. And another (probably unintentional) benefit was "the encoded image bytes are from an untrustworthy source and could trigger bugs in buggy image decoders", while bitmaps are (relatively) safe in comparison.
Of course, this is a rather arbitrary line, because it's easy to get the original image bytes out of the sandboxed renderer, e.g. by dragging out the image or by saving the image.
At this point, someone could probably try plumbing through the original bytes or even implementing delayed rendering... but it's quite expensive in terms of time, especially to test all the random things that might break. :(
Oilpan isn't without issues though: finalization causes quite a few headaches, implementation details like concurrent marking make it hard to support things like std::variant, and the interface between Oilpan and non-Oilpan types often introduces opportunities for unsafety.
Indeed. It's tradeoffs, but they've been sufficient for much of the codebase for a very long time. Taking no major action (Oilpan or memory safe language) for nearly a decade was also a tradeoff. I don't think the long list of security issues there was worth it.
Hmm. The blink dirs seem to have a very large number of security issues relevant too, despite oilpan. Which is what we'd expect, honestly; oilpan does not solve all uaf problems, uafs are not all (or even the majority of) security problems, etc. The combo of "sufficient for much of the codebase" and "the long list of security issues" paints a picture of most of the codebase being secure due to oilpan, while UI code is riddled with holes due to its lack. The reality is dramatically more nuanced (as you know, but readers of your comment might not).
As a views maintainer, I'm familiar with some of the security bugs on the UI side. Clickjacking-type issues are more common there than uafs. Uafs are an issue, but the problems are not so much due to problematic use of unsafe c++ types and idioms as problematic API designs -- resulting in, for example, cases where it's not clear whether an object is expected to be able to respond safely to calls at all, whether or not it's technically alive. Oilpan, MiraclePtr, proper use of smart pointers would all help bandaid the uafs, but are in many cases difficult to apply correctly without understanding (and often fixing) the underlying systemic design problems. Which is happening, but slowly.
There are also more dimensions of tradeoffs involved, but this is long winded enough as it is. The tldr is that at this point I would consider a couple other options better uses of effort for tackling this specific problem compared to converting browser types to oilpan.
Possibly another way of expressing your point: after writing my above comment, I found myself wondering how much a system that kept these UAF pointers alive longer (to eliminate the UAFs, e.g a GC) would actually reduce the attack surface -- the UAF-ish bugs are still bugs, and code poking at a GC-preserved object that the rest of the code doesn't really expect to still be alive might itself be pretty fraught.
> the UAF-ish bugs are still bugs, and code poking at a GC-preserved object that the rest of the code doesn't really expect to still be alive might itself be pretty fraugh
For the LayoutObject heirarchy - the team doing that conversion added a NOT_DESTROYED() macro for this reason. It's gross, but was the least worst option.
As an aside - the performance of oilpan is broadly net positive now if you avoid some of the pitfalls. (The largest being a write into a Member<> requires a write-barrier). E.g. Things become trivially destructible, and no ref incrementing/decrementing, etc.
Disclaimer: I work on Chrome and have occasionally dabbled in libxml2/libxslt in the past, but I'm not directly involved in any of the current work.
reply