I love seeing the weirdo CDATA thingy in there! CDATA ftw! E.g., you've got this...

jameshart · on May 9, 2021

The 'weirdo' CDATA thing is the only thing that makes XHTML actually amenable to this approach, because XHTML is tokenizable using a regular expression-based grammar, whereas HTML without CDATA is not. As you're obviosuly aware, the language inside <style> elements is not suitable for XML parsing. Nor is the language inside <script> elements:

   <script>
     var y = "<!--";
   </script>
   <p>... a naive tokenizer thinks this is in a comment ...</p>
   <script>
     var z = "-->"
   </script>

There's a weird interaction here between the Javascript and HTML parsers, because <, !, and -- are all valid Javascript operators, and you can, in theory, stack them up into syntactically valid JavaScript expressions. The behavior of the following script in a browser is... unpredictable:

   <script>
      var a = 1;
      if (0<!--a) { document.write('since --a is 0, and !0 is true, and 0 < true (!), this should print'); }
   </script>
   <p>... and this should not be in a comment</p>
   <script>
     if (a-->-10) { document.write('a-- should still be > -10, so this should print, too'); }
   </script>

The regex in the article will miss the opening <p> tags here, because it assumes that it's being given valid, tokenizable XHTML.

goto11 · on May 11, 2021

If I understand the HTML spec correctly, these two examples are not valid HTML.

The script element content model has the constraint that if the substring "" must occur before the </script> end-tag. (Technically, this would not be a comment though, since it is not discarded but parsed as character data.)

If I put the first example through the (X)HTML validator (https://html5.validator.nu/) I get:

> Error: The text content of element script was not in the required format: Content contains the character sequence .

I suspect this constraint on the script-element is exactly to avoid these parsing/lexing ambiguities.

A regex would still have to special-case script elements (and also style-elements I guess?) because content can contain unescaped "<". But this can still be done using a regex, as far as I can tell, since there is still no recursive productions in the syntax.

I don't believe there is any two-way interaction between the HTML parser and the JavaScript parser. The HTML parser passes the character data content of the script element to the JavaScript parser, but the JavaScript parse does not have any effect back on the HTML parse. (After all, it is legal for a user agent to not support JavaScript, but this should obviously not affect the parsing of the HTML.)

(Thanks for sending me down this rabbit-hole. HTML is weird!)

jancsika · on May 9, 2021

Part of problem and solution you describe was due to the battle to define who is encapsulating whom, no? The W3C SVG list archive was full of people essentially asking for the ability to flow text and replicate a lot of HTML as part of native SVG. In that dream, it's certainly important to have well-defined behavior for javascript inside SVG since you could have SVG user agents that aren't web browsers. And that means CDATA to hold the non-XML scripting and styling data.

However, at the end of that history HTML was the clear encapsulator and SVG exists either inside it or as a static image in Inkscape, the browser, or some library. So today, scripts inside an SVG are either a curiosity or security nightmare that comes to life when the user clicks "View Image" on an SVG image in their browser.

That leaves only the `<style>` tag content as a potential ambiguity. So I'm curious-- are there examples where content of a `<style>` tag inside an inline SVG causes unpredictable behavior in modern browser HTML parsers? I'm guessing there must be, but I'd like to play with a clear example.

goto11 · on May 10, 2021

Correct me if I'm wrong, but I believe HTML tags can still be lexed with a regular expression. The syntax of the script element is cryptic, but it does not contain any recursive productions, so it should still be possible to lex correctly with a regular expression.

jameshart · on May 10, 2021

Your lexer now needs to understand JavaScript too, though.

goto11 · on May 11, 2021

The content model of the script element is gnarly for historical reasons, but it does not depend on the syntax of the scripting language used:

https://html.spec.whatwg.org/multipage/scripting.html#restri...

camehere3saydis · on May 9, 2021

>My holy grail-- how do I use DOM methods to create a CDATA element to shove my style into? If I could know this then I can jump my Dodge Charger back and forth into XHTML without ever getting caught.

Does this help? https://developer.mozilla.org/en-US/docs/Web/API/Document/cr...

jancsika · on May 9, 2021

Ah, thanks!

In hindsight I probably could have guessed at "document.create" and then just read the autocomplete suggestions in devTools. :)