Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I love seeing the weirdo CDATA thingy in there! CDATA ftw!

E.g., you've got this enormous spec for SVG which includes CSS, but that CSS has syntax inside a style tag which could break XHTML parsers.

Amateurs out there are probably thinking, "Well, why not just compromise in the spec and tell implementers to do the same thing that HTML does to parse style tags?" Well, professionals know that cannot work for myriad reasons you can read about if you take out a college loan and remain sedentary for the required duration.

The right approach is to throw the CSS stuff inside CDATA tags to tell the parser not to parse it so things don't break. That is the way sensible, educated professionals solve this problem.

I'm only kidding!

For inline SVGs the HTML5 parser simply says, "Parse this gunk as HTML5, and use sane defaults to interpret the parsed junk in the correct svg namespace so that all the child thingies in that namespace just work."

Which it does.

Unless you're going to grab the innerHTML of the inline SVG and shove it into a file to be used later as an SVG image.

In that case you cross the invisible county line into XHTML territory where the sheriff is waiting to throw you in jail for violating the CDATA rule. In that case the XHTML parser hidden in the guts of the browser doles out the justice of an error in place of your image. Because that is the way sensible, educated professionals solve this problem. :)

My holy grail-- how do I use DOM methods to create a CDATA element to shove my style into? If I could know this then I can jump my Dodge Charger back and forth into XHTML without ever getting caught.



The 'weirdo' CDATA thing is the only thing that makes XHTML actually amenable to this approach, because XHTML is tokenizable using a regular expression-based grammar, whereas HTML without CDATA is not. As you're obviosuly aware, the language inside <style> elements is not suitable for XML parsing. Nor is the language inside <script> elements:

   <script>
     var y = "<!--";
   </script>
   <p>... a naive tokenizer thinks this is in a comment ...</p>
   <script>
     var z = "-->"
   </script>
There's a weird interaction here between the Javascript and HTML parsers, because <, !, and -- are all valid Javascript operators, and you can, in theory, stack them up into syntactically valid JavaScript expressions. The behavior of the following script in a browser is... unpredictable:

   <script>
      var a = 1;
      if (0<!--a) { document.write('since --a is 0, and !0 is true, and 0 < true (!), this should print'); }
   </script>
   <p>... and this should not be in a comment</p>
   <script>
     if (a-->-10) { document.write('a-- should still be > -10, so this should print, too'); }
   </script>
The regex in the article will miss the opening <p> tags here, because it assumes that it's being given valid, tokenizable XHTML.


If I understand the HTML spec correctly, these two examples are not valid HTML.

The script element content model has the constraint that if the substring "<!--" occurs inside the content, a corresponding "-->" must occur before the </script> end-tag. (Technically, this would not be a comment though, since it is not discarded but parsed as character data.)

If I put the first example through the (X)HTML validator (https://html5.validator.nu/) I get:

> Error: The text content of element script was not in the required format: Content contains the character sequence <!-- without a later occurrence of the character sequence -->.

I suspect this constraint on the script-element is exactly to avoid these parsing/lexing ambiguities.

A regex would still have to special-case script elements (and also style-elements I guess?) because content can contain unescaped "<". But this can still be done using a regex, as far as I can tell, since there is still no recursive productions in the syntax.

I don't believe there is any two-way interaction between the HTML parser and the JavaScript parser. The HTML parser passes the character data content of the script element to the JavaScript parser, but the JavaScript parse does not have any effect back on the HTML parse. (After all, it is legal for a user agent to not support JavaScript, but this should obviously not affect the parsing of the HTML.)

(Thanks for sending me down this rabbit-hole. HTML is weird!)


Part of problem and solution you describe was due to the battle to define who is encapsulating whom, no? The W3C SVG list archive was full of people essentially asking for the ability to flow text and replicate a lot of HTML as part of native SVG. In that dream, it's certainly important to have well-defined behavior for javascript inside SVG since you could have SVG user agents that aren't web browsers. And that means CDATA to hold the non-XML scripting and styling data.

However, at the end of that history HTML was the clear encapsulator and SVG exists either inside it or as a static image in Inkscape, the browser, or some library. So today, scripts inside an SVG are either a curiosity or security nightmare that comes to life when the user clicks "View Image" on an SVG image in their browser.

That leaves only the `<style>` tag content as a potential ambiguity. So I'm curious-- are there examples where content of a `<style>` tag inside an inline SVG causes unpredictable behavior in modern browser HTML parsers? I'm guessing there must be, but I'd like to play with a clear example.


Correct me if I'm wrong, but I believe HTML tags can still be lexed with a regular expression. The syntax of the script element is cryptic, but it does not contain any recursive productions, so it should still be possible to lex correctly with a regular expression.


Your lexer now needs to understand JavaScript too, though.


The content model of the script element is gnarly for historical reasons, but it does not depend on the syntax of the scripting language used:

https://html.spec.whatwg.org/multipage/scripting.html#restri...


>My holy grail-- how do I use DOM methods to create a CDATA element to shove my style into? If I could know this then I can jump my Dodge Charger back and forth into XHTML without ever getting caught.

Does this help? https://developer.mozilla.org/en-US/docs/Web/API/Document/cr...


Ah, thanks!

In hindsight I probably could have guessed at "document.create" and then just read the autocomplete suggestions in devTools. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: