> The question is about finding opening tags in XHTML using a regular expression
Bzzzt, wrong, sorry! The question is about finding open tags in the presence of XHTML self-closing tags. That difference alone places these interpretations gulfs apart. But there’s more: it does not specify that the input document is even XHTML, only that XHTML-style self-closing elements may be present. In fact the original question was barely
minutes old and tagged merely “regex” when that famous answer was written in 2009; the question was not tagged with “xhtml” until 2012, and not by the original author either.
Revealingly, then, if we review the broader context (i.e question history) of the original question author, it’s clear that yes indeed they were trying to fix a malformed document, and in particular to normalise it into XHTML, with focus on fixing up any so-called “dangling tags”. For this task, the suggestion of “use a parser” is indeed sound advice.
The real moral here is, don’t be a jerk about the precise semantics of a question, look at what the person needs, and help them ask better questions.
Otherwise, you’re just gonna discover that there’s always a bigger jerk, and they’re on Stack Overflow, moderating your stuff.
The problem is the answers saying "this is not possible".
The OP asks a perfectly reasonable question. The answers assume the OP actually meant to ask a different question, and they they ridicule the OP for this imagined question.
The question they imagine the OP asks was: "How do I parse an arbitrary HTML document into an element-tree using only a single regular expression and not using any auxiliary data structures, not even the call stack?"
Yes, this is indeed not possible, given the limitations of vanilla regular expressions.
But that was not the question asked.
Of course "use a parser" is perfectly sound advice (if a parser exists which can solve the OP's problem). But saying what the OP is attempting to do (tokenize xhtml) is impossible is absurd, since then it would also be impossible to write a parser!
They're trying to tokenize HTML into two arrays, one being an array of opening tags and one of closing tags, with the hope being to pairwise compare and reconcile the elements of these arrays.
The necessary clarification appears further down the page.
As for whether it's a reasonable question; as written I beg to differ, it's the opposite, since it does not convey their problem except by a very careful and nuanced reading. The proper response of the S.O. community should've been to aid the OP in clarifying their intentions and amending the question accordingly to bring focus onto their actual problem.
That is not what happened. Instead they are indeed on the receiving end of some ridicule, which is shameful, but the advice of "use a parser" would still likely have been a top answer in some fashion had the OP's scattered ancillary questions and clarifications been incorporated.
Sadly, only one correspondent of many seems to have seen fit to ask.
> For this task, the suggestion of "use a parser" is indeed sound advice.
Perhaps technically, but it's also useless advice because a parser does not exist for their particular flavor of malformed XHTML. XHTML parsers parse XHTML, which you yourself have said it wasn't:
> they were trying to fix a malformed document
So in the absence of a reference to a particular malformed-XHTML-recovering parser (which may or may not work on the specific input they have, but "try this thing" is at least actionable advice), "use a parser" amounts to "write a entire parser yourself, then use it".
> don't be a jerk about the precise semantics of a question, look at what the person needs
> "use a parser" amounts to "write a entire parser yourself, then use it"
"Use a parser" is a common answer, besides being the accepted one, and with good reason: it'll work. The world is not short of HTML parsers (although, who knows, perhaps PHP may have been short of very good parsers back in 2009). Whether they use regular expressions for tokenizing is an internal detail.
Serializing XML from the resulting memory structure, DOM or otherwise, closes the loop, and this remains a conventional and commonplace means to normalize some incoming HTML-like mush into something that can be spliced/interpolated into XML and a strict receiver will probably accept it.
What if we looked at the XHTML parsers trusted by the people who mindlessly dismiss the utility of regular expressions and found they were constructed using a lexer that relied on regular expressions.
Well, since I've learned at least four assembly languages, and wrote my first toy lexer at least 25 years ago, probably in Standard ML, by this standard I seem qualified to comment.
It doesn't matter what the parser looks like under the bonnet. What matters is the utility it provides.
One might otherwise similarly offer the advice to mine your own copper and grow your own silicon, these being equally essential activities for anyone seeking the ultimate in mechanical sympathy.
Read the regex. It handles self-closing tags fine.
Also you're doing the intensely annoying thing that lots of StackOverflow people do of imagining that the asker really wanted to ask a different question. It happens sometimes. But you shouldn't just jump in and assume that they don't know what they want and you're so much smarter than them so you know what they really want.
Offer additional answers if you want, but answer the question they asked first.
There's no wild assumption going on here. I just bothered to keep reading, very carefully, everything the original author actually wrote.
Then, please, further reflect that Stack Overflow is not Codewars; it is a forum for practical, focused, and relevant problem-solving advice, and at its best the moderation and answer processes help folks to iteratively revise and improve their questions. A crucial step is, therefore, analytically clarifying both the parameters and the intended outcome.
Contextualisation and focusing of requirements is a familiar and essential skill for any programmer being handed a requirements statement, and answering S.O. questions involves the same exercise, just in vignette.
The handling of this question was not, in this sense, the best. But expecting anything less, would just be a load of fizzbuzz.
> There's no wild assumption going on here. I just bothered to keep reading, very carefully, everything the original author actually wrote.
That doesn't excuse it. You're still inferring stuff that wasn't asked in the actual question.
What happens when someone else comes along who really does want to use a regex? They now have a question without the correct answer and they can't even ask the question themselves because it will be closed as duplicate! Probably by people like you.
The strangest part of this whole discussion has been the remarkable number of accounts making head-first personal character attacks. And as with that comment further back, the personal invective comes coupled to some strange language, like “StackOverflow people” - what are they, even? It sure ain’t a tribe I’d identify with. Does anyone with a login qualify? Where’s all that anger even coming from?
Setting that aside, I don’t believe that reading the OP’s clarifying remarks and follow up questions is “inference”. Not that there’s anything wrong with inferring things, but it’s the opposite, it is going to the primary source, and I don’t need to excuse it. Frankly, I think people who skimp on their research, and fail to engage with the source to refine the matter, are selling the question short.
This question was undoubtedly mishandled in part because it became memorialised for a famous answer. The failure to provide the OP with adequate feedback, or to edit it unilaterally to incorporate the OPs essential clarifications (without which its a “wtf” class question) was, and remains, a dereliction of moderator duty. Because it makes much more sense and is much more likely to be useful to your hypothetical later visitor once focused.
What the world did not need was yet another page of half-baked tokenisation routines.
Finally, I have never closed a dupe in all my puff, and I’m thoroughly unimpressed by those that do.
StackOverflow has a real problem with attracting strict rule followers who love over-moderating. I expect Wikipedia suffers from a similar issue but it's not such an interactive site so most people aren't exposed to it.
> Where’s all that anger even coming from?
StackOverflow can be an extremely frustrating experience due to people who probably think they are helping casually closing questions.
> Finally, I have never closed a dupe in all my puff, and I’m thoroughly unimpressed by those that do.
Good! I wish there were more people like you! I'm still waiting for the day when somebody starts a friendly competitor to StackOverflow that doesn't support closing questions, gives authors actual control over what they write (can you imagine if other people could edit your comments here?) and does away with mods. One day...
StackOverflow has its share of problems, including overzealous moderation. I do civic duty by often voting for reopen and moderating the reopen queue.
That said, removing moderation is not the solution. If you look at the unmoderated new questions, lots of questions are literally incomprehensible. A significant portion are homework or exam questions directly copy-pasted, but without enough context to answer - e.g. referring to table or figures or code examples which is not included in the question. Such is the sad reality of a widely known site in todays eternal september.
It's called the 'CodingHelp' subreddit. The quality is very low and you will see many similar questions, each with a screen shot and little explanation.
Locked. Comments on this question have been disabled, but it is still accepting new answers and other interactions. Learn more.
I need to match all of these opening tags:
<p>
<a href="foo">
But not these:
<br />
<hr class="foo" />
I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.
<([a-z]+) *[^/]*?>
I believe it says:
Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except /, then
Find a greater-than
Do I have that right? And more importantly, what do you think?
html
regex
xhtml
Share
Improve this question
edited May 26 '12 at 20:37
community wiki
11 revs, 7 users 58%
Jeff
Not sure which revision that's pasted from, but it is not the original text, which can be found at https://stackoverflow.com/revisions/1732348/1 and I strongly advise against neglecting the title; doing so is how some folks blunder into misinterpretation, since the wording of that title is telegraphing a quite different underlying need to "please solve this regex puzzle".
Reading both, I see little to no difference. Is there an important change between the two revisions that I am missing? Also, my reading of the title does not, in fact, telegraph anything other than "help me with this regex puzzle."
I recommend consulting my other remarks within this topic, but on this specific point:
The title's inconsistency with the body text, i.e. "open tags" vs "opening tags", is especially and immediately notable because they are (in context) grammatically interchangeable but have dissimilar meanings. This is immediately suggestive of (but not diagnostic of) a writer revealing context and then switching to detail. As a longtime reader of requirements documents and S.O. questions, a mental flag to check the intended meaning of both is already raised at this point.
The reference to XHTML is ambiguous, since it speaks to working around XHTML that is already present, rather than defining an input or output document, which means this is something thinking at the character level rather than in terms of DOM. This impression is verified by the comparison pairs in the question body, leaving the essential context question of what are the desired inputs and outputs?" glaringly unanswered.
The early construct, "I need to ..." is a secondary flag that immediately reinforces the likelihood of a critical gap in context, since it does not explain why the need has arisen.
To anyone familiar with the structure & semantics of the HTMLs, the omission of any mention of tag closing, having discussed "opening tags", boosts the sense that something relevant is missing from the question. The obviously flawed regular expression then attracts a "probable novice" qualifier, which is amplified by both the closing "what do you think?" and the original's emoticon.
At this point, we're about ten seconds into the read-through and there's already a ton of labels pointing to a beginner who may be conflating dissimilar concepts and (through inexperience) choosing the wrong tool for the task, whatever that may be. Another few seconds to review comments, at which point it appears that author does understand the distinction between an "open tag" and an "opening tag", and this reweights the XHTML reference in the title toward being a need to output strict XHTML.
Given their apparent beginner standing it's likely this is either their first S.O. post, or one of slightly too many, and a check in their profile for proximate questions immediately reveals the latter case. References to balancing of "dangling tags" abound, collapsing all preceding problem variants and likelihoods into the only focused understanding that fits: it's a PHP novice, bright but inexperienced, who wants to normalise a nonconforming document into XHTML by balancing the tags, and that the mostly likely use case is injecting an existing HTML fragment into a RSS XML feed.
Elapsed time ca.90 seconds.
Another minute to scroll through surrounding Q&A materials, to allow alternative options to appear (they don't).
A few seconds for a chuckle at that old friend of an amusing answer, and recognise that the advice within is sound: what they really need is, indeed, a parser. Then, a pause to consider a contrasting notion, that what they're trying to do is write a parser. Consider evidence that author has taken at least one step towards inventing stacks on their own from first principles; prospect loses to existing top answer because the experience level required to write parsers cannot, ultimately, be acquired from a single S.O. question.
Last but not least; the final boss: convincing the chorus of Hacker News to accept this interpretation. That takes much longer.
Consider alternatively that this is a tiny piece in the much broader puzzle of what they are trying to accomplish, that they are aware both of their own beginner status but also that, in this case, good enough will be good enough, or that they don't have the time or inclination to switch to a real parser and that's why they didn't ask about it. Or maybe even that the goal here is to specifically learn to use regex to solve a real software problem that they have. It's all a matter of interpretation of context that we don't have. As an asker of questions on SO where people provide answers to questions I specifically intended to not ask, this pattern of reading deeply into context we don't have is actually kind of annoying. Unless you provide both the answer to the actual question and then some guidance on how you really should be doing something. Those answers are very, very helpful.
These objections may seem relevant to your personal experience, but they don't pertain to the case at hand, for which ample context is available.
Dealing with responses from people who've rushed to write something without bothering to properly consider the problem is, of course, internet 101, but such answers generally take the short road to content oblivion.
For those answers of any quality, however, note this: if people are answering, for free, on their own time, questions that you don't think you asked, but they evidently do, then consider taking some personal responsibility for that outcome. Complaining about it would seem wholly entitled and ungrateful, and unconstructive besides. If they are well-meaning beginners, coach them: it's how the stone soup gets made.
I also both ask and respond to questions. Mostly respond. For free. On my own time. People who answer on SO are not a unique special breed in the sense that they are willing to help others with their problems. I have read your reading of the thread, gone through it myself with that in mind, and disagree wholeheartedly that you have meaningfully extrapolated the context you claim.
It's not explicitly stated, but I believe the author's point is that the original question didn't require a recursive solution (because it's only asking about individual tags, not matching opening tags with their closing partners)
Edit: yes looking at the answers, someone pointed this out in a comment response to the"Chomsky" answer:
> The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG. – LarsH Mar 2 '12 at 8:43
Look again. The context is definitely one of identifying tags that should be closed, but aren’t, and being identified thus in order to normalise a malformed document into well-formed XHTML. It is not prominently stated, but it is nevertheless the case. Consequently, balance matters. The clue is in the title: “match open tags”, for which the body text has only
one part of the author’s train of thought (i.e identifying the opening tag). Further discussion of matching the closing tag is omitted from the body of the question, and many people
forget (or disregard) the nuanced difference in the title at this point in their read-through, not stopping to ponder “why the heck would
they just want the opening
tags? what have I missed here about the context?” and instead treating it like some particularly awful and weirdly contrived exam question, as in the article at the top.
The final confirmation of all
this is in the question history of the question author around the same date: it’s definitely what they were working on.
Consequently, our zalgo-spewing correspondent has it right, they should use a parser, and in addition the question should’ve received feedback early to help them describe the context more clearly.
That all this was never properly clarified is a failure of moderation, and further a demonstration of how many folks struggle to challenge (or even identify) their own assumptions.
"Match open tags" obviously refers to writing a regular expression which matches opening tags, not to pair opening tags with end tags.
If you look at the regex which the OP themself suggests, it is clearly only intended to match opening tags (excluding self-closing tags), not search for corresponding end-tags.
Well, as now expressed at hopefully sufficient length, it doesn’t just say solely that, unless one a) disregards
the difference in phrasing,
and then b) disengages
any sense of purpose and practical utility and instead treats it like a badly worded test question.
It’s kind of a shibboleth, in a way, for developer sensitivity to actual needs, as opposed to getting hung up on how clever they are.
Oh, are they here? It’d be fascinating to hear from them.
Alternatively, perhap, that hostile tone is suggesting I’m personally unqualified to interpret loosely framed questions? I suppose, since I’ve only been doing it for a few decades, I’m definitely a novice by any standard, and my tendency to observe and follow up on anomalous, incomplete, subtly conflicting, or otherwise inexplicable requirements by investigating both the timeline and substance of the original context, and the apparent motivations and outcome preferences of the author, is sheer beginners luck, and any uptick in stakeholder utility that from time to time accompanies amending recommendations following such investigative and analytical activity a sheer coincidence! So that must be it - as you can probably tell from all this smug, empty bravado, I’m really just sharing pure speculation, wild guesswork, total fluke, impertinent leaps of inferential faith, only just grasping at the vague outline of my own blind spots et cetera et cetera, and consequently yes, I’d love to hear the original intent restated from the horse’s mouth, too; but, for the meantime, I’ll read the tealeaves, systematically analyse and synthesise to the best of my ability, describe and discuss any substantive points of comprehension that I think might help enrich, or at any rate challenge, a reader’s perspective (including my own), and cross my fingers hoping to read, mark, and inwardly digest what new understanding or revealed wisdom as I can - even when it comes, as it has there and here, via diacritical allegory and dialectical hellfire.
Or, finally, if you just want the TL;DR version, the reason I feel unshakeably comfortable asserting that the question author's actual purpose is normalising a nonconforming document into XHTML, by balancing open & close tags, is because they said so.
It's in an answer comment about halfway down the first page.
> "Can you provide a little more information on the problem you're trying to solve"
"Determining all the tags that are currently open, then compare that against the closed tags in a separate array"
If that's not enough, here are quotes from their other questions, posted in the minutes and hours prior:
"How to balance tags with PHP", and
"I need to write a function which closes the opened HTML tags."
Sometimes, we just have to bother reading what's in front of us.
I have to ask, is this series of comments some kind of performance art or some sort of social experiment? Or is this unironically how you write/speak/act?
> Sometimes, we just have to bother reading what's in front of us.
You are making an absolutely great example of that. I was absolutely not talking about the original SO post, but about the generally extremely entitled answers which assume the existence of a very specific X to the Y of a post.
If I understand correctly, you're suggesting "Who are you" wasn't directed at me personally, "the person" wasn't referring to the OP but all possible authors, and "the question" wasn't referring to, well, the original SO question at hand, but the class of all possible questions.
If so, then I see, I think: perhaps it was more intended as "Who is anyone to know the purpose and utility (of a question) better than the person who asks that question?"
I can't answer that, since I agree with the sentiment. I'm only really pondering this one specific question, made possible because there's so much extra context with which to interpret the original text. My personal hubris doesn't generalise to the class of all possible questions, and I'd struggle to sympathise with anyone making such an ambit claim.
> If I understand correctly, you're suggesting "Who are you" wasn't directed at me personally, "the person" wasn't referring to the OP but all possible authors, and "the question" wasn't referring to, well, the original SO question at hand, but the class of all possible questions.
yes, exactly ? but maybe it is less common to speak in such a general way in english than in my mother tongue
It is a form in English, example idiom might be, “the man in the street”, which in general relates “anyone in any street” and thereby simply means “ordinary people”.
The potential for confusion arises when we’re already talking about a particular person, and he’s outside on the road.
The author seems to be missing the point, in my opinion. While it is certainly true that often one can solve simple, seemingly innocent sub-problems within more general languages, the transitions from "I see I can solve this simple program with regex'es!" to "Then I can probably solve this other, almost identical problem as well!" and have the problem explode right into your face are subtle (almost imperceivable to a novice) and it would be a more robust solution to go for the right tools (i.e. an (x)html parser), as well as a good learning example. On a side note: regular expressions can not - by definition - parse recursive languages. A regular expression matcher that does is not a regular expression parser but an ugly-duckling in the family of context-free grammar matchers. People should learn when and how to use those.
But the original SO question does not imply that they want to solve a more complex problem. The SO asker explicitly asked for opinions, so that’s what they got. However, I absolutely think it is the right choice to choose simpler tools to solve simpler problems, as long as you are aware of the implications.
The article goes as far as to say that a parser is not the right tool.
> Not only can the task be solved with a regular expression - regular expressions are basically the only practical way to solve the problem. Which is why none of the clever answers actually suggest another way to solve the problem.
So no, the author is not missing the point at all.
The point is that a parser could very well use regexes under the hood to perform the tokenization. Because it is the right tool for the job. A language without regex-support might use something like lex to compile a lexer. Of course you can write a character-by-character lexer by hand, but this is just equivalent to what a regex would generate.
So saying "this is not possible, use a parser instead" is completely misunderstanding the relationship between lexing and parsing. I wonder how these people think a parser works?
I mean that bit is clearly wrong. An XML/HTML parser is a perfectly practical way to solve the problem.
However I completely agree that they didn't miss the point. A regex to do this might be fine for hacky things that you don't need to be robust (e.g. for searching for stuff, measuring stats, one-off scripts etc.).
Regular expressions can be as robust as you need them to be, just like any other kind of code. They are a DSL to create lexers, and they are exactly as robust (or hacky) as if you wrote the same lexer by hand.
Are you arguing that the effort required to make a regex robust and correct is larger than the effort required to make some hand-rolled character-by-character based lexer robust and correct?
Because that sounds counter-intuitive to me. A regex is a higher level DSL for lexing.
That's exactly what I'm arguing. Especially because it's very unlikely that you'd write an XML/HTML parser yourself instead of using somebody else's well-tested library.
Of course you should use an existing library if it solves the exact problem you have. Don't waste time re-implementing the wheel unless you are doing if for educational purposes. Whether such a library used regexes or not under the hood would be irrelevant as long as it works and it well tested.
But I would certainly like to hear an argument why you think a regex is less robust that a similar manual character-by-character matcher.
The regex is surely faster for the specific case. I can't say I've seen an XHTML parser off hand that allows me to stop parsing after just the start tag. Perhaps a lazy parser could start to compete, but I'm just guessing.
Aren't most XML parsers SAX or STaX based? Only time I ran into a library that only offered a full DOM without the underlying event based parser was whatever browsers consider the JavaScript standard library.
You're totally right! Many good stock parsers already stream things (more or less).
Still, I'm just making a comment about the overhead... I would hedge a guess that you're going to have a hard time beating a regex with an HTML parser for speed, assuming what you want can be done with both.
This is all irrelevant, because as the OP mentions, the SO question at hand cannot be solved with standards compliant parsers because self-closing tags will not be distinguishable.
Edit 1: I'm unsure if the inner <script2> is valid (X)HTML, so it might not be an issue of being unable to parse correct (X)HTML, but rather an issue of being unable to detect invalid (X)HTML. (Can someone verify?)
Edit 2: It seems Chrome chokes on the space... does anyone know if the initial space is valid? I'm pretty sure I've seen parsers that accept it...
> Edit 2: It seems Chrome chokes on the space... does anyone know if the initial space is valid? I'm pretty sure I've seen parsers that accept it...
Most browsers probably can deal with it, but it's not valid xml/html. Try passing it through a validator, it'll complain about foreign characters after `<` and then complain about a trailing `</script>` as <script> was never opened in the first place.
But it should be easy based on this example to include correct HTML tags in the script which the regular expression will emit. Or if you want to recognise HTML tags in the script, you can easily obfuscate construction of in the script using string concatenation.
I don’t think any part of that is valid XML. There cant be space between < and the tag name, and I believe content containing tags should be in a CDATA section.
This conversation would be a lot clearer with a distinction between "regexes" and "regular languages". The former is what Perl/Python/etc. have, and the latter is a mathematical concept (and automata-based non-backtracking engines like RE2, re2c, and rust/regex are closer to this set-based definition).
With those definitions, this part of the snarky answer is wrong:
HTML is not a regular language and hence cannot be parsed by regular expressions
That is, regular expressions as found in the wild can parse more than regular languages. (And that does happen to be useful in the HTML case!)
This answer is also irrelevant, since the poster is asking for a solution with regexes, NOT regular languages:
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar).
In this post, the example given IS a regex, but it IS NOT a regular language:
<!-- .*? --> # comment
The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.
> This conversation would be a lot clearer with a distinction between "regexes" and "regular languages".
Very much so.
> In this post, the example given IS a regex, but it IS NOT a regular language: `<!-- .*? --> # comment` The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.
Actually, that's [edit: "it IS NOT a regular language"] wrong, at least in principle. If you're limiting it to only the shortest match (which is how HTML (and most other) comments actually work), then (just like `(abc)+` is shorthand for `(abc)(abc)*`) `<!-- .*? -->` is shorthand for (assuming I haven't made a mistake in the for-lack-of-a-better-word-arithmetic):
<!-- ([^ ]| [^-]| -[^-]| --[^>])* -->
That is, shortest-repetition-only can be implemented in a purely regular system.
On the other hand, if you want to allow longer matches when actually needed, then for purely-regular purposes (where it either matches or not) `<!-- .*? -->` is just a wierd way of writing `<!-- .* -->` (which is quite obviously a regular language).
Yeah I think you are right -- the nongreedy match can be simply written as a more awkward pattern. I think regular language negation and intersection also help (which the rarer derivatives-based implementations seem to have). They are still regular and equivalent in power, but there's a richer set of operators for writing patterns.
I would also divide it into the "recognition problem" and the "parsing/capturing problem".
Recognition is just yes/no -- does it match? And notably that excludes common regex APIs that either return the end position (Python's re.match) or search (re.search). It is more like math -- is it in the set or not?
For the recognition problem, there's no difference between greedy and non-greedy, in terms of the result. (It does matter in terms of performance, which shows up in practice in backtracking engines!)
But parsing / capturing is more relevant to programmers. I don't remember all the details but there is some discussion on the interaction between greediness and capturing here: https://swtch.com/~rsc/regexp/regexp2.html
It looks like it can all be done in linear time and RE2 supports it fine.
So in that case maybe the 2 questions are closer than I thought. I have a very similar set of regexes that I use for parsing (my own) HTML, but I use one regex for each kind of token, and I do it with the help of an explicit stack.
I'm using Python's engine so I think it is better to separate each case, for performance reasons. But maybe with RE2 it would be fine to combine them all into one big expression as is done here, for "parallel" matching. The expression is a little weird because it's solving a specific problem and not something more general (which is useful and natural).
> common regex APIs that either return the end position (Python's re.match) or search (re.search).
Yeah, I probably should have explicitly said that's what the first translation (`[^a]|a[^b]|ab[^c]|...`) was for; it's a optimization (possibly-backtracking-parser -> [ND]FA) I've used a couple of times to beat things into guaranteed O(N) time.
> But parsing / capturing is more relevant to programmers.
I'd debate "more", since that's a additional thing on top of matching and searching. Any case where you need the former, you also need the latter to even know what to capture. But it's definitely something you do frequently need.
Does your regex assumes that "-->" must be prefixed by a space? This is not the case in XML. (Also the string "--" must not occur inside a comment, so the last clause is not necessary.)
> Does your regex assumes that "-->" must be prefixed by a space?
Yep, because the quoted regex assumed the same thing, and I didn't see a point in editorializing more than necessary.
> Also the string "--" must not occur inside a comment, so the last clause is not necessary.
"Must not" seems unreliable in webpage parsing. What page does your XHTML parser produce when fed text of the form `<!-- --is this a comment? <a href=-->`, for example?
Handling invalid syntax is another can of worms! It would make all parts of the tokenizer much more complex, not just comments.
In the case of XHTML, a parser is supposed to reject any document which is not well formed. HTML parsers typically try to "gracefully recover" from all syntax errors, but this is a crazy complex algorithm.
> This conversation would be a lot clearer with a distinction between "regexes" and "regular languages".
This is often an important distinction, but the point of the article is that the Stackoverflow question does not require recursion or any other non-standard regex features, and therefore can be solved using a vanilla regular expression.
So in this particular question the distinction is not important.
I love seeing the weirdo CDATA thingy in there! CDATA ftw!
E.g., you've got this enormous spec for SVG which includes CSS, but that CSS has syntax inside a style tag which could break XHTML parsers.
Amateurs out there are probably thinking, "Well, why not just compromise in the spec and tell implementers to do the same thing that HTML does to parse style tags?" Well, professionals know that cannot work for myriad reasons you can read about if you take out a college loan and remain sedentary for the required duration.
The right approach is to throw the CSS stuff inside CDATA tags to tell the parser not to parse it so things don't break. That is the way sensible, educated professionals solve this problem.
I'm only kidding!
For inline SVGs the HTML5 parser simply says, "Parse this gunk as HTML5, and use sane defaults to interpret the parsed junk in the correct svg namespace so that all the child thingies in that namespace just work."
Which it does.
Unless you're going to grab the innerHTML of the inline SVG and shove it into a file to be used later as an SVG image.
In that case you cross the invisible county line into XHTML territory where the sheriff is waiting to throw you in jail for violating the CDATA rule. In that case the XHTML parser hidden in the guts of the browser doles out the justice of an error in place of your image. Because that is the way sensible, educated professionals solve this problem. :)
My holy grail-- how do I use DOM methods to create a CDATA element to shove my style into? If I could know this then I can jump my Dodge Charger back and forth into XHTML without ever getting caught.
The 'weirdo' CDATA thing is the only thing that makes XHTML actually amenable to this approach, because XHTML is tokenizable using a regular expression-based grammar, whereas HTML without CDATA is not. As you're obviosuly aware, the language inside <style> elements is not suitable for XML parsing. Nor is the language inside <script> elements:
<script>
var y = "<!--";
</script>
<p>... a naive tokenizer thinks this is in a comment ...</p>
<script>
var z = "-->"
</script>
There's a weird interaction here between the Javascript and HTML parsers, because <, !, and -- are all valid Javascript operators, and you can, in theory, stack them up into syntactically valid JavaScript expressions. The behavior of the following script in a browser is... unpredictable:
<script>
var a = 1;
if (0<!--a) { document.write('since --a is 0, and !0 is true, and 0 < true (!), this should print'); }
</script>
<p>... and this should not be in a comment</p>
<script>
if (a-->-10) { document.write('a-- should still be > -10, so this should print, too'); }
</script>
The regex in the article will miss the opening <p> tags here, because it assumes that it's being given valid, tokenizable XHTML.
If I understand the HTML spec correctly, these two examples are not valid HTML.
The script element content model has the constraint that if the substring "<!--" occurs inside the content, a corresponding "-->" must occur before the </script> end-tag. (Technically, this would not be a comment though, since it is not discarded but parsed as character data.)
> Error: The text content of element script was not in the required format: Content contains the character sequence <!-- without a later occurrence of the character sequence -->.
I suspect this constraint on the script-element is exactly to avoid these parsing/lexing ambiguities.
A regex would still have to special-case script elements (and also style-elements I guess?) because content can contain unescaped "<". But this can still be done using a regex, as far as I can tell, since there is still no recursive productions in the syntax.
I don't believe there is any two-way interaction between the HTML parser and the JavaScript parser. The HTML parser passes the character data content of the script element to the JavaScript parser, but the JavaScript parse does not have any effect back on the HTML parse. (After all, it is legal for a user agent to not support JavaScript, but this should obviously not affect the parsing of the HTML.)
(Thanks for sending me down this rabbit-hole. HTML is weird!)
Part of problem and solution you describe was due to the battle to define who is encapsulating whom, no? The W3C SVG list archive was full of people essentially asking for the ability to flow text and replicate a lot of HTML as part of native SVG. In that dream, it's certainly important to have well-defined behavior for javascript inside SVG since you could have SVG user agents that aren't web browsers. And that means CDATA to hold the non-XML scripting and styling data.
However, at the end of that history HTML was the clear encapsulator and SVG exists either inside it or as a static image in Inkscape, the browser, or some library. So today, scripts inside an SVG are either a curiosity or security nightmare that comes to life when the user clicks "View Image" on an SVG image in their browser.
That leaves only the `<style>` tag content as a potential ambiguity. So I'm curious-- are there examples where content of a `<style>` tag inside an inline SVG causes unpredictable behavior in modern browser HTML parsers? I'm guessing there must be, but I'd like to play with a clear example.
Correct me if I'm wrong, but I believe HTML tags can still be lexed with a regular expression. The syntax of the script element is cryptic, but it does not contain any recursive productions, so it should still be possible to lex correctly with a regular expression.
>My holy grail-- how do I use DOM methods to create a CDATA element to shove my style into? If I could know this then I can jump my Dodge Charger back and forth into XHTML without ever getting caught.
The only part I agree in this writing is that you don't need to be snarky to be correct. (I'd like to introduce the XY problem of the second kind, where the answerer is so confident that it is the answerer who have missed the actual question.)
Some regexes can recognize a language beyond the regular language. They are typically available in two flavors: recursive references (Perl, Ruby, PCRE) and stackable captures (.NET). They are obscure enough that I would not recommend them, but it is patently false that regular expressions (EDIT: of the practical interest) cannot be recursive.
It is possible to match individual HTML tags with regexes, but it is difficult. It cannot use a bare `\w` or `\s` because both XML/XHTML and HTML5 parsers have peculiar definitions for tag name characters and space characters. For example your `\s` will typically match various Unicode space characters, while only ASCII whitespaces are recognized in tags. There are also several notable exceptions to the parser (and external states termed the "tree construction"), so missing any of them would result in an immediate XSS. If you think you can write a correct regex for HTML tags, my quizzes [1] should make you concerned. Limiting the question to XHTML does alleviate some but not all concerns.
The distinction between recognition and parsing is correct, but parsing doesn't necessarily mean the reconstruction of parse tree. Parsing means the access to constituent nonterminals, which can be used to reconstruct parse tree but also directly used as their own (e.g. calculators). Indeed in most regex implementations you can't extract two or more strings out of each capture (Raku is a notable exception), so you can match against e.g. `(\w+)(?:,(\w+))*` but can't extract a list of comma-separated words with it. Practically speaking this means you can't extract a list of attributes with a single regex, making it unsuitable for HTML parsing anyway.
> it is patently false that regular expressions cannot be recursive.
No, it's more like the term "regular expression" has gotten hijacked and nowadays gets abused to colloquially include, shall we say, irregular expressions. i.e. people basically say "regex" when they mean "some succinct pattern language with syntax similarities to (classical) regular expressions".
Regular expressions in the formal language theory do not have captures anyway. The name collision is unfortunate, but we have already established that regexes in practice means a pattern language largely modelled after theoretical regular expressions and not the theoretical regular expressions themselves. At the very least the writing could have mentioned this discrepancy.
Captures are a red herring here. They don't fundamentally alter the nature of what a regex does, which is to recognize regular languages. Pointing to them as if they're some kind of justification is like calling pilots "drivers" because drivers originally drove wagons, and wagons didn't have rubber tires like cars and planes anyway. It's completely missing that the point of the distinction between a plane and a wagon has always been the land vs. air travel, not modern features like tires or the infotainment systems or what have you.
But yes I guess it'd have been better for the writing to mention the discrepancy in any case.
> They don't fundamentally alter the nature of what a regex does, which is to recognize regular languages.
It's a bit subjective but captures are harder than recognition. Russ Cox has once noted [1] that the extraction has to be run as a separate step after the recognition and a fast DFA can't always be used for that, suggesting they are related but different problems.
Well, if you allow an arbitrary depth of capture-group nesting, then that may be so, but it seems beside the point here. It is not clear to me that this article makes any point about extraction that is relevant to this discussion.
>At the very least the writing could have mentioned this discrepancy.
The original meaning of 'regular expression' is very specific, and has some significant implications which are lost with the now-common and less well-defined usage. Therefore, if anything needed having this discrepancy mentioned, it was your original statement "it is patently false that regular expressions cannot be recursive", as this is an issue where the distinction is crucial. It is good to see that you have now done so, though the way you have done so suggests there is nothing of practical interest in the formal definition, which, I suggest, would be patently false.
> I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar).
Note that regarding formal language and complexity theory, while it is correct that in general, arbitrary nested structures require a context free grammar (type 2 in the Chomsky hierarchy) and are thus beyond regular (type 3) [1], this statement is NOT true _if_ you limit the nesting depth with a finite constant k.
For example, if you agree to an HTML tag maximum nesting depth of, say, 100, then it can be modeled with a regular (type 3) grammar, including correct required matching of opening and closing tags, and hence you can write a regular expression that matches it as well.
This debate is well-documented in the theoretical linguistics literature, where some say human languages are not regular because you can always embed yet another additional relative clause in any sentence in principle without adversely affecting grammaticality, whereas others say while you could you won't find natural examples in human-written text documents where extreme nesting depth is actually found. At that point psycholinguists and theoretical linguists usually start a fight about whether memory limits are important or "just performance as opposed to competence".
The pedantic discussions can be fun and educational, but the regex based hack gets the job done in a few minutes, while the pedants are still wrestling with parsers and libraries.
... and then there's the anticipated joy of seeing the pedants' complicated, theoretically correct solution explode because the input wasn't what they assumed, in the first place.
The pedants that have that experience either become enlightened, or vociferously strident about the importance of proper, theoretically correct solutions in place of quick hacks.
The issue is that sometimes you should use a robust parser and do it properly, and sometimes a hacky regex is fine. But people forget that when arguing about which you should use.
Does the proposed regular expression really handle embedded script content correctly? From my limited understanding of HTML, pretty much only </script> counts as closing the script contents and everything else is treated as part of the script.
Does it matter? You can create regex with lookahead for </script>. The point is that it's possible to solve the problem from SO this way, due to the nature of the problem, not that this particular expression is perfectly correct.
It seems the article does defend "it's possible" for strictly regular expressions. To allow lookahead will be context-sensitive, so not in that spirit.
The headline said that the author was specifically looking to avoid 'xhtml self-contained tags', but that doesn't mean they could assume the document they were looking in was valid XHTML - just that it might contain XHTML-style 'self-closing' tags.
I've seen plenty of plain HTML files that aren't well-formed XML yet contain <br/> tags.
Yeah the trailing slash in <br /> is legal in HTML, but it doesn't actually make the tag self-closing. For example <b /> is still an opening tag which require a </b>.
It goes to show, how few people are able to think for themselves.
The question originally asked, is "how to match HTML tags". Not how to parse. Not how to scrape. Simply "match". To which I would say, regex is perfectly suited to the task.
Furthermore - if one simply needs to scrape content, regex is again, perfectly suited. Scraping, is not parsing - and has no real need for a full blown DOM parsing library.
Cargo cult parrots like to say - if the HTML content changes, then one's regex will fail. Well, so will one's DOM parser.
The problem is: you cannot parse malformed (real, everyday) html with regexes.
But if you need to parse html even malformed) generated by the same template (like a scrapping situation), the whole file becomes regular, which can be parsed by a regular expression.
But if you try to parse html in general, too bad because then you’ll need to take html in consideration and will need a recursive descent parser, not a regex.
This question popped up so many times in forums in 2000’s that people got mad at that.
I haven't read the rest of the thread, but the article is even wronger than the glib replies there. HTML needn't be well formed. There are adhoc rules which let major browsers parse broken HTML. If you do not follow the spec to the letter you will have your tooling break on input that every browser thinks is acceptable.
Which is why you always use whatever html parsing library comes with your language. There is no simple answer in the thread because there is no simple answer in the real world.
That said, anyone who says:
>It is quite possible and not even that difficult:
(
# match all tags in XHTML but capture only opening tags
<!-- .*? --> # comment
| <!\[CDATA\[ .*? \]\]> # CData section
| <!DOCTYPE ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* >
| <\? .*? \?> # xml declaration or processing instruction
| < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag
| < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > # opening tag - captured
| </ \w+ \s* > # end tag
)
But both the original question and that article were about XHTML. The non-well-formed mess only matters for HTML, not XHTML.
The regex is a valid answer to the stack overflow question.
The original question only mentions the author wants to ignore XHTML-style self-closing tags which is in no way implying the input to be well-formed XHTML.
You make this more complex than it actually is. HTML is basically SGML with a DTD declaring rules for tag omission/inference, empty elements, and attribute shortforms. This is what the majority of XML heads seem to struggle with, but which nevertheless is every bit as formal as XML is, XML being just a proper subset of SGML, though too small for parsing HTML.
Then HTML has a number of historical quirks including:
- the style and script elements have special rules for comments that would make older pre-CSS, pre-Javascript browsers just see markup comments and ignore those
- the URL syntax using & ampersand characters needs to be treated special because & starts an entity reference in SGML's default concrete syntax
- the HTML 5 spec added a procedural parsing algorithm (in addition to specifying a grammar-based spec) that basically parses every byte sequence as HTML via fallback rules; for most intents and purposes, the language recognized by these rules, taken to the extreme, is not what's commonly understood as HTML
- WHATWG have added a number of element content rules on top of the HTML 4.01/HTML 5.0 baseline with ill-defined parsing rules (such as the content models for tables and description lists); the reason is precisely that WHATWG, once Ian Hickson had distilled the HTML DTD 4.01 grammar rules into the HTML 5 grammar presentation as prose, a formal basis was no longer used for vocabulary extension
Maybe people who think a basic regex such as this is difficult, need some mentoring.
It doesn't even use lookahead/lookbehind or other more complicated features. It only uses non-greedy matching 2 times, which is the only thing that's more complicated than the basic AND/OR logic expressed by the rest of the expression.
problem is that it doesn't reflect how browsers parse things. if you were using this in a security context, e.g., here's an example it won't detect (granted this is not technically valid, but does it matter?):
<div "> put arbitrary html here as you please (using single quotes for attributes)<div ">
Oh I've seen this many times in different forms. Especially with regexes.
You know what this is a great example of? A case where hacking makes a mess, and thinking before coding solves the problem.
The madness comes from using the wrong tool for the problem. Yes, you can hack a regex to parse XHTML this might be "good enough", but it is more robust, cleaner and easier to explain if you use a lexical tokenizer and a grammar.
The lure is an illusion that comes from an initial effort assessment. Where the effort to hack a quick-and-dirty regex (call this Ehack) vs a "oh, man, you mean I gotta think about the problem" (call this Ethink) appears as "Equick <<< Ethink." However, it soon evolves to the scenario where "Equick >>> Ethink," driven by the thought process, "I'm almost there, this regex just needs one more tweak." Aka, the gambler's fallacy: it comes into play and the sunk costs are ignored.
TL;DR - Use the right tool for the problem, even if it means a slightly larger up-front effort investment.
Yeah, and you never actually end up solving the problem. You just end up solving every edge case that comes up :D
It's the same mentality that can lead to fixing but symptoms without making any real progress on the underlying issues.. Sometimes that's good enough I guess.
based on the stackoverflow thread, and then the comments here, an interesting research paper topic would be 'why do people get so passionate about regex'
Regexes are computer programs just like any other computer program. A long program, terse syntax, embedding one programming language in another programming language, etc. are risk factors for unmaintainable code, but are manageable in the same way that you manage the risk inherent in other complex computer programs.
I did always like elisp's "rx"[1] library for removing some of the terseness. Not enough to ever use it, of course, but it's an interesting idea.
IMO the biggest problem with regexes as seen in the wild is a lack of composability. If you need some kind of pattern like "[setA][setA+setB]{0,n}" then you'll copy-paste the definition of setA in both places. If you need to re-use that entire regex you'll copy-paste it again and construct a monstrous string with a really well-defined structure that isn't even slightly apparent without a reverse engineering session.
Up to a point you can solve that by just giving names to relevant sub-expressions, using a regex builder, etc, but in my experience if I'm going to write even a moderately complicated regex I'll probably be better served with something like parsec (a python implementation here [0]) in whichever language I'm currently using.
That isn't to say that regexes don't have their place -- anything quick and dirty, situations where you need to handle unsanitized input (mind you, the builtin regex engine is probably vulnerable to exponential runtime attacks) and don't want to execute a turing-complete language, etc.... I just think it has bad ergonomics for any parser you might use more than once, and I haven't yet regretted using parsec in situations where a complex regex would have sufficed.
Perl is great for this. It’s been a long time since I’ve written any, but with the right flags and use of qr// you can write extremely readable perlre.
Long regex's are not the root of all evil but they're certainly a tendril of bad-practice.
I would say that the OP's usage, a home-made concoction to find all opening tags in an xhtml doc, crosses the line into bad practice. Get a room and use a parser, people!
Who said regular expressions are bad practice? Regexes are great but they can get abused easily.
Looking more carefully at the SO question, I am inclined to ask "why?" at least a couple of times because I suspect the answer to the deeper problem the SO OP needs to solve can be worked out with a DOM parser. If not, then definitely a SAX parser could solve that specific problem and it would be more robust than handcrafting a fussy regex.
As far as I can tell, a SAX parser does not expose the distinction between an opening tag and a self-closing tag. The tag `<foo />` would just emit a startElement event followed by an endElement, exactly the same as `<foo></foo>`.
Which means it can't solve the specific problem the OP describes.
I would rather use a robust battle-tested parser and handle that edge case separately than to create regex for all of xhtml, copied off an opinionated blog, as a "hail-mary".
>Parsing typically uses (at least) two steps: Tokenization which uses regular expressions to splits the input string into a sequence of syntax elements
I don't use regex for tokenization, I'm doing something wrong?
But overall I think this is important post, even despite I believe that regex is the best example of "good idea, shitty API"
Bzzzt, wrong, sorry! The question is about finding open tags in the presence of XHTML self-closing tags. That difference alone places these interpretations gulfs apart. But there’s more: it does not specify that the input document is even XHTML, only that XHTML-style self-closing elements may be present. In fact the original question was barely minutes old and tagged merely “regex” when that famous answer was written in 2009; the question was not tagged with “xhtml” until 2012, and not by the original author either.
Revealingly, then, if we review the broader context (i.e question history) of the original question author, it’s clear that yes indeed they were trying to fix a malformed document, and in particular to normalise it into XHTML, with focus on fixing up any so-called “dangling tags”. For this task, the suggestion of “use a parser” is indeed sound advice.
The real moral here is, don’t be a jerk about the precise semantics of a question, look at what the person needs, and help them ask better questions.
Otherwise, you’re just gonna discover that there’s always a bigger jerk, and they’re on Stack Overflow, moderating your stuff.