> For this task, the suggestion of "use a parser" is indeed sound advice.
Perhaps technically, but it's also useless advice because a parser does not exist for their particular flavor of malformed XHTML. XHTML parsers parse XHTML, which you yourself have said it wasn't:
> they were trying to fix a malformed document
So in the absence of a reference to a particular malformed-XHTML-recovering parser (which may or may not work on the specific input they have, but "try this thing" is at least actionable advice), "use a parser" amounts to "write a entire parser yourself, then use it".
> don't be a jerk about the precise semantics of a question, look at what the person needs
> "use a parser" amounts to "write a entire parser yourself, then use it"
"Use a parser" is a common answer, besides being the accepted one, and with good reason: it'll work. The world is not short of HTML parsers (although, who knows, perhaps PHP may have been short of very good parsers back in 2009). Whether they use regular expressions for tokenizing is an internal detail.
Serializing XML from the resulting memory structure, DOM or otherwise, closes the loop, and this remains a conventional and commonplace means to normalize some incoming HTML-like mush into something that can be spliced/interpolated into XML and a strict receiver will probably accept it.
Perhaps technically, but it's also useless advice because a parser does not exist for their particular flavor of malformed XHTML. XHTML parsers parse XHTML, which you yourself have said it wasn't:
> they were trying to fix a malformed document
So in the absence of a reference to a particular malformed-XHTML-recovering parser (which may or may not work on the specific input they have, but "try this thing" is at least actionable advice), "use a parser" amounts to "write a entire parser yourself, then use it".
> don't be a jerk about the precise semantics of a question, look at what the person needs
Pot, kettle.