It appears that you were probably describing the lexer pass in your description of docuwiki. Indeed tokenization is a very hard problem for wikitext. We use a pegjs grammar for it, but it contains less of lookahead/special conditions/novel extensions, etc. It's hard. Wikitext is messy precisely because it was intentionally designed to be easy and forgiving to write.
Seems like we've learned many of the same lessons building our parsers. Markup parsers do seem to be a unique thing, not really like parsing either programming languages or natural languages. If we every meet I'm sure we could happily share a beverage of your choice trading stories.
There are thousands of integration tests. The "correct" output of the parser is well-known for a given input, and those test cases have been accumulating for over a decade. But the internal structure of the parser is much more fluid, and so it wasn't (historically) thought worthwhile to try to write tests against that shifting target.
I'm on the team. Part 2 of this post series should have lots of interesting technical details for y'all; be patient, I'm still writing it.
But to whet your appetite: we used https://github.com/cscott/js2php to generate a "crappy first draft" of the PHP code for our JS source. Not going for correctness, instead trying to match code style and syntax changes so that we could more easily review git diffs from the crappy first draft to the "working" version, and concentrate attention on the important bits, not the boring syntax-change-y parts.
The original legacy Mediawiki parser used a big pile of regexps and had all sorts of corner cases caused by the particular order in which the regexps were applied, etc.
Parsoid uses a PEG tokenizer, written with pegjs (we wrote a PHP backend to pegjs for this project). There are still a bunch of regexps scattered throughout the code, because they are still very useful for text processing and a valuable feature of both JavaScript and PHP as programming languages, but they are not the primary parsing mechanism. Translating the regexps was actually one of the more difficult parts, because there are some subtle differences between JS and PHP regexps.
We made a deliberate choice to switch from JS-style loose typing to strict typing in the PHP port. Whatever you may consider the long term merits are for maintainability, programming-in-the-large, etc, they were extremely useful for the porting project itself, since they caught a bunch of non-obvious problems where the types of things were slightly different in PHP and JS. JS used anonymous objects all over the place; we used PHP associative arrays for many of these places, but found it very worthwhile to take the time to create proper typed classes during the translation where possible; it really helped clarify the interfaces and, again, catch a lot of subtle impedance mismatches during the port.
We tried to narrow scope by not converting every loose interface or anonymous object to a type -- we actually converted as many things as possible to proper JS classes in the "pregame" before the port, but the important thing was to get the port done and complete as quickly as possible. We'll be continuing to tighten the type system -- as much for code documentation as anything else -- as we address code debt moving forward.
AMA, although I don't check hacker news frequently so I can't promise to reply.
One question: did you investigate why the PHP version was so much faster than the JS one ? Do you think the performance gains of the PHP versions could be achieved in JS, or do you use any special feature of the PHP interpreter ?
No, we haven't investigated it yet since we haven't had the time to do it. But, we've filed a task for maybe someone to look at it ( https://phabricator.wikimedia.org/T241968 ), but our hunch is that it would be be incorrect to conclude that PHP is faster from JS.
A slightly longer answer is that we looked into a number of possible reasons and it's not clear there's an easy answer. Lots of differences between the two setups, and every time we come up with an possible answer like "oh, it's the reduced network API latency" we come up with a counter like "but html2wt is also faster and it does barely any network requests". Casual investigation raises more questions than answers. So we've looked into it but don't yet have an answer that we fully believe.
If you want to dig through the history some:
https://github.com/wikimedia/parsoid/blame/6eb00df3e090b20cc...
Is a pretty good example of the porting technique. You'll see quite a decent number of lines are still unchanged from the "automatic conversion from JS".
https://github.com/wikimedia/parsoid/commit/6eb00df3e090b20c...
shows what the initial port process was like. Still quite a bit of work, but you'll see it's almost all "real" work that needs a human to think about things, not just mechanical syntax translation. The syntax translation part was done automatically.
If you look through the history earlier in 2019, you'll even see JS commits like https://github.com/wikimedia/parsoid/commit/2853a90ceda7cdfa... which are to the JS code (in production at the time) preparing the way for the PHP port. In that particular case, our tooling was doing offset conversion between JS UTF-16 and PHP UTF-8 as part of the output-testing-and-comparison QA framework we'd built for the port, and it was getting hugely confused by Gallery since Gallery was using "bogus" offsets into the source text. Since fixing the offsets was rather involved (the patchset for this commit in gerrit went through 56 revisions : https://gerrit.wikimedia.org/r/505319 ) the change was first done on the JS side, thoroughly tested, and deployed to production to ensure it had no inadvertent effects, before that now-better JS code was ported to PHP. It would have been a disaster to try to make this change in the PHP version directly during the port.
It's actually a 269-line test case for more serious projects at the foundation: the Offline Content Generator and Parsoid. We're allowed to have fun in the service of the greater goal.
More technical details at https://meta.wikimedia.org/wiki/Telnet_gateway