More

cscottnet · on Jan 18, 2022

Funny thing: I was hired by OLPC specifically to implement that book: https://blog.printf.net/articles/2011/06/18/narrative-interf...

cscottnet · on Feb 14, 2020

It appears that you were probably describing the lexer pass in your description of docuwiki. Indeed tokenization is a very hard problem for wikitext. We use a pegjs grammar for it, but it contains less of lookahead/special conditions/novel extensions, etc. It's hard. Wikitext is messy precisely because it was intentionally designed to be easy and forgiving to write.

Seems like we've learned many of the same lessons building our parsers. Markup parsers do seem to be a unique thing, not really like parsing either programming languages or natural languages. If we every meet I'm sure we could happily share a beverage of your choice trading stories.

cscottnet · on Feb 14, 2020

Yes. We're not there yet, but that's the goal!

znpy · on Feb 16, 2020

awesome! keep rocking!

cscottnet · on Feb 14, 2020

There are thousands of integration tests. The "correct" output of the parser is well-known for a given input, and those test cases have been accumulating for over a decade. But the internal structure of the parser is much more fluid, and so it wasn't (historically) thought worthwhile to try to write tests against that shifting target.

cscottnet · on Feb 13, 2020

I'm on the team. Part 2 of this post series should have lots of interesting technical details for y'all; be patient, I'm still writing it.

But to whet your appetite: we used https://github.com/cscott/js2php to generate a "crappy first draft" of the PHP code for our JS source. Not going for correctness, instead trying to match code style and syntax changes so that we could more easily review git diffs from the crappy first draft to the "working" version, and concentrate attention on the important bits, not the boring syntax-change-y parts.

The original legacy Mediawiki parser used a big pile of regexps and had all sorts of corner cases caused by the particular order in which the regexps were applied, etc.

Parsoid uses a PEG tokenizer, written with pegjs (we wrote a PHP backend to pegjs for this project). There are still a bunch of regexps scattered throughout the code, because they are still very useful for text processing and a valuable feature of both JavaScript and PHP as programming languages, but they are not the primary parsing mechanism. Translating the regexps was actually one of the more difficult parts, because there are some subtle differences between JS and PHP regexps.

We made a deliberate choice to switch from JS-style loose typing to strict typing in the PHP port. Whatever you may consider the long term merits are for maintainability, programming-in-the-large, etc, they were extremely useful for the porting project itself, since they caught a bunch of non-obvious problems where the types of things were slightly different in PHP and JS. JS used anonymous objects all over the place; we used PHP associative arrays for many of these places, but found it very worthwhile to take the time to create proper typed classes during the translation where possible; it really helped clarify the interfaces and, again, catch a lot of subtle impedance mismatches during the port.

We tried to narrow scope by not converting every loose interface or anonymous object to a type -- we actually converted as many things as possible to proper JS classes in the "pregame" before the port, but the important thing was to get the port done and complete as quickly as possible. We'll be continuing to tighten the type system -- as much for code documentation as anything else -- as we address code debt moving forward.

AMA, although I don't check hacker news frequently so I can't promise to reply.

lovasoa · on Feb 14, 2020

One question: did you investigate why the PHP version was so much faster than the JS one ? Do you think the performance gains of the PHP versions could be achieved in JS, or do you use any special feature of the PHP interpreter ?

subbu_ss · on Feb 14, 2020

No, we haven't investigated it yet since we haven't had the time to do it. But, we've filed a task for maybe someone to look at it ( https://phabricator.wikimedia.org/T241968 ), but our hunch is that it would be be incorrect to conclude that PHP is faster from JS.

cscottnet · on Feb 17, 2020

A slightly longer answer is that we looked into a number of possible reasons and it's not clear there's an easy answer. Lots of differences between the two setups, and every time we come up with an possible answer like "oh, it's the reduced network API latency" we come up with a counter like "but html2wt is also faster and it does barely any network requests". Casual investigation raises more questions than answers. So we've looked into it but don't yet have an answer that we fully believe.

cscottnet · on Feb 13, 2020

If you want to dig through the history some: https://github.com/wikimedia/parsoid/blame/6eb00df3e090b20cc... Is a pretty good example of the porting technique. You'll see quite a decent number of lines are still unchanged from the "automatic conversion from JS". https://github.com/wikimedia/parsoid/commit/6eb00df3e090b20c... shows what the initial port process was like. Still quite a bit of work, but you'll see it's almost all "real" work that needs a human to think about things, not just mechanical syntax translation. The syntax translation part was done automatically.

Then https://github.com/wikimedia/parsoid/commits/master/src/Ext/... is a not-too-atypical view of the process after the "intial working port" was done (post Aug 2019). Some nasty bugs fixed (https://github.com/wikimedia/parsoid/commit/34fcb4241aa0f3a0... a GC bug in PHP!), some more subtle bugs (PHP's crazy behavior of '$' at the end of a regexp, unless you use the 'D' flag), etc.

If you look through the history earlier in 2019, you'll even see JS commits like https://github.com/wikimedia/parsoid/commit/2853a90ceda7cdfa... which are to the JS code (in production at the time) preparing the way for the PHP port. In that particular case, our tooling was doing offset conversion between JS UTF-16 and PHP UTF-8 as part of the output-testing-and-comparison QA framework we'd built for the port, and it was getting hugely confused by Gallery since Gallery was using "bogus" offsets into the source text. Since fixing the offsets was rather involved (the patchset for this commit in gerrit went through 56 revisions : https://gerrit.wikimedia.org/r/505319 ) the change was first done on the JS side, thoroughly tested, and deployed to production to ensure it had no inadvertent effects, before that now-better JS code was ported to PHP. It would have been a disaster to try to make this change in the PHP version directly during the port.

cscottnet · on April 1, 2016

If you use it, we will keep it up.

pavel_lishin · on April 2, 2016

Turns out, I don't know how to do that with telnet as a one-liner :/

cscottnet · on April 1, 2016

Sadly, this does seem to be a limitation of our labs infrastructure. :(

cscottnet · on April 1, 2016

Yeah, the domain setting was leaking between sessions before. Oops. Fixed now.

cscottnet · on April 1, 2016

It's actually a 269-line test case for more serious projects at the foundation: the Offline Content Generator and Parsoid. We're allowed to have fun in the service of the greater goal. More technical details at https://meta.wikimedia.org/wiki/Telnet_gateway

agumonkey · on April 1, 2016

I was meanly poking at Wales mails and hyper disrupting popup for money (you guess how much I like all this).

That said, I friggin love the telnet access point. I am a bit fed up with the ever more weighty web so simple text link + repl gets my vote.

cscottnet · on April 1, 2016

Patches welcome!