Translations of Russ Cox's Thompson NFA C Program to Rust

mleonhard · on Nov 4, 2023

If you're interested in Rust regex implementations, you may be interested in the one that I wrote in a macro: https://crates.io/crates/safe-regex . At build time, the macro generates the matcher as Rust code. The compiler then compiles and optimizes the matcher code as part of your Rust binary.

I wanted to see if it was possible to make a fast Rust regex library without using any unsafe code. It turns out that it is. Russ Cox's paper was very helpful.

porridgeraisin · on Nov 3, 2023

Ripgrep also uses Thompson nfas

keepamovin · on Nov 3, 2023

A fun experience 12 years ago was implementing Thompson NFA from the original 1960s paper into C# to improve a NLP lab's regex processing speed. Speed up was crazy, like 1200 times versus the custom bespoke matching method they had been using until then.

I had this awakening of raw primal power, like: I can take this ancient thing, in this obscure long forgotten paper, and use to cast new power and speed.

I realized not all good things are shiny and new! Some can be dusty and old.

Experts may dispute my characterization of the Thompson NFA paper as obscure and forgotten but this was in 2011 and how it seemed to me. You must remember that to an expert, something otherwise obscure will not be! Also, seemed that way when I learned that not every regex implementation used this clearly superior method.

Also, I realized and appreciated what an absolute genius Thompson was. Reading his paper was like clarity: you could feel his clarity and power. It's masterful. To appreciate his clear presentation and absolute command of the concepts that were contained in this dusty ancient document (well I mean the PDF looked that way, ha!), was just awesome. It was cool! :)

microtonal · on Nov 3, 2023

I had this awakening of raw primal power, like: I can take this ancient thing, in this obscure long forgotten paper, and use to cast new power and speed.

100%. I worked for a company a long time ago who were in one application matching tokens against word lists with a certain edit distance was a bottleneck. I replaced it by a solution that uses dictionary automata (directed acyclic finite state automata) for sets of words and Levenshtein automata as matching automata and some code to compute the intersection languages on-the-fly, and it was (IIRC) orders of magnitudes faster. Most of this was based on papers that were pretty obscure in the NLP field, even back then (the knowledge about finite state automata seems to be lost among many recent graduates, since the focus on machine learning became so much stronger).

During my MA internship, I also made a deterministic part-of-speech tagger based on an idea of Roche and Schabes [1] to convert a transformation-based tagger (Brill-style) into a deterministic finite state transducer. The company that I interned with was worried that an HMM tagger would be too slow. The deterministic tagger was very fast, but I convinced them that HMM taggers can be fast as well (and more accurate), so the internship resulted in two taggers.

[1] https://aclanthology.org/J95-2004.pdf

jvanderbot · on Nov 3, 2023

I had a similar experience with even simpler tools: Classic memoization, caching, and some basic process + instruction parallelism produced a stack that did the job in seconds compared to 10+ minutes.

It's amazing how much "production" code can be optimized if you take time to find hot loops. I really wish it was some black magic only I could do (would help with salary negotiations), but it's mundane diligence and saying "no" to anything else for a month-two.

uticus · on Nov 3, 2023

That's pretty cool, C# performance chasing can be fun or maddening. Thompson NFA was obviously a good choice based on the speedup. But after reading this article I'm wondering if it was the best choice for all-out performance:

> The rust-regex program is quite a bit faster, but primarily because it uses a different technique for this particular regex (a lazy DFA).

[edit] Point being, different techniques work better for different use cases. To deal with this aspect of performance chasing I wonder, since multi-core is more prevalent now (than in 1968), if it would be the ultimate in performance to fire off a few different techniques simultaneously on different cores and let the first one take it. That would bypass the need for analysis before choosing a technique.

masklinn · on Nov 3, 2023

> A fun experience 12 years ago was implementing Thompson NFA from the original 1960s paper into C# to improve a NLP lab's regex processing speed. Speed up was crazy, like 1200 times versus the custom bespoke matching method they had been using until then.

I don’t quite understand, were they not using regexes, or were they using regexes which triggered classic NFA misbehaviours in .net’s regex engine which thompson’s was safe from (and they didn’t use advanced features)?

keepamovin · on Nov 3, 2023

Oh, I understand what you're saying. No they weren't using native .net regexes.

As an aside, I'm not sure the specifics of the .net regex engine at the time, but you seem to be suggesting that it used an NFA. I suspect it might not have but I don't know. Certainly a few languages didn't use NFAs, yet others did indeed! :)

To your point: it was many years ago so perhaps I will get some details wrong in what follows. I apologize in advance for the fuzziness! The most clear thing in my memory was the beauty of Thompson's work and how much using it sped things up.

Anyway, their system had a custom pattern matching algorithm specific to a couple NLP domains and conceptual models they were working in, that was like regex but extended. It was an academic invention, sophisticated and ambitious, and IIRC, they hadn't fully implemented all their ideas. Turned out their implementation just wasn't particularly efficient.

IIRC, it was a recursive implementation, but it's so many years I can't recall really, especially as I was mainly focused on making it faster. This necessitated replacing much of their code, but luckily the implementation and ideas were clean enough that the replacement was almost as if a clean interface had been planned all along, IIRC.

Turns out it could be sped up mightily by replacing a lot of their implementation with Thompson NFA and then adding their custom weird stuff on top of that. I tried to push the NFA as far as it could to utilize its own concepts to implement their ideas, but some things simply couldn't fit, so had to be bolted on. Still, it was far more efficient.

That's what it was.

As another aside: I think it was Russ Cox's article online about Thompson NFA that I found and which led me down that path. IIRC, he has this super simple HTML page with maybe a powder blue background telling all about it. His article was also a great help in clearly understanding the concept. It was fun to go back and forth between article, paper and code to get it to work!

vidarh · on Nov 3, 2023

No powder blue background so maybe it's a different page or has been redesigned, but Cox has a page linking to several articles on regexps here [1].

[1] https://swtch.com/~rsc/regexp/

keepamovin · on Nov 3, 2023

That's the page! Man it was 12 years, could be my memory :)

https://swtch.com/~rsc/regexp/regexp1.html

It might have had a more "retro geocities" type design in the past, tho haha

burntsushi · on Nov 3, 2023

This is exactly what was linked and discussed in my OP. ;-)

keepamovin · on Nov 3, 2023

Yeah, very good overview, no?

matheusmoreira · on Nov 3, 2023

It's remarkable how much our ancestors achieved. I can only wonder what other treasures are hidden away in the literature. Does anyone have a collection?

kzrdude · on Nov 3, 2023

That reminds me of the Apollo guidance computer talk (available on youtube https://www.youtube.com/watch?v=B1J2RMorJXM ). "Everyone" might have seen this one, but if one hasn't, it's a gem.

matheusmoreira · on Nov 3, 2023

I hadn't!

dekhn · on Nov 3, 2023

you know there is ken thompson fanfic, right? a lot of stories about eldritch mainframe horrors being cast into the doom by long-forgotten papers?

nextaccountic · on Nov 3, 2023

Ripgrep implements many regex engines and select the best performant for a given regex

Here's an overview of them

https://docs.rs/regex-automata/latest/regex_automata/#availa...

zackoverflow · on Nov 3, 2023

Does anyone have any insight on why the PikeVM implementation is slower (it's based on the VM approach described here https://swtch.com/~rsc/regexp/regexp2.html)

I feel like compiling regex to bytecode and executing it should be faster than dealing with big graph like structures with NFAs/DFAs

Also, V8's regex engines are very fast because they get JIT compiled to machine code, so wouldn't bytecode just be one step higher in abstraction, and have worse but similar performance?

burntsushi · on Nov 3, 2023

Before getting to your actual question, it might help to look at a regex benchmark that compares engines (perhaps JITs are not the fastest in all cases!): https://github.com/BurntSushi/rebar

In particular, the `regex-lite` engine is strictly just the PikeVM without any frills. No prefilters or literal optimizations. No other engines. Just the PikeVM.

As to your question, the PikeVM is, essentially, an NFA simulation. The PikeVM just refers to the layering of capture state on top of the NFA simulation. But you can peel back the capture state and you're still left with a slow NFA simulation. I mention this because you seem to compare the PikeVM with "big graph structures with NFAs/DFAs." But the PikeVM is using a big NFA graph structure.

At a very high level, the time complexity of a Thompson NFA simulation and a DFA hints strongly at the answer to your question: searching with a Thompson NFA has worst case O(m*n) time while a DFA has worst case O(n) time, where m is proportional to the size of the regex and n is proportional to the size of the haystack. That is, for each character of the haystack, the Thompson NFA is potentially doing up to `m` amount of work. And indeed, in practice, it really does need to do some work for each character.

A Thompson NFA simulation needs to keep track of every state it is simultaneously in at any given point. And in order to compute the transition function, you need to compute it for every state you're in. The epsilon transitions that are added as part of the Thompson NFA construction (and are, crucially, what make building a Thompson NFA so fast) exacerbate this. So what happens is that you wind up chasing epsilon transitions over and over for each character.

A DFA pre-computes these epsilon closures during powerset construction. Of course, that takes worst case O(2^m) time, which is why real DFAs aren't really used in general purpose engines. Instead, lazy DFAs are used.

As for things like V8, they are backtrackers. They don't need to keep track of every state they're simultaneously in because they don't mind taking a very long time to complete some searches. But in practice, this can make them much faster for some inputs.

Feel free to ask more questions. I'll stop here.

masklinn · on Nov 3, 2023

One that seems to be missing from the list is Spencer’s, which I know of from being postgres’ (and tcl’s) and infamously supports backreferences while being quite resilient to the usual backtracking NFA issues.

burntsushi · on Nov 3, 2023

It's not really "missing" because it's not intended to be an exhaustive list of all possible approaches. :-) It is merely a list of the regex engines currently in use in the regex crate.

It's essentially the same as RE2. But with the addition of a full DFA and a public versioned API for each of the internal engines. See: https://blog.burntsushi.net/regex-internals/

The regex crate follows the RE2 tradition of not supporting features that aren't known how to implement efficiently.

mcbain · on Nov 3, 2023

Which is also the OP's (BurntSushi) work.

panyam · on Nov 3, 2023

I actually had a lot of fun doing this in typescript (though a much simpler not as prod ready version of it) - https://github.com/panyam/tlex

burntsushi · on Nov 4, 2023

Yours is far more complicated than what I did here. This is a translation of Cox's C program for a Thompson NFA simulation. Look at the source. It's only a few hundred lines. And the README should reveal that these are certainly not production ready programs. :-)

panyam · on Nov 4, 2023

Oh yeah. I started with Rus's (amazingly demystifying) blog posts and then got a bit carried away and ended up atleast covering a good chunk of the js regex test suite. The wonders of not having to work on someone else's schedule!

By the way I loved what you did and all your work around making the regex engine exposed as a "configurable" API (I was going by one of your previous posts)!! My only regret is not doing this in Rust as I got a bit scared of the learning curve!

bsaul · on Nov 3, 2023

Every time i read about rust requiring you to use indices instead of pointers for some algorithm, i wonder if this isn't just rust forcing you to implement your own memory management, with your own local pointers.

IshKebab · on Nov 3, 2023

It pretty much is but with some fairly huge benefits:

* It's still memory safe (in the traditional sense). No chance of overwriting return addresses or whatever.

* It's type safe. There's no risk of type confusion. There's an excellent crate called `typed_index_collections` that lets you use a unique index type for each vector so the index is properly typed (i.e. you have `UserIndex` and `CommentIndex` or whatever and you can't mix them up).

* You don't need to do it for your entire program.

I have used this extensively in a profiler that worked on array oriented data structures. It worked great, especially because the data was immutable.

For mutable data it would probably work less well. I haven't tried that.

burntsushi · on Nov 3, 2023

And in this particular case:

* You can choose to use handles smaller than a pointer.

* The NFA is cyclic, which makes memory management quite the chore. Even if you're using Rust's reference counted pointers. But if you use handles, the actual memory management becomes a lot simpler.

muldvarp · on Nov 3, 2023

> * You can choose to use handles smaller than a pointer.

Did you measure any performance improvements due to using small handles?

I have a medium-sized project that uses a ton of `usize`s to reference elements in `Vec`s and would love to know if there's some potential for performance improvements there.

burntsushi · on Nov 3, 2023

In this particular program I didn't do an in depth performance comparison because it wasn't really the point of the exercise.

But in the actual regex crate, I do indeed use `u32` as a representation for a handle. To a first approximation, it corresponds to about half as much memory usage.

Same deal with the aho-corasick crate.

burntsushi · on Nov 3, 2023

What parts of my own memory management did I need to implement in this exercise? And how does its risks and rewards compare to the alternative in C (or Rust)?

We have a real program in front of us. Let's try to be concrete here please.

GrumpySloth · on Nov 3, 2023

You would need to do the same in C++, since the underlying dynamic array is being resized throughout the algorithm, i.e. pointers to its elements are being invalidated. Except C++ won’t warn you about it. You’ll just corrupt memory.

And in garbage collected languages you’d do the same to avoid the cost of GC.

vidarh · on Nov 3, 2023

In C++ you can trivially opt out of using a dynamic array.

There are plenty of reasons not to use C++, including the lack of memory safety, but if you've already decided to ditch that, C++ does let you pick and choose which parts you want to use.

spoiler · on Nov 3, 2023

> does let you pick and choose which parts you want to use

Isn't that the case for most programming languages (ie, you can use the implementation you want/need), though?

vidarh · on Nov 3, 2023

Not to the extent of C++. E.g. you can't opt out of dynamic dispatch overhead in Ruby, because everything is method passing, but "virtual" in C++ is a choice.

The commitment to ensure that features do not add cost(memory, runtime) if you don't use them is something fairly uncommon in other languages, and many languages have features you can't meaningfully avoid.

ernst_klim · on Nov 3, 2023

You don't have to use an array based on realloc. You can use a vector based on linked list of memory pages, or linked list if you need deletion (or gap buffer or any other data structure depends on your needs).

GrumpySloth · on Nov 3, 2023

And at that point you’re slower.

lifthrasiir · on Nov 3, 2023

Even if it's the case, local pointers are at least better than unrestricted global pointers for the same reason that local variables are better than unrestricted global variables. One difference is that local variables generally need no reasoning to prove their safety while local pointers may need one, but that's rather a job for abstractions.

bsaul · on Nov 3, 2023

I agree, but if that's the case, then maybe zig's philosophy of simply creating local memory allocators really is the good pattern ?

lifthrasiir · on Nov 3, 2023

Yes, it is indeed a good trade-off. That strategy is called a region-based memory management in general, and in fact Rust's borrow checker is its generalization, automatically generating fine-grained small regions that are called "lifetimes" ;-) So they are valid approaches, just that Rust borrow checking is much broader that safe Rust covers wide use cases.

masklinn · on Nov 3, 2023

Isn’t it essentially the same pattern except unsafe if you don’t have a borrow checker?

If you have indices into an array you need access to that array, once the array is gone the indexes are “dead”.

If you have pointers into local memory without any sort of borrow checking once the local memory is gone you have UAFs.

vidarh · on Nov 3, 2023

Sort of. They do make many patterns far easier to reason about also without help. E.g. in this case, for any memory only used internally by the parser and NFA, guaranteeing that indices or pointers to the underlying memory isn't leaked at all is often fairly easy, and certainly easier than trying to keep precise track of what to free when.

Having it done for you automatically is certainly even better, but if you're first saddled with manual memory management, it's better than nothing.

zackoverflow · on Nov 3, 2023

Most of the time you don't need to do this, usually its the solution for when creating a graph structure with cyclic references.

But when you do, and your data is mutable, generational indices are a legit and very useful strategy that is used in C/C++ too.

See this: https://floooh.github.io/2018/06/17/handles-vs-pointers.html

pornel · on Nov 4, 2023

I think another solution could be to use an arena/memory pool to replace indices with references.

But indices are still safer than raw pointers - vec access does bounds checks. You can make indices typed if you’re worried about mixing them up.

burntsushi · on Nov 4, 2023

I mentioned slab/bump allocators in the README. But whether it works with the Ptrlist type punning is unclear to me.

ernst_klim · on Nov 3, 2023

That's correct. And you operate with weak pointers at that moment with no help from the type system. So basically that's all about bypassing the too incomplete type system.

yagni_dev · on Nov 3, 2023

It's because borrow checker is very restrictive. Indices are a sanity escape!