Doesn't handle escaped quotes, and the time complexity of that regex is very bad...

eru · on Feb 23, 2024

The time complexity for all matching a string against any fixed regular expression is O(length of string).

If you want to talk about constant factors, we need to leave our comfortable armchairs and actually benchmark.

[Just to be clear, I am talking about real regular expressions, not Franken-xpressions with back-references etc here. But what the original commenter described is well within the realm of what you can do with regular expressions.]

You are right about escaped quotes etc. That's part of why parsing with regular expressions is hard.

meindnoch · on Feb 23, 2024

The time complexity for deciding whether an N-letter string matches a regex or not, is O(N). The time complexity of finding all matches is not O(N) - which is needed in OPs case, because they want to split the string.

Also, OP's solution uses lookahead assertions, so it's not a real regular expression.

(I wonder if we can summon @burntsushi for expert opinion on this?)

eru · on Feb 24, 2024

You are right that the lookahead might be expensive.

(There's probably a way that a sufficiently smart compiler of (ir-)regular expressions can optimize this expression to be still matchable quickly; but Python's regular expression matcher is probably not that smart. I'm not sure if any real world matcher is.)

> The time complexity of finding all matches is not O(N) [...]

If you are happy find a maximal set of non-overlapping matches, you can still do it in O(N). By 'maximal' I mean that you can't greedily find another match (without removing any of the existing matches.)

A sketch of the technique is: take your pattern and wrap it up like this '.?{pattern.?}' (where ? means non-greedy repetition) and match that against your input string. You can do non-greedy repetition and the very limited form of sub-pattern capturing that you need to find all the matches of 'pattern' without breaking O(N) time.

I'm not sure whether you can find the global maximum number of non-overlapping matches, instead of a just a greedy maximum, in O(N) time.