(Python) Generator Tricks For Systems Programmer (pdf)

jbm · on Dec 14, 2011

I love generators! They produce attractive code that is understandable (once one really understands what a generator is)

I remember using one when I was building a Mail DB search appliance (on the backburner..) with the mail retrieval code being this:

https://gist.github.com/1014045

Basically, after importing, retrieving emails is as simple as this:

    get_messages, close_server  = __pop3()
    all_messages = get_messages()
    for message in all_messages:
            fn(message)
    close_server()

__pop3 returning a simple generator, while close_server deals with shutting off the server / cleaning up the closure itself. No need to retrieve/initialize all the emails at once (with the memory / other overhead involved); just one at a time (with the connection being kept open in the background). Normally if you'd want to do that, you'd have to stick the "message handling code" into the "message retrieval code", or at least pass along a "handling" function - neither of which really seems that kosher to me. Generators solved that problem handily.

rmc · on Dec 14, 2011

Looks like you could benefit from python's context managers (aka the "with" statement http://www.python.org/dev/peps/pep-0343/ ). It allows you to do something at the start, do stuff in the middle, then guantee that something is done at the end, even if an exception is thrown. things that need to be closed afterwards are a great example of where with statements help.

jbm · on Dec 14, 2011

Thanks, I'll take a look.

julian37 · on Dec 14, 2011

An alternate (and, arguably, more elegant) approach to "Parsing with Regex" and "Tuples to Dictionaries" is to use named regex groups and the MatchObject's "groupdict" method like so:

  logpats = r'(?P<host>\S+) (?P<referrer>\S+) (?P<user>\S+) \[(?P<datetime>.*?)\] '\
            r'"(?P<method>\S+) (?P<request>\S+) (?P<proto>\S+)" (?P<status>\S+) '\
            r'(?P<bytes>\S+)'
  logpat  = re.compile(logpats)
  groups  = (logpat.match(line) for line in loglines)
  log     = (g.groupdict() for g in groups if g)

RodgerTheGreat · on Dec 13, 2011

If you really like the "pipelined" approach to describing data processing, postfix languages sing at it. Factor (http://factorcode.org/) is well worth investigating.

edit: I built a little toy system for working with generators in Forth- some aspects are a bit messy, but you can still write code like

  100 upto sum dup * 100 upto { dup * } map sum -

(https://github.com/JohnEarnest/Mako/blob/master/examples/Alg...)

irahul · on Dec 14, 2011

Apart from chaining generators, you can write a simple helper to pipe general expressions.

    def pipe(cur_val, *fns):
        """
        Pipes `cur_val` through `fns`.
        ::
            def sqr(x): return x * x

            def negate(x): return -x

        `pipe(5, sqr, negate)`

        is the same as
        ::
            negate(sqr(5))
        """
        for fn in fns:
            cur_val = fn(cur_val)
        return cur_val

I find `pipe(5, sqr, negate)` neater than `negate(sqr(5))`, especially when the nesting is greater than 2.

This is inspired by clojure's threading macro -> and a re-factoring of @fogus's code snippet on stack overflow. He used reduce to implement the loop - though loops can be implemented with reduce, and that's how you generally do it in clojure et al., it isn't idiomatic Python.

exDM69 · on Dec 14, 2011

Python's iterators and generators are a perfect example of doing practical lazy evaluation. Streams in Scheme and lazy sequences in Clojure are similar in many ways.

A good programming language should try to find a sweet spot that has the best of both worlds, using lazy and eager evaluation in different situations. It's also a lot harder to convert code from eager to lazy evaluation than it is to do the other way, so I'm thinking that lazy evaluation should be the "default", as it is in Python 3 (range vs. irange in Python 2) and in Clojure for iteration and sequences.

grifaton · on Dec 13, 2011

Inspired by this very presentation, I recently knocked together a library for munging text streams in Python, rather than calling through to sed and grep.

If anyone's interested, the code's here: https://github.com/inglesp/munger.py

slig · on Dec 13, 2011

Is there any special reason for why you didn't use imap here https://github.com/inglesp/munger.py/blob/master/munger.py#L... ?

grifaton · on Dec 14, 2011

Habit, I think. Will change it when I get the chance.

plainOldText · on Dec 14, 2011

I've really enjoyed this presentation. Thanks for sharing

Implementation of "tail -f filename" in Python given as example. (slightly altered)

  def follow(filename):
       f = open(filename)
       f.seek(0,2)
       while True:
           line = f.readline()
           if not line:
               continue
           yield line

  for new_line in follow('/tmp/test.file'):
       print new_line

This is so cool.

drats · on Dec 14, 2011

I love this pdf. The one thing I've been wanting is for the original author, or someone else with another ~1gig dataset, to do the test again with pypy and give us the speed figures.

bryanh · on Dec 14, 2011

Running the same scripts (mostly, except my Apache log lines were a bit longer) on a 1.1GB file resulted in:

python 2.7.1: list - 15.7s / generator - 14.1s

pypy 1.7: list - 7.5s / generator - 7.7s

On a MBP 2.2ghz i7 w/8gb. Fun!

drats · on Dec 15, 2011

Thanks, the future looks bright. Although RAM usage would be higher for pypy the improvement is massive.

bryanh · on Dec 15, 2011

I checked, PyPy used about 30mb, while vanilla Python was right around 3-4mb. So an order of magnitude.

slig · on Dec 13, 2011

Reading "Hacking Scrabble (part 1)" <http://news.ycombinator.com/item?id=3347720>; source, I remembered about this presentation. I learned a lot about it and thought you guys might like too.

insipid · on Dec 14, 2011

The follow-up presentation here is just as good, in my opinion: http://www.dabeaz.com/coroutines/Coroutines.pdf

slig · on Dec 14, 2011

Thank you! I'll read it ASAP.

pyre · on Dec 14, 2011

I've got an implementation of tee in Python. Python itself comes with a reverse tee (reading from multiple files/file-like objects), but there was nothing out there for writing once to multiple file (or file-like objects).

It wasn't as easy as one would think, because trying to use it in conjunction with something like subprocess.Popen() presents a problem since it requires a file object that has an actual file handle (i.e. file.fileno()) rather than just some mock-object that implements the right instance methods.

I need to get it up on github and/or pypi.

bockris · on Dec 14, 2011

http://docs.python.org/library/itertools.html

has an implementation of tee

pyre · on Dec 14, 2011

That's not the tee that I'm talking about.

This takes one iterator and returns n multiple iterators:

http://docs.python.org/library/itertools.html#itertools.tee

This is what I'm talking about when I'm talking about tee:

http://linux.die.net/man/1/tee

Python has a reverse tee:

http://docs.python.org/library/fileinput.html

This iterates over input from multiple sources, but my implementation takes input from a single source and writes it to multiple destinations.

gerds · on Dec 13, 2011

Is there an equivalent in Ruby?

irahul · on Dec 14, 2011

Not as high level as Python's generator expressions, but you can use Ruby's fibers to implement generators.

rue · on Dec 14, 2011

Enumerator solves typical generator scenarios reasonably well, and are easy to implement for your own classes. They use Fibers internally, and include Enumerable. Enumerator supports both internal and external iteration.