Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
(Python) Generator Tricks For Systems Programmer (pdf) (dabeaz.com)
205 points by slig on Dec 13, 2011 | hide | past | favorite | 24 comments


I love generators! They produce attractive code that is understandable (once one really understands what a generator is)

I remember using one when I was building a Mail DB search appliance (on the backburner..) with the mail retrieval code being this:

https://gist.github.com/1014045

Basically, after importing, retrieving emails is as simple as this:

    get_messages, close_server  = __pop3()
    all_messages = get_messages()
    for message in all_messages:
            fn(message)
    close_server()
__pop3 returning a simple generator, while close_server deals with shutting off the server / cleaning up the closure itself. No need to retrieve/initialize all the emails at once (with the memory / other overhead involved); just one at a time (with the connection being kept open in the background). Normally if you'd want to do that, you'd have to stick the "message handling code" into the "message retrieval code", or at least pass along a "handling" function - neither of which really seems that kosher to me. Generators solved that problem handily.


Looks like you could benefit from python's context managers (aka the "with" statement http://www.python.org/dev/peps/pep-0343/ ). It allows you to do something at the start, do stuff in the middle, then guantee that something is done at the end, even if an exception is thrown. things that need to be closed afterwards are a great example of where with statements help.


Thanks, I'll take a look.


An alternate (and, arguably, more elegant) approach to "Parsing with Regex" and "Tuples to Dictionaries" is to use named regex groups and the MatchObject's "groupdict" method like so:

  logpats = r'(?P<host>\S+) (?P<referrer>\S+) (?P<user>\S+) \[(?P<datetime>.*?)\] '\
            r'"(?P<method>\S+) (?P<request>\S+) (?P<proto>\S+)" (?P<status>\S+) '\
            r'(?P<bytes>\S+)'
  logpat  = re.compile(logpats)
  groups  = (logpat.match(line) for line in loglines)
  log     = (g.groupdict() for g in groups if g)


If you really like the "pipelined" approach to describing data processing, postfix languages sing at it. Factor (http://factorcode.org/) is well worth investigating.

edit: I built a little toy system for working with generators in Forth- some aspects are a bit messy, but you can still write code like

  100 upto sum dup * 100 upto { dup * } map sum -
(https://github.com/JohnEarnest/Mako/blob/master/examples/Alg...)


Apart from chaining generators, you can write a simple helper to pipe general expressions.

    def pipe(cur_val, *fns):
        """
        Pipes `cur_val` through `fns`.
        ::
            def sqr(x): return x * x

            def negate(x): return -x

        `pipe(5, sqr, negate)`

        is the same as
        ::
            negate(sqr(5))
        """
        for fn in fns:
            cur_val = fn(cur_val)
        return cur_val

I find `pipe(5, sqr, negate)` neater than `negate(sqr(5))`, especially when the nesting is greater than 2.

This is inspired by clojure's threading macro -> and a re-factoring of @fogus's code snippet on stack overflow. He used reduce to implement the loop - though loops can be implemented with reduce, and that's how you generally do it in clojure et al., it isn't idiomatic Python.


Python's iterators and generators are a perfect example of doing practical lazy evaluation. Streams in Scheme and lazy sequences in Clojure are similar in many ways.

A good programming language should try to find a sweet spot that has the best of both worlds, using lazy and eager evaluation in different situations. It's also a lot harder to convert code from eager to lazy evaluation than it is to do the other way, so I'm thinking that lazy evaluation should be the "default", as it is in Python 3 (range vs. irange in Python 2) and in Clojure for iteration and sequences.


Inspired by this very presentation, I recently knocked together a library for munging text streams in Python, rather than calling through to sed and grep.

If anyone's interested, the code's here: https://github.com/inglesp/munger.py


Is there any special reason for why you didn't use imap here https://github.com/inglesp/munger.py/blob/master/munger.py#L... ?


Habit, I think. Will change it when I get the chance.


I've really enjoyed this presentation. Thanks for sharing

Implementation of "tail -f filename" in Python given as example. (slightly altered)

  def follow(filename):
       f = open(filename)
       f.seek(0,2)
       while True:
           line = f.readline()
           if not line:
               continue
           yield line

  for new_line in follow('/tmp/test.file'):
       print new_line
This is so cool.


I love this pdf. The one thing I've been wanting is for the original author, or someone else with another ~1gig dataset, to do the test again with pypy and give us the speed figures.


Running the same scripts (mostly, except my Apache log lines were a bit longer) on a 1.1GB file resulted in:

python 2.7.1: list - 15.7s / generator - 14.1s

pypy 1.7: list - 7.5s / generator - 7.7s

On a MBP 2.2ghz i7 w/8gb. Fun!


Thanks, the future looks bright. Although RAM usage would be higher for pypy the improvement is massive.


I checked, PyPy used about 30mb, while vanilla Python was right around 3-4mb. So an order of magnitude.


Reading "Hacking Scrabble (part 1)" <http://news.ycombinator.com/item?id=3347720>; source, I remembered about this presentation. I learned a lot about it and thought you guys might like too.


The follow-up presentation here is just as good, in my opinion: http://www.dabeaz.com/coroutines/Coroutines.pdf


Thank you! I'll read it ASAP.


I've got an implementation of tee in Python. Python itself comes with a reverse tee (reading from multiple files/file-like objects), but there was nothing out there for writing once to multiple file (or file-like objects).

It wasn't as easy as one would think, because trying to use it in conjunction with something like subprocess.Popen() presents a problem since it requires a file object that has an actual file handle (i.e. file.fileno()) rather than just some mock-object that implements the right instance methods.

I need to get it up on github and/or pypi.



That's not the tee that I'm talking about.

This takes one iterator and returns n multiple iterators:

http://docs.python.org/library/itertools.html#itertools.tee

This is what I'm talking about when I'm talking about tee:

http://linux.die.net/man/1/tee

Python has a reverse tee:

http://docs.python.org/library/fileinput.html

This iterates over input from multiple sources, but my implementation takes input from a single source and writes it to multiple destinations.


Is there an equivalent in Ruby?


Not as high level as Python's generator expressions, but you can use Ruby's fibers to implement generators.


Enumerator solves typical generator scenarios reasonably well, and are easy to implement for your own classes. They use Fibers internally, and include Enumerable. Enumerator supports both internal and external iteration.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: