Basically, after importing, retrieving emails is as simple as this:
get_messages, close_server = __pop3()
all_messages = get_messages()
for message in all_messages:
fn(message)
close_server()
__pop3 returning a simple generator, while close_server deals with shutting off the server / cleaning up the closure itself. No need to retrieve/initialize all the emails at once (with the memory / other overhead involved); just one at a time (with the connection being kept open in the background). Normally if you'd want to do that, you'd have to stick the "message handling code" into the "message retrieval code", or at least pass along a "handling" function - neither of which really seems that kosher to me. Generators solved that problem handily.
Looks like you could benefit from python's context managers (aka the "with" statement http://www.python.org/dev/peps/pep-0343/ ). It allows you to do something at the start, do stuff in the middle, then guantee that something is done at the end, even if an exception is thrown. things that need to be closed afterwards are a great example of where with statements help.
An alternate (and, arguably, more elegant) approach to "Parsing with Regex" and "Tuples to Dictionaries" is to use named regex groups and the MatchObject's "groupdict" method like so:
logpats = r'(?P<host>\S+) (?P<referrer>\S+) (?P<user>\S+) \[(?P<datetime>.*?)\] '\
r'"(?P<method>\S+) (?P<request>\S+) (?P<proto>\S+)" (?P<status>\S+) '\
r'(?P<bytes>\S+)'
logpat = re.compile(logpats)
groups = (logpat.match(line) for line in loglines)
log = (g.groupdict() for g in groups if g)
If you really like the "pipelined" approach to describing data processing, postfix languages sing at it. Factor (http://factorcode.org/) is well worth investigating.
edit: I built a little toy system for working with generators in Forth- some aspects are a bit messy, but you can still write code like
Apart from chaining generators, you can write a simple helper to pipe general expressions.
def pipe(cur_val, *fns):
"""
Pipes `cur_val` through `fns`.
::
def sqr(x): return x * x
def negate(x): return -x
`pipe(5, sqr, negate)`
is the same as
::
negate(sqr(5))
"""
for fn in fns:
cur_val = fn(cur_val)
return cur_val
I find `pipe(5, sqr, negate)` neater than `negate(sqr(5))`, especially when the nesting is greater than 2.
This is inspired by clojure's threading macro -> and a re-factoring of @fogus's code snippet on stack overflow. He used reduce to implement the loop - though loops can be implemented with reduce, and that's how you generally do it in clojure et al., it isn't idiomatic Python.
Python's iterators and generators are a perfect example of doing practical lazy evaluation. Streams in Scheme and lazy sequences in Clojure are similar in many ways.
A good programming language should try to find a sweet spot that has the best of both worlds, using lazy and eager evaluation in different situations. It's also a lot harder to convert code from eager to lazy evaluation than it is to do the other way, so I'm thinking that lazy evaluation should be the "default", as it is in Python 3 (range vs. irange in Python 2) and in Clojure for iteration and sequences.
Inspired by this very presentation, I recently knocked together a library for munging text streams in Python, rather than calling through to sed and grep.
I've really enjoyed this presentation. Thanks for sharing
Implementation of "tail -f filename" in Python given as example. (slightly altered)
def follow(filename):
f = open(filename)
f.seek(0,2)
while True:
line = f.readline()
if not line:
continue
yield line
for new_line in follow('/tmp/test.file'):
print new_line
I love this pdf. The one thing I've been wanting is for the original author, or someone else with another ~1gig dataset, to do the test again with pypy and give us the speed figures.
Reading "Hacking Scrabble (part 1)" <http://news.ycombinator.com/item?id=3347720>; source, I remembered about this presentation. I learned a lot about it and thought you guys might like too.
I've got an implementation of tee in Python. Python itself comes with a reverse tee (reading from multiple files/file-like objects), but there was nothing out there for writing once to multiple file (or file-like objects).
It wasn't as easy as one would think, because trying to use it in conjunction with something like subprocess.Popen() presents a problem since it requires a file object that has an actual file handle (i.e. file.fileno()) rather than just some mock-object that implements the right instance methods.
Enumerator solves typical generator scenarios reasonably well, and are easy to implement for your own classes. They use Fibers internally, and include Enumerable. Enumerator supports both internal and external iteration.
I remember using one when I was building a Mail DB search appliance (on the backburner..) with the mail retrieval code being this:
https://gist.github.com/1014045
Basically, after importing, retrieving emails is as simple as this:
__pop3 returning a simple generator, while close_server deals with shutting off the server / cleaning up the closure itself. No need to retrieve/initialize all the emails at once (with the memory / other overhead involved); just one at a time (with the connection being kept open in the background). Normally if you'd want to do that, you'd have to stick the "message handling code" into the "message retrieval code", or at least pass along a "handling" function - neither of which really seems that kosher to me. Generators solved that problem handily.