I left the following comment at the blog: For this example, idiomatic Pandas wou...

blattimwind · on July 15, 2017

Sometimes I wonder whether actual optimization (as a craft, if you like) has been poisoned by scripting languages (Python etc.) and frameworks.

For one thing they completely changed how the speed of computers is perceived, e.g. it is actually considered fast when a web app spends "only" 20 server CPU milliseconds on a request.

And because most things are so inefficient, it is relatively easy to get integer speed-ups. When looking at applications written in C++, written by actual engineers, then often just getting 5 or 10 % on some functionality is appreciated, and achieving more can be seriously difficult. With software in scripting languages improvements below a few hundred percent don't raise an eye brow.

However, actually optimizing code in a scripting language (and not changing the way a native library is used) can be seriously difficult and annoying, because most simply don't have efficient tools — obviously, if a scripting language (like Python) had something substantially faster than, say, attribute access, then that something would simply be used to power attribute access in the general case. So in C++ (or other native languages) you can actually make good use of the processor's and memory system's resources, but you just can't writing in a scripting language.

It is also surprising that most tooling available for e.g. Python is rather immature, insufficient and cumbersome to use compared to the tooling (especially commercial) available for native code (or Java). This likely has something to do with people a) not caring about performance at all and/or b) since it is basically not possible to write fast Python code, as soon as something becomes important for performance, it is written in another language and used from Python, and for analyzing the performance of that code, which is now an external library, you don't need tools for Python.

pwang · on July 17, 2017

When Python is being used for these data analytical tasks, random attribute access is not the common case. Instead, SIMD and vectorized mathematical transformations is the order of the day, and those can be specified very concisely and powerfully within the syntax of a "scripting" language.

More importantly, that conciseness means that that language is accessible to people who understand the right maths to use. This is frequently not your CS-trained C++/Java software engineer.

It's the age-old philosophical quandary: is it more useful to have the fastest bomber that can only be piloted by specialized pilots who must receive targeting instructions via long-distance radio, or is it better to have a moderately fast bomber than can be piloted by people who can actually see the targets?

Python is succeeding and taking over the world because businesses are overwhelmingly seeing the value of the latter approach.

minimaxir · on July 15, 2017

Ad-hoc statistical analysis that takes a few seconds to run is a different technical case than production environments where every second is critical. For most datasets/workflows, as long as it can fit into a few GB of memory, it can run on a computer in a reasonable amount of time on any modern data analysis library like pandas or R's dplyr without having to dive into bare metal.

The exceptions are if you are coding your algorithm very poorly (like in this case) and have bad time complexity that causes the run time to balloon at large amounts of data.

_gok2 · on July 15, 2017

This comment contains some fairly pejorative language

I can assure you that there are real-world high performance python systems with per-event budgets of far less than 1ms being used for critical systems in large organisations. Sometimes with the help of compiled components, but made understandable by being coordinated in python.

Of course these systems are atypical, but the idea that ‘actual’ engineers always use c/Java, while python is just a scripting language is just ignorant

blattimwind · on July 16, 2017

That's not what I wrote.

minimaxir · on July 15, 2017

It's especially weird in this case because the value-proposition of pandas is the ability to do things like aggregation in 3 lines of code and not have to iterate through every row.

Radim · on July 15, 2017

Well pandas iterates through every row anyway, just in native-land rather than pure Python. That's not the difference.

The trouble with this code was the choice of the quadratic algorithm (which the idiomatic pandas version avoids). You could achieve a similar speed-up as OP for the pure Python code too, by dropping the doubly-nested loop.

The point stands as usual: better algo > clever tricks :) Personally, I find the code brevity that pandas offers does not outweigh the added complexity and cognitive load of its (IMO hard to remember and reason about) API.

EDIT: pure Python version

  def gen_stats_python2(dataset_python):
      aggr = {}
      for product_id, product_order_num, quantity, price in dataset_python:
          item = aggr.get(product_id, [0, []])
          item[0] += quantity
          item[1].append(price)
          aggr[product_id] = item
      return [
          (int(product_id), len(prices), int(quantity), round(float(sum(prices)) / len(prices), 2))
          for product_id, (quantity, prices) in aggr.items()
      ]

761x speedup!

  %timeit gen_stats_python(dataset_python)
  1 loops, best of 3: 36.3 s per loop

  %timeit gen_stats_python2(dataset_python)
  10 loops, best of 3: 47.7 ms per loop

iClaudiusX · on July 15, 2017

It's a lot faster if you use python's itertools and operator itemgetter:

  def gen_stats(data):
    start = time.time()
    stats = []
    data.sort(key=itemgetter(0))
    for p_id, p_items in itertools.groupby(data, key=itemgetter(0)):
      num_orders, _, total_quantity, total_price = map(sum, zip(*p_items))
      p_id, num_orders, total_quantity = map(int, [p_id, num_orders/p_id, total_quantity])
      stats.append([p_id, num_orders, total_quantity, round(float(total_price/num_orders),2)])
    end = time.time()
    working_time = end-start
    return stats, working_time

Results:

  Dataset size 55096 records
  [[1, 8, 340, 22.55], [2, 4, 94, 15.83], [3, 2, 97, 3.89]]
  [[1, 8, 340, 22.55], [2, 4, 94, 15.83], [3, 2, 97, 3.89]]
  All results equal: True
  Runtime (s)
  gen_stats_slow: 28.489468574523926
  gen_stats_fast:  0.06000018119812012

Radim · on July 15, 2017

That doesn't look faster -- 60ms? Nor clearer, at least to me :)

Additionally, you might have a bug in the code: groupby() only groups consecutive keys, so your code will fail if the dataset is not already sorted by product_id.

I guess that's the lesson here -- use pandas to make sure you get a sane, fast, tested implementation of some common use-cases. That is, if you can remember its API, when it makes copies vs views under the hood, and all that gnarly stuff :-)

eesmith · on July 15, 2017

Wee! Pointless optimization Saturday!

Looks like my gen_stats_eesmith is even faster. Radim's is about 400x faster, iClaudiusX is 450x faster, and eesmith is 580x faster.

Under pypy2-v5.8.0 those are 65x, 16x, and 91x, respectively. The original algorithm is 10x faster under pypy.

    def gen_stats_python(dataset_python):
        start = time.time()
        product_stats = []
        unique_products = set([x[0] for x in dataset_python])
        for product_id in unique_products:
            product_items = [x for x in dataset_python if x[0]==product_id ]
            num_orders = len(product_items)
            total_quantity = 0
            total_price = 0
            for row in product_items:
                total_quantity += row[2]
                total_price += row[3]
            avg_price = float(total_price/num_orders)
            product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
        end = time.time()
        working_time = end-start
        return product_stats,working_time

    def gen_stats_radim(dataset_python):
      start = time.time()
      aggr = {}
      for product_id, product_order_num, quantity, price in dataset_python:
          item = aggr.get(product_id, [0, []])
          item[0] += quantity
          item[1].append(price)
          aggr[product_id] = item
      return [
          [int(product_id), len(prices), int(quantity), round(float(sum(prices)) / len(prices), 2)]
          for product_id, (quantity, prices) in aggr.items()
      ], time.time()-start

    def gen_stats_iclaudiusx(data):
        start = time.time()
        stats = []
        data.sort(key=itemgetter(0))
        for p_id, p_items in itertools.groupby(data, key=itemgetter(0)):
          num_orders, _, total_quantity, total_price = map(sum, zip(*p_items))
          p_id, num_orders, total_quantity = map(int, [p_id, num_orders/p_id, total_quantity])
          stats.append([p_id, num_orders, total_quantity, round(float(total_price/num_orders),2)])
        end = time.time()
        working_time = end-start
        return stats, working_time

    def gen_stats_eesmith(dataset_python):
        start = time.time()
        product_stats = []
        total_num_orders = defaultdict(int)
        total_quantity_by_product_id = defaultdict(int)
        total_price_by_product_id = defaultdict(float)

        for product_id, order_id, quantity, price in dataset_python:
            total_num_orders[product_id] += 1
            total_quantity_by_product_id[product_id] += quantity
            total_price_by_product_id[product_id] += price

        for product_id in total_price_by_product_id:
            num_orders = total_num_orders[product_id]
            total_quantity = total_quantity_by_product_id[product_id]
            total_price = total_price_by_product_id[product_id]

            avg_price = float(total_price)/num_orders
            product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
        end = time.time()
        working_time = end-start
        return product_stats,working_time

    reference_result = reference_dt = None
    gen_stats_functions = (gen_stats_python, gen_stats_radim,
        gen_stats_iclaudiusx, gen_stats_eesmith)
    for gen_stats_function in gen_stats_functions:
        result, dt = gen_stats_function(dataset_python)
        if reference_result is None:
            reference_result = result
            reference_dt = dt
        assert result == reference_result, (gen_stats_function.__name__,
                 result[:2], reference_result[:2])
        print(gen_stats_function.__name__, dt, reference_dt/dt)

tech2 · on July 16, 2017

Who doesn't like a good pointless optimisation day? In the same vein (and because what else was I going to do with my Sunday morning?) I ran yours under 3.x and came up with radim/iclaudiusx at ~430x, eesmith at ~600x, tech2 at ~850x. I didn't get a chance to try pypy. Thanks for the amusement!

    def gen_stats_tech2(dataset_python):
        start = time.time()
        product_stats = []

        # totals = (num_orders, quantity, price)
        totals = defaultdict(lambda: (0, 0, 0.0))

        for product_id, order_id, quantity, price in dataset_python:
            o_count, o_quant, o_price = totals[product_id]
            totals[product_id] = (o_count + 1, o_quant + quantity, o_price + price)

        for product_id, product_info in totals.items():
            num_orders, total_quantity, total_price = product_info
            avg_price = total_price / num_orders
            product_stats.append(
                [product_id, num_orders, int(total_quantity),
                 round(avg_price, 2)]
            )
        end = time.time()
        working_time = end - start
        return product_stats, working_time

tech2 · on July 16, 2017

And, just because I'm in a golfing mood:

    def gen_stats_tech2(dataset_python):
        start = time.time()

        totals = defaultdict(lambda: (0, 0, 0.0))

        for product_id, order_id, quantity, price in dataset_python:
            o_count, o_quant, o_price = totals[product_id]
            totals[product_id] = (o_count + 1, o_quant + quantity, o_price + price)

        product_stats = [
            [product_id, num_orders, total_quantity,
             round(total_price / num_orders, 2)]
            for product_id, (num_orders, total_quantity, total_price)
            in totals.items()
        ]
        end = time.time()
        working_time = end - start
        return product_stats, working_time

950x improvement. I think the former reads more pleasantly and you're likely to do better with something like pypy though rather than dragging the code through the muck.

Most improvements are from use of some of the builtins like dictionary's .items method rather than making a getitem call per loop, removal of references to globals (int) since the values stored already have an integer value, removing N calls to append and letting the list-comp manage, and removing a store-then-unpack.

Sadly I don't think the top loop can be reduced in a similar manner because it's self-referential. Still, that made for a fun hour or so, I was really hoping to hit 1,000x improvement :/

eesmith · on July 16, 2017

Nice work!

Radim · on July 15, 2017

Nice :)

Interestingly, the O(n) Python is still about 4x slower than the O(n) Pandas -- the same relative difference as in the original O(n^2) implementations from the blog post.

So despite the mishaps of this "Senior IT Business Analyst" choosing an unsuitable algorithm, his relative graphs pretty much stand.

lukasz_km · on July 15, 2017

I have had much fun reading this thread of comments today :)

This comparison was never about concrete (optimal) implementations but about comparing exactly the same implementations, doing exactly the same operations. Such approach would be the same for comparing performance of any languages, like C and Java. This is not "the best algo" race.

So, after a day of discussion, the result is the following: regardless if the implementation is quadratic or linear, the relative speed of all 3 technologies is (roughly) the same. I could not agree more as this is primary school math: dividing by the same denominator still means that proportions are the same :)

I plan to update my blog post, based on many interesting comments you have all provided. I definetely think this analysis should be more complex. Thank you for your insights.

I have many great pieces of code for inspiration, but I am struggling with providing exactly the same implementations. So it may cost me some time to provide all 3 optimized implementations that are exactly relevant to each other.

eesmith · on July 16, 2017

"the relative speed of all 3 technologies is (roughly) the same"

I'll be a bit pedantic. The original essay says "pure Python", which is not a technology but a language. There are several implementations of Python, which I'll argue is the "technology". The PyPy implementation of the pure Python algorithm is 10x faster than the CPython implementation.

According to the original essay, that's close to the CPython/NumPy performance, and faster than the CPython/Pandas version.

It's tempting to omit PyPy because "nobody uses it in data science". That's a self-fulfilling argument.

PyPy is doing amazing work to support both NumPy and Pandas, but it's limited by funding. Why aren't data science people pumping money into PyPy? It would give a huge performance boost even for naive, throw-it-together algorithms.

If you don't know why PyPy can do for you, how do you learn if/when you should use it?

lukasz_km · on July 17, 2017

I have tried PyPy once and it gives huge speed up. I am not sure how it works with Pandas/Numpy. I have also tried Numba JIT. Eeasy to use, but in my experince it saves just up to 10% of time. There is also Cython (https://pandas.pydata.org/pandas-docs/stable/enhancingperf.h...) and it definitely works with Pandas. Have not tried yet.

eesmith · on July 17, 2017

If you have a pure Python implementation, then why does it matter how well PyPy works with Pandas and NumPy?

My point is a minor one. "Python" isn't a technology. "CPython" and "PyPy" are technologies. Or you can try MicroPython, Iron Python, Jython, and others.

In any case, you'll see the best idiomatic Pandas code is reported to be 500x faster than the original, while the best idiomatic Python is reported to be 950x faster. The proportions are not the same, so it's not like "dividing by the same denominator".

The rank order is unchanged, though that's a weaker observation than what you wrote.

Radim · on July 16, 2017

Alright man, good luck. I know HN can be brutal.

But this was really not your best piece (technically; it may have met its marketing / traffic goals).

NumberCruncher · on July 15, 2017

>> Well pandas iterates through every row anyway

The power of pandas comes - partly - from vectorized operations. This means that instead of iterating through every row it processes the data in data sets transforming the operations into matrix/vector manipulation. For matrix manipulation it uses low lewel high performat packages borrowed from c, numpy, etc.

Writing "pandas code" not using the pandas operations is like eating soup with a fork. It works but it is pretty slow.

Radim · on July 15, 2017

Even vectorized operations still have to go through every data item.

I'm not sure what you imagine happens inside pandas, but I assure you it's not magic.

apl · on July 15, 2017

Thanks for coding this up. The code in the blog post is, at best, incredibly misleading.

azag0 · on July 15, 2017

Was the comment redacted?