It's complicated. The Army Corps of Engineers has had a civilian mandate to support flood control prevention since 1917 [1]. Beyond that, they are also involved in large public works projects such as the building of roads and bridges, and Superfund clean-up sites. On top of this, they regularly receive large pork barrel grants from Congress that can siphon money into a senator's state or a congressperson's district. They do have a large contracting arm and are actually pretty well-regarded for their comprehensive procurement and management process for these large public works projects.
So it's scale, politics, and history/momentum at this point.
Ilan Schnell is not "some guy". He's the original primary author of the Anaconda distribution. One of the main reasons that so many data scientists use Python.
NumPy is a library that provides typed multidimensional arrays and functions that run atop them. It does provide a built-in LAPACK/BLAS or can link externally to LAPACK/BLAS, but that's a side effect of providing typed arrays and is nowhere near the central purpose of the library.
Also, NumPy is implemented completely in C and Python, and makes extensive use of CPython extension hooks and knowledge of the CPython reference counting implementation, which is part of the reason why it is so hard to port to other implementations of Python.
Except that Docker containers play terribly with virtualization solutions. Still, some sort of configuration/infrastructure-as-code would go a long way.
This is a really interesting article and I'm glad this is getting attention. It's especially refreshing to see a theoretical treatment bridging algorithms with more "modern" hardware implementations.
One opportunity I do find in this article is real-world performance tests and determining the importance of constants in performance tuning. Prakop et al. did some really interesting work in 1999 on cache-oblivious algorithms, with applications to kernels such as the FFT and matrix multiplication. The work is theoretically extremely interesting, but in practice, hand-written or machine-generated algorithms continue to dominate. The most famous example of this is probably Kazushige Goto, whose hand-optimized GotoBLAS (now under open source license as OpenBLAS) still provides some of the fastest linear algebra kernels in the world.
If you're interested in learning more about how the differences between the two approaches shake out in linear algebra, I recommend "Is Cache-Oblivious DGEMM Viable?" by Gunnels et al.
I don't think this would work for a number of reasons. If it's a database that you're modifying, you can see that a lot of operations (increment, delete, etc...) will do the wrong thing if they're called twice. If the operations themselves are idempotent, you wouldn't be able to verify that the intended side effect was correct. This is one reason developers spend a lot of time building mock objects: to capture "side effects".
How robust do you imagine it would be to just record the call / response pairs of the mutable objects in the new code and then replay them when running the experiment on the old code?
For example, suppose you have a db object and two versions of the code new_code and old_code. You call something that looks like:
experiment.run(new_code, old_code, mutables=[db])
Then the infrastructure runs new_code normally, but records the arguments and return value of every call to db (and any other object defined as mutable). Then, the infrastructure runs old_code, but whenever a method of db is called, it tries to match it with a call made by new_code and just directly returns the return value that call returned. If it can't match the call, it signals an error, but it never actually tries to call db, thus negating the risk of side effects.
Obviously this would fail when the two versions perform different operations in the database, even semantically equivalent, but non-identical operations (say one retrieves a value and increments it inside a transaction, while the other uses a stored procedure in db to increment without fetching). But it still relaxes the constraint, now you can do this for code that has no side effects and code that has the exact same side effects as represented by exact call/return pairs to mutable objects.
> How robust do you imagine it would be to just record the call / response pairs of the mutable objects in the new code and then replay them when running the experiment on the old code? […] Obviously this would fail when the two versions perform different operations in the database, even semantically equivalent
I'd think the latter would be the common expectation by virtue of a different implementation.
Awesome, I'm a huge fan of new and innovative tools that help improve the process of refactoring and improving existing code. This looks like a really promising tool for Ruby developers, and I'm always grateful when companies and their employees invest the time and effort to release their tools to the community. I really liked the point about "buggy data" as opposed to just buggy code, I think that's a really important point.
A few reactions from reading through the release:
Scientist appears to be largely limited to situations where the code has no "side effects". I think this is a pretty big caveat, and it would have been helpful in the introduction/summary to see this mentioned. Similarly, I think it would be nice to point out that Scientist is a Ruby-only framework :)
You don't mention "regression test" at any point in the article, which is the language I'm most familiar with for referring to this sort of testing. How does a scientist "experiment" compare to a regression test over that block of code?
Anyway, thanks again for writing this up, I'll be thinking more about the Experiment testing pattern for my own projects.
> Scientist appears to be largely limited to situations where the code has no "side effects".
That's one of the things I was initially thinking too, but then as I thought about where I could have used it in the past, I can think of only a few cases where it wouldn't be possible to keep it isolated.
For example, have your new code running against a non-live data store. Example: When a user changes permissions, the old code changes the live DB, while the new code changes the temporary DB. Later (or continuously), you could also compare the databases to ensure they're identical (easier if the datastore remains constant, a bit harder if you are changing to a different storage schema or product).
Where it would be in the difficult-to-impossible range is when touching on external services that don't have the ability to copy state and set up a secondary test service, but even in that case, you could record the requests made (or that would have been made) and ensure they both would have done the same thing.
This looks a lot like the Jupyter/IPython Notebook, which is a free and open source "scientist's notebook". If you're interested in mixing LaTeX, Markdown, and code from almost any language (Python, R, and Julia are very well-supported but there's an open kernel spec), then this might be a more appropriate tool for you to use.
The Jupyter/IPython notebook default storage format is JSON, which makes it a little more friendly for text-based version control, and also enables a static HTML view of notebooks (http://nbviewer.jupyter.org/github/ketch/teaching-numerics-w...) on GitHub.
Helen Shen wrote up a great article for Nature (http://www.nature.com/news/interactive-notebooks-sharing-the...) on how scientists are using the notebook, but it also provides a good overview of how you might use it, as well as a free interactive demo.
The originator of the notebook UI is Mathematica if I recall correctly. You can try the web version at [1] - it doesn't quite have the elegance of the desktop one, but is a much better notebook than Jupyter in my experience.
That patent is about a user interface for hiding/showing the code that creates particular rendered outputs, not for the general idea of a “notebook” UI.
I suspect there’s prior art (Hypercard? every spreadsheet ever? this list: https://en.wikipedia.org/wiki/List_of_graphical_user_interfa... ?), and the patent seems pretty obvious, but in any event, I don’t think other “code notebook” implementations are currently infringing this patent, and it seems relatively straightforward to work around.
Totally agree. Jupyter/IPython is an impressive tool.
I started to play with it recently and I am very happy with what it can do.
It's, also, very easy to install in a docker container: https://github.com/jupyter/docker-stacks
(I haven't managed to make work the persistence part when stopping the container yet, but this is due to my inexperience with docker).
Have you looked at conda and http://anaconda.org? We spent a lot of time curating the most important Python packages for data science into the Anaconda distribution, and conda packages are a great format for distributing complicated software.
this is pretty cool and I think you are on the right track here. but here's the same question here - are you making the process of package creation much easier ?
Because I suspect until you acquire mindshare among the academics (who distribute their research as R code), this will be difficult to scale.
I suspect that your primary limitation was the Google Compute Engine infrastructure. I'm not familiar with the limitations there, but a quick search on Google turns up a fairly limited set of libraries indeed.
I thought it would be interesting to adapt your code slightly to use Numba acceleration. Here's what it looks like:
from numba import jit
def avg_transform(image):
m, n, c = image.shape
for i in range(m):
xi = image[i]
for j in range(n):
avg = xi[j].sum()/3
xi[j][:] = avg
return image
fast_avg_transform = jit(avg_transform, nopython=True)
Re-reading your post, I suspect that einsum might actually be your cup of tea, but I really enjoy the simplicity and performance of using Numba for these sort of tasks.
But am I missing something? Numpy has everything you need already, natively, no? Some slicing or a dot product should get you there... no need for ufuncs or einsum, I think...
More generally, for an image im with shape (width, height, channels) and a square transformation matrix M of shape (channels, channels), you can do :
res = np.dot(im, M.T)
It will work with affine transformation as well if you add a 1 component to every pixel. It will also work with higher dimensional images if I'm not mistaken.
Numba is indeed pretty impressive, but you're not comparing exactly the same thing with this code.
In the Numba case, you're basically modifying the image in place: it means no allocating a new array, no full copying. However, your pure-numpy code basically creates a new array (the result of np.dot) before copying it back entirely in image.
If you write the two functions so that they both return a new numpy array and do not touch the original one, the time difference drops from 4 times faster to 2.5 times faster. That's still an impressive difference, but at the loss of a bit of flexibility.
N.B.: numpy.dot does not use broadcasting, i.e. it does not allocate a temporary array to extend the smaller one. The function handles n-dimensional arrays by summing on the last index of the first array, and on the second last of the second array.
Thanks, I clearly wasn't being careful. I'll update my Gist...
edit: On reviewing, I think the intent of the original blog post was to modify images in place (or at least to do it as quickly as possible with in-place filtering ok). In that case, I think my comparison is fair, since NumPy doesn't offer a faster way to do the requested operation. I didn't try out einsum, but I think Numba would outperform that as well.
So it's scale, politics, and history/momentum at this point.
[1] https://en.wikipedia.org/wiki/U.S._Army_Corps_of_Engineers_c...