Bonobo runs each functions in the pipeline in parallel and make the fifo queues plumbing and thread pool management completely transparent.
The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.
The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.
Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.
Python is my usual language of choice, but recently I picked up Go for some data processing because there was a lot of benefits to parallelising the task - which Go made easy.
As soon as I can, I'll include comparison pages to the documentation, trying to keep it as objective as possible. I can't seriously answer this question in depth here, but it is planned, so at least experts from other systems can also jump in and complement/correct my understanding of each systems. I used a bunch of them, but I'm in no mean expert user of each so making it collaborative sound like a better idea than just giving my point of view.
You're very right, as I'm using both pandas and bonobo for different reasons.
Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.
I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...
No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.
Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.
With the ancestor of bonobo, I was processing 5M lines of data in around 1 hour, including extraction, joins, api calls and a few loads. That should give a first info about the size target.
Multiprocessing or muilthreading? Why don't you market it as parallel coroutines processing? That gets me interested. Because there are dozens of frameworks with overgeneral descriptions.
Today, as a default, multithreading. But that's an implementation detail. Actually, Bonobo does not support coroutines (as in asyncio coroutines) so it would be a lie to market it this way. The plan though is to allow to use coroutines/futures in the future, for specific reasons (like long running/blocking operations where keeping output order tied to input order is of no importance). Still, there is a lot on the roadmap before this becomes a priority.
I note that I still have a lot of work explaining in simple terms what is actually bonobo, without falling in the trap of "overgeneral description".
It's indeed intended for «small data», by opposition to «big data». I know, that does not say much, but I basically wanted to handle small flux of data without having to install the "big weapons".
I'm preparing explanation pages for a lot of the questions I got, including comparisons, volumes of data, where it is good and where it is not ...
All that will be well ready before 1.0, but for now, we're at 0.2 ...
Came here to post just that. It's called 'Bonobo', there's a picture of a gorilla, and the page keeps saying 'monkey'- as petty as it sounds, you're probably losing potential users to zoological nerdrage.
Yes, hackernews and twitter brutally told me I should take animal reign culture classes asap ...
This being said, if any of you have a good picture of bonobos that I can use instead of the current one, I'd be really glad to replace it! It needs to be released under a free license, though.
I came here to say that! The bonobo is an an indigenous ape of the left bank of the Congo river in the Congo rain forest in the DR Congo. They look indistinguishable to chimpanzees to the untrained eye.
Gorillas are a whole different species and you have at least 4 subspecies of gorillas, none of which look like chimps or bonobos.
They have some behavioral features that tell them apart from a chimpanzee. Apparently they use sex as a greeting. It is (sort of) anthropomorphized in a hilarious way in the Will Self novel 'Great Apes'. However that may be colouring my memory of how common this is in the real creatures...
The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.
The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.
Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.