rdorgueil's comments

rdorgueil · on April 23, 2017

Bonobo runs each functions in the pipeline in parallel and make the fifo queues plumbing and thread pool management completely transparent.

The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.

The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.

Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.

robzyb · on April 23, 2017

This is really cool!

Python is my usual language of choice, but recently I picked up Go for some data processing because there was a lot of benefits to parallelising the task - which Go made easy.

rdorgueil · on April 22, 2017

Me (as an individual), and a few great people that helped me along the way. Not commercially endorsed, or supported.

lookACamel · on April 23, 2017

How does this compare to Dask, Luigi or Airflow?

rdorgueil · on April 23, 2017

As soon as I can, I'll include comparison pages to the documentation, trying to keep it as objective as possible. I can't seriously answer this question in depth here, but it is planned, so at least experts from other systems can also jump in and complement/correct my understanding of each systems. I used a bunch of them, but I'm in no mean expert user of each so making it collaborative sound like a better idea than just giving my point of view.

rdorgueil · on April 22, 2017

You're very right, as I'm using both pandas and bonobo for different reasons.

Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.

RobinL · on April 22, 2017

I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...

rdorgueil · on April 22, 2017

No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.

dirtyaura · on April 22, 2017

Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.

Can anybody comment how Bonobo compares to Luigi?

rdorgueil · on April 22, 2017

With the ancestor of bonobo, I was processing 5M lines of data in around 1 hour, including extraction, joins, api calls and a few loads. That should give a first info about the size target.

rdorgueil · on April 22, 2017

Short answer : parralel execution.

rs86 · on April 22, 2017

Multiprocessing or muilthreading? Why don't you market it as parallel coroutines processing? That gets me interested. Because there are dozens of frameworks with overgeneral descriptions.

rdorgueil · on April 23, 2017

Today, as a default, multithreading. But that's an implementation detail. Actually, Bonobo does not support coroutines (as in asyncio coroutines) so it would be a lie to market it this way. The plan though is to allow to use coroutines/futures in the future, for specific reasons (like long running/blocking operations where keeping output order tied to input order is of no importance). Still, there is a lot on the roadmap before this becomes a priority.

I note that I still have a lot of work explaining in simple terms what is actually bonobo, without falling in the trap of "overgeneral description".

ColanR · on April 22, 2017

"parallel" :)

cicero · on April 22, 2017

Remember, the double l's make parallel lines.

rjurney · on April 23, 2017

The docs say nothing about this. If you've implemented pmap, maybe that is useful. But the framework itself doesn't seem to do anything.

rdorgueil · on April 22, 2017

It's indeed intended for «small data», by opposition to «big data». I know, that does not say much, but I basically wanted to handle small flux of data without having to install the "big weapons".

I'm preparing explanation pages for a lot of the questions I got, including comparisons, volumes of data, where it is good and where it is not ...

All that will be well ready before 1.0, but for now, we're at 0.2 ...

Thanks for all the hackerlove, though!

rdorgueil · on April 22, 2017

Noted, sorry for that. I'll get more infos about bonobos.

nn3 · on April 22, 2017

The picture looks more like a Gorilla than a Bonobo too

e5an · on April 22, 2017

Came here to post just that. It's called 'Bonobo', there's a picture of a gorilla, and the page keeps saying 'monkey'- as petty as it sounds, you're probably losing potential users to zoological nerdrage.

rdorgueil · on April 22, 2017

Yes, hackernews and twitter brutally told me I should take animal reign culture classes asap ...

This being said, if any of you have a good picture of bonobos that I can use instead of the current one, I'd be really glad to replace it! It needs to be released under a free license, though.

Thanks HN

rdorgueil · on April 22, 2017

Currently realizing that we only have one word in french for both ape and monkeys ...

rkda · on April 22, 2017

Oh... now it makes more sense. Didn't mean to sound harsh. Thanks for sharing your framework! :)

rdorgueil · on April 22, 2017

It didn't sound harsh at all. I'm really laughing a lot right now about how ignorant I am about apes and monkeys. ^^

init · on April 22, 2017

I came here to say that! The bonobo is an an indigenous ape of the left bank of the Congo river in the Congo rain forest in the DR Congo. They look indistinguishable to chimpanzees to the untrained eye.

Gorillas are a whole different species and you have at least 4 subspecies of gorillas, none of which look like chimps or bonobos.

jimnotgym · on April 22, 2017

They have some behavioral features that tell them apart from a chimpanzee. Apparently they use sex as a greeting. It is (sort of) anthropomorphized in a hilarious way in the Will Self novel 'Great Apes'. However that may be colouring my memory of how common this is in the real creatures...