Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I use Airflow, and am a big fan. I don't think it's particularly clear, however, as to when to use airflow.

The single best reason to use airflow is that you have some data source with a time-based axis that you want to transfer or process. For example, you might want to ingest daily web logs into a database. Or maybe you want weekly statistics generated on your database, etc.

The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures. For example, maybe you want to garbage-collect some files on a remote server with spotty connectivity, and you want to be emailed if it fails for more than two days in a row.

Beyond those two, Airflow might be very useful, but you'll be shoehorning your use case into Airflow's capabilities.

Airflow is basically a distributed cron daemon with support for reruns and SLAs. If you're using Python for your tasks, it also includes a large collection of data abstraction layers such that Airflow can manage the named connections to the different sources, and you only have to code the transfer or transform rules.



Yes, this seems to be yet another tool that falls prey to what I think of as "The Bisquick Problem". Bisquick is a product that is basically pre-mixed flour, salt, baking powder that you can use to make pancakes, biscuits, and waffles. But why would you buy this instead of its constituent parts? Does Bisquick really save that much time? Is it worth the loss of component flexibility?

Worst of all, if you accept Bisquick, then you open the door to an explosion of Bisquick options. Its a combinatorial explosion of pre-mixed ingredients. In a dystopian future, perhaps people stop buying flour or salt, and the ONLY way you can make food is to buy the right kind of Bisquick. Might make a kind of mash up of a baking show and Black Mirror.

Anyway, yeah, Airflow (and so many other tools) feel like Bisquick. It has all the strengths, but also all the weaknesses, of that model.


The art of software engineering is all about finding the right abstractions.

Higher-order abstractions can be a productivity boon but have costs when you fight their paradigm or need to regularly interact with lower layers (in ways the designs didn't presume).

Airflow and similar tools are doing four things:

A) Centralized cron for distributed systems. If you don't have a unified runtime for your system, the old ways of using Unix cron, or a "job system" become complex because you don't have centralized management or clarity for when developers should use one given scheduling tool vs another.

B) Job state management. Job can fail and may need to be retried, people alerted, etc ... Most scheduling system has some way to do deal with failure too, but these tools are now treating this as stored state

C) DAGs, complex batch jobs are often composed of many stages with dependencies. And you need the state to track and retry stages independently (especially if they are costly)

D) What many of these tools also try to do, is tie the computation performing a given job to the scheduling tool. This now seems to be an antipattern. They also try to have "premade" job stages or "operators" for common tasks. These are a mix of wrappers to talk to different compute systems and actual compute mechanisms themselves.

If you have the kind of system that is either sufficiently distributed, or heterogeneous enough that you can't use existing schedulers, you need something with #A, but if you also need complex job management, you need #A, #B and #C, and having rebuilt my own my times, using a standard system is better when coordinating between many engineers. What seems necessary in general is #D.


I meant to say D seems unnecessary


Just playing devil's advocate for a bit: the horror of your Bisquick scenario depends in part on the assumption that salt, flour, etc are fungible across applications, which is not quite true. Flour, sugar, and probably other trace ingredients for managing texture benefit from using different types in different recipes. If any of those benefit from economies of scale, it could well be optimal in some sense to have mixes for everything. This is much closer to being true in software, where different circumstances demand different concrete implementations of abstractions like, say, "scheduler" (analogous to grade/type of abstract ingredient like "flour").

Ed: I should say, I really like this metaphor, and I expect it will crop up in my thinking in the future.


I realize this is a metaphor and I'm answering the metaphor and not the underlying problem, but: camping. Seriously, want some quick pancakes or donuts when you're out in the field? Bisquick and just change up how much water you add.


Also disaster survival.

Couscous, Bisquick and other low-or-no-heat, premixed, just-add-water solutions are a godsend to have when a tornado takes out your gas line or electric grid.


On top of that, such Bisquick takes hours of learning, deployment, troubleshooting and has complex failure modes compared to few cronjobs and trivial scripts.


Assume that future kinds of Bisquick can have negative amounts of flour, salt or baking powder. Now recipes in the dystopian future just require a simple change of basis.


Yes, that is true. If you allow negative ingredients you can indeed reach all points in the characteristic state-space of baking, even when limited to picking from a huge set of proprietary Bisquicks. Which is a hopeful thought. I think.


Implying that Airflow is a simple mixture of a few ingredients is selling it quite short. There are a lot of knobs and switches in Airflow (i.e. features) that have been built and battle-tested over a lot of users. It has quite a lot of dependencies across the scheduler, webserver, cli, and worker modes. And there is a lot of new development going into Airflow in recent months (new API, DAG serialization, making the scheduler HA).


Brawndo. It's what plants crave.


Your comment doesn't provide insight of when and when not to use it.


I guess as I've grown older I've grown wary of black-and-white thinking. The insight I would share with you is to be wary of Bisquick, but do not dismiss it outright. All creation is combination, and you won't succeed saying no to all combinations. In the same way, you won't succeed saying yes to every combination.


I think what you’re saying is, you cannot start from first principles if you want to accomplish most things, but you need to understand first principles to not misuse those things.


I guess so, like everything comes with pros and cons you know


I would add these gotchas/recommendations:

- Airflow the ETL framework is quite bad. Just use Airflow the scheduler/orchestrator: delegate the actual data transformation to external services (serverless, kubernetes etc.).

- Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

- Don't use it for latency-sensitive jobs (this one should be obvious).

- Don't use sensors or cross-DAG dependencies.

So yeah unfortunately it's not a good fit for all the use cases, but it has the right set of features for some of the most common batch workloads.

Also python as the DAG configuration language was a very successful idea, maybe the most important contributor to Airflow success.


> - Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

You can totally design your tasks to be idempotent - but its up to you to make them that way. The scheduler or executor doesn't have any context into your job.

This is why I encourage people to use a unified base operator and then pass their own docker containers to it. Aka like how https://medium.com/bluecore-engineering/were-all-using-airfl... outlines it.

> - Don't use it for latency-sensitive jobs (this one should be obvious).

IIRC this is being addressed in Airflow 2.0

> - Don't use sensors or cross-DAG dependencies.

This is a little extreme. I've never ran into issues with cross dag dependencies or sensors. They make managing my DAGs way easier because wee can separate computation dags from loading dags.

context: I built/manage my company's Airflow platform. Everything is managed on k8s.


> You can totally design your tasks to be idempotent

Yes of course, I mean Airflow is not a good fit for the tasks you don't want to be idempotent (I think most but not all tasks should be idempotent).

> I've never ran into issues with cross dag dependencies

I believe Airflow docs advice against them when possible. I see why from my experience: less visibility and more complexity, especially for backfills.


> context: I built/manage my company's Airflow platform. Everything is managed on k8s

My team is running Airflow on a single node but we're slowly outgrowing this setup. We're considering running jobs on k8s.

Curious what's your setup like? Is your cluster of a fixed size or does it scale with the load?


Using the KubernetesPodOperator for everything adds a huge amount of overhead. You still need Airflow worker nodes, but they're just babysitting the K8S pods doing the real work.

I know it's 2020 and memory is cheap or whatever, but Airflow is shockingly wasteful of system resources.


re: ETL framework - you can get a lot done with the built-in Airflow Operators, including the PythonOperator (and bring in any python dependency you like) and BashOperator (call any CLI, etc.) - it's not drag-and-drop, but I've found it to be quite versitile.

re: idempotency - yes, make your workflow tasks idempotent.

re: latency - this is being worked on very actively. Ash (PMC member) has committed to working on task latency almost exclusively until it's resolved

re: sensors, there is some great work from Airbnb to improve: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+A...


If using a docker or kubernetes operator, you can't make use of operators or Airflow connections. Do you just work without those?


Correct, I use operators only to delegate the actual workload to an external service.


Got it, to be clear - do you still use the Airflow for storing connections, or no? If no, how do you store your credentials? We've only done a POC and we've discovered a higher than expected learning curve.


This is handy thanks - I deal mostly with luigi and this helps me place airflow a bit better.


Sensors are quite helpful when you don't know the exact moment your data will come in.


I think it's better to place everything in one DAG if that solves the problem. If it doesn't then sensors are ok I guess, but I would try to avoid them otherwise.


I really disagree here. Monolithic dags are a bigger pain to manage.

Breaking them out into smaller dags makes retrying/backfilling/etc a lot more straightforward. It also lets you reuse those pieces more easily.

We have compute DAGs that are the upstream dependency for many other dags. Originally, this dag was monolithic and loaded data into one table. But because the dag is split into computation and loading we can easily add more downstream dags without changing how the first one operated.


What do you mean by bookmark in this context?


"I have processed 5/15 records, next run I need to start at record 6". Bookmarking is a common concept for working through a large job in several small runs.


I wonder if this difference in jargon has origins in different sub-fields. I recognize that concept as "checkpoints" across different companies but I also remember seeing the term from data science folks and thinking that I just don't know the concept.


Often called a Waterline/Watermark as well.


I'd call it a watermark too; "checkpoint" can mean an intermediate point in a multi-step database transaction.


I've always called them offsets.


I would call it "a read cursor (that's updated within the transaction)", but I'm boring.


Cursor nomenclature is used for describing where in a sequence the operator is during a code execution pass in database query or similar right?

I think I kind of immediately understand "cursor" the best in this context. I also agree, it's a little boring and definitely old school:).


> The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures.

For this specific use case, I use healthchecks.io - trivial to deploy in almost any context which can ping a public URL. Generous free tier limits so I've got that going for me which is nice.


I thought the point of airflow is for orchestration in an event driven microservices architecture? That’s what Uber uses Cadence for at least.


I recently started investigating Airflow for our use-case and it seems exactly what you describe and not more. But in it niche it excels featurewise, at least regarding the features I need to expose to the users.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: