I use Airflow, and am a big fan. I don't think it's particularly clear, however,...

javajosh · on May 29, 2020

Yes, this seems to be yet another tool that falls prey to what I think of as "The Bisquick Problem". Bisquick is a product that is basically pre-mixed flour, salt, baking powder that you can use to make pancakes, biscuits, and waffles. But why would you buy this instead of its constituent parts? Does Bisquick really save that much time? Is it worth the loss of component flexibility?

Worst of all, if you accept Bisquick, then you open the door to an explosion of Bisquick options. Its a combinatorial explosion of pre-mixed ingredients. In a dystopian future, perhaps people stop buying flour or salt, and the ONLY way you can make food is to buy the right kind of Bisquick. Might make a kind of mash up of a baking show and Black Mirror.

Anyway, yeah, Airflow (and so many other tools) feel like Bisquick. It has all the strengths, but also all the weaknesses, of that model.

jacobr1 · on May 29, 2020

The art of software engineering is all about finding the right abstractions.

Higher-order abstractions can be a productivity boon but have costs when you fight their paradigm or need to regularly interact with lower layers (in ways the designs didn't presume).

Airflow and similar tools are doing four things:

A) Centralized cron for distributed systems. If you don't have a unified runtime for your system, the old ways of using Unix cron, or a "job system" become complex because you don't have centralized management or clarity for when developers should use one given scheduling tool vs another.

B) Job state management. Job can fail and may need to be retried, people alerted, etc ... Most scheduling system has some way to do deal with failure too, but these tools are now treating this as stored state

C) DAGs, complex batch jobs are often composed of many stages with dependencies. And you need the state to track and retry stages independently (especially if they are costly)

D) What many of these tools also try to do, is tie the computation performing a given job to the scheduling tool. This now seems to be an antipattern. They also try to have "premade" job stages or "operators" for common tasks. These are a mix of wrappers to talk to different compute systems and actual compute mechanisms themselves.

If you have the kind of system that is either sufficiently distributed, or heterogeneous enough that you can't use existing schedulers, you need something with #A, but if you also need complex job management, you need #A, #B and #C, and having rebuilt my own my times, using a standard system is better when coordinating between many engineers. What seems necessary in general is #D.

jacobr1 · on June 3, 2020

I meant to say D seems unnecessary

andrewflnr · on May 30, 2020

Just playing devil's advocate for a bit: the horror of your Bisquick scenario depends in part on the assumption that salt, flour, etc are fungible across applications, which is not quite true. Flour, sugar, and probably other trace ingredients for managing texture benefit from using different types in different recipes. If any of those benefit from economies of scale, it could well be optimal in some sense to have mixes for everything. This is much closer to being true in software, where different circumstances demand different concrete implementations of abstractions like, say, "scheduler" (analogous to grade/type of abstract ingredient like "flour").

Ed: I should say, I really like this metaphor, and I expect it will crop up in my thinking in the future.

vorpalhex · on May 29, 2020

I realize this is a metaphor and I'm answering the metaphor and not the underlying problem, but: camping. Seriously, want some quick pancakes or donuts when you're out in the field? Bisquick and just change up how much water you add.

jstarfish · on May 29, 2020

Also disaster survival.

Couscous, Bisquick and other low-or-no-heat, premixed, just-add-water solutions are a godsend to have when a tornado takes out your gas line or electric grid.

eeZah7Ux · on May 29, 2020

On top of that, such Bisquick takes hours of learning, deployment, troubleshooting and has complex failure modes compared to few cronjobs and trivial scripts.

awhitby · on May 29, 2020

Assume that future kinds of Bisquick can have negative amounts of flour, salt or baking powder. Now recipes in the dystopian future just require a simple change of basis.

javajosh · on May 29, 2020

Yes, that is true. If you allow negative ingredients you can indeed reach all points in the characteristic state-space of baking, even when limited to picking from a huge set of proprietary Bisquicks. Which is a hopeful thought. I think.

rywalker · on May 29, 2020

Implying that Airflow is a simple mixture of a few ingredients is selling it quite short. There are a lot of knobs and switches in Airflow (i.e. features) that have been built and battle-tested over a lot of users. It has quite a lot of dependencies across the scheduler, webserver, cli, and worker modes. And there is a lot of new development going into Airflow in recent months (new API, DAG serialization, making the scheduler HA).

michaelcampbell · on May 29, 2020

Brawndo. It's what plants crave.

_mzl1 · on May 29, 2020

Your comment doesn't provide insight of when and when not to use it.

javajosh · on May 29, 2020

I guess as I've grown older I've grown wary of black-and-white thinking. The insight I would share with you is to be wary of Bisquick, but do not dismiss it outright. All creation is combination, and you won't succeed saying no to all combinations. In the same way, you won't succeed saying yes to every combination.

nickpeterson · on May 29, 2020

I think what you’re saying is, you cannot start from first principles if you want to accomplish most things, but you need to understand first principles to not misuse those things.

_mzl1 · on June 1, 2020

I guess so, like everything comes with pros and cons you know

simo7 · on May 29, 2020

I would add these gotchas/recommendations:

- Airflow the ETL framework is quite bad. Just use Airflow the scheduler/orchestrator: delegate the actual data transformation to external services (serverless, kubernetes etc.).

- Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

- Don't use it for latency-sensitive jobs (this one should be obvious).

- Don't use sensors or cross-DAG dependencies.

So yeah unfortunately it's not a good fit for all the use cases, but it has the right set of features for some of the most common batch workloads.

Also python as the DAG configuration language was a very successful idea, maybe the most important contributor to Airflow success.

prions · on May 29, 2020

> - Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

You can totally design your tasks to be idempotent - but its up to you to make them that way. The scheduler or executor doesn't have any context into your job.

This is why I encourage people to use a unified base operator and then pass their own docker containers to it. Aka like how https://medium.com/bluecore-engineering/were-all-using-airfl... outlines it.

> - Don't use it for latency-sensitive jobs (this one should be obvious).

IIRC this is being addressed in Airflow 2.0

> - Don't use sensors or cross-DAG dependencies.

This is a little extreme. I've never ran into issues with cross dag dependencies or sensors. They make managing my DAGs way easier because wee can separate computation dags from loading dags.

context: I built/manage my company's Airflow platform. Everything is managed on k8s.

simo7 · on May 29, 2020

> You can totally design your tasks to be idempotent

Yes of course, I mean Airflow is not a good fit for the tasks you don't want to be idempotent (I think most but not all tasks should be idempotent).

> I've never ran into issues with cross dag dependencies

I believe Airflow docs advice against them when possible. I see why from my experience: less visibility and more complexity, especially for backfills.

domenp · on May 29, 2020

> context: I built/manage my company's Airflow platform. Everything is managed on k8s

My team is running Airflow on a single node but we're slowly outgrowing this setup. We're considering running jobs on k8s.

Curious what's your setup like? Is your cluster of a fixed size or does it scale with the load?

ForHackernews · on May 29, 2020

Using the KubernetesPodOperator for everything adds a huge amount of overhead. You still need Airflow worker nodes, but they're just babysitting the K8S pods doing the real work.

I know it's 2020 and memory is cheap or whatever, but Airflow is shockingly wasteful of system resources.

rywalker · on May 29, 2020

re: ETL framework - you can get a lot done with the built-in Airflow Operators, including the PythonOperator (and bring in any python dependency you like) and BashOperator (call any CLI, etc.) - it's not drag-and-drop, but I've found it to be quite versitile.

re: idempotency - yes, make your workflow tasks idempotent.

re: latency - this is being worked on very actively. Ash (PMC member) has committed to working on task latency almost exclusively until it's resolved

re: sensors, there is some great work from Airbnb to improve: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+A...

hn2017 · on May 29, 2020

If using a docker or kubernetes operator, you can't make use of operators or Airflow connections. Do you just work without those?

simo7 · on May 29, 2020

Correct, I use operators only to delegate the actual workload to an external service.

hn2017 · on May 29, 2020

Got it, to be clear - do you still use the Airflow for storing connections, or no? If no, how do you store your credentials? We've only done a POC and we've discovered a higher than expected learning curve.

IanCal · on May 29, 2020

This is handy thanks - I deal mostly with luigi and this helps me place airflow a bit better.

contravariant · on May 29, 2020

Sensors are quite helpful when you don't know the exact moment your data will come in.

simo7 · on May 29, 2020

I think it's better to place everything in one DAG if that solves the problem. If it doesn't then sensors are ok I guess, but I would try to avoid them otherwise.

prions · on May 29, 2020

I really disagree here. Monolithic dags are a bigger pain to manage.

Breaking them out into smaller dags makes retrying/backfilling/etc a lot more straightforward. It also lets you reuse those pieces more easily.

We have compute DAGs that are the upstream dependency for many other dags. Originally, this dag was monolithic and loaded data into one table. But because the dag is split into computation and loading we can easily add more downstream dags without changing how the first one operated.

bosie · on May 29, 2020

What do you mean by bookmark in this context?

vorpalhex · on May 29, 2020

"I have processed 5/15 records, next run I need to start at record 6". Bookmarking is a common concept for working through a large job in several small runs.

devonkim · on May 29, 2020

I wonder if this difference in jargon has origins in different sub-fields. I recognize that concept as "checkpoints" across different companies but I also remember seeing the term from data science folks and thinking that I just don't know the concept.

nickpeterson · on May 29, 2020

Often called a Waterline/Watermark as well.

sk5t · on May 29, 2020

I'd call it a watermark too; "checkpoint" can mean an intermediate point in a multi-step database transaction.

jstarfish · on May 29, 2020

I've always called them offsets.

wiml · on May 29, 2020

I would call it "a read cursor (that's updated within the transaction)", but I'm boring.

unixhero · on June 1, 2020

Cursor nomenclature is used for describing where in a sequence the operator is during a code execution pass in database query or similar right?

I think I kind of immediately understand "cursor" the best in this context. I also agree, it's a little boring and definitely old school:).

gravitas · on May 29, 2020

> The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures.

For this specific use case, I use healthchecks.io - trivial to deploy in almost any context which can ping a public URL. Generous free tier limits so I've got that going for me which is nice.

chrischen · on May 29, 2020

I thought the point of airflow is for orchestration in an event driven microservices architecture? That’s what Uber uses Cadence for at least.

aequitas · on May 29, 2020

I recently started investigating Airflow for our use-case and it seems exactly what you describe and not more. But in it niche it excels featurewise, at least regarding the features I need to expose to the users.