Introducing Airflow at SIG (and some bonus thoughts on technology selection)

4 min read

Why we needed Airflow (or something like it)

Our marketing department might disagree, but you could say that SIG provides static code analysis as-a-service, running a little over 60.000 analyses per month. The infrastructure for that was originally built in-house dating back to 2010 and was starting to show its age. As these things go, there’s never a moment when things explode in your face but you can see the cracks getting bigger, so around 2019 we decided it was time to think about a new setup, preferably without reinventing the wheel too much.

Running static code analysis is not an on-line transaction (it’s not fast enough for that) but it’s also not a batch job (it’s triggered by new code being available, not by the clock). It’s more akin to build pipelines, so we initially looked at that type of tooling. We had experience with Gitlab CI/CD and Jenkins but didn’t think they were a great fit. Gitlab was too closely tied to individual repositories and Jenkins was not software-defined enough for our taste and we would heavily depend on plugins to get Jenkins to do what we needed.

Gauging the market and making a decision

Then, we turned to a more loosely defined category of ‘software that can orchestrate workflows’. There we discovered that many well-known tech companies had apparently dealt with similar issues, solved them and shared the results with the lesser gods out there. As is the etiquette, they all had tongue-in-cheek names like ‘Luigi’ (Spotify), ‘Pinball’ (Pinterest) and Conductor (Netflix). However, AirBnB’s creation (Airflow) appealed to us most. It was reasonably easy to build a PoC that strung together a couple of Docker containers and allowed data to be passed between those containers with not too much Python code.

The nice thing about making technology decisions in an organization like ours is that you can be unfair; we were not in a formal tender process or anything like that. We could get Airflow to do what we wanted and it met a set of criteria we had previously created, so we could give the competitors a quick glance-over, asking ourselves ‘Is there any way in which this product is clearly superior over Airflow?’. The outcome of that exercise could be summarized as ‘no’, so we went ahead with Airflow.

Start small and learn

Then came the question of where to start. You want to start in a place where you gain some experience with how the solution behaves IRL, but where you have some room to manoeuvre if things don’t go exactly as planned. Not purely coincidentally, we were introducing a new analysis type at that time that didn’t fit well in the existing infrastructure, so that was a logical place to start.

All in all, we got it to work except for one fundamental problem. The Airflow API (which was still experimental back then) did not allow us to trigger more than one analysis per second. And as we all know, in Computerland, a second is an eternity. Luckily, the problem was in the API itself and not in the core which led to our only contribution to the project so far.

There were a couple of other obstacles we had to overcome before we could consider ourselves happy campers. There was Airflow’s inability to mask secrets (fixed since 2.2) which forced us to tweak the UI so ‘normal’ users could not see the so-called ‘rendered templates’. Also, we needed 3rd-party code to do some cleaning up, as some UI elements got unresponsive with our volumes. A telltale sign there was that the numbers would no longer fit in the designated circle area when the total number of runs for some DAGs reached 100,000. Again, Airflow offers this out-of-the-box these days.

More observations and where next

Did we run into more things? Of course, but we managed to move our entire workload to Airflow without any major incidents or technical challenges. We had to play around a bit with weights, queues and especially pools to get everything right, but all within the confines of what Airflow offers.

Very recently, we moved to Airflow 2. We decided to combine this with a move to Kubernetes (moving away from the Docker Swarm-based deployment we were running with Airflow 1). This should allow us to scale better and leverage the work being done in the community on the k8s deployment. Our old deployment used hand-crafted Airflow images, the maintenance of which gave us no particular advantages, so we switched to the standard images.

Another area of improvement is tied to our strategy of giving users a shorter feedback loop on the quality of the code they commit, which implies that you might run more than one analysis on the same repo in parallel. Right now, Airflow does not know about this, because it does not know the concept of a repository (or ‘system’, as we call it in our data model). This sometimes leads to problems when importing analysis results in our database. So perhaps we need to teach Airflow the concept of a system to ensure things happen in the correct order.

But all in all, we’re pretty happy campers so far and we’re confident that Airflow will continue to support our scaling workloads for quite some time!

A moment of reflection: Open Source vs. COTS

While writing this post, I started to compare our introduction of Airflow to technology selection processes in the past in which I was involved, not as an engineering manager but as a consultant. In all cases there was a technical problem coupled with a business problem, both sufficiently well articulated. The big difference between the earlier cases and the Airflow case is the lack of a commercial aspect in the latter.

Being able to use open source software to solve a technical problem is really different from using COTS software. It allows you to focus on solving the actual problem you have, keeping it small at first and expanding the effort as your solution proves to work. This is quite a contrast with COTS software, whose implementation is inevitably top-down in nature. A vendor wants to talk to the higher ups because that’s where spending decisions are made. These higher ups will then involve project managers, buyers, HR and other stakeholders (and perhaps a bunch of consultants ;-)). These people will invariably end up in workshops listening to polished stories and sit through confusing demos. This is all well-intended, but it obscures the question: does this solve our technical problem? Many non-technical hours will have been spent before the answer to that question becomes clear.

So open source software will have a much lower cost of introduction than COTS, which adds to its attractiveness. Perhaps the picture is different when the problem is more ‘business-y’ (open source ERP is still not a big thing AFAIK) but if the technical problem is clear enough, COTS software should have very obvious advantages for me to even consider it.