Containerized Data Science and Engineering - Part 1, Dockerized Data Pipelines

(This is part 1 of a two part series of blog posts about doing data science and engineering in a containerized world)

I'm sure you have been hearing at least some of the hype over "containers" and Docker this past year. In fact, Bryan Cantrill (CTO at Joyent) and Ben Hindman (founder at Mesosphere) recently declared that 2015 was the "year of the container" (see their webinar here) So what's all the hype and how does this relate to what's happening in the data science and engineering world?

If you have been living in a hole this past year, here is an introduction to containers along with some advantages of using them. Here, however, I am going to provide some resources for those wishing to containerize their data pipelines.

Advantages of containerizing your data pipeline:

Let's say you've already got a beautiful Kafka + Spark + Cassandra data pipeline in place. Why would you want to go to the trouble of containerizing one or more of these pieces? Here are a few reasons:

  1. Ease of Deployment: By packaging parts of your pipeline in a one-command-line deployable component, you can easily re-deploy those components when need be. For example, I'm sure you Spark users out there have realized by now that keeping up with the rapid pace of Spark version updates can require a lot of work. However, using a containerized version of Spark, you can easily deploy new versions with one command (assuming you are aware of verion dependencies in your actual Spark jobs).

  2. Scalability: Because deploying containers is so easy, scaling resources is also easy. You can bring up and take down containers very quickly to provide extra resources when your data pipeline is stressed and then destroy the containers when your data pipeline is under utilized.

  3. Resilience: By resilience in this context, I mean that you can easily deploy multiple containers of each component of your data pipeline and not stress about one of these containers failing. If a container fails or malfunctions just tear it down and bring up a new one while the other running containers pick up the slack.

  4. Cloud Flexibility: If all data pipeline components are deployed via containers, we have the flexibility to easily deploy the pipeline with most any cloud provider. When you get credits in one cloud, just move your pipeline over from another cloud. Or dynamically orchestrate where your pipeline is deployed to optimize spend or allocated resources.

Get started with these containerized components:

Luckily for all of us, developers have already thought a lot about containerizing common components of data pipelines. Here are some Docker images and Dockerfiles to help you quickly spin up a data pipeline:

Related projects:

Also, if you are really interested in containerized data pipelines, here are some interesting open source projects that you may want to follow, contribute to, or utilize:

  • Pachyderm Pipeline System - "Rather than thinking in terms of map or reduce jobs, pps thinks in terms of pipelines expressed within a container. A pipeline is a generic way to express computation over large datasets. Pipelines are also containerized to make them portable, isolated, and easy to monitor."
  • Luigi - "a Python (2.7, 3.3, 3.4, 3.5) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more." Specifically, check out this blog post about "Managing Containerized Data Pipeline Dependencies With Luigi."

Stay tuned for Part 2, where I will discuss doing data science in containers!

comments powered by Disqus