Containerized Data Science and Engineering - Part 2, Dockerized Data Science

March 29, 2016 0 Comments

(This is part 2 of a two part series of blog posts about doing data science and engineering in a containerized world, see part 1 here)

Let's admit it, data scientists are developing some pretty sweet (and potentially valuable) models, optimizations, visualizations, etc. Unfortunately, many of these models will never actually be used because they cannot be "productionized." In fact, much of the "data science" happening in industry is happening in isolation on data scientists' laptops, and, in the case in which data science applications are actually deployed, they are often deployed as hacky python/R scripts uploaded AWS and run as a cron job.

This is a huge problem and blocker for data science work in industry, as evidenced below:

"There was only one problem — all of my work was done in my local machine in R. People appreciate my efforts but they don’t know how to consume my model because it was not “productionized” and the infrastructure cannot talk to my local model. Hard lesson learned!" -Robert Chang, data scientist at Twitter

"Data engineers are often frustrated that data scientists produce inefficient and poorly written code, have little consideration for the maintenance cost of productionizing ideas, demand unrealistic features that skew implementation effort for little gain… The list goes on, but you get the point". -Jeff Magnusson, director of data platform Stitchfix

But don't worry! There is a better way: Dockerize your data science applications for ease of deployment, portability, and integration within your infrastructure.

Why should Data Scientists care about Docker?

The simple answer to this question is: because data scientists should want their models, dashboards, optimizations, etc. to be used. In order for data science applications to be utilized and bring value, they need to be deployed and get off of your laptop! They also need to play nice with your existing infrastructure and be easy to update and iterate on.

How does a Dockerized data science application provide these benefits:

  • However and wherever the application is deployed, you don't need to worry about dependencies. One of the pain points in deploying data science applications is managing to get all those heavyweight dependencies (numpy, scipy, pandas, scikit-learn, statsmodels, etc.) on the machine. By containerizing your data science application, you can deploy with one command regardless of your dependencies, the OS of the machine on which you are deploying, or the version of existing packages/libraries.

  • As your company's infrastructure changes or you need to scale your application, you can easily move it around or launch more instances. It is a common situation to develop a model or service without a full understanding of where that service will eventually live, how many requests it will need to process, etc. When you have your data science application containerized, you can easily move it from AWS to Azure when your company is granted cloud credits, or, to handle load, you can quickly launch one, two, or twenty instances of the application.

  • You, as a data scientist, can embrace your company's modern architecture. Instead of a cron job sitting on some machine that interacts directly with 4 different databases, how about a dockerized data application that interacts with the other pieces of your infrastructure via JSON API and messaging queues. This is so much more sustainable, and your engineering organization will love how your applications don't break when there is a schema change or upgrade. You can also easily integrate data science work into the CI/CD pipelines utilized by the rest of your engineering organization. (data scientists in the audience don't worry: this is not scary or difficult, and we will work through a simple example below)

A Simple Example of a Dockerized Data Science Application

Enough chit chat, let's move from a hacky python scripts to a sustainable dockerized data science application. In the following, I will present a very simple example of a data science application that:

  1. Utilizes technologies familiar to most data scientists (python and scikit-learn).

  2. Is Dockerized (i.e., can be built into a Docker image for deployment).

  3. Interacts outside of the running Docker container via JSON API.

A simple model to make predictions:

For this example, we are going to build a k-NN classification model (with scikit-learn) using the famous Iris dataset:

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

def predict(inputFeatures):

    iris = datasets.load_iris()

    knn = KNeighborsClassifier(),

    predictInt = knn.predict(inputFeatures)
    if predictInt[0] == 0:
        predictString = 'setosa'
    elif predictInt[0] == 1:
        predictString = 'versicolor'
    elif predictInt[0] == 2:
        predictString = 'virginica'
        predictString = 'null'

    return predictString

This predict function will return a species of Iris based on input features inputFeatures (sepal length, sepal width, petal length, and petal width). The dataset on which the model is trained is static in this case (i.e., loaded from the scikit-learn datasets), however it is easy to imagine how you could dynamically load in a dataset or aggregated values here via messaging, API(s), or database interactions.

A JSON API to communicate predictions:

Next, we need to communicate these predictions with the rest of the world. For this we will develop our application as a simple JSON API. This is common practice for many engineering organizations operating with a microservices architecture and will allow your data application to play nice with other existing services.

Here I will utilize flask-restful for our simple API, but you could use twisted or any other framework:

from flask import Flask
from flask_restful import Resource, Api
from flask_restful import reqparse
from utils import makeprediction

app = Flask(__name__)
api = Api(app)

class Prediction(Resource):
    def get(self):

        parser = reqparse.RequestParser()
        parser.add_argument('slength', type=float, 
                 help='slength cannot be converted')
        parser.add_argument('swidth', type=float, 
                 help='swidth cannot be converted')
        parser.add_argument('plength', type=float, 
                 help='plength cannot be converted')
        parser.add_argument('pwidth', type=float, 
                 help='pwidth cannot be converted')
        args = parser.parse_args()

        prediction = makeprediction.predict([

        print "THE PREDICTION IS: " + str(prediction)

        return {
                'slength': args['slength'],
                'swidth': args['swidth'],
                'plength': args['plength'],
                'pwidth': args['pwidth'],
                'species': prediction

api.add_resource(Prediction, '/prediction')

if __name__ == '__main__':

So we have one GET endpoint that we will utilize to retrieve predictions for a set of features. For example, the following path:


will return:

  "pwidth": 0.3, 
  "plength": 1.3, 
  "slength": 1.5, 
  "species": "setosa", 
  "swidth": 0.7

where the species in the response JSON is the prediected species, based on the input features.

A Dockerfile to build the Docker image:

In order to build a "Docker image" for our data science application. We need a Dockerfile. This Dockerfile will live in the root of our repo and will include statements that will include any necessary files and dependencies in Docker image and run a command of our choosing when we "run" the Docker image:

FROM ubuntu:12.04

# get up pip, vim, etc.
RUN apt-get -y update --fix-missing
RUN apt-get install -y python-pip python-dev libev4 libev-dev gcc libxslt-dev libxml2-dev libffi-dev vim curl
RUN pip install --upgrade pip

# get numpy, scipy, scikit-learn and flask
RUN apt-get install -y python-numpy python-scipy
RUN pip install scikit-learn
RUN pip install flask-restful

# add our project
ADD . /

# expose the port for the API

# run the API 
CMD [ "python", "/" ]

That's It, Let's Deploy Our Application

That's all you need to build your first dockerized data science application (for Docker installation instructions see the Docker Website). Now, let's build the "Docker image" for our application:

docker build --force-rm=true -t pythoniris .

This command will build a Docker image called pythoniris. We could "tag" this images if we like (e.g., pythoniris:latest) or associate with a user/account on Docker Hub, such as dwhitena/pythoniris (Docker Hub is a public registry for Docker images, kind of like Github for Docker images).

If you uploaded the image to Docker Hub (or a private registry), deployment can be as easy as "running" the Docker images referencing the user/imagename on Docker Hub or the registry. However, assuming you want to play around with this locally first, you can run the Docker image via:

docker run --net host -d --name myiris pythoniris

This will run the docker image as a container named myiris, as a daemon (-d), and using the same network interface as the localhost (--net host). That's it, your JSON API will now be available at localhost:5000.

See how easy that was, from python script to dockerized data application with very little pain. Now, go forth and do data science, dockerize your datascience, and deploy your data science!!

The above code is available on Github here.