(image from diginomica)
Originally posted here.
Machine learning and artificial intelligence are changing the game across industry with models, driving cars, preventing fraud, recommending movies, and even playing board games. These applications are increasingly making decisions that have a direct user impact, such as flagging fraudulent transactions, providing approvals for policies (e.g., insurance policies), matching users with service providers, etc.
It’s not surprising that, with increased user impact, comes increased regulation for machine learning and AI. A huge precedent has recently been set by the European Union’s new General Data Protection Regulation (GDPR). In a nutshell,
"Organizations that use ML to make user-impacting decisions must be able to fully explain the data and algorithms that resulted in a particular decision.”
Your models must be held accountable for output. That’s right, the EU is already passing laws that impact the usage of models being built by your data science team!
An analysis of the GDPR (see here) suggests that it may prohibit a “wide swath of algorithms” and “could require a complete overhaul of standard and widely used algorithmic techniques.” Wow! A complete overhaul sounds painful. Let’s analyze the implications of these regulations and how you can ensure that your machine learning and AI models are prepared to be accountable (ideally without a complete overhaul).
Implications for data science workflows
Basically the GDPR, and likely other regulations that will follow in its wake, gives users a “right to an explanation” for any algorithmic decisions. Think about it. If a user is financially penalized or denied a service based on an algorithmic decision, how could data scientists/engineers ensure that this user is provided with a proper explanation for the decision, especially given how ML/AI algorithms evolve over time and operate on varying data? Well, to provide this explanation, data scientists/engineers need to do at least the following:
Simplify your analyses — Specifically, you should give priority to simple summary statistics and interpretable models above more complicated models. As stated in this analysis of the GDPR, “it stands to reason that an algorithm can only be explained if the trained model can be articulated and understood by a human.” Regulations like these may not totally kill useful modeling techniques such as neural networks, which are difficult to interpret, but they definitely motivate the usage of simple, interpretable models whenever possible. In fact, this should be a general best practice for data scientists already. As Martin Goodson states in Ten Ways Your Data Project is Going to Fail, “use a simple model you can understand. Only then move onto something more complex, and only if you need to.” Also, as Noah Lorang states in Data scientists mostly just do arithmetic and that’s a good thing, most businesses “just need good data and an understanding of what it means that is best gained using simple methods.” Many data scientist already spend most of their time cleaning, organizing, and parsing to get “good,” usable data sets that power models with a high degree of integrity, and this is not a bad thing. We should celebrate this effort to generate and use good data along with efforts to apply simple analyses to that data.
Version your data —
“How could a result be explained, especially a result of a machine learning model, without a versioned record of what data was input to generate the result and what data was output representing the result?”
Although this might seem like common sense, data versioning is far from a current best practice for data science teams in industry. Versioning the code that implements a model is not enough, because the model may behave so drastically different from one input data set to another.”
Know the provenance of your data — Simple versioning of your data isn’t even enough to ensure a sufficient explanation. Even if you have the input data set for a certain run of your model, can you explain where that data came from, at what times it was generated, how it was combined, and and how it was generated? As stated in the Pachyderm Data Science Bill of Rights: “Results without context is meaningless. At every step of your analysis, you need to understand where the data came from and how it reached its current state.”
Have the ability to reproduce any result — Assuming you have versioned your data and your model and know the provenance of your data, you should be able to exactly reproduce a given result. Moreover, this reproducibility gives you the ability to replay history including logs, input/output data sets, model parameters, etc., which is invaluable in identifying the cause for certain model behaviors.
These characteristics should become data science best practices regardless of regulation. However, with the inevitable increased regulation on the horizon, data scientists/engineers will feel more and more pressure to ensure that these characteristics are as standard in data science as version control and code reviews are in software engineering.
Practically achieving this type of data science workflow
If reproducibility, data provenance, and data versioning are to be standards within data science, we need proper tools to integrate these characteristics into workflows. Ideally, these tools will be:
- Language agnostic — The language wars in data science between python, R, scala, and others will continue on forever. We will always need a mix of languages and frameworks to enable advancements in a field as broad as data science. However, if tools enabling data versioning/provenance are language specific, they are unlikely to be integrated as standard practice.
- Infrastructure Agnostic — The tools should be able to be deployed on your existing infrastructure — locally, in the cloud, or on-prem.
- Scalable/distributed — It would be impractical to implement changes to a workflow if they were not able to scale up to production requirements.
- Non-invasive — The tools powering data versioning/provenance should be able to integrate effortlessly with existing data science applications, without a complete overhaul of the toolchain and data science workflows.
Pachyderm is an amazing open source system that meets all the above requirements and powers responsible (and innovative) data science workflows compliant with the newest machine learning and artificial intelligence regulations. I can't recommend it enough, and I mention it on this blog frequently (although, for full disclosure, I was just hired by them, so I can't be totally unbiased). Pachyderm powers data pipelines that naturally interface with a “git for data” data versioning system. In other words,
“Pachyderm provides complete provenance for your data while allowing you to utilize your existing data science applications/models written in any language and distribute your work at scale.”
Staying compliant with Pachyderm does not require the painful “complete overhaul” that is suggested by the GDPR. Your existing analyses (in any language or framework including python, R, Spark, etc.) can be part of a Pachyderm pipeline with versioned input/output and complete data provenance. All you need to do is wrap up the your work in a Docker image, define a pipeline via a simple JSON specification, and send your job off to Pachyderm, which can be deployed on your existing infrastructure.
To learn more about Pachyderm’s data provenance capabilities and how to integrate it into your workflow see this article and the Pachyderm docs. Spin up Pachyderm in just a few commands and start keeping your models/analyses accountable!