Common Go for Data Science Questions
Recently, I gave the talk "Go for Data Science" at GopherCon 2016 (video here). I discussed how I transitioned from primarily using python for data science to primarily using Go for data science. I also led everyone through an example data science project using an entirely Go-based approach.
There seemed to be a huge interest in using Go for data science, and it was awesome to see so many people asking great questions after the talk. I think we are entering the age of Go-based data science!! Anyway, I thought I would share some of the questions that I received after the talk along with my answers. I think these represent the most common questions I encounter when talking to people about Go for data science:
Question: What about all those libraries/packages in Python? Don't you find Go lacking in that regard?
Simply put, no. But let me go into a little more detail here, because I think this is the main hang up for people considering Go for data science. The python data science ecosystem seems so developed and rich. Things must be missing on the Go side, right?
Well, there are two responses:
Despite having a rich ecosystem of data science tooling, the python data science community still struggles to produce production ready data science applications/services. This, in my opinion, is the biggest hardship for data scientists. Data scientists have a reputation for producing poorly written, inefficient code (in some cases copied straight out of Jupyter notebooks) that cannot be maintained and that does not play well with data pipelines developed by data engineers. I hear this constantly from devops and data engineers (and if you need more evidence you can follow the links in this post). Thus, if a data scientist working in python cannot produce valuable services/applications, the question of whether python does or does not have a rich ecosystem of tooling is a moot point. Go, on the other hand, is a natural for producing efficient, maintainable applications that play well with modern architectures.
In reality, the majority of work that a data scientist does day-to-day is NOT training neural nets to play board games. A data scientist spends 90% of his/her time cleaning data, organizing data, gathering data, parsing data, and extracting field/patterns from data (see here for evidence supporting that). Go has already proven extremely useful and efficient for these tasks! Then for the remaining 10% of data science work which includes training algorithms etc., Go already has a number of options. Its just that you might have to spend a little time finding the right things or implementing something here or there (see below for more resources as well), which is worth it for the huge gains in the other 90% of my tasks.
Question: Are there any key features missing from the Go equivalents of pandas/numpy/matplotlib that you wish existed?
Answer: There is always opportunity to add/improve functionality in these respects, but, as I mentioned above, I truly believe that there is no ecosystem-based blocker for data scientists transitioning to Go. That being said, we should definitely put some effort into visualization, either by expanding things like gonum/plot or providing tutorials on powering things like D3js or dashing via Go. Also, I personally wish plotting was enabled inline in Go notebooks, and I plan to work on this very soon.
Question: How do you do machine learning using Go?
Answer: You do ML with Go the same way you do it with python, except it is much easier to end up with maintainable, deployable ML code. The same rules of training, testing, and validation still apply, but as gophers we should also keep in mind things like clarity, simplicity, graceful handling of errors, etc. Regarding specific packages, it is a common fallacy that Go ML packages don't exist. In fact you have most anything you could want! There is tooling for regression (e.g., here, here and here), classification (e.g., here, here and here), dimensionality reduction (e.g., here), and much more, and, even if you don't find what you need there, you can enable any ML via connectors to H2O, Tensorflow, Apache Beam, or a number of other frameworks.
Question: What are the odds of numpy-style arrays getting into Go?
Answer: The most relevant news that I currently know about is this proposal for multidimensional slices. Please follow the comments/discussion on this and contribute if you are able. However, also look into gonum. They already provide a rich set of array/matrix functionality relevant to data science work.
Question: What are some advantages of developing your models, e.g., neural nets, in Go?
Answer: Integrity, integrity, integrity!! By developing you data science applications/services in Go you can have amazing confidence that your application will behave as expected and will be able to be deployed and maintained. On top of that, you get the efficiency, simplicity, and scaling that are already well known attributes of Go.
Question: What are some good Go resources for modeling, analysis, and data post processing?
Answer: I'm going to list a few below. However, I have also started an effort to centralize some knowledge/training around Go-based data science here, which I think is another need in the community. If you are having trouble finding info on how to do something data-related with Go or would like another point of view, PLEASE open an issue here explaining what you would like to see. I, and/or another one of my amazing gopher data friends, will prioritize these requests and get some more training material or information out there!
The #data-science channel on gophers slack - There are a bunch gophers here that are doing amazing data science work. They are a great resource and happy to answer questions and point to resources. If you are not part of gophers slack, join right away here.
This list of data-science related tooling.
Gonum - matrix manipulation, stats, plotting, and much more.
Pachyderm - a really great framework, written in Go, for distributing data analysis and pipelining.
Gobot - Go for IoT.
Question: Yes, Go is fast, statically compiled, and has concurrency, but if you wanted to be fast why not just encourage data scientists to work in C?
Answer: Maybe I should based on speed? Haha. However, speed is not the only consideration, and I'm not evangelizing Go for data science based on speed (although Go is wicked fast). I think the main advantages that Go provides a data scientist relate to the production of usable, clear code that has integrity. C may be fast, but we also want data scientists to be productive in their work and to produce things that are clear and maintainable. I think Go strikes this balance.
Question: If you could pick one area where Go could be improved for data science, what would it be?
Answer: Centralization of knowledge and training around common tasks in data science. All common tasks in data science are handled quite well with Go, but a newcomer to the community might have trouble finding data analysis/science related resources. It would be so wonderful to bring some knowledge together from those working on anomaly detection, those working on IoT, those working on neural nets, etc. to show the amazing possibilities of doing data science work with Go and to help ease the transition to Go. As I mentioned above, PLEASE open issues here explaining what types of data related tutorials or information you would like to see curated. Let's continue building momentum around Go for data science!!