July 26, 2016

Common Go for Data Science Questions


Recently, I gave the talk "Go for Data Science" at GopherCon 2016 (video here). I discussed how I transitioned from primarily using python for data science to primarily using Go for data science. I also led everyone through an example data science project using an entirely Go-based approach.

There seemed to be a huge interest in using Go for data science, and it was awesome to see so many people asking great questions after the talk. I think we are entering the age of Go-based data science!! Anyway, I thought I would share some of the questions that I received after the talk along with my answers. I think these represent the most common questions I encounter when talking to people about Go for data science:

Question: What about all those libraries/packages in Python? Don't you find Go lacking in that regard?

Simply put, no. But let me go into a little more detail here, because I think this is the main hang up for people considering Go for data science. The python data science ecosystem seems so developed and rich. Things must be missing on the Go side, right?

Well, there are two responses:

Question: Are there any key features missing from the Go equivalents of pandas/numpy/matplotlib that you wish existed?

Answer: There is always opportunity to add/improve functionality in these respects, but, as I mentioned above, I truly believe that there is no ecosystem-based blocker for data scientists transitioning to Go. That being said, we should definitely put some effort into visualization, either by expanding things like gonum/plot or providing tutorials on powering things like D3js or dashing via Go. Also, I personally wish plotting was enabled inline in Go notebooks, and I plan to work on this very soon.

Question: How do you do machine learning using Go?

Answer: You do ML with Go the same way you do it with python, except it is much easier to end up with maintainable, deployable ML code. The same rules of training, testing, and validation still apply, but as gophers we should also keep in mind things like clarity, simplicity, graceful handling of errors, etc. Regarding specific packages, it is a common fallacy that Go ML packages don't exist. In fact you have most anything you could want! There is tooling for regression (e.g., here, here and here), classification (e.g., here, here and here), dimensionality reduction (e.g., here), and much more, and, even if you don't find what you need there, you can enable any ML via connectors to H2O, Tensorflow, Apache Beam, or a number of other frameworks.

Question: What are the odds of numpy-style arrays getting into Go?

Answer: The most relevant news that I currently know about is this proposal for multidimensional slices. Please follow the comments/discussion on this and contribute if you are able. However, also look into gonum. They already provide a rich set of array/matrix functionality relevant to data science work.

Question: What are some advantages of developing your models, e.g., neural nets, in Go?

Answer: Integrity, integrity, integrity!! By developing you data science applications/services in Go you can have amazing confidence that your application will behave as expected and will be able to be deployed and maintained. On top of that, you get the efficiency, simplicity, and scaling that are already well known attributes of Go.

Question: What are some good Go resources for modeling, analysis, and data post processing?

Answer: I'm going to list a few below. However, I have also started an effort to centralize some knowledge/training around Go-based data science here, which I think is another need in the community. If you are having trouble finding info on how to do something data-related with Go or would like another point of view, PLEASE open an issue here explaining what you would like to see. I, and/or another one of my amazing gopher data friends, will prioritize these requests and get some more training material or information out there!

Question: Yes, Go is fast, statically compiled, and has concurrency, but if you wanted to be fast why not just encourage data scientists to work in C?

Answer: Maybe I should based on speed? Haha. However, speed is not the only consideration, and I'm not evangelizing Go for data science based on speed (although Go is wicked fast). I think the main advantages that Go provides a data scientist relate to the production of usable, clear code that has integrity. C may be fast, but we also want data scientists to be productive in their work and to produce things that are clear and maintainable. I think Go strikes this balance.

Question: If you could pick one area where Go could be improved for data science, what would it be?

Answer: Centralization of knowledge and training around common tasks in data science. All common tasks in data science are handled quite well with Go, but a newcomer to the community might have trouble finding data analysis/science related resources. It would be so wonderful to bring some knowledge together from those working on anomaly detection, those working on IoT, those working on neural nets, etc. to show the amazing possibilities of doing data science work with Go and to help ease the transition to Go. As I mentioned above, PLEASE open issues here explaining what types of data related tutorials or information you would like to see curated. Let's continue building momentum around Go for data science!!

Comments powered by Disqus