Died-in-the-wool software developers can get quite passionate about the relative virtues of one programming language or another, their debates sometimes threatening to transport you back to middle-school arguments about the greatest ballplayers of all time. Though their computer passions find other outlets as well, data scientists also talk about software and programming.
As you plan your own personal skill development program in statistics, analytics, and data science, will you focus on Python? R? Both? Something else? For those whose work is primarily traditional statistical research, R is clearly preferred. Data scientists, on the other hand, use both R and Python.
The recruiting firm Burtchworks periodically surveys professionals in the field, and the firm makes a distinction between predictive analytics professionals (who deal with structured data) and data scientists (who deal with text and other unstructured data). Burtchworks reports that R users are roughly split between the two groups, while Python is favored 2 to 1 by data scientists. Their definition of the roles is somewhat unique, and Burtchworks combines the groups for other analyses, but it is true that analysis of text and unstructured data is a useful sub-discipline to distinguish, and such practitioners do definitely lean toward Python.
I spoke with a number of data scientists to get a flavor of which they use, and why.
- Dave Shirley is a data scientist at a digital marketing agency, was trained in statistics, and uses mainly R. It’s what he learned initially, and he finds it satisfies his needs: “We use R for ad-hoc analysis, regular production of Excel and HTML reports, dash boarding (HTML). If there’s something we need to do programmatically the chances are we will do it using R.”
- Niral Upadhyaya is a data scientist at Elder Research and, likewise, prefers R: “Personally, I probably use R more than Python because it is more familiar or because it was already being used on the project, but I have used both. It really depends on the client, what they have approved, and the type of work we will be doing. For instance, in cybersecurity, I think Python is more prevalent since a lot of tools like Splunk are built upon it. Python also seems to be the choice when the problem needs deep learning. I probably use a mixture of SQL and R or Python for exploration and then I build models in either R or Python.”
- Peter Gedeck is a Senior Data Scientist at Collaborative Drug Discovery where his work involves data collection and analysis, building and validating models, and finally making the models available to users either as web services or by embedding the whole process into applications. He also teaches the Predictive Analytics in Python series at Statistics.com. “While I used R in the past, most of the functionality I require is now available in the main data science packages in Python. This together with the availability of excellent domain specific solutions for chem- and bioinformatics, makes Python my preferred language for data science. I still use R for creating publication-quality graphs using ggplot. However, most of the time, my work requires embedding the analysis or model into a bigger system and in this case, a general programming language like Python is superior.”
- Leanna Kent is a Data Scientist at Elder Research, who used to rely primarily on R but now also uses Python. She works mainly in analysis and building models; others help in deployment. “I prefer R, so if I have a choice I will use that. I use Python when my projects require me to. While Python is easier to read, I find it more difficult to code. With all of the different packages (base python, numpy, pandas) keeping track of data types is difficult, and I feel like I often have to hack into a solution. I also prefer the visualization capabilities in R.
- Grant Fleming is also a Data Scientist at Elder Research whose work focuses on analysis, building models, and building datasets. He uses both R and Python: “I use R for most billable work and data science/modeling tasks, Python for working with neural networks or text data.”
- Andrew Bruce is a Principal Research Scientist at Amazon: “I use both, but now mostly R due to the type of work I’m doing – 1) exploratory analysis, 2) statistical modeling, and 3) experimental design. I use Python for problems with bigger data involving ML and projects that need to be deployed into production. A vast majority of production code at Amazon for data science is based on Python.”
- Ramon Perez is the Director of UK Operations for Elder Research: “I use the Anaconda distribution of Python with Pandas, SciKitLearn, and Plotly. Python as a general purpose language that is easier to put into production environments for clients and a lot of serious deep learning development is happening within Python. However, the core statistical modeling packages in R are still the gold standard, especially for time series work and Bayesian techniques.”
- Finally, I also spoke to John Elder, the Founder of Elder Research, not about R vs. Python, but about data scientists and programming more generally: “It’s useful to distinguish the software engineering perspective from the data science perspective. Professional software developers build software for wide distribution and must take the time along the way to “harden” their code – making it robust, efficient and error-free. This is essential for quality deliverables and a very valuable skill. But not necessarlly during the discovery phase of a project! For data scientists, most code is written during the trial and error research and discovery phase; it would be a waste of precious time to ‘harden’ each iteration and branch along the way.”