The Complete Data Scientist
Nowadays, because of high market demand and because of the rise of artificial intelligence (AI), many people are calling themselves a “data scientist”. But what does this job title actually mean? Many articles on the internet do a great job giving a comprehensive checklist of computational skills necessary to be a data scientist, but to me this view is too narrow. My approach in this short article is to give a brief overview of three general areas that I feel constitute a complete data scientist. They are not listed in any preferential order.
-
Computer skills
I list below a basic core set of tools that allow a data scientist to do his or her work and to stay current:
- R, especially the
tidyverse
packages, and the ability to use R Markdown. - Python, and the ability to use JupyterLab. There are many popular libraries you can learn as you go (e.g., NumPy, Pandas, Matplotlib, etc.).
- The ability to use a Git + GitHub workflow.
- The ability to be able to move around one of The Big Three cloud computing platforms (e.g., AWS). This also includes a basic familiarity with Spark (Databricks is a nice platform).
- If you are going to do a lot of data engineering, then knowing how to use SQL is important.
This list will undoubtedly not satisfy everyone, but it is not meant to be exhaustive. Because of my earlier training, there are still times when I find myself reaching for SAS for various reasons. You might also find yourself having to pick up a skill depending on where you work. For example, you might find yourself working for a company that will insist you produce graphics with Tableau or program in Ruby.
- R, especially the
-
The ability to apply statistics and solve problems.
It will be important that you have the ability to apply a broad set of statistical skills. You will use basic inferential statistics (e.g., t-tests, simple linear regression, etc.) more than you think and deep learning less than you think.
Having a basic understanding of some or all of the classic fields of statistics, like sampling, experimental design, time series, survival analysis, spatial statistics, Bayesian statistics, categorical data analysis, nonparametrics, and multivariate statistics is important. Why? Because these fields can be used for many problems and you never know what sort of problem you will be tasked to work on.
Consulting practice and experience analyzing actual data are like gold, and the more practice and experience you get, the stronger a data scientist you will become. It teaches you key concepts of staying curious, critical thinking, solving difficult problems, and the art of being able to communicate results to people that oftentimes know very little about statistics. Keep studying and evolving in your ability to apply statistics to solve problems. There are some good online courses hosted on Udemy and DataCamp, for example. Just take care to make sure that the course is taught by someone who is reasonably credentialed. As well, there are some great books out there that you can read and teach yourself that cool machine learning skill you’ve been hearing about :)
-
Having a background in statistical theory and math.
It has been my experience that many people who call themselves data scientists are strong with computers, but when it comes to statistics, they have a limited understanding of what it is they are doing and why they are doing it. This is unfortunate because what happens when you face a tough problem that has no solution worked out on the internet or you have to derive a tricky result on your own? Or perhaps you encounter an answer that doesn’t make sense or should be challenged.
Because of this, I think having an MS or PhD in Statistics or Data Science is crucial, and it should be a graduate program nicely stocked with some basic math courses (e.g., calculus, linear algebra, etc.).
Ok, let’s suppose you’ve worked hard and now check the box in all three areas. What about once we are in the workforce? There are some great annual conferences that allow you to stay current and attend technical workshops as part of your continued professional development. For example, The Symposium on Data Science and Statistics comes to mind. The American Statistician comes out four times a year and gives about as nice and comprehensive an overview of what is current in statistics and data science as you will find. I much recommend reading this journal. If it is possible, it is worth the time to continue to publish peer-reviewed papers. These do not have to be only in high-impact theoretical journals within the statistics and data science realm. A solid paper evidencing an interdisciplinary collaboration can only help your professional credibility.
In closing, I think it is important to have a blended balance of all three of these areas. To be a complete data scientist, you cannot be strong only in computer skills (area 1), and ignore knowing how to apply statistics to solve problems (area 2), and ignore having a basic understanding of statistical and mathematical theory (area 3). Some people do not like to hear this because it is frankly harder to transition from application to theory than it is to transition from theory to application. However, to illuminate this point, I was informed the other day that a very large company I have worked for in the past currently has on staff several mid- and senior-level statisticians and data scientists to focus on areas 2 and 3 only. As for the actual turning the crank on the data analysis (area 1), all that is now shipped overseas for greatly reduced costs. Think about it.
So that is the short definition of a complete data scientist I use in my own professional career and it has served me well so far. These are the skills I’ll be looking for when it comes to potential new data scientist hires for my team in CORE at Atrium Health.