I recently received an inquiry on how to get started in the currently hot field of data science. Perhaps a better question is: should you get started in data science?
What is data science?
Data science is vaguely defined. In many respects, it is what used to be called statistics and data analysis. The proliferation of ultra-fast computers, a range of cheap, portable sensors such as magnetometers (compasses), accelerometers (acceleration and gravity measurements), and gyroscopes (orientation), huge compact data storage, ultra fast networks, and web sites instrumented with detailed tracking of customers and visitors have resulted in a proliferation of data and effort to analyze that data to make more money — data science.
Data science often encompasses “traditional” statistics and data analysis methods such as analysis of variance, maximum likelihood estimation, and both linear and non-linear regression — also known as least squares fitting. Data science also often refers to more “modern” methods such as “machine learning” and “deep learning,” the artificial intelligence technique formerly known as artificial neural networks (ANN). Many of the methods labeled as machine learning and deep learning are essentially fitting mathematical models with large numbers — sometimes hundreds of thousands — of adjustable parameters to large data sets — sometimes petabytes of data (one thousand trillion bytes).
Data science, machine learning, and deep learning have all been extremely “hot” the last several years. Deep learning, in particular, has received a great deal of attention over the last few years with highly publicized reports of success in playing the Japanese board game Go, a game that defies the brute force try every possible move method that ultimately enabled computers to play chess at the world champion level, as well as reports of successes in face recognition, speech recognition, and other areas.
Published reports in the last two years have claimed that deep learning, artificial neural networks, have matched or exceeded human level pattern recognition in a number of areas; usually this is said to require a supercomputer built from large numbers of GPU’s or other high performance computers or computer chips — you can’t perform human level deep learning on your laptop or smartphone.
Google and some other super-unicorn (a unicorn is usually defined as a startup company with a market capitalization or revenues of one billion dollars — although Google is not really a startup but many Googlers seem to think it is) technology companies are said to be hiring deep learning experts with substantial academic or professional track records for very high salaries, stock options, and other perks, well beyond most software engineering salaries.
It should be noted that most data science positions are not deep learning, and many companies, startups, and other potential employers currently the lack the resources to construct the supercomputers that supposedly run the deep learning systems. However, many, many companies now have huge stores of data collected from customers and web site visitors. Six-figure and sometimes high six figure salaries are reported for some of these more common data science jobs.
Many of these six-figure jobs are located in the Silicon Valley or other expensive regions with very high rental and home costs. However, even adjusting for the high cost of living in these regions, the salaries are still substantial, just not as eye-popping as they might seem in inexpensive regions such as Texas.
Should you get started in data science?
The answer is not a simple one. It depends on the amount of relevant background that you have compared to competitors for the data science positions and it also depends on the degree to which data science is an employment bubble.
It is important to realize that statistics and data analysis is taught and practiced in many traditional graduate research fields including experimental particle physics (high energy physics), econometrics, actuarial science, biology, social psychology, and many others. Most of these programs, possibly all, produce far more Ph.D.’s with these skills that there are stable long term positions in these fields. Approximately ninety to ninety-five percent of people who earn Ph.D.’s in these fields ultimately leave them, often for some type of software engineering or sometimes biotechnology or health. There is, in fact, fierce, highly qualified competition for the relatively small number of data science positions.
Most data scientists that I have met or heard of have Ph.D.’s in some quantitative or semi-quantitative field. Those who do not have other impressive backgrounds. The few data science boot camps typically claim to responsibly only accept students with Ph.D.’s or other strong math and statistics backgrounds.
On the other hand, it is reasonable to project current trends and argue that there will be a longer term growth in the number of data science positions, following the trend of more and more data.
However, this projection assumes that tools won’t be developed to automate and de-skill much of the currently labor intensive statistics and data analysis. At present, analysis tools such as the R statistical language, Python/NumPy/SciPy with various toolkits such as scikit-learn, MATLAB, and Mathematica require substantial human intervention to produce valid results. A highly skilled analyst must select the appropriate statistical methods and tests, assess how the data was collected and measured and its impact on the purely mathematical issues in the analysis, and perform other tasks that have proven difficult to automate.
Also, the current data science frenzy resembles many employments bubbles that have swept through STEM (science, technology, engineering, and math) fields at least since World War II. To give a well known example, Computer Science (CS) enrollments at US colleges and universities soared in the late 1990’s during the dot com boom. Many of those students graduated after the dot com bust in 2000 and were unable to find jobs, in some cases after taking on four years of expensive student loans.
After the surprise launch of Sputnik on October 4, 1957, the United States poured huge sums of money into science generally and specifically physics education, resulting in a bumper crop of Ph.D. physicists in the late 1960’s, far more than there were jobs for physicists, resulting in a huge bust in 1969 and the early 1970’s. A similar physics employment bubble occurred in the 1980’s with the Reagan Era defense buildup, followed by a dramatic bust in about 1993 after the Cold War ended, the Super Conducting SuperCollider (SSC) project was cancelled by the Clinton Administration, and other cutbacks.
Much of the current financing for data science, machine learning, and deep learning is speculative, from venture capital funds and experimental projects within large established companies such as Google. Google makes nearly all its revenues and profits from advertising. While Google’s Go playing deep learning system AlphaGo may be technically impressive, there is little money in playing Go.
Almost certainly, some — perhaps many — machine learning and deep learning claims will prove to be hype as has happened with previous technology fads. What the field will look like after the current wave of hype subsides remains to be seen.
Thus, anyone considering investing time and money in data science education should consider that they might graduate into a very hostile job market in a few years if the data science bubble bursts.
Thus, the bottom line is that persons with strong, current, up-to-date relevant skills in statistics and data analysis should consider cashing in on the data science boom. The weaker your skills and the more training that you need, the higher the risk of investing time and money into breaking into data science. Borrowing substantial amounts of money in the form of student loans or even worse credit card debt to finance a data science degree or certificate is especially questionable. Only after assessing these risks should you then ask: how do I get into data science?
A good article/blog post on getting into data science is “Getting started in data science” by Trey Causey (dated June 7, 2014)
© 2017 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing gesture recognition for touch devices, video compression and speech recognition technologies. He has extensive experience developing software in C, C++, MATLAB, Python, Visual Basic and many other programming languages. He has been a Visiting Scholar at HP Labs developing computer vision algorithms and software for mobile devices. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at firstname.lastname@example.org.