Some historical background
One of the first scientific papers on data science was written by John W. Tukey at Princeton University and Bell Laboratories, submitted for peer review in 1961. It was titled The future of data analysis (Tukey: 1962). Tukey was at a time when the first computers had already been created, some smaller than a simple building, but the challenges of calculation and the new waves of data would mark a reform (one that gave birth to the spatial BigData among others). One of his main contributions was to talk about the need for tools and positive attitudes towards the processing of large volumes of information. He also spoke of declaring the adoption of data analysis as an experimental science and closing the possibility of seeing it as a deductive logical system, something that today would seem obvious to us but at the time it was revolutionary, since there were no personal computers, therefore, there were no laboratories for data analysis. The last sentence of his article is intense:
Who is in for the challenge?
Then in 1974, the book Concise Survey of Computer Methods by the winner of the Turing-award (the so-called Nobel prize in computer science) Peter Naur, mentions for the first time the concept of “data science”, opening the way for investigations about what we call today "data scientists”. Throughout the years, books, articles and technical reports have accumulated with more and more innovative ways of treating data, which have been separating from statistics and giving space to computer theorists. In this sea of bibliography a definition that creates paradigm in this subject stand out:
Data science is the link between traditional statistical methodologies, modern computer technology, and specific expert knowledge to turn data into information and knowledge.
This definition is also important because of who created it. It comes largely from the mission statement of the International Association for Statistical Computing (Saunders: 2013).
The data scientist
The Data Scientists concept has been evolving beyond academic research. It has strong business roots since the position of data analyst has existed during the last century. A data or business intelligence analyst performs tasks such as complex queries to databases, series of clusters or aggregations and their focus is basic descriptive statistics along with their graphical representation. Data scientists can theoretically perform these same activities and also perform prediction and classification processes using Data Mining and Machine learning. To these skills, some authors add the ability to work with large volumes of data and the use of contextual knowledge to perform expert analysis (Saunders: 2013).
These large volumes and their data sources are the heart of the concept of BigData: structured or unstructured data whose size can be dynamic and which can be analyzed, stored by different techniques and at different speeds. When we refer to speeds we speak not only of computing capacity but also of content delivery networks known as CDN (Content Delivery Network). Examples are abundant, but the literature focuses mainly on quantitative problems.
One of the best research on data scientist definitions was conducted by researchers at the University of Wollongong in Australia (Chatfield: 2014), they scanned databases of scientific literature and showed the annual growth of related research and with that information they put together an interesting list of definitions between academics and industry. All the definitions converge in a character who gives answers to business or research questions with a sum of statistical, computational and domain-specific skills.
One of the most interesting details of the Wollongong University article is that it lists statistics at the number eight skill, below many other skills a data scientist should have. In the first two positions are business knowledge followed by computer science. It must be clarified that the researchers, although they make an enormous effort to collect previous works, are from the "Faculty of Engineering and Information Sciences". What will statisticians think?
An unmissable article on the subject that we want to recommend in the definition of data scientists is “The Sexiest Job of the 21st Century” (Harvard Business Review Magazine: 2021), it will be a reading of only 5 min!
What does a data scientist study?
In our opinion, any scientist can be a data scientist in practice. But only some specific profiles focus on prediction or classification techniques and technologies. These subjects tend to be dealt with (not limited to) statisticians, librarians, IT engineers, industrial engineers, or the like. We believe that any scientist can become a Data Scientists because in most research areas, data processing and knowledge generation must be carried out by them.
In many cases, most experts in different areas deal with data processing over which only they have control, for example, lawyers, sociologists, geologists, etc. In research, the best combination is usually that of data experts and statisticians, mathematicians, or engineers performing computational processing. It is a myth that a data scientist should be a statistician or an IT engineer exclusively, although it is true that they can arrive much faster at specific solutions in classification, analysis, prediction and storage.
There are other gigantic challenges as a data scientist: obtaining information. The offer of datasets is gigantic! and more and more places offer their datasets in the form of REST APIs, so it is necessary to have some important concepts in data processing and specific technologies in order to be effective in a BigData world. JSON, REST API, SQL, Storage, Cloud Computing and other concepts are part of the list.
In addition, the offer of virtual courses has grown immensely in data science in the main MOOC engines (mainly driven by cloud platforms).
John W. Tukey, “The Future of Data Analysis“, The Annals of Mathematical Statistics, Vol. 33, No. 1. 1962, pp . 1-67 Published by : Institute of Mathematical Statistics, Stable URL: [http://www.jstor.org/stable/2237638] ↩︎
Peter Naur., “Concise Survey of Computer Methods“, 397 p. Studentlitteratur, Lund, Sweden, ISBN 91-44-07881-1, 1974. ↩︎
Data Scientist: “The Sexiest Job of the 21st Century“, Harvard Business Review Magazine: [http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1] ↩︎