What does a data scientist actually do? Here at Good Rebels we wanted to outline a profile of this new profession, with the help of various industry leaders from academia, business and institutions. In short, we concluded that the main tasks of a data scientist are to identify data, transform it when incomplete, categorize it, prepare it for analysis, perform the analysis, visualize the results and communicate them.
1. 1 Data Scientists: Who are they? What do they do? How do they work?
Data Scientists:
Who are they?
What do they do?
How do they work?
2. 2 Data Scientists: Who are they? What do they do? How do they work?
“The sexiest job in the next ten
years will be that of the statistician.
People think I’m joking, but who
would’ve guessed that computer
engineers would’ve had the sexiest
job of the 1990s?”.
Hal Varian, October 2008.
3. 3 Data Scientists: Who are they? What do they do? How do they work?
Introduction: Data Scientist, the sexiest job of the decade
- Data, data and more data
- A little bit of history
1. Where do Data Scientists come from?
- Understanding the role of each specialist
2. Data Scientists: seeking their place in the organizational chart
- The data was already in-house
- Are companies ready to listen to the Data Scientist?
3. Who needs a Data Scientist?
4. The Data Scientist skill set
- Technical skills
- Above and beyond technical skills
- How to choose your data scientist
- Struggling to find a data scientist? Train them in-house
- Supermen and superwomen? No, super teams!
5. The Data Scientist’s tools
- Data processing system construction, databases, visualization,
and data wrangling tools
- Open source or proprietary software?
6. Getting down to it: the work process
- Three obstacles to overcome before accessing data
- From data to decision... if nothing goes wrong
7. Evaluating the Data Scientist’s work
8. Trust: an essential component in the process of data science
- Ethics: science’s essential accessory
9. Data scientists in Spain today
- Who’s making the most out of data science in Spain?
10. Conclusions: still a great deal to be done
- What does the adulthood of big data look like?
4. 4 Data Scientists: Who are they? What do they do? How do they work?
The data scientist is a sort of
mix between a programmer,
an analyst, a communicator
and an adviser. A very difficult
combination to come across.
5. 5 Data Scientists: Who are they? What do they do? How do they work?
Data scientist,
the sexiest job of the decade
The figure of the data scientist first emerged in the early twenty-first century. A decade
after the widespread business adoption of the Internet, Hal Varian, chief economist at
Google, predicted in an interview in October 2008: “The sexiest job in the next ten years will be
statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve
had the sexiest job of the 1990s?”
Varian, also a professor at the University of California, Berkeley, was one of the first to
recognize the strategic importance of extracting information from data, and not just
at a corporate level. “The ability to take data - to be able to understand it, to process it, to
extract value from it, to visualize it, to communicate it - that’s going to be a hugely important
skill in the next decades. And not only at the professional level, but even at the educational
level for elementary school kids, for high school kids, for college kids. Because now we really do
have essentially free and ubiquitous data. So, the complimentary scarce factor is the ability to
understand that data and extract value from it”.
The truth is that in 2008 a few companies had already incorporated the position in order
to manage a volume of information hitherto unknown, due to its variety and sheer scope,
in a quest for findings relevant to the business. Until then nobody had called them “Data
Scientists”. The first to do so were DJ Patil and Jeff Hammerbacher, then heads of Data
Analytics at LinkedIn and Facebook respectively.
Eight years later, in 2016, with an increasing volume of data generated on a daily basis,
Varian’s predictions are more poignant than ever. According to the McKinsey Global
Institute report “Game changers: Five opportunities for US growth and renewal”, the big
data industry in the United States could increase annual GDP by 325 billion dollars by
2020. According to the same report, the United States alone will face a shortage of up to
190,000 data scientists and 1.5 million professionals with enough proficiency to use big
data effectively. Between 2010 and 2020, the number of companies seeking to incorporate
the figure of a data scientist will grow by 18.7%, according to the EMC study “The
Digital Universe in 2020”. An estimated 40,000 exabytes of data will be created by 2020,
underlying the need for organizations to incorporate talent to conduct in-depth analysis of
information.
6. 6 Data Scientists: Who are they? What do they do? How do they work?
In reality, many companies (the biggest or the most pioneering ones) have already
incorporated the figure of data scientist in any one of its variations. Their sudden
appearance in the business world and the high demand for these professionals expected
over the coming years confirm that there is a growing need to process large volumes of
information and transform it into a valuable asset, given that data “in its raw state” is
not useful for companies. Only an in-depth analysis offers the chance to reveal patterns
and trends, which at the same time streamline business processes and optimize decision-
making. This is where data science emerges as the process that enables the collection,
preparation, analysis, visualization, management and preservation of large volumes
of data. Extracting valuable information from all types of sources provides solutions to
a companies’ vital strategic issues, such as those related to time and cost savings, new
product development, the optimization of offers and faster and more accurate decision-
making processes.
But what does a data scientist actually do? Here at Good Rebels we wanted to outline
a profile of this new profession, with the help of various industry leaders from
academia, business and institutions. In short, we concluded that the main tasks of
a data scientist are to identify data, transform it when incomplete, categorize it,
prepare it for analysis, perform the analysis, visualize the results and communicate
them. To do this, the data scientist must have technical training in programming,
data management, statistics and data mining. And let’s not forget, aside from the
analytical part, the ability to focus on creating value for the company. This is why, in
a competitive scenario where challenges are constantly renewed and data doesn’t stop
flowing, the data scientist’s work enables managers to move from an ad hoc analysis to
an ongoing conversation with the data.
What kind of person is able to perform this task? The data scientist is a mix between a
programmer, an analyst, a communicator and an adviser. With proficiency in statistics,
technology, math, and data architecture. All this without forgetting human qualities. A
very difficult skill set to find all in one person? Probably so. Simply because there are
not many people who can do all that.
7. 7 Data Scientists: Who are they? What do they do? How do they work?
So basically, we’re talking about a well-rounded jack-of-all trades proficient in
mathematics, IT and data architecture, knowledgeable of business, with strong
communication skills as well as empathetic virtues... Professionals refer to this ideal
person, given the practical impossibility of finding one on the market, with labels such
as “El Dorado, “Unicorn”, “The Data Science Superhero”, “The Dark Beast” or “The
New Renaissance Man”. An extremely powerful combination... and very hard to find,
because demand is growing and such professionals are in short supply. The solution:
training, retraining and building teams that when combined are able to integrate a
profile like the one described.
Read more:
Hal Varian interview at McKinsey.com
DJ Patil Biography
Building Data Science Teams, at Amazon.com
8. 8 Data Scientists: Who are they? What do they do? How do they work?
Data, data and more data
With countless services and connected devices, it is estimated that 90% of data has been
generated in the last two years. This volume is higher than all the information ever created
in the history of mankind. And this is also very good news for anyone who specializes in
data management and processing: they’ll probably never be short of work for the rest of
their lives.
Numerous indicators illustrate this spectacular explosion of data. For example:
- In 2020, 1.7 MB of information will be created per second and for every human
being, according to EMC forecasts.
- Information is constantly being generated, which someone needs to monitor.
For example, on Google alone there are 40,000 searches every second.
- Facebook is another behemoth when it comes to data generation. Every minute,
its users send an average of 31.2 million messages and watch 2.77 million videos.
- In May 2016, Facebook and Microsoft began laying a 6,600-km underwater
cable between Europe and the US, capable of transmitting 160 TB of data per
second.
- 80% of photos will be taken with smartphones in 2017. A high percentage of
them will be shared via the Internet.
- It is estimated that in 2020 more smartphones will be in use than landlines, with
a total of 6,100 million users worldwide.
- Also in 2020, there will be 50 billion smart devices in use worldwide, all
collecting, analyzing, and sharing data. A third of data will travel through the
cloud.
- 80% of data generated today is unstructured. This includes data found in emails,
spreadsheets, social media, the Internet, etc.
9. 9 Data Scientists: Who are they? What do they do? How do they work?
- The market for Hadoop (an open-source software framework used to manage
networked computers) will grow at an annual rate of 58%, exceeding the value
of 1 billion dollars in 2020.
- For an average company on the Fortune 1000, an improvement of just 10% in
data accessibility will result in over $60 million of additional net income.
- Businesses that make full use of the potential of data could boost their operating
margins by up to 60%.
- Perhaps the most mind-boggling fact, and which highlights the enormous
potential that lies ahead for the big data industry: according to MIT, less than
0.5% of all data generated right now is analyzed.
Read more:
Big Data: 20 Mind-Boggling Facts Everyone Must Read
Internet Live Stats
Big data: The next frontier for innovation, competition, and productivity
10. 10 Data Scientists: Who are they? What do they do? How do they work?
A little bit of history
The Cyclopædia of Commercial and Business Anecdotes, published in 1865 by Richard Millar
Devens, contains the first recorded reference of the term “business intelligence”. The
author described how a banker, Sir Henry Furnese, succeeded by having an understanding
of market conditions before his competitors: “Throughout Holland, Flanders, France, and
Germany, he maintained a complete and perfect train of business intelligence. The news…was thus
received first by him”, Devens writes. Furnese ultimately used this advance knowledge to
duplicitous ends and became renowned as a corrupt financier. However, he can be credited
for sowing the seeds of business intelligence.
Technology did not advance to the point where it could be considered an agent of business
intelligence until well into the 20th century. The first commercial computers arrived in the
United States in the 1950s. Hans Peter Luhn, a pioneering researcher at IBM, published in
1958 the article “A Business Intelligence System”, in which he defined business intelligence as
“the ability to apprehend the interrelationships of presented facts in such a way as to guide
action towards a desired goal”.
Luhn contemplated the development of an automatic and intelligent system, built on
document processing equipment, capable of designing target-specific action guidelines
for the various sections of any organization. With this article, Luhn, considered the father
of business intelligence, laid the foundations for information analysis and distribution to
serve the needs of a company.
It wasn’t until three decades later, in 1989 to be exact, when the analyst Howard Dresner
brought the modern definition of business intelligence into the common vernacular.
Encompassing somewhat cumbersome-sounding concepts related to data storage and data
processing, Dresner summed up the idea of business intelligence as “concepts and methods
to improve business decision-making by using fact-based support systems”.
From the 2000s, the intersection between different technologies and business needs
prompted new concepts and terminologies: data engineering, business analytics, data
mining, etc. There is currently no clear consensus on exactly where the skills of each of
these disciplines begin and end, nor to what extent some overlap with others. But what’s
clear is that they all coexist under the umbrella of big data.
11. 11 Data Scientists: Who are they? What do they do? How do they work?
Read more:
Richard Miller Devens - Cyclopædia of commercial and business
anecdotes
Hans Peter Luhn – A Business Intelligence System
Howard Dresner’s blog
13. 13 Data Scientists: Who are they? What do they do? How do they work?
The following people have
participated in this study:
Bosco Aranguren
Chief Marketing Officer, Microsoft Iberia
CMO at Microsoft Iberia since March 2017. Previously, he was
responsible for Programmatic Media Buying at Google. He joined
Google in 2010 as Industry Head Automotive, and in 2012 he
became Industry Head CPG Entertainment
Álvaro Barbero
Chief Data Scientist at Instituto de Ingeniería del Conocimiento (IIC)
Expert in the fields of machine learning, optimization and
algorithm engineering. His work is to transform advances in
these areas into practical Big Data systems, from predictive and
recommender systems to automated text analysis and resource
optimization.
Richard Benjamins
Director of External Positioning Big Data, LUCA: Data-Driven Decisions
Director of External Positioning Big Data for Social Good at
Telefonica in Telefonica’s Chief Data Office. In his previous position
of Group Director BI Big Data he was responsible for internal
exploitation of Big Data across Telefonica. He was also Director of
Business Intelligence at Telefonica Digital, and before that he was
Director of User Modelling where he led Global BI programs.
14. 14 Data Scientists: Who are they? What do they do? How do they work?
Fuencisla Clemares
Country Manager at Google Spain Portugal
Joined Google in 2009 as Manager of Retail and Consumer
goods; after that, she led the Telecommunications, Banking and
Insurance sectors, along with the mobile strategy for Spain. Prior
to Google, she worked for seven years as a strategic consultant at
McKinsey Company, and later became Director of Purchasing in
the Carrefour home division.
Manuel Marín
Data Analytics Manager, PwC
Data Analytics Manager at PwC. Before that, he was Chief
Technical Officer at APARA, and applied predictive analytics
in telco, banking, insurance, energy, health, sports and retail
companies in the areas of fraud detection and customer
intelligence.
Esteban Moro
Associate Professor at Universidad Carlos III de Madrid
Esteban is professor at Universidad Carlos III de Madrid and
member of the Joint Institute UC3M-Santander on Big Data and
academic director of the Master of Data Science and Big Data
on Finance by AFI. He serves as consultant for many public and
private institutions. His areas of interests are applied mathematics,
financial mathematics, viral marketing and social network.
15. 15 Data Scientists: Who are they? What do they do? How do they work?
Felipe Ortega
Director of the Master in Data Science at Universidad Rey Juan Carlos
Assistant Professor in the Department of Theory on Signal and
Communication and Telematic Systems and Computing, School of
Telecommunications Engineering at University Rey Juan Carlos (Madrid).
He is co-founder of the Data Science Lab at the Center for Intelligent
Information Systems (CETINIA) and Academic Director of the Master
in Data Science at UJC. His main areas of research are data engineering,
computational statistics, machine learning, quantitative methods, open
source software, large-scale data management and data visualization.
Pep Porrà
Business Performance Director, King.com
Business Performance Director at King.com, where he leads a team
of Data Scientists and Business Performance managers focused on
evaluate, anticipate and understand the monetization impact of
game features. Prior to work in corporate, he was a Statistics and
Mathematics Professor at University of Barcelona.
Alejandro Rodríguez
Professor at Universidad Politécnica de Madrid
Professor at the Department of Computer Languages and Systems
and Software Engineering at UPM. Specialized researcher in the fields
of medical informatics, knowledge representation, expert systems
and semantic web.
Marcelo Soria
Partner at Tramontana.co
From mid-2016, partner at Tramontana.co. Between May 2014 and
May 2016, he was VP of Data Services at BBVA Data Analytics, and
before that he was Big Data // Smart Cities initiative co-leader at
BBVA.
16. 16 Data Scientists: Who are they? What do they do? How do they work?
1.
Where do data
scientists come
from?
18. 18 Data Scientists: Who are they? What do they do? How do they work?
Where are data scientists?
More than half of these professionals are
concentrated in the United States. Spain
is ranked as the eighth country in the
world with the highest number of data
scientists in employment.
“The State of Data Science”, Stitchdata.com
19. 19 Data Scientists: Who are they? What do they do? How do they work?
DJ Patil, currently Chief Data Scientist for the US government, was the first to coin the
term “data scientist”, during his tenure at LinkedIn. But nearly a decade after, there is still
some controversy about its exact meaning, and whether or not this role differs from that
performed by data analysts in companies for many years now.
For some, the origin of data science lies in machine learning. All prediction and
classification models have been developed from this branch. Professionals trained in this
discipline were mainly mathematicians who also had programming skills that enabled
them to implement and test predictive models, as it represents a non-theoretical branch of
mathematics.
The huge change in the amount of data being handled by organizations is the main
driving force behind the new profile. If elements such as big data and machine learning
are added to traditional data analytics, we may well be talking about a new theoretical
discipline - and also job category - whose terms are being defined virtually at the same
time as the market creates demand. What distinguishes a data scientist is a different,
more scientific type of training, which allows them to use the very latest techniques
to access mass data, not only at the level of exploration, but also speed. A profile both
academic training and professional.
Due to the current lack of consensus on their characteristics and skills, there is a wide
spectrum of professionals included in the category of data scientist. It is important, though,
that they meet a set of characteristics: they should be able to use their knowledge to
extract non-obvious information from data and empirical evidence, and also present it
in an understandable way.
Each specialist has their place and time
Data science, big data, data analytics... Terms that we’ve been hearing for years now,
but are still somewhat enshrouded in confusion when it comes to their definition and
competencies. What’s involved in each of these disciplines?
First and foremost, it’s important to stress that the role of data scientist is different from
that of an analyst who designs models or forecasts. The data scientist is not only expected
to explain the effect that the data will have on the company’s future, but also to provide
solutions that help the company to grow, both in the present and in the future.
20. 20 Data Scientists: Who are they? What do they do? How do they work?
“You can not communicate a
relevant decision in your business
if you are not able to explain how
you got it, what data you have
used, and what processes you have
followed to break it down.”
Esteban Moro.
21. 21 Data Scientists: Who are they? What do they do? How do they work?
Data science
- Faced with structured or unstructured data, data science is a field that encompasses
everything related to the cleaning (curation), preparation and analysis of data.
- Data science consists of a medley of statistics, mathematics and programming,
peppered with problem-solving, data extraction using as much ingenuity as required
and the ability to scrutinize a problem from different perspectives.
- The data scientist shifts business cases to an analytical plane, develops hypotheses
and patterns, and evaluates their impact on the business. This deep analysis has
the ultimate goal of solving complex business issues efficiently and anticipating
future needs.
Big Data
- Big data refers to huge volumes of data, proprietary or third-party and usually non-
aggregated, the size of which prevents it from being processed effectively using
traditional applications.
- Big data is a term that is gaining more and more ground in firms and industries.
The analysis of data trends using sophisticated algorithms and other cutting-edge
information processing methods ultimately improves strategic decisions that are a
driving force behind business.
Data analytics
- Data analytics uses data to examine market and business trends, and to develop or
improve methods linked to productivity and cost reduction.
- The essence of data analytics is inference, which is the process of drawing
conclusions based solely on what the researcher already knows.
- Data analytics is used in many industries to help companies improve decision-
making, as well as to verify or refute existing theories and models.
22. 22 Data Scientists: Who are they? What do they do? How do they work?
“The next big challenge in the
gaming industry is to create smart
systems. To convert data into new
value for the company”.
Pep Porrà.
23. 23 Data Scientists: Who are they? What do they do? How do they work?
A hypothetical case will let us see the different processes involved in a data science project.
Let’s imagine that every day millions of images are uploaded to a restaurant review site and
they need to be catalogued: are they pictures of food? What kind of food? Or are they of a
restaurant? Of the outside or the inside?
Machine learning automatically classifies each image into its respective category.
Properly “trained”, a computer can figure out, for example, if the photo of a restaurant is
of the inside or the outside. The data scientist oversees the entire project, from selecting
the right algorithm to engineering design.
- The data scientist creates the model which allows the computer to make this
distinction, using different sources of information ranging from manually classified
images to keywords in screenshots.
- Using data engineering techniques, a data feed and storage system is created, to
which algorithms are applied on a large scale.
- Finally, analysis is made of the business implications for the company of the
innovation applied: is it useful for business? Will it help the website generate more
traffic?... and so on. The findings are then presented using visualization tools.
24. 24 Data Scientists: Who are they? What do they do? How do they work?
2.
Data scientists:
seeking their
place within the
organizational chart
26. 26 Data Scientists: Who are they? What do they do? How do they work?
“The problem we often find is
that data has been managed in
isolation. And then the time comes
to enable that data and there’s no
communication going on”.
Bosco Aranguren.
27. 27 Data Scientists: Who are they? What do they do? How do they work?
The data scientist isn’t a radically new profile that’s being defined from scratch.
Companies have long been resorting to in-depth data analysis as a valuable tool that
helps meet or exceed their goals. What’s changed now is the dimension of this analysis,
as in a greater volume of data calls for a different approach, with regard both to
procedures and the purpose of the analysis.
Many experts stress the idea of rediscovering data, or rather, discovering its value
contribution to the company. The person who used to manage data, target customers or
detect products with the greatest turnover quite clearly added value to the company. But
the data scientist’s role goes much further.
The data was already in-house
It’s true that the figure of data manager has existed in companies for some time now.
Data Analytics has been used in the telecommunications industry for at least 20 years.
Banking also has been using Business Intelligence for several years, as have - somewhat
more mutedly - all major companies at the helm of their respective industries. However,
far from being a cross-disciplinary practice, data analysis has often only been applied
in specific departments, mainly in Marketing, Communication and Customer Insights.
A form of pigeonholing which has to a certain extent jeopardized its importance within
the hierarchy of company priorities.
The main problem in companies without a data-focused corporate culture is that they were
often run in a decentralized and disorganized way. As a result of this siloed management,
each corporate department has been taking technology-related decisions it deemed the
most appropriate at any given time.
Now that the time has come to deal with data, experts are encountering barriers and
incompatibilities that hugely complicate their work. In institutions with enormous
historical repositories, grouping together and processing data files is a colossal effort, but
once this path of self-learning has been completed, the work translates into improvements
in internal processes, people management and/or customer service.
28. 28 Data Scientists: Who are they? What do they do? How do they work?
“Technically you can do just
about everything, but the
organization must then be
prepared to use it”.
Richard Benjamins.
29. 29 Data Scientists: Who are they? What do they do? How do they work?
The difference when compared to the situation in recent years is that data analytics
specialists now have much more powerful and effective technological resources, allowing
them to extract greater value from the information. Computing costs are lower, data
availability is higher and connectivity between both is greater, so this raises the chances of
finding patterns or potential case-based reasoning, helping to update the practice of using
data to improve management.
In this process of recognizing the status of data scientists, it’s vital to mention a
fundamental advance in their professional acknowledgement: they have taken on the
crucial responsibility to commit towards improving company results. Their mission is
no longer limited to guiding or advising the actions of other departments, nor to crunching
data to later present it to managers responsible for decision-making. The data scientist’s
work culminates with the delivery of new business opportunities founded on the
comprehensive inspection of data.
Is the company ready to listen to the data scientist?
The data scientist in many cases faces another crucial battle to make sure that their new
status within the company is acknowledged: overcoming resistance to change. Digital
inertia is pushing many companies towards the culture of data, but in more traditional or
larger organizations, where digital natives are often part of the management, this can end
up being a costly journey if it is long, or traumatic if it is short.
The first leg of the company’s journey towards big data must receive firm support
from the management. There are so many departments involved (IT, Business
Intelligence, e-Commerce, Marketing, etc.), and so much coordination among them is
needed for data to flow, be shared and properly used, that only by providing resources
from the top will it be possible for change to take place. Without agility and cooperation,
there can be no results.
In companies where there’s a tendency towards convenience or resistance to change, the
data scientist might even be seen as a gatecrasher who has turned up to lecture experts on
how to run the business. Executives who have long established the rules of the game are
wary of the mathematician, who even seems to be speaking a language that is foreign to
the company.
30. 30 Data Scientists: Who are they? What do they do? How do they work?
The first step in a company’s
journey towards Big Data needs
support from top management.
31. 31 Data Scientists: Who are they? What do they do? How do they work?
This is a cultural issue: the scientific endorsement behind the data scientist’s
recommendations must tap into traditional decision-making processes, based on
experience or other types of indicators, sometimes as simple as a spreadsheet. There may
even be people who ignore the contributions of the data scientist, as they may fear being
put into a compromise to improve results: meeting KPIs can be a painful goal.
A phenomenon that is repeated in all kinds of organizations, including startups, because
ultimately each person tends to protect their own teams and projects. That’s why, as we
shall see later on, entropy and communication are two of the essential non-technical
qualities required to work as a data scientist.
32. 32 Data Scientists: Who are they? What do they do? How do they work?
3.
Who needs a
data scientist?
34. 34 Data Scientists: Who are they? What do they do? How do they work?
In the United States, data scientist
was listed in 2016 as the job with the
best prospects, based on three factors:
job openings, salary and potential for
career development.
Source: 25 best Jobs in America, Glassdoor.com
35. 35 Data Scientists: Who are they? What do they do? How do they work?
Companies and organizations in countless industries today are embarking upon
projects related to data analysis: banking, communications, entertainment,
healthcare, education, natural resources, insurance, retail, transport, energy, etc.
Many institutions publish their big data repositories, and moreover technologies
to visualize and analyze data are generally available. This scenario facilitates
investigation as anyone with basic training can raise a company-related issue and
collect the data required to solve it.
Why would a company venture into a big data related project? The main objective is
usually to improve customer experience, but other goals include reducing costs,
refocusing marketing strategies, streamlining internal processes or improving
security. We know that we have unprecedented access to information and data.
What’s more, complex systems appear in any field of knowledge. Unpredictability
can manifest itself in all kinds of disciplines: mathematics, physics, chemistry,
engineering, programming, economics, sociology, psychology, etc. There is a
continual challenge to find order or a behavior pattern among the seemingly chaotic
nature of any system.
As a result, there is no shortage of data or, obviously, problems to solve. And there is so
much knowledge out there that it is difficult to create new knowledge, in this instance
understood as any algorithm or model to help improve business performance. Taking on
all these challenges, in addition to a solid technical background, requires huge doses of
passion and motivation. That’s why defining the criticality of the problem to be solved is
crucial for the data scientist.
But, how do you define a good problem? How it is recognized and how are resources
allocated to solve this particular issue and not another? The answer may be subjective,
depending on the other person. But basically, a good problem should meet three
conditions:
• Demonstrate a clear and direct impact on the business.
• Prove solvable with the data at hand.
• Provide sufficient motivation to the data scientist and his/her team.
36. 36 Data Scientists: Who are they? What do they do? How do they work?
“It’s impossible to have someone
who is knowledgeable in all the
businesses in the world. The
company may have a generalist
data scientist and specialists in
the areas where business can be
developed”.
Álvaro Barbero.
37. 37 Data Scientists: Who are they? What do they do? How do they work?
The last question is who can take charge of solving such problems. In his book Building
Data Science Teams, DJ Patil sums up the essence of a guide for employing or hiring a data
scientist:
“The inventor of LinkedIn’s ‘People You May Know’ was an experimental physicist. A
computational chemist on my decision sciences team had solved a 100-year-old problem on
energy states of water. An oceanographer made major impacts on the way we identify fraud.
Perhaps most surprising was the neurosurgeon who turned out to be a wizard at identifying rich
underlying trends in the data”.
Ultimately, all scientists, whatever their training, are able to meet the challenge of
extracting information from data, as long as they convey enough passion for problem-
solving. And it is always beneficial to test the robustness of a model based on the variety of
perspectives provided by different scientific disciplines.
38. 38 Data Scientists: Who are they? What do they do? How do they work?
4.
Skills of a
data scientist
40. 40 Data Scientists: Who are they? What do they do? How do they work?
“MOOCs are very useful for
training, because they are very
specific and oriented towards a
specific objective.”.
Alejandro Rodríguez.
41. 41 Data Scientists: Who are they? What do they do? How do they work?
The data scientist is not necessarily a professional with a “numbers” training. It’s
not essential to come from disciplines such as mathematics, statistics, physics or exact
sciences, although these educational backgrounds provide a useful foundation. Some data
scientists come from fields such as telecommunications, engineering or computer science,
and even from seemingly obscure areas such as communication, economics, finance or
biomedicine.
Why? Because the most important part of their job is ultimately to analyze data: play with
it, work with it, question it, and love it. The data scientist should be a curious, creative,
innovative and even defiant person, capable of questioning the status quo. And that’s
why their training is not as decisive as their attitude is.
Technical skills
What is clear is that the data scientist’s work revolves around the combination of
technology, creativity and data. There are likely common core requirements when it comes
to their qualifications and performance, but as time goes by, the profile will gradually
diversify into multiple branches and specializations.
In short, the data scientist should be fully at ease with the following four disciplines:
• Statistics / Mathematics: they should be able to analyze databases, build models,
make statistical forecasts and distinguish what is representative from what is not.
Therefore, they should have a strong mathematical background that allows them to
control supervised models with predictive techniques (data mining, machine learning)
and unsupervised segmentation models. Prior to this modelling, they should be able to
work with all mathematical techniques of data pre-processing, and once the model is
built, of data evaluation. In short, they should be familiar with a skill set of techniques
to enable them to construct and to evaluate a predictive model, as well as apply
statistical logic to programming languages.
• Technology: as a requirement for transforming data into knowledge, the data scientist
must understand the business’ technological and have the know how to implement them.
Algorithm design is key to data transformation, and calls for fluency in multiple computer
languages, as well as full knowledge of database management. It’s very important to be
proficient in automation, since many processes are repeated on a computer while the data
scientist is working on refining or calibrating the model.
42. 42 Data Scientists: Who are they? What do they do? How do they work?
“In Spain, we lack the mindset to help
people grow, take risks, even train
them to grow in their job positions”.
Fuencisla Clemares.
43. 43 Data Scientists: Who are they? What do they do? How do they work?
• Business analytics: the data scientist should speak the corporate language,
understand the company’s goals, the industry in which it operates and the
processes that drive profit and growth. Only in this way will they be able to
discern which problems can be feasibly solved through data processing, and
only by understanding the inner workings of the company will they be able
to convert data analysis into insights and valuable recommendations for the
company. Without certain knowledge of the business environment, mere
technical qualifications can lead to rejection of the “techie” or difficulty in
understanding them, or even awkward situations where all they are offered are
obvious answers.
• Communication: the data scientist will at some point have to present meticulous
and accurate results of their work - not based on experience, but on their
analysis- to professionals, often managers with decision-making powers and
extensive business experience but who lack technical training. That’s why they
should possess the ability to communicate with ease and create a dialogue tailored
to the level of their audience. It’s paramount that the result of an analytical
process be able to be understood by any manager within the company, whether
that be an engineer or a social media specialist.
Skills above and beyond technical ones
The data scientist doesn’t only subsist on technical know-how. Ideally, the above
capabilities are complemented by a series of personal characteristics, thereby forming
a skill set (sometimes merely utopian) in which merges specialisation with human
qualities.
• Creativity: in order to give a different perspective analysis thanks to the ability to
use new methods to collect, interpret and analyze data. The technology itself is
not a differential factor from the moment that a program is made available to any
organization. That’s why the significance of know-how is vital: the tools may be the
same for everyone, but the minds handling them are not. Technological uniformity
melts down when intelligence is added, turning the results offered by a software
solution – one which may even be used by the competition - into unique ones.
• Intuition: the ability to choose between one way or another of reaching a solution
is extremely important. Experts underline the importance of applying an artistic
component to a technical working process that usually triggers a fixed sequence
44. 44 Data Scientists: Who are they? What do they do? How do they work?
“To stay on top of everything and
constantly refresh one’s knowledge,
curiosity is essential”.
Marcelo Soria.
45. 45 Data Scientists: Who are they? What do they do? How do they work?
(data processing, curation, modelling, etc.), but which requires an intuitive spark to
discriminate which steps are suited to critical analysis.
• Flexibility: Trial and error mechanisms allow us to evaluate and choose one
option or another for the work already underway, complementing - or even
rectifying - decisions made before starting the project. Mathematical models are
not unique, but are grouped into toolboxes that encompass different techniques.
Therefore, agility is required to opt for a technique or one analytical tool or
another, depending on the structure of the data, the information available, etc.
For professionals trained in theory but with little experience in the practical side
this may represent a point of weakness.
• Curiosity: understood as the ability to ask questions, to comprehend what is asked
and to envisage the right path to take. Curiosity is essential for keeping abreast of
techniques and arts, as well as for constantly refreshing knowledge base. Ultimately,
this will lead the data scientist to draw meaningful inferences from the data.
• Empathy: Although their work is the result of hours and hours spent in front of a
computer, the data scientist is not a lone wolf. The human factor must be present
in their daily lives, in the sense that their work depends on collaboration with other
departments, and it is impossible to pull it off without cooperation. Accustomed
to mobility between projects and areas, the challenge lies in creating free-flowing
dialogue with other parts of the organization. What’s more, they may sometimes
have to present undesirable results to clients or superiors, further reinforcing the
importance of the personal touch.
• Pragmatism: Finally, there’s no point in all this theoretical analysis if it isn’t
accompanied by a practical impact. Technical skills are of little use if the data
scientist isn’t able to integrate into a team or convert all their analytical potential
into results that benefit the company or other working groups. Therefore, they must
be able to transfer data analysis into insights or actions with a direct impact on the
business.
46. 46 Data Scientists: Who are they? What do they do? How do they work?
“At Google, we try to work extensively
in the ecosystem, which is a word we’re
very fond of. We aren’t the ones who
are going to train people, but we can
influence other experts to encourage
such initiatives”.
Fuencisla Clemares.
47. 47 Data Scientists: Who are they? What do they do? How do they work?
How to choose your data scientist
For a profession that is still evolving, traditional recruitment processes are of no use.
Companies like Facebook, Amazon, Google or Microsoft are at the forefront of corporate use
of data science, serving as a benchmark for companies from all industries to understand the
professional profile of recruits and the type of work they perform.
It goes without saying that their technological background is critical: without the relevant
technical training, it is impossible to address the mission of data processing. That’s why
above all it is important to evaluate training and experience in mathematics and computer
science.
But we must also assess the ability to refresh knowledge, grow and learn in an ever-
changing environment, because we’re likely to recruit someone who doesn’t know which
challenges they are going to face in three years’ time.
Therefore, in the selection process it is important to test reasoning skills through problems
where it is not as important to find the right solution as it is to follow a logical process.
Nor is it uncommon to consult references seldom used in other selection processes, for
example, work developed on platforms such as GitHub.
Struggling to find a data scientist? Train them in-house
When recruiting a data processing specialist becomes a complex or financially costly chore,
some companies opt for internal promotion. Professionals already working in an area
related to data analytics are trained or re-trained in disciplines adapted to the new needs of
the company. This is a widespread and perfectly valid procedure for companies that choose
to re-train their specialists in data analytics.
This re-training is favored by the trend towards standardization brought on by technology:
there are countless tools that make the prior task of data analysis and cleansing easier, and
which allow professionals already in the workforce - especially in business intelligence - to
be re-trained in data science.
The pull effect of what some describe as today’s coolest profession, along with
technological standardization, has somewhat lowered the bar of technical knowledge
required to perform the role of data scientist, which actually poses a risk that threatens the
quality of the decision-making process. The tools that automate some of the work with
48. 48 Data Scientists: Who are they? What do they do? How do they work?
Where has data scientist studied?
When looking at data scientists’
academic backgrounds, it’s surprising
that Business Administration is the
second-most common course of study.
Source: “The State of Data Science”, Stitchdata.com
49. 49 Data Scientists: Who are they? What do they do? How do they work?
less specific knowledge globalize and streamline the practice of extracting value from
data, without the need to aspire towards having a data scientist, or at least a data analyst,
on the payroll.
Another advantage of in-house training stems from the unique nature of the data
scientist’s work. Their concerns and personal motivations do not always coincide with
those of other professionals. Their passion for research - let’s not forget that we’re
talking about scientists - and their motivation to learn may actually replace the priority
levels they give to variables such as their rank in the company, advancement, salary
or responsibilities. In this regard, the profile lies halfway between professional and
academic, although we must remember that performance metrics in a company are not the
same as those at a university.
Supermen and superwomen? No, super teams!
Statistics, Technology, Analytics, Communication... Without forgetting human qualities. Is
this skill set very difficult to come across all in one person? Probably so. Simply because
there aren’t many people who can do all that. The alternative is simple: working in
multidisciplinary teams. This involves creating groups that, as a whole, satisfy all these
qualities. A collaborative effort that goes beyond the work of a single person, where the
most important thing is to create a climate where curiosity, motivation, knowledge sharing
and cooperation are encouraged.
Each team member has a clearly defined role, and does not need to know everything: the
modelling expert will work alongside the analytics expert; and the business specialist with
the head of communication. But what is important is that the generalist data scientist has
a global vision of the entire work process, which will avoid situations where, for example,
they invent a mathematical model that cannot be run with the available hardware.
The group should operate smoothly, within a dynamic rather than a rigid structure, because
once the general problem has been identified, specialists centered in a particular area can
be incorporated. Such a smooth operation, besides oiling the wheels of the team, will allow
each group member to focus on areas that most appeal to them.
50. 50 Data Scientists: Who are they? What do they do? How do they work?
“Right now, there is demand
from our Data Science
students even before they
complete their training”.
Esteban Moro.
51. 51 Data Scientists: Who are they? What do they do? How do they work?
The ideal CV
Looking to work as a data scientist? In that case, you should make sure
that your CV features the maximum number of the following skills and
qualifications:
• Programming
- R
- Python
- Spreadsheets
- JavaSript and HTML
- C/C++ o Java, Julia
• Statistics
- Descriptive and inferential statistics
- Experimental design
• Mathematics
- Functions and graphs
- Multivariable calculus
- Linear algebra
And an essential complement: a good command of English, the language in which an
enormous amount of new knowledge is generated.
How much does each specialist earn?
Salaries (in the US)
Data Scientist $113,000 / year
Big Data Specialist $62,000 / year
Data Analyst $60,000 / year
Source: Glasdoor.com
• Data management
- Database systems
- SQL
• Data communication and visualization
- Visual coding
- Data presentation
- Knowledge of audiences
• Bonus: Intuition
- Project management
- Industry knowledge
• Machine learning
- Supervised learning
- Unsupervised learning
- Reinforcement learning
52. 52 Data Scientists: Who are they? What do they do? How do they work?
5.
The Data
Scientist’s Tools
54. 54 Data Scientists: Who are they? What do they do? How do they work?
“Expectations are the issue.
Companies don’t understand that
in research, there are times when
things just don’t work out”.
Alejandro Rodríguez.
55. 55 Data Scientists: Who are they? What do they do? How do they work?
Construction of data processing systems, databases,
visualization tools, and data wrangling tools
Within engineering related to the construction of data processing systems, there are three
basic tools to embark upon the analysis of huge volumes of information: Python, R and
Hadoop. While these programming languages are relatively news and not as widespread,
they are easier to grasp for professionals already proficient in programming languages like
Java or C.
R Project. Considered the standard among statistical programming languages, some
know it as “the golden boy” of data science. R is a free software environment dedicated to
statistical computing and graphics, compatible with UNIX, Windows, and MacOS platforms.
It is a must in data science, and being proficient in it practically guarantees a job offer,
given the increasing number of commercial applications and its advantageous versatility.
- R is free: anyone can install, use, upgrade, clone, modify, redistribute, and even
resell R. Not only does it save money on technology projects, but it also provides
constant updates, which are always useful for any statistical programming language.
- R is a high-performance language, which helps users handle large data packages,
making it a great tool for managing big data. It’s also ideal for intense and resource-
intensive simulations.
- Given all its advantages, it is increasingly popular. It has about 2 million users,
who make up an active and supportive community. There are more than 2,000
free libraries with statistical resources devoted to finance, cluster analysis, and
much more.
56. 56 Data Scientists: Who are they? What do they do? How do they work?
Any cultural change is costly or
takes a long time; and if it’s short,
it’s traumatic.
57. 57 Data Scientists: Who are they? What do they do? How do they work?
Python. Another flexible and straightforward open-source programming language. A
programmer working with Python ends up writing less code thanks to its “friendly”
features for beginners, such as code readability, simplified syntax and ease of
implementation.
- As with R, programming in Python is suited to a great deal of industries and
applications. Python powers Google’s search engine, as well as YouTube, Dropbox,
or Reddit. Institutions such as NASA, IBM, and Mozilla also depend heavily on
Python.
- Python is also free, which benefits startups and small businesses. Since the language
favors simplification, it can be handled by small teams. And a good knowledge of the
basics of this target-focused language lets you migrate to another similar language
just by learning the syntax of the new language.
- As a high-performance language, Python is the option often chosen to construct
fast-access applications. Plus, its huge library of resources provides the necessary
help to ensure that productivity is just a few clicks away.
Hadoop. Another staple for anyone who wants to venture into the analysis of big data.
Available as an open-source framework, Hadoop facilitates the storage and processing
of huge amounts of data. It is considered the cornerstone of any flexible and forward-
thinking data platform.
- Hadoop is one of the technologies with the greatest potential for growth within the
data industry. Companies like Dell, Amazon Web Services, IBM, Yahoo, Microsoft,
Google, eBay, and Oracle are firmly committed to Hadoop’s implementation.
- One of its major benefits is to help companies with their marketing needs:
Identifying customer behavior patterns on the website, providing recommendations
and custom targeting, etc.
- Hadoop opens great career opportunities up in a wide variety of positions. Given
its relevance in many industries, Hadoop specialists can find work as an architect,
developer, administrator or data scientist.
58. 58 Data Scientists: Who are they? What do they do? How do they work?
“The reality of Data Scientist’s work is
that you do not know what you’re going
to find behind the data. If you want to
work agilely, you have to be flexible and,
above all, be very practical”.
Álvaro Barbero.
59. 59 Data Scientists: Who are they? What do they do? How do they work?
Another frequent interaction in the data scientist‘s work is with databases. Here it’s
common to work with NoSQL databases, Apache Storm, and processing tools like Spark,
as well as with virtual machines like Storm.
Visualization tools are not as important for creating value as they are for convincing. In
this sense, they’re associated with the results communication phase and the actual work
of rediscovering the value of the data: it’s not the same to trawl through numbers as it is
to present them. Programs such as QlikView, Tableau, and Spotfire are used for this.
Finally, there’s a pretty unglamorous part of the data scientist’s work, which is a process
known as data wrangling. Raw data is often presented in a confused or imperfect way,
so the data first needs to be manually collected and cleaned up before it can be converted
into a structured format to be explored and analyzed. And this is a task that can take
up more than 50% of the data scientist’s working time, using tools like OpenRefine or
Fusion Tables.
Open source or proprietary software?
As in any area where specific software is required, data science professionals can choose
between programs marketed by private companies and open-source software.
Before embarking on a data science project, it’s very important to know exactly which
technological needs will be required to adapt resources and budgets accordingly. This is
one of the reasons why more and more companies are opting for the flexibility of open-
source alternatives. The variety of options arising from the open-source environment
has also helped to expand the use of new technologies and knowledge. Fee-charging
commercial tools that dominated the market up until recently are increasingly seeing their
prominence diminished in favor of free alternatives.
Some experts have warned about manufacturers who try to impose their commercial
solutions on businesses, which end up investing heavily in proprietary applications that
always have an open-source alternative. This captive nature is replaceable by open-source
projects, which are scalable and can offer a performance that’s comparable to proprietary
software.
60. 60 Data Scientists: Who are they? What do they do? How do they work?
6.
Getting down to it:
the work process
62. 62 Data Scientists: Who are they? What do they do? How do they work?
“Some people get scared because
they think you want to impose an
army of mathematicians on them”.
Manuel Marín.
63. 63 Data Scientists: Who are they? What do they do? How do they work?
The coexistence between analysts and specialists in a company within mixed teams
involves starting out on a journey that will ideally culminate in the opening of new lines
of business. Results don’t sprout up from one day to the next, but data science makes once
seemingly unattainable milestones feasible.
Three obstacles before accessing data
Before buckling down to work, the data scientist first must overcome three obstacles:
1. Access to data
Many companies may amass huge amounts of customer data, but the nature of their
services includes restrictions related to security and privacy. This presents a ‘chicken
and egg’ type of dilemma: as a condition for access to data, management will want to
know the potential value it can bring to the company. No matter how much the analyst
may sound off about this, the real benefits for the company cannot be demonstrated if the
necessary data cannot be accessed.
How can we get out of this quandary? One way of doing so is by pressing on through
scaled models which progressively show the management team the benefits analytics
can bring. Access to a sample of data will help create a model that solves a specific
problem. A small-scale study of specific customers, which can trigger a decision with
immediate impact on the company, is a good starting point. Once the management team
can verify the model’s suitability, by applying it to immediate decisions, the first step
will have been taken.
In this scenario, choosing a suitable problem that has a visible impact on the business
is crucial. Therefore, the analyst needs to show their skills, intuition, and knowledge of
the business. It goes without saying that a model built from a limited sample will have
limited significance, but it is a requirement to fling open the doors of data.
64. 64 Data Scientists: Who are they? What do they do? How do they work?
“There will be a lot of demand
from companies that we could
consider more traditional”.
Bosco Aranguren.
65. 65 Data Scientists: Who are they? What do they do? How do they work?
2. Technological means
Having overcome the first obstacle, the next one appears: having the necessary
technological infrastructure to support access to data, analysis, and the exploration
of results.
It’s not about looking for a culprit if such means are not available: there might not be
anybody in the organization cognizant of the impact that data analysis can have on the
business. But, this path offers no shortcuts: if this work isn’t done, someone will have
to deal with it.
A further problem that often comes up is the decentralization of data. With disaggregated
departments and dispersed databases, each with its own access and security protocols, the
data scientist, sometimes with the help of an engineer, will have to focus on grouping the
data in one place, before they can even get to work.
3. Human resource management
Part of data science, like any other science, is exploration. And exploration calls for a great
deal of inspiration and the lowest possible number of strict orders that stifle creativity.
Passion, perseverance, and curiosity are qualities required in this type of work, and
are often not compatible with rigid organizational structures. Therefore, managers
must be patient and understanding, and always within the varying pressure dictated
by financial results, should grant the data scientist the necessary time and freedom to
move forward with his or her investigation. Once the balance has been achieved between
what motivates employees and the business’s priorities, the results should start to
appear.
From data to decision... if nothing goes wrong
Once the data is available, the data scientist generally undertakes a scaled process. He or
she will have to devote much of their time to cleaning the data, and then set off on a route
that begins with small samples and will end, if all goes well, with the extraction of useful
conclusions based on a predictive model.
66. 66 Data Scientists: Who are they? What do they do? How do they work?
“Oftentimes the reason they end
up hiring you astonishes you”.
Manuel Marín.
67. 67 Data Scientists: Who are they? What do they do? How do they work?
If all goes well... Because data science is not a foolproof process. As in any research project,
there are no absolute certainties. Therefore, we must be prepared for possible failure,
however hard it may be for companies with high expectations and often do not consider the
lack of results to assume.
In projects involving vast databases, it’s not always necessary to use all the data.
Therefore, it is important to scale: starting with a manageable database, going back
and forth, and setting up a permanent dialogue with the person or department most
interested in the project. Then, once a small insight into the potential scope has been
gained, scaling can begin.
The road to this point is sometimes littered with issues related to decision-making:
the focus of the investigation, the data to be used, the analytics to be used… Technical
knowledge does not guarantee the customization of specific projects, always subject to
unforeseen circumstances that are not covered in training centers.
The ratio between available information and decisions is very unbalanced towards
the former. The process of transforming data into decisions may lead to swathes of
information being lost, and the way the process is transmitted plays a role in this journey.
An important decision for the company cannot be conveyed if it is not backed up with
solid arguments about the source of this conclusion, which data has been used and which
processes have been followed to analyze this information and turn it into the nugget that
is the decision.
68. 68 Data Scientists: Who are they? What do they do? How do they work?
7.
Evaluating the
data scientist’s
work
70. 70 Data Scientists: Who are they? What do they do? How do they work?
In what industries can we find
data scientists?
Technology-heavy industries
account for the largest
accumulation of data scientists.
Fuente: “The State of Data Science”, Stitchdata.com
71. 71 Data Scientists: Who are they? What do they do? How do they work?
Mathematician George E. P. Box, considered one of the most important statisticians of
the twentieth century, famously once said: “All models are wrong, but some are useful”.
Wrong in the sense that they cannot capture all the details of a system, because if they did
that, the model would be so complex that it would contradict the very purpose of modeling.
Yet, that does render models useless; but it does force them to be constantly reinterpreted
and validated using empirical data and knowledge of the system itself, regardless of the
techniques or algorithms used in the analysis.
How can we measure the results of the data scientist’s work? First, we must take the
time horizon into account: benefits are never seen in the short term. The data scientist
develops a predictive model, whose execution depends on whether it is accepted by
management. Machine learning techniques will then be run on the model created to
improve accuracy.
For team leaders, it is important to emphasize the work’s practical application. It is
fundamental, especially in large companies, to ensure that algorithms do not end up simply
as beautiful theories. The responsibility of the data scientist can officially be wrapped up
once they have finished constructing their model, but personal responsibility presses on,
even at the risk of sounding gloomy, until the model is run.
Then comes the wait for results. Models are not foolproof: a key parameter may have been
left out, either because a wrong variable altering the outcome has been entered or because
the subtleties of the business have not been grasped. Execution may also fail: the insight
might be good, but it is not put into practice in the right way.
The quality of the algorithm is not the exclusive yardstick to measure that data scientist’s
performance. Their responsibilities include some sales-related work-dealing with
customers, explaining to them what they have found, guiding them on what to do with
their data, always using the communication skills that the data scientist - or any member
of their team - should hold. Another type of valuation can be extracted from this work.
Finally, let’s once again remind ourselves of the importance of the human factor. Data
science is not a black box enshrouded in mystery. Data scientists are not oracles, nor are
their words prophecies: the algorithm may make a specific prediction, but the option to
translate that insight to the business or not, with all the consequences that it may incur,
ultimately depends on the person who makes the decision. Hence the importance of the
human factor in the whole process.
72. 72 Data Scientists: Who are they? What do they do? How do they work?
8.
Trust: an essential
component in
the data science
process.
74. 74 Data Scientists: Who are they? What do they do? How do they work?
“In terms of training, I don’t
think there is a gap between
Spain and the United States
or the United Kingdom”.
Pep Porrà.
75. 75 Data Scientists: Who are they? What do they do? How do they work?
Data is highly sensitive, especially when working with outside information. In such cases,
the customer relationship should be respectful and diplomatic: it’s their business, it’s their
data and it’s often their most asset with the most value.
In some industries, there is a certain idea of harnessing a return on data, but the lack
of experience with big data leads to reservations before they even dare to venture into
data analytics. Younger companies are more cautious, perhaps waiting for others in their
industry to take the first step.
It’s also common for companies to take the big data route but are later reluctant to give up
their data, either because they hold back from sharing any conclusions with the market or
because they don’t even want analysts to know them. In this context, the most common
formula is: acquire the tool, train the team in the tool, and then give support.
Another delicate situation arises with the dangers of do-it-yourself data science.
There are some people who choose to blindly apply tools only after learning about them
superficially, with unpredictable results. This creates a buzz that is detrimental to the
entire data science industry, in the sense that companies don’t receive the advertised
benefits of big data, without truly understanding why they haven’t reaped the full rewards.
There are many disoriented companies, that have heard the fanfare about big data, spend
lots of money without knowing what they’re spending it on, or have yet to see the
results. They need to be treated sensitively, with sound judgement and common sense,
clarifying and simplifying the guidelines for action. In an industry where the raw material
is so perilous, trust is essential.
Ethics: the essential complement to science
The data scientist takes on a strong ethical commitment, in the sense that they must
ensure a responsible use of the information given to them. In an increasingly digitalized
society where everyone unwittingly and involuntarily leaves trails, it would be possible
to invade anybody’s freedom simply by using the appropriate knowledge and powerful
servers. But nobody wants that to happen.
Ethical commitment is not just a sign of sound judgement; it is also imperative in an
information society that may face dangers that are not fully known: mass surveillance,
lack of privacy, large-scale loss of data, etc. It is therefore the data scientist’s duty to
work transparently, explaining in a simple and accessible way what their job is and how
76. 76 Data Scientists: Who are they? What do they do? How do they work?
“Clients sometimes comes
across things that they weren’t
expecting, and communicating it
requires specialists who are very
good with people”.
Felipe Ortega.
77. 77 Data Scientists: Who are they? What do they do? How do they work?
they do it, to quash the threat to privacy that people might often associate with big data.
Few people are interested in knowing the intricacies of an algorithm, but they do want an
outline of the route that the data follows.
One way to ensure that data gets used ethically is to work on open data projects, where
anyone can access the data, contributing in some way social awareness and utility. For
example, Spanish bank BBVA has launched several of these projects, designed to improve
the quality of life of citizens or to optimize efficiency in cities through the intelligent use
of information.
Open the data, give something back to society, become an aggregated data platform for
others to use for the creation of value in cutting-edge projects where altruism replaces
the quest for profit. That is the ethical commitment that many data scientists have taken
to safeguard the good name of their specialty.
78. 78 Data Scientists: Who are they? What do they do? How do they work?
9.
Data scientists
in Spain today
80. 80 Data Scientists: Who are they? What do they do? How do they work?
To stay on top of everything
and constantly refresh one’s
knowledge, curiosity is essential.
81. 81 Data Scientists: Who are they? What do they do? How do they work?
Are Spanish data scientists more qualified or less qualified than other nationalities? Is there
a shortage of professionals? Will academic programs keep up with the expected demand in
the years to come?
Overall, experts agree that Spain is at a par with the leading countries in data science.
There is no shortage of highly qualified professionals or startups specializing in big data
processing which stand out among the most advanced in Europe, if not the world. The
professional level is so high that it’s not unreasonable to think of Spain as a global
powerhouse in data science.
This opportunity must be managed well to make sure it doesn’t fail. As in other scientific
disciplines, excellent professionals are going to other countries to pursue their careers. It’s
true that money draws professionals to places like California, but a high concentration does
not necessarily imply a higher level. For Spanish data scientists to prove their worth, they
should start with loving themselves, acting with professionalism and discretion to ensure a
promising future.
The range of academic programs is also increasingly extensive in both public and private
colleges, where there are countless Master’s programs and specialized courses. This mix is
indispensable in a discipline that is permanently in coexistence with innovation and research.
So, if something were to jeopardize the advancement of data science in Spain, it wouldn’t
be the academic level of specialists, but rather some of the endemic problems provoked by
how work is organized in Spanish corporations. For example, agility when implementing
projects is not comparable to the United States, where there are far fewer bureaucratic
obstacles. Similarly, there is still a gap between academia and the business world: there’s a
lack of dynamism when integrating the work of a data scientist into the business world.
In Spain, there are claims that there is less flexibility in the labor market when it comes to
re-training. Once the professional has focused on a career path, taking the risk to change
it is more difficult than in other countries, due to a tendency towards convenience or
pigeonholing. Therefore, it is important for organizations to support their employees.
That said, Spanish professionals, as well as those from Latin American, have a bonus that
can give them a competitive advantage over their peers in rest of the world: creativity,
understood as the ability to seek out alternative problem-solving processes that nobody
else has imagined. And that fits in with and complements the empathy side. In other words,
other words, creativity lets Spanish data scientists apply a part of art - the other is science
- to problem-solving.
82. 82 Data Scientists: Who are they? What do they do? How do they work?
“Everyone must realize that
our daily life is going to be very
dependent on and influenced
by data analysis”.
Felipe Ortega.
83. 83 Data Scientists: Who are they? What do they do? How do they work?
Who’s making the most out of data
science in Spain?
Three industries are at the forefront of the implementation of data science in Spain:
banking, telecommunications, and tourism. Overall, large companies are investing more
resources into data science. These include entities such as Santander, BBVA, Telefónica,
Bankinter, Sabadell, La Caixa, Amadeus, Kayak, etc.
But this investment isn’t exclusively for large companies. More moderately-sized
companies are using data science in a very creative and innovative way, with worldwide
recognition of their work. Two examples:
Carto
http://www.cartodb.com
Founded in Madrid in 2012, originally as CartoDB. Its most popular tool is Carto
Builder, which allows visualization enthusiasts to build interactive maps from
geodata with no programming skills required. With more than 1,400 customers,
200,000 registered users and an office in New York, its goals focus on offering
large corporations an optimization tool for decision-making and predicting
consumer trends.
Stratio
http://www.stratio.com
Also, founded in 2012 as an offshoot of predecessor Paradigma. Stratio develops
platforms and products from big data technologies such as Cassandra, Apache
Stark, and proprietary developments. Customers using its real-time processing
solution come from banking, insurance, tourism, and retail. More than 25
specialists in big data architecture work out of Stratio’s Madrid headquarters.
Stratio also has an office in Palo Alto, California, the heart of Silicon Valley.
84. 84 Data Scientists: Who are they? What do they do? How do they work?
10.
Conclusions:
still a great deal
to be done
86. 86 Data Scientists: Who are they? What do they do? How do they work?
“People ask us: are you opening
up data so that everyone can do
business? Well, yes: we let others
have a better knowledge of reality
from our data”.
Marcelo Soria.
87. 87 Data Scientists: Who are they? What do they do? How do they work?
The analysis of big data has already left behind the emerging technology phase (hype cycle)
and is taking hold in many companies. Or, at the very least, certain “core” technologies are,
like: distributed databases, real-time processing, large analytical layers, etc.
With the initial implementation being wrapped up, data science professionals are treading
towards specialization. As the field continues to grow, it is normal for it to split up into
specialties, to form an ecosystem. Companies, to some extent, are promoting this trend
because they cannot afford to properly compensate large teams of data scientists.
The same is happening in training. It’s no longer possible to offer a set of core courses,
so the range of academic content is beginning to diversify. As they define their needs,
companies will continue to increasingly demand sought-after professionals, who are
often awarded grants by the companies that recruit them or guaranteed immediate
employment upon completing their education.
Lots of companies invest huge sums into market research. Some will realize that data
science represents another data source, a new form of RD that converts data into a new
value for the company.
But big data is still in its teenage years. Many challenges lie ahead, derived from handling
large volumes of information and its conversion into useful tools.
What’s the adulthood of big data looking like?
Attention should be shifted from the “bigness” of data to its application. The famous
“Four Vs” of big data (Volume, Velocity, Variety and Veracity) must be expanded to
bring in a new concept: Value. This involves reducing the noise of data and increasing its
contribution.
Data science will mature, strengthen its position, gain recognition as a career and surprise
us with future discoveries. It should be designed as a tool to not only bring transparency to
the present, but to anticipate the future in a way conducive to business growth.
88. 88 Data Scientists: Who are they? What do they do? How do they work?
“It is our duty to give something back
to society. With all the information
companies have about people, they
can greatly improve their lives”.
Richard Benjamins.
89. 89 Data Scientists: Who are they? What do they do? How do they work?
This will be possible by converting data into knowledge, and that knowledge into practical
actions, whether to provide better customer service, boost efficiency through automation,
or create new business opportunities by identifying cross-sells or opening new markets.
At present, most projects related to data analysis focus on cost optimization and process
integration. In the future, predictive analysis will place emphasis on data monetization
and the delivery of new applications and business opportunities. Predictive models in
cloud environments, parallel data processing or sophisticated machine learning algorithms
will optimize or guide the decision-making process.
Ultimately, companies will have to reinvent themselves or reinterpret themselves as
their business becomes more digital and customer proposals will increasingly depend
on lessons learned from data. Companies like Siemens, defined by its CEO defines as “a
software company”, have already fully embarked on this process. A key element of this
evolution will be existence alongside an environment of experimentation, tolerance, and
short development cycles that drive innovation.
The companies leading this evolution will be those who place the figure of data scientist
at the core of their strategy. This way, they will be able to develop the conditions (talent
acquisition, employee commitment and priority-setting) needed to place them at the head
of the race to turn data into a long-lasting and tangible competitive advantage.
In our daily lives, we are already using applications and products that come from
processing a huge amount of data: spam filters in email inboxes, recommendations on
social networks, search engine results, medical tests and prescriptions, investment funds,
etc. And with the future promised by The Internet of Things, the need to process more and
more information will only grow and grow. Our lives may end up highly conditioned, or
heavily influenced at the very least, by the analysis of all the data surrounding us.
A future, in any case, where all those involved in the analysis of big data should be very
cautious with everything related to data privacy and consumer confidence. It doesn’t matter
if our data is used to better manage our time or our money, customize the advertising we
see or improve our health. If we believe that it will improve our lives, we won’t object to
anybody’s use of it.
90. 90 Data Scientists: Who are they? What do they do? How do they work?
Annexed.
91. 91 Data Scientists: Who are they? What do they do? How do they work?
Business case 1
Commerce360
What are my customers most interested in? On what day does my competition outsell me?
Are their items more expensive or cheaper than mine? When do I sell the most? Where
do my buyers live? What is their gender, their age, and how much do they spend on every
purchase?
Any business would like to know the answers to these and similar questions. Large and
medium companies can do this by allocating resources to business intelligence, but it’s
more difficult for independent traders or local stores.
That’s why Spanish bank BBVA has developed Commerce360, a tool that aims to make
business intelligence accessible to any company. Based on aggregated and anonymous data
from BBVA card payments, the application extracts indicators related to the industry and
profile of customers who buy items in a specific area.
“Commerce360 is a tool for retailers, where by using our information on card payments we can
provide a store with its economic activity, purchasing dynamics, socio-demographic information
on what its customers are like, age, gender, where and when they shop, etc., comparing all this
with aggregated businesses that are their competition or other businesses in the area that perform
the same type of activity,” as Marcelo Soria explains.
As a result, retailers once guided by intuition or other traditional methods have access to an
analytical tool that lets them discover the origin of their customers, measure their loyalty,
study their demographic characteristics and identify high-value customers. “For us it is a
very interesting line for democratizing access to data and data-based intelligence. This is
the future of retail,” adds Soria.
93. 93 Data Scientists: Who are they? What do they do? How do they work?
Business case 2
Smart Steps
SmartSteps is a geo-marketing program developed by Telefónica using data from its mobile
phone network. Data is aggregated and extrapolated anonymously to extract information
on user trends or behavior patterns in a specific area.
The project captures billions of data points from Telefónica’s mobile network, 365 days a
year, 24 hours a day. This data is matched with different sociodemographic and mobility
indicators (residence, means of transport, age) that can offer companies precise targeting
based on the movements of their potential customers.
Smart Steps can be applied to any industry in which the movement and knowledge of the
user profile are important, such as travel and transport, tourism, or outdoor advertising.
For example, local retailers could find out whether participants in an event such as San
Fermín are regular or sporadic, where they come from, where they are staying, the length
of their visit, etc., and with this information they can tailor their sales approach.
It is also useful in the public sector, as knowing people’s movement patterns helps improve
traffic management in the city, adapt public transport, or analyze the need to build new
infrastructure. In 2014, the program was used to map out the most crime-prone areas
in London: the generated algorithm obtained an accuracy of 70% when predicting crime
hotspots.
95. 95 Data Scientists: Who are they? What do they do? How do they work?
Business case 3
Home Risk Fire Map
25,000 people are killed or injured in house fires every year in the United States. The
American Red Cross aims to reduce the number of victims through an initiative based on
big data.
The Home Fire Risk Map program identifies the locations most house fire-prone across the
country, and will be used by Red Cross volunteers to install smoke alarms and provide fire
safety courses where they’re needed most. Data suggests that 60% of fires can be prevented
simply by having a working smoke alarm and by knowing what to do in the event of a fire.
Using different open data repositories, 50 volunteers worked for over a year to create a map
that identifies high-risk areas throughout the country. First, they built a model to identify
those communities with the least amount of smoke alarm coverage. After that, another
algorithm predicted the places most prone to fires. Lastly, a third program calculated the
likelihood of injury or death when a home fire does occur. The three models and their
results come together on the map presented here.
Thanks to this initiative launched in June 2016, the first month saw the installation of
400,000 smoke alarms in households across the United States, with the goal of reaching
2.5 million alarms. Smoke alarms have an average lifespan of 10 years, which signals that a
year’s work is expected to result in medium-term benefits.
97. 97 Data Scientists: Who are they? What do they do? How do they work?
Business case 4
The Huffington Post
The Huffington Post is one of the widest-read digital media resources in the world.
And an environment where data analysts enjoy almost as much prominence as editors,
since much of their success is due to big data, which optimizes content, authenticates
comments, boosts advertising clout, and improves user experience.
Real-time statistics and analytical platforms define the editorial process. For HuffPost
it is essential to provide the right content to each reader straight away and in the
right format. For example, data analytics for the Parents section showed that this
demographic mainly uses mobile devices to connect, especially when children are in
bed, and is more active on weekend mornings. Content and advertising is tailored to
these habits.
The huge number of comments received on the website (more than 300 million in 2013)
also encouraged HuffPost executives to debug data to improve user experience. This
was achieved by means of conjoint analysis, a statistical technique used to evaluate the
different characteristics of a product or service. The analysis found that the quality of
comments increased by geographic proximity and in identified users, which led THP to
banning anonymous comments.
Big data was also used to improve user loyalty. In collaboration with technology
company Gravity, HuffPost identified topics of interest for its readers, connecting
the most compelling content for each type of reader through what it calls “passive
personalization”. The technology also provides information on where each reader
accesses content, and helps optimize navigation around the website. With an average of
10 to 12 articles read in each session, the goal is to reach 15.
99. 99 Data Scientists: Who are they? What do they do? How do they work?
Business case 5
Hillary Clinton’s 2016 campaign
Few Americans will have heard of the name Elan Kriegel. Yet millions of them were in his
sight during the 2016 presidential campaign. Kriegel led a team of 60 mathematicians and
analysts responsible for guiding each of the Democrat candidate’s promotional activities in
the campaign, from the party primaries up to the final vote with absolute precision.
For example, Kriegel’s team developed an algorithm that decided where to spend each cent
of the $60 million TV advertising budget during the primaries. With hundreds of local and
state TV networks scattered throughout the country, the victory over Bernie Sanders was
molded by carefully choosing the states, networks, programs, and schedules where Clinton
would convey her message to voters.
Unlike in other countries, campaigns for elections in the United States get fully customized.
Key decisions were made based on the work of analysts, such as at what time and how
to send email messages to voters, which doors canvassers knock, which numbers phone
bankers would dial, which voters to target via a Facebook ad, and which to address through
regular mail.
This meticulous work turned Clinton’s campaign into more of a mathematical than
inspirational exercise. A ground-breaking and efficient campaign organized around
models defined by data analysis, and which paves the way for a new era in the definition of
political campaigns, based on data culture. And in the meantime, Kriegel’s team is already
incubating the next generation of talent within the Democratic Party, unknown names for
now but which will play a key role in 2020.
100. #REBELTHINKING
REBEL THINKERS
Iñaki Bagazgoitia
Mar Castaño
Carlos Corredor
Laura Dinneen
Carlota García-Abril
Amelia Hernández
Natasha Morrison
Ellen Thomas
HAVE COLLABORATED
Fuencisla Clemares
Bosco Aranguren
Richard Benjamins
Marcelo Soria
Álvaro Barbero
Alejandro Rodríguez
Manuel Marín
Esteban Moro
Felipe Ortega
Pep Porrà
ACKNOWLEDGMENT