Self-service data preparation is becoming ever more important for data scientists – experts and Citizen Data Scientists alike - especially when focusing on extracting business value out of Big Data. Data preparation is a crucial and often time consuming phase required to pre–process information used to generate reports and train Machine Learning algorithms. In this talk we will discuss some techniques (Data Lake Dictionary, Join Recommender, etc.) that can be used as powerful tools for quickly and efficiently preparing the huge amounts of data stored in Data Lakes.
Advanced Machine Learning for Business Professionals
Self-service Big Data Preparation - Michele Stecca
1. Data Driven Innovation - Rome
Self-Service Data
Preparation
Dr. Michele Stecca
24 Feb., 2017
2. • IoT systems generate massive amounts of data
SELF-SERVICE DATA
PREPARATION
Who knows
how to do
this?
So far,
so good
BUT
• We store this huge amount of information in big
data platforms
• Then we extract value from it
3. • A "citizen data scientist" is a person who creates or generates
models that leverage predictive or prescriptive analytics but whose
primary job function is outside of the field of statistics and
analytics1
• The person is not typically a member of an analytics team. Citizen
data scientists are typically in a line of business, outside of IT and
outside of a BI team
SELF-SERVICE DATA
PREPARATION
¹ Gartner 2015 Research – “Smart Data Discovery Will Enable a New Class of Citizen Data Scientist”
4. • Big data discovery will help expand the use of big data analytics
because exploration of big data sources will occur more often,
much faster and at a lower cost per analysis, delivered by a
broader range of users with more rudimentary technical skills
• The global trend is to enable lesser skilled (i.e., citizen data
scientists) users with the ability to solve more complex problems
or access more insights using easier and quicker methods
• Through 2017, the number of citizen data scientists will grow five
times faster than the number of highly skilled data scientists
• The blending in a single tool or tightly coupled portfolio, the ease
of use, interactivity and agility of data discovery, with the richness
of analysis and scale, diversity or immediacy of big data, will be
the inception of big data discovery
SELF-SERVICE DATA
PREPARATION
5. • Gartner has developed the concept of smart big data discovery
• Preparing data, finding patterns in large, complex data and sharing
findings with other users from data remains largely manual
• Smart self-service data preparation is a smart data discovery
capability, where algorithms are used to find relationships in data
and to profile and recommend to users the best approaches to
minimize modeling time and improve quality
SELF-SERVICE DATA
PREPARATION
6. • doolytic simplifies access to big data with a modern BI user
experience and functionality
• doolytic enables smart data discovery on both structured and
unstructured data
• doolytic offers sophisticated advanced query capabilities required by
power users/citizen data scientists
• doolytic leverages supervised and unsupervised machine learning
features for further investigation
SELF-SERVICE DATA
PREPARATION
8. • Native Datalake Dictionary
• Join Recommender
• Not based on field name conventions
like traditional BI tools
• Search links between fields and draw
graphs with confidence from Datalake
Dictionary
SELF-SERVICE DATA
PREPARATION
How can
doolytic help
to discover
unknown
correlations?
9. • The algorithm suggests the user the potential correlations by
associating a degree of confidence
• The user can accept/reject recommendations
• Graphical visualization for usability
• The algorithm is scalable
SELF-SERVICE DATA
PREPARATION
10. SELF-SERVICE DATA
PREPARATION
• The network planning department needs to optimize the bandwidth allocation by user and traffic type
• Citizen data scientists are limited by the existing technology stack to high aggregation levels and small
fractions of data while performing statistical analyses
• Citizen data scientists must manually correlate data coming from different data sources (including
network probes)
solution
challenge
benefits
• Business users keep track of frequently used queries with responsive interactive dashboards and
visualizations
• Citizen data scientists drill data on the-the-fly at maximum granularity – at user, device and traffic
package level - and discover new paths and rules for network optimization through the Relation-Action
model
• Relationships among datasets are automatically recommended by a specific component
• More accurate and effective network optimizations algorithms are enabled with a wider and
deeper set of inputs
• Citizen data scientist are free to do big data discovery on their own
• Lower TCO than legacy tools
• IT department redirected from support to custom data inquiries
• ROI realized through smaller required investment in optimized network equipment
11. • Moving from manual data preparation to smart data preparation is
an important trend for IoT/big data applications
• This is particularly true when dealing with heterogeneous data such
as sensor data, structured/unstructured data, etc.
• doolytic supports the citizen data scientist by providing advanced
tools for data preparation on large datasets with the Join
Recommender
SELF-SERVICE DATA
PREPARATION
12. @steccami
SELF-SERVICE DATA
PREPARATION
• Senior Big Data Analyst, doolytic
• Ph.D. Computer Engineering, Univ. of Genoa, Italy
• Visiting Researcher, ICSI - UC Berkeley, USA
• Principal Investigator, FP6 & FP7 projects co-
funded by EU
• Author 30+ scientific papers in Computer Science
• Main interests: Big data (Hadoop, Spark, etc.), IoT