SlideShare a Scribd company logo
1 of 96
Download to read offline
DSSG Speaker Series, 2013-08-12:
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data ScienceTeams,
and a 2-year survey of Enterprise Use Cases
Paco Nathan @pacoid
Chief Scientist, Mesosphere
1
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
2
employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking
3
Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced the prior practices
of data modeling
because the data won’t fit on one computer anymore
4
Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual
process) to the rising use of algorithmic modeling (machine
data for automation/optimization) which led in turn to the
practice of leveraging inter-disciplinary teams
5
approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks that can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to socialize the problems, knocking down silos
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making process repeatable
What is needed most?
UniqueRegistration
aunchedgameslobby
NUI:TutorialMode
BirthdayMessage
hatPublicRoomvoice
unchedheyzapgame
Test:testsuitestarted
CreateNewPet
rted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
paceremaining:512M
aseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
anelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
sspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
anelRemoveProduct
yPanelApplyProduct
NUI:DressUpMode
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ConnectivityTest:testsuitestarted
CreateNewPet
MovieViewStarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
Addressspaceremaining:512M
CustomerMadePurchaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
ClientInventoryPanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Addressspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
ClientInventoryPanelRemoveProduct
ClientInventoryPanelApplyProduct
NUI:DressUpMode
6
apps
discovery
modeling
integration
systems
help people ask the
right questions
allow automation to
place informed bets
deliver data products
at scale to LOB end uses
build smarts into
product features
keep infrastructure
running, cost-effective
Team Process = Needs
analysts
engineers
inter-disciplinary
leadership
7
business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
Team Composition = Roles
leverage non-traditional
pairing among roles, to
complement skills and
tear down silos
8
discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
Team Composition = Needs × Roles
9
Alternatively, Data Roles × Skill Sets
Harlan Harris, et al.
datacommunitydc.org/blog/wp-content/uploads/
2012/08/SkillsSelfIDMosaic-edit-500px.png
Analyzing the Analyzers
Harlan Harris, Sean Murphy,
Marck Vaisman
O’Reilly, 2013
amazon.com/dp/B00DBHTE56
10
Learning Curves
difficulties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codified practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that anti-pattern leads to big teams, low ROI
scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context
11
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
12
Business Disruption through Data
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm
@Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the
entire Global 1000 on notice over the next decade…
data as the major force… mostly through apps –
verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
@XLDB, 2012:
complex analytics workloads are now displacing SQL
as the basis for Enterprise apps
13
Data Categories
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
let’s now add other useful distinctions:
• Open Data
• Curated Metadata
• A/D conversion for sensors (IoT)
14
Open Data notes
successful apps incorporate three components:
• Big Data (consumer interest, personalization)
• Open Data (monetizing public data)
• Curated Metadata
most of the largest Cascading deployments leverage some
Open Data components: Climate Corp, Factual, Nokia, etc.
consider buildingeye.com, aggregate building permits:
• pricing data for home owners looking to remodel
• sales data for contractors
• imagine joining data with building inspection history,
for better insights about properties for sale…
research notes about
Open Data use cases:
goo.gl/cd995T
15
Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
Adapted from:
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
Drivers, circa 2013
• governments have run out of money,
cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products,APIs
• mobile devices enabling use cases
16
Open Data ecosystem
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Data feeds structured for
public private partnerships
17
Open Data ecosystem – caveats for agencies
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• respond to viable use cases
• not budgeting hackathons
18
Open Data ecosystem – caveats for publishers
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs
19
Open Data ecosystem – caveats for aggregators
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data
20
Open Data ecosystem – caveats for data vendors
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream,
i.e., cleaned data at source
• focus on core verticals
21
Open Data ecosystem – caveats for end uses
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• address consumer needs
• identify community benefits
of the data
22
algorithmic modeling
+ machine data (Big Data)
+ curation, metadata
+ Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things
+ A/D conversion
+ more complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Recipes for Success
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic
23
Internet of Things
24
Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
technologyreview.com/...
25
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
26
in general, apps alternate between learning patterns/rules
and retrieving similar things…
machine learning – scalable, arguably quite ad-hoc,
generally “black box” solutions, enabling you to make billion
dollar mistakes, with oh so much commercial emphasis
(i.e. the “heavy lifting”)
statistics – rigorous, much slower to evolve, confidence
and rationale become transparent, preventing you from
making billion dollar mistakes, any good commercial project
has ample stats work used in QA
(i.e.,“CYA, cover your analysis”)
once Big Data projects get beyond merely digesting
log files, optimization will likely become the next
overused buzzword :)
Learning Theory
27
Generalizations about Machine Learning…
great introduction to ML, plus a proposed categorization
for comparing different machine learning approaches:
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
toward a categorization for Machine Learning algorithms:
• representation: classifier must be represented in some
formal language that computers can handle (algorithms, data
structures, etc.)
• evaluation: evaluation function (objective function, scoring
function) is needed to distinguish good classifiers from bad
ones
• optimization: method to search among the classifiers in
the language for the highest-scoring one
28
Something to consider about Algorithms…
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead of e-commerce in
terms of data rates and sophisticated algorithms work – as Breiman
suggested in 2001 – may take a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
29
Make It Sparse…
also, take a moment to check this out…
(and related work on sparse Cholesky, etc.)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
30
Sparse Matrix Collection
for those times when you really, really need
a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.ufl.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/
31
A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… what impact does that
have on sampling rates?
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale
32
Just Enough Mathematics?
having a solid background in statistics becomes vital,
because it provides formalisms for what we’re trying
to accomplish at scale
along with that, some areas of math help – regardless
of the “calculus threshold” invoked at many universities…
linear algebra e.g., calculating algorithms for large-scale apps efficiently
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI
33
ADMM: a general approach for optimizing learners
Distributed Optimization and Statistical Learning
via the Alternating Direction Method of Multipliers
Stephen Boyd, Neal Parikh, et al., Stanford
stanford.edu/~boyd/papers/admm_distr_stats.html
“Throughout, the focus is on applications rather than theory, and a main goal is
to provide the reader with a kind of ‘toolbox’ that can be applied in many situations
to derive and implement a distributed algorithm of practical use.Though the focus
here is on parallelism, the algorithm can also be used serially, and it is interesting
to note that with no tuning, ADMM can be competitive with the best known
methods for some problems.”
“While we have emphasized applications that can be concisely explained, the
algorithm would also be a natural fit for more complicated problems in areas
like graphical models. In addition, though our focus is on statistical learning
problems, the algorithm is readily applicable in many other cases, such as in
engineering design, multi-period portfolio optimization, time series analysis,
network flow, or scheduling.”
34
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
35
Enterprise Data Workflows
middleware for Big Data applications is evolving,
with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
ParAccel Big Data Analytics Platform
Actian
Anaconda supporting IPython Notebook, Pandas,Augustus, etc.
Continuum Analytics
ETL
data
prep
predictive
model
data
sources
end
uses
36
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
37
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
38
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
39
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
40
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
41
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
one connected DAG:
• optimization
• troubleshooting
• exception handling
• notifications
cascading.org
42
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
 
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
 
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
43
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
 
flowDef.addAssemblyPlanner( pmmlPlanner );
44
Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
Edgar Codd alluded to this (DSLs for structuring data)
in his original paper about relational model
45
Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading –
used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on
domain-specific languages (DSLs) in JVM languages which
emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
46
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
47
Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
48
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
data is represented as flows of tuples
operations in the flows bring functional
programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
49
Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
50
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
51
void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition:
this simple program provides an excellent test case
for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “HelloWorld” for Hadoop apps
a distributed computing framework that runsWord Count
efficiently in parallel at scale can handle much larger
and more interesting compute problems
count how often each word appears
in a collection of text documents
52
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual flow diagram
cascading.org/category/impatient
53
WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
54
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{2}:'token', 'count']
[{2}:'token', 'count']
[{1}:'token']
[{1}:'token']
WordCount – generated flow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
55
A Thought Exercise
Consider that when a company like Caterpillar moves
into data science, they won’t be building the world’s
next search engine or social network
They will be optimizing supply chain, optimizing fuel
costs, automating data feedback loops integrated
into their equipment…
Operations Research –
crunching amazing amounts of data
$50B company, in a $250B market segment
Upcoming: tractors as drones –
guided by complex, distributed data apps
56
Alternatively…
climate.com
57
Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
58
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
59
Q3 1997: inflection point
four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
this effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this period
60
RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
61
RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
“throw it over the wall”
62
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
63
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
64
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
65
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
66
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources
67
Cluster Computing’s Dirty Little Secret
people like me make a good living by leveraging high ROI
apps based on clusters, and so the execs agree to build
out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL,
for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization… various notions
of “cloud” help
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
68
Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of disruption in topologies
could this imply? because there’s
no such thing as RAM anymore…
69
Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
C A
P
strong
consistency
high
availability
partition
tolerance
eventual
consistency
“You can have at most two of these properties for
any shared-data system… the choice of which
feature to discard determines the nature of your
system.” – Eric Brewer, 2000 (Inktomi/YHOO)
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
70
Some Topologies Other Than Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
ParAccel (MPP)
SciDB (array database)
71
“Return of the Borg”
consider that Google is generations ahead of
Hadoop, etc., with much improved ROI on its
data centers…
Borg serves as a kind of “secret sauce” for
data center OS, with Omega as its next
evolution:
2011 GAFS Omega
John Wilkes, et al.
youtu.be/0ZFMlO98Jkc
72
“Return of the Borg”
Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos
The Datacenter as a Computer: An Introduction
to the Design ofWarehouse-Scale Machines
Luiz André Barroso, Urs Hölzle
research.google.com/pubs/pub35290.html
73
Mesos – definitions
a common substrate for cluster computing
heterogenous assets in your data center or cloud
made available as a homogenous set of resources
• top-level Apache project
• scalability to 10,000s of nodes
• obviates the need for virtual machines
• isolation between tasks with Linux Containers (pluggable)
• fault-tolerant replicated master using ZooKeeper
• multi-resource scheduling (memory and CPU aware)
• APIs in C++, Java, Python
• web UI for inspecting cluster state
• available for Linux, Mac OSX, OpenSolaris
74
Mesos – simplifies app development
CHRONOS SPARK HADOOP DPARK MPI
JVM (JAVA, SCALA, CLOJURE, JRUBY)
MESOS
PYTHON C++
75
Mesos – data center OS stack
HADOOP STORM CHRONOS RAILS JBOSS
TELEMETRY
Kernel
OS
Apps
MESOS
CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING
76
Mesos Kernel
Chronos Marathon
Apps
Web AppsStreamingBatch
FrameworksHadoop Spark Storm
RailsJBoss
KafkaMPI
Hive Scalding
JVMPythonC++
Workloads
Mesos – architecture
77
Prior Practice: Dedicated Servers
DATACENTER
• low utilization rates
• longer time to ramp up new services
78
Prior Practice: Virtualization
DATACENTER PROVISIONED VMS
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
79
Prior Practice: Static Partitioning
DATACENTER STATIC PARTITIONING
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
• static partitioning limits elasticity
80
MESOS
Mesos: One Large Pool Of Resources
DATACENTER
“We wanted people to be able to program
for the data center just like they program
for their laptop."
Ben Hindman
81
What are the costs of Virtualization?
benchmark
type
OpenVZ
improvement
mixed workloads 210%-300%
LAMP (related) 38%-200%
I/O throughput 200%-500%
response time order magnitude
more pronounced
at higher loads
82
What are the costs of Single Tenancy?
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
83
Compelling arguments for Data Center OS
• obviates the need forVMs (licensing, adiosVMware)
• provides OS-level building blocks for developing new
distributed frameworks (learning curve, adios Hadoop)
• removes significantVM overhead (performance)
• requires less h/w to buy (CapEx), power and fix (OpEx)
• implies lessVMs, thus less Ops overhead (staff)
• removes the complexity of Chef/Puppet (staff)
• allows higher utilization rates (ROI)
• reduces latency for data updates (OLTP + OLAP on same server)
• reshapes cluster resources dynamically (100’s ms vs. minutes)
• runs dev/test clusters on same h/w as production (flexibility)
• evaluates multiple versions without more h/w (vendor lock-in)
84
Opposite Ends of the Spectrum, One Substrate
Built-in /
bare metal
Hypervisors
Solaris Zones
Linux CGroups
85
Opposite Ends of the Spectrum, One Substrate
Request /
Response
Batch
86
Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical forTwitter’s
continued success at scale. It's one of the primary keys to our
data center efficiency."
Chris Fry, SVP Engineering
blog.twitter.com/2013/mesos-graduates-from-apache-incubation
• key services run in production: analytics, typeahead, ads
• Twitter engineers rely on Mesos to build all new services
• instead of thinking about static machines, engineers think
about resources like CPU, memory and disk
• allows services to scale and leverage a shared pool of
servers across data centers efficiently
• reduces the time between prototyping and launching
87
Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis,VP Engineering
gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...
• improves resource management and efficiency
• helps advance engineering strategy of building small teams
that can move fast
• key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop
• allowed company to migrate off Elastic MapReduce
• enables use of Hadoop along with Chronos, Spark, Storm, etc.
88
Resources
Apache Project
mesos.apache.org
Mesosphere
mesosphe.re
Getting Started
mesosphe.re/tutorials
Documentation
mesos.apache.org/documentation
Research Paper
usenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf
Collected Notes/Archives
goo.gl/jPtTP
89
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
SUMMARY…
DSSG, 2013-08-12
90
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
91
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
1. End Use Cases, the drivers
92
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
2. A new kind of team process
93
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
3. Abstraction layer as optimizing
middleware, e.g., Cascading
94
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
4. Data Center OS, e.g., Mesos
95
Enterprise DataWorkflows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
monthly newsletter for updates, events,
conference summaries, etc.:
liber118.com/pxn/
96

More Related Content

What's hot

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayXoriant Corporation
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive FrameworkRan Zhang
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...DATAVERSITY
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceWesley Eldridge
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextMurad Daryousse
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Gregg Barrett
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Mr.Sameer Kumar Das
 

What's hot (20)

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big data survey
Big data surveyBig data survey
Big data survey
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop Way
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data Science
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing Platforms
 
Business analytics
Business analyticsBusiness analytics
Business analytics
 
M.Florence Dayana
M.Florence DayanaM.Florence Dayana
M.Florence Dayana
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
 

Similar to DSSG Speaker Series: Paco Nathan

The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
 
Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? ScaleFocus
 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...PhD Assistance
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
 
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Sahilakhurana
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Joanne Luciano
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career pathRubikal
 
DATA SCIENCE PPT1.pptx
DATA SCIENCE PPT1.pptxDATA SCIENCE PPT1.pptx
DATA SCIENCE PPT1.pptxDMKurnool
 
DATA SCIENCE PPT.pptx
DATA SCIENCE PPT.pptxDATA SCIENCE PPT.pptx
DATA SCIENCE PPT.pptxDMKurnool
 
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldGlobal Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldNeil Raden
 
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...Data Science Society
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
From eGov 2.0 to eGov 3.0: The Research Agenda
From eGov 2.0 to eGov 3.0: The Research AgendaFrom eGov 2.0 to eGov 3.0: The Research Agenda
From eGov 2.0 to eGov 3.0: The Research Agendasamossummit
 
DSS_Understanding_the_paradigm_shift.pdf
DSS_Understanding_the_paradigm_shift.pdfDSS_Understanding_the_paradigm_shift.pdf
DSS_Understanding_the_paradigm_shift.pdfBizuayehuDesalegn
 

Similar to DSSG Speaker Series: Paco Nathan (20)

The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...
 
Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it?
 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
The Role of Big Data Management and Analytics in Higher Education
The Role of Big Data Management and Analytics in Higher EducationThe Role of Big Data Management and Analytics in Higher Education
The Role of Big Data Management and Analytics in Higher Education
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data Science for Finance Interview.
Data Science for Finance Interview. Data Science for Finance Interview.
Data Science for Finance Interview.
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
DATA SCIENCE PPT1.pptx
DATA SCIENCE PPT1.pptxDATA SCIENCE PPT1.pptx
DATA SCIENCE PPT1.pptx
 
DATA SCIENCE PPT.pptx
DATA SCIENCE PPT.pptxDATA SCIENCE PPT.pptx
DATA SCIENCE PPT.pptx
 
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldGlobal Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid World
 
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
From eGov 2.0 to eGov 3.0: The Research Agenda
From eGov 2.0 to eGov 3.0: The Research AgendaFrom eGov 2.0 to eGov 3.0: The Research Agenda
From eGov 2.0 to eGov 3.0: The Research Agenda
 
DSS_Understanding_the_paradigm_shift.pdf
DSS_Understanding_the_paradigm_shift.pdfDSS_Understanding_the_paradigm_shift.pdf
DSS_Understanding_the_paradigm_shift.pdf
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

DSSG Speaker Series: Paco Nathan

  • 1. DSSG Speaker Series, 2013-08-12: Learnings generalized from trends in Data Science: a 30-year retrospective on Machine Learning, a 10-year summary of Leading Data ScienceTeams, and a 2-year survey of Enterprise Use Cases Paco Nathan @pacoid Chief Scientist, Mesosphere 1
  • 2. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workflows 5. the evolution of cluster computing DSSG, 2013-08-12 2
  • 3. employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must Process Variation Data Tools Statistical Thinking 3
  • 4. Modeling back in the day, we worked with practices based on data modeling 1. sample the data 2. fit the sample to a known distribution 3. ignore the rest of the data 4. infer, based on that fitted distribution that served well with ONE computer, ONE analyst, ONE model… just throw away annoying “extra” data circa late 1990s: machine data, aggregation, clusters, etc. algorithmic modeling displaced the prior practices of data modeling because the data won’t fit on one computer anymore 4
  • 5. Two Cultures “A new research community using these tools sprang up.Their goal was predictive accuracy.The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.” Statistical Modeling: TheTwo Cultures Leo Breiman, 2001 bit.ly/eUTh9L chronicled a sea change from data modeling (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) which led in turn to the practice of leveraging inter-disciplinary teams 5
  • 6. approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem unfortunately, data-related budgets tend to go into frameworks that can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to understand the audience and their priorities ‣ learn to socialize the problems, knocking down silos ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making process repeatable What is needed most? UniqueRegistration aunchedgameslobby NUI:TutorialMode BirthdayMessage hatPublicRoomvoice unchedheyzapgame Test:testsuitestarted CreateNewPet rted:client,community NUI:MovieMode BuyanItem:web PutonClothing paceremaining:512M aseCartPageStep2 FeedPet PlayPet ChatNow EditPanel anelFlipProductOver AddFriend Open3DWindow ChangeSeat TypeaBubble VisitOwnHomepage TakeaSnapshot NUI:BuyCreditsMode NUI:MyProfileClicked sspaceremaining:1G LeaveaMessage NUI:ChatMode NUI:FriendsMode dv WebsiteLogin AddBuddy NUI:PublicRoomMode NUI:MyRoomMode anelRemoveProduct yPanelApplyProduct NUI:DressUpMode UniqueRegistration Launchedgameslobby NUI:TutorialMode BirthdayMessage ChatPublicRoomvoice Launchedheyzapgame ConnectivityTest:testsuitestarted CreateNewPet MovieViewStarted:client,community NUI:MovieMode BuyanItem:web PutonClothing Addressspaceremaining:512M CustomerMadePurchaseCartPageStep2 FeedPet PlayPet ChatNow EditPanel ClientInventoryPanelFlipProductOver AddFriend Open3DWindow ChangeSeat TypeaBubble VisitOwnHomepage TakeaSnapshot NUI:BuyCreditsMode NUI:MyProfileClicked Addressspaceremaining:1G LeaveaMessage NUI:ChatMode NUI:FriendsMode dv WebsiteLogin AddBuddy NUI:PublicRoomMode NUI:MyRoomMode ClientInventoryPanelRemoveProduct ClientInventoryPanelApplyProduct NUI:DressUpMode 6
  • 7. apps discovery modeling integration systems help people ask the right questions allow automation to place informed bets deliver data products at scale to LOB end uses build smarts into product features keep infrastructure running, cost-effective Team Process = Needs analysts engineers inter-disciplinary leadership 7
  • 8. business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability data science Data Scientist App Dev Ops Domain Expert introduced capability Team Composition = Roles leverage non-traditional pairing among roles, to complement skills and tear down silos 8
  • 9. discovery discovery modeling modeling integration integration appsapps systems systems business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability data science Data Scientist App Dev Ops Domain Expert introduced Team Composition = Needs × Roles 9
  • 10. Alternatively, Data Roles × Skill Sets Harlan Harris, et al. datacommunitydc.org/blog/wp-content/uploads/ 2012/08/SkillsSelfIDMosaic-edit-500px.png Analyzing the Analyzers Harlan Harris, Sean Murphy, Marck Vaisman O’Reilly, 2013 amazon.com/dp/B00DBHTE56 10
  • 11. Learning Curves difficulties in the commercial use of distributed systems often get represented as issues of managing complexity much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering “conservatism”, with highly structured process and strictly codified practices – people learn a few things well, then avoid having to struggle with learning many new things perpetually… that anti-pattern leads to big teams, low ROI scale➞ complexity➞ ultimately, the challenge is about managing learning curves within a social context 11
  • 12. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workflows 5. the evolution of cluster computing DSSG, 2013-08-12 12
  • 13. Business Disruption through Data Geoffrey Moore Mohr DavidowVentures, author CrossingThe Chasm @Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade… data as the major force… mostly through apps – verticals, leveraging domain expertise Michael Stonebraker INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. @XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps 13
  • 14. Data Categories Three broad categories of data Curt Monash, 2010 dbms2.com/2010/01/17/three-broad-categories-of-data • Human/Tabular data – human-generated data which fits into tables/arrays • Human/Nontabular data – all other data generated by humans • Machine-Generated data let’s now add other useful distinctions: • Open Data • Curated Metadata • A/D conversion for sensors (IoT) 14
  • 15. Open Data notes successful apps incorporate three components: • Big Data (consumer interest, personalization) • Open Data (monetizing public data) • Curated Metadata most of the largest Cascading deployments leverage some Open Data components: Climate Corp, Factual, Nokia, etc. consider buildingeye.com, aggregate building permits: • pricing data for home owners looking to remodel • sales data for contractors • imagine joining data with building inspection history, for better insights about properties for sale… research notes about Open Data use cases: goo.gl/cd995T 15
  • 16. Trends in Public Administration late 1880s – late 1920s (Woodrow Wilson) as hierarchy, bureaucracy → only for the most educated, elite late 1920s – late 1930s as a business, relying on “Scientific Method”, gov as a process late 1930s – late 1940s (Robert Dale) relationships, behavioral-based → policy not separate from politics late 1940s – 1980s yet another form of management → less “command and control” 1980s – 1990s (David Osborne,Ted Gaebler) New Public Management → service efficiency, more private sector 1990s – present (Janet & Robert Denhardt) Digital Age → transparency, citizen-based “debugging”, bankruptcies Adapted from: The Roles,Actors, and Norms Necessary to Institutionalize Sustainable Collaborative Governance Peter Pirnejad USC Price School of Policy 2013-05-02 Drivers, circa 2013 • governments have run out of money, cannot increase staff and services • better data infra at scale (cloud, OSS, etc.) • machine learning techniques to monetize • viable ecosystem for data products,APIs • mobile devices enabling use cases 16
  • 17. Open Data ecosystem municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Data feeds structured for public private partnerships 17
  • 18. Open Data ecosystem – caveats for agencies municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus • respond to viable use cases • not budgeting hackathons 18
  • 19. Open Data ecosystem – caveats for publishers municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus • surface the metadata • curate, allowing for joins/aggregation • not scans as PDFs 19
  • 20. Open Data ecosystem – caveats for aggregators municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus • make APIs consumable by automation • allow for probabilistic usage • not OSS licensing for data 20
  • 21. Open Data ecosystem – caveats for data vendors municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus • supply actionable data • track data provenance carefully • provide feedback upstream, i.e., cleaned data at source • focus on core verticals 21
  • 22. Open Data ecosystem – caveats for end uses municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus • address consumer needs • identify community benefits of the data 22
  • 23. algorithmic modeling + machine data (Big Data) + curation, metadata + Open Data data products, as feedback into automation evolution of feedback loops less about “bigness”, more about complexity internet of things + A/D conversion + more complex analytics accelerated evolution, additional feedback loops orders of magnitude higher data rates Recipes for Success source: National Geographic “A kind of Cambrian explosion” source: National Geographic 23
  • 25. Trendlines Big Data? we’re just getting started: • ~12 exabytes/day, jet turbines on commercial flights • Google self-driving cars, ~1 Gb/s per vehicle • National Instruments initiative: Big Analog Data™ • 1m resolution satellites skyboximaging.com • open resource monitoring reddmetrics.com • Sensing XChallenge nokiasensingxchallenge.org consider the implications of Jawbone, Nike, etc., plus the effects of Google Glass 7+ billion people, instrumented better than … how we have Nagios instrumenting our web servers right now technologyreview.com/... 25
  • 26. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workflows 5. the evolution of cluster computing DSSG, 2013-08-12 26
  • 27. in general, apps alternate between learning patterns/rules and retrieving similar things… machine learning – scalable, arguably quite ad-hoc, generally “black box” solutions, enabling you to make billion dollar mistakes, with oh so much commercial emphasis (i.e. the “heavy lifting”) statistics – rigorous, much slower to evolve, confidence and rationale become transparent, preventing you from making billion dollar mistakes, any good commercial project has ample stats work used in QA (i.e.,“CYA, cover your analysis”) once Big Data projects get beyond merely digesting log files, optimization will likely become the next overused buzzword :) Learning Theory 27
  • 28. Generalizations about Machine Learning… great introduction to ML, plus a proposed categorization for comparing different machine learning approaches: A Few UsefulThings to Know about Machine Learning Pedro Domingos, U Washington homes.cs.washington.edu/~pedrod/papers/cacm12.pdf toward a categorization for Machine Learning algorithms: • representation: classifier must be represented in some formal language that computers can handle (algorithms, data structures, etc.) • evaluation: evaluation function (objective function, scoring function) is needed to distinguish good classifiers from bad ones • optimization: method to search among the classifiers in the language for the highest-scoring one 28
  • 29. Something to consider about Algorithms… many algorithm libraries used today are based on implementations back when people used DO loops in FORTRAN, 30+ years ago MapReduce is Good Enough? Jimmy Lin, U Maryland umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf astrophysics and genomics are light years ahead of e-commerce in terms of data rates and sophisticated algorithms work – as Breiman suggested in 2001 – may take a few years to percolate into industry other game-changers: • streaming algorithms, sketches, probabilistic data structures • significant “Big O” complexity reduction (e.g., skytree.net) • better architectures and topologies (e.g., GPUs and CUDA) • partial aggregates – parallelizing workflows 29
  • 30. Make It Sparse… also, take a moment to check this out… (and related work on sparse Cholesky, etc.) QR factorization of a “tall-and-skinny” matrix • used to solve many data problems at scale, e.g., PCA, SVD, etc. • numerically stable with efficient implementation on large-scale Hadoop clusters suppose that you have a sparse matrix of customer interactions where there are 100MM customers, with a limited set of outcomes… cs.purdue.edu/homes/dgleich stanford.edu/~arbenson github.com/ccsevers/scalding-linalg David Gleich, slideshare.net/dgleich 30
  • 31. Sparse Matrix Collection for those times when you really, really need a wide variety of sparse matrix examples… University of Florida Sparse Matrix Collection cise.ufl.edu/research/sparse/matrices/ Tim Davis, U Florida cise.ufl.edu/~davis/welcome.html Yifan Hu, AT&T Research www2.research.att.com/~yifanhu/ 31
  • 32. A Winning Approach… consider that if you know priors about a system, then you may be able to leverage low dimensional structure within high dimensional data… what impact does that have on sampling rates? 1. real-world data 2. graph theory for representation 3. sparse matrix factorization for production work 4. cost-effective parallel processing for machine learning app at scale 32
  • 33. Just Enough Mathematics? having a solid background in statistics becomes vital, because it provides formalisms for what we’re trying to accomplish at scale along with that, some areas of math help – regardless of the “calculus threshold” invoked at many universities… linear algebra e.g., calculating algorithms for large-scale apps efficiently graph theory e.g., representation of problems in a calculable language abstract algebra e.g., probabilistic data structures in streaming analytics topology e.g., determining the underlying structure of the data operations research e.g., techniques for optimization … in other words, ROI 33
  • 34. ADMM: a general approach for optimizing learners Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, et al., Stanford stanford.edu/~boyd/papers/admm_distr_stats.html “Throughout, the focus is on applications rather than theory, and a main goal is to provide the reader with a kind of ‘toolbox’ that can be applied in many situations to derive and implement a distributed algorithm of practical use.Though the focus here is on parallelism, the algorithm can also be used serially, and it is interesting to note that with no tuning, ADMM can be competitive with the best known methods for some problems.” “While we have emphasized applications that can be concisely explained, the algorithm would also be a natural fit for more complicated problems in areas like graphical models. In addition, though our focus is on statistical learning problems, the algorithm is readily applicable in many other cases, such as in engineering design, multi-period portfolio optimization, time series analysis, network flow, or scheduling.” 34
  • 35. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workflows 5. the evolution of cluster computing DSSG, 2013-08-12 35
  • 36. Enterprise Data Workflows middleware for Big Data applications is evolving, with commercial examples that include: Cascading, Lingual, Pattern, etc. Concurrent ParAccel Big Data Analytics Platform Actian Anaconda supporting IPython Notebook, Pandas,Augustus, etc. Continuum Analytics ETL data prep predictive model data sources end uses 36
  • 37. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL 37
  • 38. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic 38
  • 39. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models 39
  • 40. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs… 40
  • 41. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs… 41
  • 42. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source a compiler sees it all… one connected DAG: • optimization • troubleshooting • exception handling • notifications cascading.org 42
  • 43. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap );   SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement );   flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org 43
  • 44. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields();   flowDef.addAssemblyPlanner( pmmlPlanner ); 44
  • 45. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. to ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters Edgar Codd alluded to this (DSLs for structuring data) in his original paper about relational model 45
  • 46. Cascading – functional programming • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/ 46
  • 47. Functional Programming for Big Data WordCount with token scrubbing… Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time 47
  • 48. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. 48
  • 49. Workflow Abstraction – pattern language Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R data is represented as flows of tuples operations in the flows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 49
  • 50. Workflow Abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Literate Programming Don Knuth literateprogramming.com 50
  • 51. Workflow Abstraction – business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale 51
  • 52. void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator group): int count = 0; for each pc in group: count += Int(pc); emit(word, String(count)); The Ubiquitous Word Count Definition: this simple program provides an excellent test case for parallel processing: • requires a minimal amount of code • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction • is not many steps away from useful search indexing • serves as a “HelloWorld” for Hadoop apps a distributed computing framework that runsWord Count efficiently in parallel at scale can handle much larger and more interesting compute problems count how often each word appears in a collection of text documents 52
  • 53. Document Collection Word Count Tokenize GroupBy token Count R M 1 map 1 reduce 18 lines code gist.github.com/3900702 WordCount – conceptual flow diagram cascading.org/category/impatient 53
  • 54. WordCount – Cascading app in Java String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Document Collection Word Count Tokenize GroupBy token Count R M 54
  • 55. mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] WordCount – generated flow diagram Document Collection Word Count Tokenize GroupBy token Count R M 55
  • 56. A Thought Exercise Consider that when a company like Caterpillar moves into data science, they won’t be building the world’s next search engine or social network They will be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment… Operations Research – crunching amazing amounts of data $50B company, in a $250B market segment Upcoming: tractors as drones – guided by complex, distributed data apps 56
  • 58. Two Avenues to the App Layer… scale ➞ complexity➞ Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding 58
  • 59. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workflows 5. the evolution of cluster computing DSSG, 2013-08-12 59
  • 60. Q3 1997: inflection point four independent teams were working toward horizontal scale-out of workflows based on commodity hardware this effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this period 60
  • 61. RDBMS Stakeholder SQL Query result sets Excel pivot tables PowerPoint slide decks Web App Customers transactions Product strategy Engineering requirements BI Analysts optimized code Circa 1996: pre- inflection point 61
  • 62. RDBMS Stakeholder SQL Query result sets Excel pivot tables PowerPoint slide decks Web App Customers transactions Product strategy Engineering requirements BI Analysts optimized code Circa 1996: pre- inflection point “throw it over the wall” 62
  • 63. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels Circa 2001: post- big ecommerce successes 63
  • 64. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels Circa 2001: post- big ecommerce successes “data products” 64
  • 65. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere 65
  • 66. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere “optimize topologies” 66
  • 67. Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtu.be/E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtu.be/qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx MIT Media Lab “Social Information Filtering for Music Recommendation” – Pattie Maes pubs.media.mit.edu/pubs/papers/32paper.ps ted.com/speakers/pattie_maes.html Primary Sources 67
  • 68. Cluster Computing’s Dirty Little Secret people like me make a good living by leveraging high ROI apps based on clusters, and so the execs agree to build out more data centers… clusters for Hadoop/HBase, for Storm, for MySQL, for Memcached, for Cassandra, for Nginx, etc. this becomes expensive! a single class of workloads on a given cluster is simpler to manage; but terrible for utilization… various notions of “cloud” help Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” All your workloads are belong to us regardless of how architectures change, death and taxes will endure: servers fail, and data must move Google Data Center, Fox News ~2002 68
  • 69. Three Laws, or more? meanwhile, architectures evolve toward much, much larger data… pistoncloud.com/ ... Rich Freitas, IBM Research Q: what kinds of disruption in topologies could this imply? because there’s no such thing as RAM anymore… 69
  • 70. Topologies Hadoop and other topologies arose from a need for fault- tolerant workloads, leveraging horizontal scale-out based on commodity hardware because the data won’t fit on one computer anymore a variety of Big Data technologies has since emerged, which can be categorized in terms of topologies and the CAP Theorem C A P strong consistency high availability partition tolerance eventual consistency “You can have at most two of these properties for any shared-data system… the choice of which feature to discard determines the nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO) cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf julianbrowne.com/article/viewer/brewers-cap-theorem 70
  • 71. Some Topologies Other Than Hadoop… Spark (iterative/interactive) Titan (graph database) Redis (data structure server) Zookeeper (distributed metadata) HBase (columnar data objects) Riak (durable key-value store) Storm (real-time streams) ElasticSearch (search index) MongoDB (document store) ParAccel (MPP) SciDB (array database) 71
  • 72. “Return of the Borg” consider that Google is generations ahead of Hadoop, etc., with much improved ROI on its data centers… Borg serves as a kind of “secret sauce” for data center OS, with Omega as its next evolution: 2011 GAFS Omega John Wilkes, et al. youtu.be/0ZFMlO98Jkc 72
  • 73. “Return of the Borg” Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon Cade Metz wired.com/wiredenterprise/2013/03/google- borg-twitter-mesos The Datacenter as a Computer: An Introduction to the Design ofWarehouse-Scale Machines Luiz André Barroso, Urs Hölzle research.google.com/pubs/pub35290.html 73
  • 74. Mesos – definitions a common substrate for cluster computing heterogenous assets in your data center or cloud made available as a homogenous set of resources • top-level Apache project • scalability to 10,000s of nodes • obviates the need for virtual machines • isolation between tasks with Linux Containers (pluggable) • fault-tolerant replicated master using ZooKeeper • multi-resource scheduling (memory and CPU aware) • APIs in C++, Java, Python • web UI for inspecting cluster state • available for Linux, Mac OSX, OpenSolaris 74
  • 75. Mesos – simplifies app development CHRONOS SPARK HADOOP DPARK MPI JVM (JAVA, SCALA, CLOJURE, JRUBY) MESOS PYTHON C++ 75
  • 76. Mesos – data center OS stack HADOOP STORM CHRONOS RAILS JBOSS TELEMETRY Kernel OS Apps MESOS CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING 76
  • 77. Mesos Kernel Chronos Marathon Apps Web AppsStreamingBatch FrameworksHadoop Spark Storm RailsJBoss KafkaMPI Hive Scalding JVMPythonC++ Workloads Mesos – architecture 77
  • 78. Prior Practice: Dedicated Servers DATACENTER • low utilization rates • longer time to ramp up new services 78
  • 79. Prior Practice: Virtualization DATACENTER PROVISIONED VMS • even more machines to manage • substantial performance decrease due to virtualization • VM licensing costs 79
  • 80. Prior Practice: Static Partitioning DATACENTER STATIC PARTITIONING • even more machines to manage • substantial performance decrease due to virtualization • VM licensing costs • static partitioning limits elasticity 80
  • 81. MESOS Mesos: One Large Pool Of Resources DATACENTER “We wanted people to be able to program for the data center just like they program for their laptop." Ben Hindman 81
  • 82. What are the costs of Virtualization? benchmark type OpenVZ improvement mixed workloads 210%-300% LAMP (related) 38%-200% I/O throughput 200%-500% response time order magnitude more pronounced at higher loads 82
  • 83. What are the costs of Single Tenancy? 0% 25% 50% 75% 100% RAILS CPU LOAD MEMCACHED CPU LOAD 0% 25% 50% 75% 100% HADOOP CPU LOAD 0% 25% 50% 75% 100% t t 0% 25% 50% 75% 100% Rails Memcached Hadoop COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP) 83
  • 84. Compelling arguments for Data Center OS • obviates the need forVMs (licensing, adiosVMware) • provides OS-level building blocks for developing new distributed frameworks (learning curve, adios Hadoop) • removes significantVM overhead (performance) • requires less h/w to buy (CapEx), power and fix (OpEx) • implies lessVMs, thus less Ops overhead (staff) • removes the complexity of Chef/Puppet (staff) • allows higher utilization rates (ROI) • reduces latency for data updates (OLTP + OLAP on same server) • reshapes cluster resources dynamically (100’s ms vs. minutes) • runs dev/test clusters on same h/w as production (flexibility) • evaluates multiple versions without more h/w (vendor lock-in) 84
  • 85. Opposite Ends of the Spectrum, One Substrate Built-in / bare metal Hypervisors Solaris Zones Linux CGroups 85
  • 86. Opposite Ends of the Spectrum, One Substrate Request / Response Batch 86
  • 87. Case Study: Twitter (bare metal / on premise) “Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical forTwitter’s continued success at scale. It's one of the primary keys to our data center efficiency." Chris Fry, SVP Engineering blog.twitter.com/2013/mesos-graduates-from-apache-incubation • key services run in production: analytics, typeahead, ads • Twitter engineers rely on Mesos to build all new services • instead of thinking about static machines, engineers think about resources like CPU, memory and disk • allows services to scale and leverage a shared pool of servers across data centers efficiently • reduces the time between prototyping and launching 87
  • 88. Case Study: Airbnb (fungible cloud infrastructure) “We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos." Mike Curtis,VP Engineering gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven... • improves resource management and efficiency • helps advance engineering strategy of building small teams that can move fast • key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop • allowed company to migrate off Elastic MapReduce • enables use of Hadoop along with Chronos, Spark, Storm, etc. 88
  • 89. Resources Apache Project mesos.apache.org Mesosphere mesosphe.re Getting Started mesosphe.re/tutorials Documentation mesos.apache.org/documentation Research Paper usenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf Collected Notes/Archives goo.gl/jPtTP 89
  • 90. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workflows 5. the evolution of cluster computing SUMMARY… DSSG, 2013-08-12 90
  • 91. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 91
  • 92. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 1. End Use Cases, the drivers 92
  • 93. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 2. A new kind of team process 93
  • 94. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 3. Abstraction layer as optimizing middleware, e.g., Cascading 94
  • 95. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 4. Data Center OS, e.g., Mesos 95
  • 96. Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do monthly newsletter for updates, events, conference summaries, etc.: liber118.com/pxn/ 96