This document discusses the evolution of computing architectures and data processing techniques over time. As data grew larger than what could fit on a single computer, distributed systems and topologies like Hadoop emerged. This led to a shift from traditional data modeling to algorithmic modeling using machine learning. The rise of big data, IoT, and complex analytics is now disrupting businesses by enabling new, automated data products and feedback loops. This presents opportunities for companies in various industries to optimize operations using data science.
5. First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
5
6. First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
CPU
6
7. First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
RAM
7
8. First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
I/O
8
9. First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
9
10. First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
10
11. First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
11
12. First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
a machine created in his image, if you will
NB: credit should go to Eckert and Mauchly, inventors of the ENIAC
12
13. First Principles
a generation of computer scientists has been
taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees,
normal forms, etc.
Q: need to query bigger data?
A: simple, buy or lease a bigger DB server
13
14. First Principles
a generation of computer scientists has been
taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees,
normal forms, etc.
Q: need to query bigger data?
A: simple, buy or lease a bigger DB server
however, that all changed…
some of the issues encountered in large-scale
data teams are, to put it politely, obscure
starting from first principles, let’s explore a
map of some important points to consider
14
17. Topologies
largely due to the rapid rise of machine data, circa late 1990s,
we use distributed systems
because the data won’t fit on one computer anymore
AMZN, EBAY,YHOO, GOOG leveraged horizontal scale-out,
based on commodity hardware
practices at LinkedIn,Apple, Facebook,Twitter, etc., followed
from those early successes
algorithmic modeling, applied to the aggregation of machine
data, allowed for Big Data to become monetized
a feedback loop evolved – refining aggregate social interactions
into data products, which in turn made web apps become
more intelligent
17
18. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big e-commerce successes
18
19. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big e-commerce successes
“data products”
19
20. Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
20
21. Apache
Wikipedia
Hadoop, as a topology
components which implement MapReduce:
• name node / data node
• job tracker / task tracker
• submit queue
• task slots
• distributed cache
• HDFS
21
22. Some Other Topologies…
Spark (iterative/interactive)
Titan (graph database)
Redis (in-memory data grid)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Cassandra (key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
Greenplum (MPP)
SciDB (array database)
22
23. CAP Theorem
“You can have at most two of these properties for any shared-data
system… the choice of which feature to discard determines the
nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO)
C A
P
strong
consistency
high
availability
partition
tolerance
eventual
consistency
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
23
24. financial transactions general ledger in RDBMS C A x
ad-hoc queries RDS (hosted MySQL) C A x
reporting, dashboards like Pentaho C A x
log rotation/persistence like Riak, Cassandra x x P
search indexes like ElasticSearch, Solr x A P
static content, archives S3 (durable storage) x A P
key/value data objects like HBase C x P
data prep, ETL, modeling at scale like Hadoop/Cascading C x P
graph queries like Titan C x P
Access → Frameworks → CAP Theorem Forfeits
24
25. financial transactions general ledger in RDBMS C A x
ad-hoc queries RDS (hosted MySQL) C A x
reporting, dashboards like Pentaho C A x
log rotation/persistence like Riak, Cassandra x x P
search indexes like ElasticSearch, Solr x A P
static content, archives S3 (durable storage) x A P
key/value data objects like HBase C x P
data prep, ETL, modeling at scale like Hadoop/Cascading C x P
graph queries like Titan C x P
Access → Frameworks → CAP Theorem Forfeits
25
26. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
26
27. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
27
28. Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
In their own words…
28
31. Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced data modeling
because the data won’t fit on one computer anymore
31
32. Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
in other words, seeing the forest for the trees…
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)
32
33. Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netflix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…
stanford.edu/~lmackey/papers/netflix_story-nas11-slides.pdf
33
36. Attention
impromptu survey:
• how many people say they practice some kind of “Agile” process at work?
• how many people say that they DON’T practice “Agile” ?
• how many people say they are in a lean startup?
Q:
with respect to Big Data practices,
how is that working out?
Abby Fichtner vimeo.com/27797408
36
37. Agile Data?
some people see a reconciliation of Agile process and Big Data…
Agile Data
Russell Jurney, 2013
amazon.com/dp/1449326269
“Run like a studio, not an assembly line.”
37
38. Perhaps Not
great values, wrong domain…
that worked when we were building features in web apps
Agile represents industrialization of software engineering,
codifying social interactions, compartmentalizing attention
meanwhile, Data Science is inherently multi-disciplinary:
• teams of people with complementary skill sets
• actionable insights require weeks/months, not hours
• variance and statistical thinking are foreign to CS
LinkedIn-style problems circa 2011 required certain skills…
manipulating the Newtonian physics of data… that money
may be mostly off the table by now
Big Data opportunities ahead require different math?
38
41. Business Disruption
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the entire Global 1000
on notice over the next decade… data as the major force… mostly
through apps – verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012:
complex analytics workloads are now displacing SQL as the basis
for Enterprise apps
Larry Page
CEO, Google / Wired, 2013:
create products and services that are 10 times better than the
competition… thousand-percent improvement requires rethinking
problems entirely, exploring the edges of what’s technically possible,
and having a lot more fun in the process
41
42. algorithmic modeling + machine data
+ curation, metadata + Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things + A/D conversion
+ complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Internet ofThings accelerates this process of disruption
Business Drivers
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic
42
44. A Thought Exercise
consider that when a company like Catepillar moves
into data science, they won’t be building the world’s
next search engine or social network
they will most likely be optimizing supply chain,
optimizing fuel costs, automating data feedback
loops integrated into their equipment…
that’s a $50B company,
in a market segment worth $250B
upcoming: tractors as drones –
guided by complex, distributed data apps
Operations Research –
crunching amazing amounts of data
44
48. Algorithms
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, UMD
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead in sophisticated
algorithms work – as Breiman suggested in 2001 – which may take
a while to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
48
49. How much does it cost you to earn $1B?
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
49
50. How much does it cost you to earn $1B?
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
distributed algorithms for high ROI
use cases on cost-effective clustered
resources…
we’re learning how to do it right
50
53. Personality
we have perhaps built computers (once named “electronic
brains”) in the image of JohnVon Neumann, et al.: standalone
genius, aristotelian uber-geek, incredible capacity for memory
and logic, overbearing, not particularly cooperative…
one can almost imagine a war-time dialogue,“Get one of these
guys in the room, they’ll solve anything!” … as a result, decades
of mutually assured destruction for global strategy
Q:
have we created software engineering practices which selected for
this kind of personality? selecting for “lone wolf” guys, socially
awkward, ONE person who can understand an entire code base,
able to out-logic and out-argue the rest of the room… charming
fellow, really
have we enabled software process to box these personalities
into something resembling teams? along with overtly described
rules for social conventions… silos, in other words
53
54. Chasing Unicorns
silos… but didn’t that all change?
because the data won’t fit on one computer anymore
leverage with data science teams is where organizations
tear down internal silos, socializing hard problems
data won’t fit on one computer anymore, problems won’t
fit in one department anymore, the code base won’t fit
into one uber-geek’s memory recall anymore…
so we embrace distributed systems for solutions
Q:
“Why aren’t there more women in engineering?”
IMHO, we’re trying to select for a personality which
doesn’t exist, and would not resolve current challenges;
meanwhile, my data science teams run about 50/50
54
57. Clusters
a little secret: people like me make a good living by
leveraging high ROI apps based on clusters, and so
the execs agree to build out more data centers…
clusters for Hadoop/Hive/HBase, clusters for Memcached,
for Cassandra, for MySQL, for Storm, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization
leveragingVMs and various notions of “cloud” helps
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
57
58. Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega:“10x” secret sauce
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
incubator.apache.org/mesos
58
61. Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the secondary/tertiary effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
plus the business implications given that much of the
Global 1000 is positioned to be disrupted technologyreview.com/...
61
62. Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of evolution in topologies could
this imply?
62
65. Languages
JVM-based languages became popular for Big Data open source
technologies:
• partly becauseYHOO adopted Hadoop, etc.
• partly because Enterprise IT shops have J2EE expertise
• partly because of functional languages: Clojure, Scala
JVM has its drawbacks, especially for low-latency use cases
ample use of languages such as Python and Erlang in Big Data
practices, plus keep in mind that Google uses C++
FunctionalThinking
Neal Ford
youtu.be/plSZIkLodDM
a hunch: issues about current programming languages are
secondary to culture
65
66. Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
66
67. references…
“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”
Vitaly Gordon [ especially see slide #9 ]
slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin
Elements Of Functional Programming
Chris Reade
amazon.com/dp/0201129159
67
70. Organization
How Do Committees Invent?
Melvin Conway, 1968
melconway.com/research/committees.html
Manu Cornet bonkersworld.net
“Any organization that designs a system
(defined more broadly here than just
information systems) will inevitably
produce a design whose structure is a
copy of the organization’s communication
structure.”
Q:
• does this fit with software process?
• does this fit with distributed apps?
see also:
haacked.com/archive/2013/05/13/applying-conways-law.aspx
70
71. Cooperation
perhaps we have selected for the wrong
personality to idealize…
linkedin.com/today/post/article/20130520190305-110300724-why-nothing-not-even-software-can-eat-the-world
All long-term success depends on eliciting
the voluntary support of an ecosystem.
As the African proverb says,“If you want
to go fast, go alone; if you want to go far,
go with others.” – Geoffrey Moore
71
75. Architecture
Rich Hickey, Nathan Marz, Stuart Sierra, et al.:
functional programming to help reduce
costs over time
1. technical debt? this is how an organization
builds a culture to avoid it
2. Conway's Law corollary: model teams and
communication based on properties of the
desired architecture
3. also consider Mesos/Borg: schedule data
to be located where [CPU, RAM, I/O, surety]
will become available
Rich Hickey, infoq.com/presentations/Simple-Made-Easy
75
77. Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
77
80. Culture
Notes from the Mystery Machine Bus
SteveYegge, Google
goo.gl/SeRZa
consider these perspectives
in light of Conway’s Law…
“conservatism” “liberalism”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
80
81. Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
81
82. approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
What is needed?
82
85. Learning Curves
difficulties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codified practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that approach leads to enormous teams and low ROI scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context
85
86. Management
ultimately, the challenge is about managing
learning curves within a social context
est. cost of individual learning, initial impl
est.costofteamre-learning,lifecycle
some technologies constrain the
need to learn, others accelerate
re-learning prior business logic…
choose the latter, FTW!
86
87. Management
ultimately, the challenge is about managing
learning curves within a social context
est. cost of individual learning, initial impl
est.costofteamre-learning,lifecycle
some technologies constrain the
need to learn, others accelerate
re-learning prior business logic…
choose the latter, FTW!
IMHO, the “agile” part was intended to be
about shared learnings; while the “lean” part
was about how much you have on your plate
at any one time
87
88. blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.html
ThrowYour Life a Curve
Whitney Johnson
Aggressively Pro-Active Learning
• deconstruction of the cognitive bias One Size Fits All
• “makes a compelling case for personal disruption”
• “plan your career around learning curves”
• hire people who learn/re-learn efficiently
88
89. Summary
to be competitive globally with Big Data
requires learning many technologies –
then learning the nuances of a code base for
which the team is responsible, learning the
ever-changing surprises and insights which
are hidden deep within mountains of data,
plus the ever-evolving mathematics needed
to grapple with these conditions effectively
because the data won’t fit on one computer anymore
First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves you are here
89
91. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
91
92. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
92
93. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
93
94. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
94
95. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
95
96. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
96
97. ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
97
98. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
98
99. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
flowDef.addAssemblyPlanner( pmmlPlanner );
99
100. cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great
way to improve how teams work together:
Literate Programming, Don Knuth
www-cs-faculty.stanford.edu/~uno/lp.html
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
100
101. ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
visual collaboration for the business logic is a great
way to improve how teams work together:
Literate Programming, Don Knuth
www-cs-faculty.stanford.edu/~uno/lp.html
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for flow
planners, compiler, optimization, troubleshooting,
exception handling, notifications, security audit,
performance monitoring, etc.
cascading.org
101
102. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
102
103. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
1. End Use Cases, the drivers
103
104. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
2. A new kind of team process
104
105. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
3. Abstraction layer as optimizing
middleware, e.g., Cascading
105
106. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
4. Distributed OS, e.g., Mesos
106