Hadoop and Beyond

Paco Nathan
bit.ly/pxnnews
@pacoid
“Hadoop and Beyond”
1

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Contents:
1. Conceptual Map
2. Design Patterns
2

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
BeyondHadoopBeyondHadoop
3

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
BeyondHadoop
First
Principles
BeyondHadoop
4

First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
5

First Principles
CPU
6

First Principles
RAM
7

First Principles
I/O
8

First Principles
back in the day, all the tables required for a
given database could ﬁt onto one computer,
with one memory space, and one ﬁle space
9

First Principles
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID conﬁg…
10

First Principles
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
11

First Principles
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
a machine created in his image, if you will
NB: credit should go to Eckert and Mauchly, inventors of the ENIAC
12

First Principles
a generation of computer scientists has been
taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees,
normal forms, etc.
Q: need to query bigger data?
A: simple, buy or lease a bigger DB server
13

First Principles
a generation of computer scientists has been
taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees,
normal forms, etc.
Q: need to query bigger data?
A: simple, buy or lease a bigger DB server
however, that all changed…
some of the issues encountered in large-scale
data teams are, to put it politely, obscure
starting from ﬁrst principles, let’s explore a
map of some important points to consider
14

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
15

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
BeyondHadoopBeyondHadoop Topologies
16

Topologies
largely due to the rapid rise of machine data, circa late 1990s,
we use distributed systems
because the data won’t ﬁt on one computer anymore
AMZN, EBAY,YHOO, GOOG leveraged horizontal scale-out,
based on commodity hardware
practices at LinkedIn,Apple, Facebook,Twitter, etc., followed
from those early successes
algorithmic modeling, applied to the aggregation of machine
data, allowed for Big Data to become monetized
a feedback loop evolved – reﬁning aggregate social interactions
into data products, which in turn made web apps become
more intelligent
17

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big e-commerce successes
18

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big e-commerce successes
“data products”
19

Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
20

Apache
Wikipedia
Hadoop, as a topology
components which implement MapReduce:
• name node / data node
• job tracker / task tracker
• submit queue
• task slots
• distributed cache
• HDFS
21

Some Other Topologies…
Spark (iterative/interactive)
Titan (graph database)
Redis (in-memory data grid)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Cassandra (key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
Greenplum (MPP)
SciDB (array database)
22

CAP Theorem
“You can have at most two of these properties for any shared-data
system… the choice of which feature to discard determines the
nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO)
C A
P
strong
consistency
high
availability
partition
tolerance
eventual
consistency
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
23

ﬁnancial transactions general ledger in RDBMS C A x
ad-hoc queries RDS (hosted MySQL) C A x
reporting, dashboards like Pentaho C A x
log rotation/persistence like Riak, Cassandra x x P
search indexes like ElasticSearch, Solr x A P
static content, archives S3 (durable storage) x A P
key/value data objects like HBase C x P
data prep, ETL, modeling at scale like Hadoop/Cascading C x P
graph queries like Titan C x P
Access → Frameworks → CAP Theorem Forfeits
24

ﬁnancial transactions general ledger in RDBMS C A x
ad-hoc queries RDS (hosted MySQL) C A x
reporting, dashboards like Pentaho C A x
log rotation/persistence like Riak, Cassandra x x P
search indexes like ElasticSearch, Solr x A P
static content, archives S3 (durable storage) x A P
key/value data objects like HBase C x P
data prep, ETL, modeling at scale like Hadoop/Cascading C x P
graph queries like Titan C x P
Access → Frameworks → CAP Theorem Forfeits
25

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
26

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
27

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
In their own words…
28

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
29

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
BeyondHadoopBeyondHadoop Modeling
30

Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. ﬁt the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that ﬁtted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced data modeling
31

Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in ﬁnancial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
in other words, seeing the forest for the trees…
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)
32

Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netﬂix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…
stanford.edu/~lmackey/papers/netﬂix_story-nas11-slides.pdf
33

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
34

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
BeyondHadoopBeyondHadoop Attention
35

Attention
impromptu survey:
• how many people say they practice some kind of “Agile” process at work?
• how many people say that they DON’T practice “Agile” ?
• how many people say they are in a lean startup?
Q:
with respect to Big Data practices,
how is that working out?
Abby Fichtner vimeo.com/27797408
36

Agile Data?
some people see a reconciliation of Agile process and Big Data…
Agile Data
Russell Jurney, 2013
amazon.com/dp/1449326269
“Run like a studio, not an assembly line.”
37

Perhaps Not
great values, wrong domain…
that worked when we were building features in web apps
Agile represents industrialization of software engineering,
codifying social interactions, compartmentalizing attention
meanwhile, Data Science is inherently multi-disciplinary:
• teams of people with complementary skill sets
• actionable insights require weeks/months, not hours
• variance and statistical thinking are foreign to CS
LinkedIn-style problems circa 2011 required certain skills…
manipulating the Newtonian physics of data… that money
may be mostly off the table by now
Big Data opportunities ahead require different math?
38

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
39

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Business
40

Business Disruption
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the entire Global 1000
on notice over the next decade… data as the major force… mostly
through apps – verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012:
complex analytics workloads are now displacing SQL as the basis
for Enterprise apps
Larry Page
CEO, Google / Wired, 2013:
create products and services that are 10 times better than the
competition… thousand-percent improvement requires rethinking
problems entirely, exploring the edges of what’s technically possible,
and having a lot more fun in the process
41

algorithmic modeling + machine data
+ curation, metadata + Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things + A/D conversion
+ complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Internet ofThings accelerates this process of disruption
Business Drivers
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic
42

A Thought Exercise
consider that when a company like Catepillar moves
into data science, they won’t be building the world’s
next search engine or social network
they will most likely be optimizing supply chain,
optimizing fuel costs, automating data feedback
loops integrated into their equipment…
that’s a $50B company,
in a market segment worth $250B
upcoming: tractors as drones –
guided by complex, distributed data apps
Operations Research –
crunching amazing amounts of data
44

Alternatively…
climate.com
45

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
46

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Algorithms
47

Algorithms
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, UMD
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead in sophisticated
algorithms work – as Breiman suggested in 2001 – which may take
a while to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• signiﬁcant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workﬂows
48

How much does it cost you to earn $1B?
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efﬁcient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
49

How much does it cost you to earn $1B?
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efﬁcient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
distributed algorithms for high ROI
use cases on cost-effective clustered
resources…
we’re learning how to do it right
50

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
51

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Personality
52

Personality
we have perhaps built computers (once named “electronic
brains”) in the image of JohnVon Neumann, et al.: standalone
genius, aristotelian uber-geek, incredible capacity for memory
and logic, overbearing, not particularly cooperative…
one can almost imagine a war-time dialogue,“Get one of these
guys in the room, they’ll solve anything!” … as a result, decades
of mutually assured destruction for global strategy
Q:
have we created software engineering practices which selected for
this kind of personality? selecting for “lone wolf” guys, socially
awkward, ONE person who can understand an entire code base,
able to out-logic and out-argue the rest of the room… charming
fellow, really
have we enabled software process to box these personalities
into something resembling teams? along with overtly described
rules for social conventions… silos, in other words
53

Chasing Unicorns
silos… but didn’t that all change?
leverage with data science teams is where organizations
tear down internal silos, socializing hard problems
data won’t fit on one computer anymore, problems won’t
fit in one department anymore, the code base won’t fit
into one uber-geek’s memory recall anymore…
so we embrace distributed systems for solutions
Q:
“Why aren’t there more women in engineering?”
IMHO, we’re trying to select for a personality which
doesn’t exist, and would not resolve current challenges;
meanwhile, my data science teams run about 50/50
54

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
55

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Clusters
56

Clusters
a little secret: people like me make a good living by
leveraging high ROI apps based on clusters, and so
the execs agree to build out more data centers…
clusters for Hadoop/Hive/HBase, clusters for Memcached,
for Cassandra, for MySQL, for Storm, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization
leveragingVMs and various notions of “cloud” helps
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
57

Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega:“10x” secret sauce
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
incubator.apache.org/mesos
58

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
59

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Trendlines
60

Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial ﬂights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the secondary/tertiary effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
plus the business implications given that much of the
Global 1000 is positioned to be disrupted technologyreview.com/...
61

Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of evolution in topologies could
this imply?
62

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
63

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Languages
64

Languages
JVM-based languages became popular for Big Data open source
technologies:
• partly becauseYHOO adopted Hadoop, etc.
• partly because Enterprise IT shops have J2EE expertise
• partly because of functional languages: Clojure, Scala
JVM has its drawbacks, especially for low-latency use cases
ample use of languages such as Python and Erlang in Big Data
practices, plus keep in mind that Google uses C++
FunctionalThinking
Neal Ford
youtu.be/plSZIkLodDM
a hunch: issues about current programming languages are
secondary to culture
65

Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
66

references…
“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”
Vitaly Gordon [ especially see slide #9 ]
slideshare.net/VitalyGordon/scalable-and-ﬂexible-machine-learning-with-scala-linkedin
Elements Of Functional Programming
Chris Reade
67

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
68

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Organization
69

Organization
How Do Committees Invent?
Melvin Conway, 1968
melconway.com/research/committees.html
Manu Cornet bonkersworld.net
“Any organization that designs a system
(defined more broadly here than just
information systems) will inevitably
produce a design whose structure is a
copy of the organization’s communication
structure.”
Q:
• does this fit with software process?
• does this fit with distributed apps?
see also:
haacked.com/archive/2013/05/13/applying-conways-law.aspx
70

Cooperation
perhaps we have selected for the wrong
personality to idealize…
linkedin.com/today/post/article/20130520190305-110300724-why-nothing-not-even-software-can-eat-the-world
All long-term success depends on eliciting
the voluntary support of an ecosystem.
As the African proverb says,“If you want
to go fast, go alone; if you want to go far,
go with others.” – Geoffrey Moore
71

discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
access
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
Team Composition: Needs × Roles
72

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
73

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Architecture
74

Architecture
Rich Hickey, Nathan Marz, Stuart Sierra, et al.:
functional programming to help reduce
costs over time
1. technical debt? this is how an organization
builds a culture to avoid it
2. Conway's Law corollary: model teams and
communication based on properties of the
desired architecture
3. also consider Mesos/Borg: schedule data
to be located where [CPU, RAM, I/O, surety]
will become available
Rich Hickey, infoq.com/presentations/Simple-Made-Easy
75

Lambda Architecture
Big Data
Nathan Marz, James Warren
manning.com/marz
• batch layer (immutable data, idempotent ops)
• serving layer (to query batch)
• speed layer (transient, cached “real-time”)
• combining results
76

Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
77

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
78

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Culture
79

Culture
Notes from the Mystery Machine Bus
SteveYegge, Google
goo.gl/SeRZa
consider these perspectives
in light of Conway’s Law…
“conservatism” “liberalism”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
80

Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
81

approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log ﬁles, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the conﬁdence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
What is needed?
82

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
83

First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves
Learning
Curves
84

Learning Curves
difﬁculties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codiﬁed practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that approach leads to enormous teams and low ROI scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context
85

Management
ultimately, the challenge is about managing
learning curves within a social context
est. cost of individual learning, initial impl
est.costofteamre-learning,lifecycle
some technologies constrain the
need to learn, others accelerate
re-learning prior business logic…
choose the latter, FTW!
86

Management
ultimately, the challenge is about managing
learning curves within a social context
est. cost of individual learning, initial impl
est.costofteamre-learning,lifecycle
some technologies constrain the
need to learn, others accelerate
re-learning prior business logic…
choose the latter, FTW!
IMHO, the “agile” part was intended to be
about shared learnings; while the “lean” part
was about how much you have on your plate
at any one time
87

blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.html
ThrowYour Life a Curve
Whitney Johnson
Aggressively Pro-Active Learning
• deconstruction of the cognitive bias One Size Fits All
• “makes a compelling case for personal disruption”
• “plan your career around learning curves”
• hire people who learn/re-learn efﬁciently
88

Summary
to be competitive globally with Big Data
requires learning many technologies –
then learning the nuances of a code base for
which the team is responsible, learning the
ever-changing surprises and insights which
are hidden deep within mountains of data,
plus the ever-evolving mathematics needed
to grapple with these conditions effectively
First
Principles
Topologies
Languages
Modeling Attention
Clusters
Algorithms
Trendlines
Organization
Architecture
Culture
Business
Personality
Learning
Curves you are here
89

Cascading: Workflow Abstraction
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Design Patterns for Workflows,
Across Departments
90

Anatomy of an Enterprise app
Deﬁnition a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
91

ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
92

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
93

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
94

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
95

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
96

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
97

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
98

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );
99

cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
visual collaboration for the business logic is a great
way to improve how teams work together:
Literate Programming, Don Knuth
www-cs-faculty.stanford.edu/~uno/lp.html
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
100

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
visual collaboration for the business logic is a great
way to improve how teams work together:
Literate Programming, Don Knuth
www-cs-faculty.stanford.edu/~uno/lp.html
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for ﬂow
planners, compiler, optimization, troubleshooting,
exception handling, notiﬁcations, security audit,
performance monitoring, etc.
cascading.org
101

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
102

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
1. End Use Cases, the drivers
103

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
2. A new kind of team process
104

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
3. Abstraction layer as optimizing
middleware, e.g., Cascading
105

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
4. Distributed OS, e.g., Mesos
106

Enterprise DataWorkﬂows
with Cascading
O’Reilly, 2013
references…
107

blog, dev community, code/wiki/gists, maven repo,
commercial products, career opportunities, newsletter:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
bit.ly/pxnnews
drill-down…
108

Hadoop and Beyond

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and Beyond

Similar to Hadoop and Beyond (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Hadoop and Beyond