SlideShare a Scribd company logo
1 of 99
Download to read offline
A New Year in Data Science: ā€Ø
ML Unpaused
Data Day Texasā€Ø
Austin, 2015-01-10
Paco Nathan, @pacoid
Observations about Machine Learning, Data Science,
Big Data, Open Source, Cluster Computing, Notebooks,
etc., over the past year ā€¦ plus, a look ahead
Backstory
Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
processā€¦
Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
processā€¦	

Clearly, provably, ā€Ø
our Sun revolves
around the Earth ā€Ø
at an observable
rate
Backstory: The Sun Also Rises
Others create and
evaluate models to
predict the Earthā€™s
orbit of the Sun
Backstory: The Sun Also Rises
Sometimes, when ā€Ø
the sky gods become
angry and obscure
the Sun as our due
punishmentā€¦	

We grow scared and
react: sacriļ¬ces must
be offered, our plans
must change, etc.
Backstory: The Sun Also Rises
Sometimes, when
the sky gods become
angry and obscure
the Sun
punishmentā€¦	

We grow scared and
react: sacriļ¬ces must
be offered, our plans
must
These points are what ā€Ø
Iā€™d like to discuss today
Whither Data Science?
Whither Data Science?
twitter.com/josh_wills/status/198093512149958656
Feel free to disagree, but I ļ¬nd that deļ¬nition ā€Ø
to be ļ¬‚awedā€¦
Whither Data Science?
Feel free to disagree, but I ļ¬nd that deļ¬nition ā€Ø
to be ļ¬‚awedā€¦	

1. That ignores DevOps (howā€™s that working out?) ā€Ø
andĀ Visualization/Design (ditto)
Whither Data Science?
Feel free to disagree, but I ļ¬nd that deļ¬nition ā€Ø
to be ļ¬‚awedā€¦	

1. That ignores DevOps (howā€™s that working out?) ā€Ø
andĀ Visualization/Design (ditto)	

2. When the CEO asks you to help explain why ā€Ø
revenue nose-dived over the past monthā€¦
neither ļ¬eld has a clue about how to model
business phenomena
Whither Data Science?
Software Engineering: ā€Ø
implement and test a model that somebody selected	

ā€¦almost ignores the matter of modeling entirely, ā€Ø
at least not since old school types like Dijkstra	

!
Statistics: ā€Ø
measure and justify a model that somebody selected	

ā€¦was never particularly good at teaching how to ā€Ø
model problems ā€“ as two renowned statisticians, ā€Ø
William Cleveland and Leo Breiman, noted
Whither Data Science?
Software Engineering:
implement and test a model that somebody selected
ā€¦almost ignores the matter of modeling entirely,
at least not since old school types like
!
Statistics:
measure and justify a model that somebody selected
ā€¦was never particularly good at teaching how to
model problems ā€“ as two renowned statisticians,
William Cleveland
Whither Data Science?
Both ļ¬elds are necessary,
but not sufļ¬cient
TheThorn in the Side of Big Data: too few artistsā€Ø
Christopher RĆ©, Stanfordā€Ø
safaribooksonline.com/library/view/strata-conference-santa/9781491900321/
part92.html
Whither Data Science?
TheThorn in the Side of Big Data: too few artists
Christopher RĆ©, Stanford
safaribooksonline.com/library/view/strata-conference-santa/9781491900321/
part92.html
Whither Data Science?
ā€œYou should think
about features and
not algorithmsā€
Remember EJBs?
Floyd Marinescu observed about the aftermath ā€Ø
of EJBs in Brief Historyā€¦	

Intended for building framework components,ā€Ø
e.g., for IBM, Oracle, Sun, but not many others	

Based on RMI, prior to notions ā€Ø
like RESTful web services
Enterprise Java Beans: Lessons from hate-watch reality television
Maybe a handful of people in the world would ā€Ø
ever actually need to use EJBs, but those few
people wanted a spec	

Then, for tragic political reasons (MSFT envy), ā€Ø
Sun Microsystems made EJBs prominent in ā€Ø
their Java APIs
Enterprise Java Beans: Lessons from hate-watch reality television
Fortunately, we evolved: Spring, JBoss, etc., ā€Ø
those came along as relatively more sane tech	

Now we see the Docker thing soar, with notions
such as microservices displacing legacy cruft	

(BTW, if you havenā€™t yet, check out Weave)
Enterprise Java Beans: Lessons from hate-watch reality television
I mention this because, to me, EJB represented ā€Ø
a convoluted form of template thinking:
Enterprise Java Beans: Lessons from hate-watch reality television
developing complex web apps ā€Ø
for the sake of ā€Ø
developing complex web apps
Enterprise Java Beans: Lessons from hate-watch reality television
IRL developers and template thinking donā€™t
determine public policyā€¦ right?
Enterprise Java Beans: Lessons from hate-watch reality television
To paraphrase Dean Wampler, consider WordCount
a simple apps written for MapReduce in Hadoop ā€¦
~50 lines of unapologetic Java that feels hella like
writing EJBs:
Enterprise Java Beans: Lessons from hate-watch reality television
Compare that with functional programming, where ā€Ø
the same WC app is three lines of easily-read Scala
when run in Apache Spark:
Enterprise Java Beans: Lessons from hate-watch reality television
Check out Deanā€™s talk at 11:00, ā€Ø
ā€œWhy Scala isTaking Over ā€Ø
the Big DataWorldā€
Compare that with functional programming, where ā€Ø
the same WC app is three lines of easily-read Scala
when run in Apache Spark:
Enterprise Java Beans: Lessons from hate-watch reality television
Hadoop suffers because, IMHO, that convoluted ā€Ø
EJB style of developer-centric template thinking
staged a coup
Perhaps we could
ā€œdonateā€ some
OSS talentā€¦	

Send a pull
requestā€¦	

Or something.
Lies, Damn Lies, ā€Ø
Statistics, and ā€Ø
Data Science
Probability got going, formally, in the 16th c. ā€“ ā€Ø
although interesting mathematical estimations ā€Ø
trace back to classical times	

Arabs in the 9th c. used frequency analysis ā€“ ā€Ø
later rediscovered by Europeans during the ā€Ø
early Italian Renaissance	

Statistics followed, originally more about what ā€Ø
we might call demographics ā€“ through 18th c.
Lies, Damn Lies, Statistics, Data Science
Laplace, Gauss, et al., bridged the ļ¬elds in the ā€Ø
late 18th c. using distributions (what we studied ā€Ø
in Stats 101) to infer the probability of errors ā€Ø
in estimates	

!
!
Much of the 19th/20th c. work was about using
goodness of ļ¬t tests, etc., justifying some distribution	

ā€¢ generally speaking, that require samples	

ā€¢ that, in turn, implies batch windows
Lies, Damn Lies, Statistics, Data Science
Lies, Damn Lies, Statistics, Data Science
That kind of template thinking in actionā€Ø
really lurvs it some batch windows
While 19th/20th c. stats work focused on
defensibility	

21st c. work, w.r.t. Big Data apps, focuses more ā€Ø
on predictability ā€“ plus thereā€™s a shift in how we
make estimatesā€¦
Lies, Damn Lies, Statistics, Data Science
BTW, doesnā€™t it seem weird to crunch through piles
of data in large batch jobs, at large expense, when
the results get used to approximate features
ultimately? Why not perform that in stream?
A fascinating, relatively new area pioneered by
relatively few people ā€“ e.g., Philippe Flajolet	

Provides approximation with error bounds
using much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-
data-mining/
Lies, Damn Lies, Statistics, Data Science
algorithm use case example
Bloom Filter set membership code
MinHash	

 set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code
Lies, Damn Lies, Statistics, Data Science
Lies, Damn Lies, Statistics, Data Science
E.g., Ā±4% could buy you two orders of magnitude
reduction in the required memory footprint for ā€Ø
an analytics app	

!
OSS projects such as Algebird and BlinkDB
provide for this newer approach to the math of
approximations at scale
Lies, Damn Lies, Statistics, Data Science
E.g., Ā±4% could buy you two orders of magnitude
reduction in the required memory footprint for
an analytics app	

!
OSS projects such as
provide for this newer approach to the math of
approximati
Oscar Boykin at 14:00, ā€Ø
ā€œAggregators: Modeling ā€Ø
Data Queries Functionallyā€	

co-author of Algebird, Scalding
The Interzone
Data Science is inherently interdisciplinary	

To paraphrase Chris RĆ©, emphasis on algorithms ā€Ø
is relatively minor in the grand scheme ā€“	

Especially when compared to needs for modeling
business problems effectively	

To wit: beyond phenomenology, leading ā€Ø
into quantitative analysis and repeatable results	

On the one hand, CS + Stats do not quite address
those needsā€¦
The Interzone
On the other hand, Physics
does well to teach modeling ā€“	

I like to hire physicists to work
on Data teamsā€¦
The Interzone
They tend to get the interdisciplinary aspects: ā€Ø
got the math background, coding experience,
generally good at systems engineering, etc.	

Not saying we should all rush out to get Physics
degrees; thereā€™s something to be learned there, ā€Ø
vital for the work and priorities ahead
I mention this because we are at a crossroads, ā€Ø
which has more to do with the physical world ā€“ ā€Ø
some talks here at DDTx15 help illustrate that	

Vast implications for Health Care, Transportation,
Agriculture, Energy, Gov, Manufacturing in generalā€¦	

More about that ā€Ø
in a bit ā€“
The Interzone
The Libraries
Most of the ML libraries that one encounters ā€Ø
today focus on two general kinds of solutions:	

ā€¢ convex optimization	

ā€¢ matrix factorization	

The Libraries: Alexandria Redux
One might think of the convex optimization ā€Ø
in this case as a kind of curve ļ¬tting ā€“ generally ā€Ø
with some regularization term to avoid overļ¬tting, ā€Ø
which is not good
Good Bad
The Libraries: Alexandria Redux
For supervised learning, used to create classiļ¬ers:	

1. categorize the expected data into N classes	

2. split a sample of the data into train/test sets	

3. use learners to optimize classiļ¬ers based onā€Ø
the training set, to label the data into N classes	

4. evaluate the classiļ¬ers against the test set,
measuring error in predicted vs. expected labels
The Libraries: Alexandria Redux
Bokay, great for security problems with simply
two classes: good guys vs. bad guys	

How do you decide what the classes are ā€Ø
for more complex problems in business?	

Thatā€™s where the matrix factorization
parts come in handyā€¦
The Libraries: Alexandria Redux
For unsupervised learning, which is often used ā€Ø
to reduce dimension:	

1. create a covariance matrix of the data	

2. solve for the eigenvectors and eigenvalues ā€Ø
of the matrix	

3. select the top N eigenvectors, based on
diminishing returns for how they explain
variance in the data	

4. those eigenvectors deļ¬ne your N classes
The Libraries: Alexandria Redux
An excellent overview of ML deļ¬nitions ā€Ø
(up to this point) is given in:
The Libraries: Alexandria Redux
To wit: ā€Ø
	

 Generalization = Representation + Optimization + Evaluation
A Few UsefulThings to Know about Machine Learningā€Ø
Pedro Domingosā€Ø
CACM 55:10 (Oct 2012)ā€Ø
http://dl.acm.org/citation.cfm?id=2347755
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far in a workļ¬‚owā€¦
Results are shown in blue, and the real work ā€Ø
is highlighted in red
The Libraries: Alexandria Redux
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far
Results are shown in
is highlighted in
1. focus on features not algorithms	

2. learn how to model business
problems by leveraging data	

3. notice the workļ¬‚ows needed?	

4. leave the dev-centric thinking ā€Ø
for odd city council meetings
The Libraries: Alexandria Redux
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far
Results are shown in
is highlighted in
The Libraries: Alexandria Redux
Matthew Kirk 12:00ā€Ø
ā€œLessons Learned: Machine Learning
andTechnical Debtā€
Ted Dunning 13:00ā€Ø
ā€œComputing with Chaosā€
Julia Evans 15:00ā€Ø
ā€œData Pipelines.They're a lot of work!ā€
Christopher Johnson 16:00ā€Ø
ā€œScala Data Pipelines for Music
Recommendationsā€
Even so, business demands exceed far beyond
what classiļ¬ers and labels alone can give usā€¦	

Businesses lurv Optimization, gobs of it;Ā in ā€Ø
that context ML libraries today merely scratch
the surface	

Round hole, square peg
The Libraries: Alexandria Redux
Imagine that you compete with FedExā€¦ how do
you optimize delivery routes for airplanes, trucks,
trains, nanodrones, hoverboards, etc.?
Which do you optimize: fuel cost,
delivery time, maintenance schedules,
minimizing lost packages? 	

Doesnā€™t sound much like online
advertising, social networks, or ā€Ø
any episode of Silicon Valley
The Libraries: Alexandria Redux
ML, Unpaused
What were the origins of machine learning?	

ā€¢ Marvin Minsky @MIT, 1950s	

ā€¢ Support Vector Machines @Bell Labs, 1990s	

ā€¢ Google @Stanford, 1990s	

ā€¢ Ray Kurzweil, 2000s	

Nopeā€¦
ML, Unpaused
ML has been an aspect of AI research for a ā€Ø
long while, through several different vectors	

A good early history (up to 1980s) is given in:
ML, Unpaused
Machine Learning:A Historical and Methodological Analysisā€Ø
Jaime Carbonell, Ryszard Michalski, Tom Mitchellā€Ø
AI Magazine 4:3 (1983)ā€Ø
http://dx.doi.org/10.1609/aimag.v4i3.406
To wit: 	

task-oriented studies, knowledge acquisition, cognitive
simulation, theoretical exploration ā€¦ overall, a much ā€Ø
broader class of optimization problems
An era of anticipation ā€“ AI was making inroadsā€¦	

ā€¢ emphasis on capturing/representing knowledge ā€Ø
and expertise ā€“ production use cases in medicine	

ā€¢ Fifth Generation Computing (parallel h/w) ā€Ø
in Japan MCC, etc.	

However:	

ā€¢ few outside academia had enough cluster compute
power ā€“ aside from 3-letter agencies and AT&T	

ā€¢ meanwhile ML was not yet considered ā€œacademicā€
enough within academia
Circa early 1980s:
Stock market ā€œcorrectedā€ in 1987:
Butā€¦
Some fundamental tech platforms emergeā€¦	

ā€¢ Hubble Space Telescope, Human Genome Project,
WWW, electric cars relaunched	

And throughout that decade:	

ā€¢ Linux, Java @Sun, JavaScript @Netscape	

ā€¢ Fireļ¬‚y, an initial commercial ML app ā€Ø
on teh interwebs @MIT Media Lab	

ā€¢ Rise of e-commerce leveraging horizontal ā€Ø
scale-out with commodity hardware
Circa early 1990s:
Stock market ā€œtumbledā€ in 2000:
Butā€¦
GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR
emerged out of the dustā€¦	

ā€¢ web apps dominated for search, e-commerce, ā€Ø
social networks, etc.	

ā€¢ did we mention EJBs and template thinking? 	

ā€¢ mobile picked up traction	

ā€¢ recommender systems went mainstream	

ā€¢ AI picked up with semantic web effortsā€¦
Circa early 2000s:
Stock market ā€œwent free-fallā€ in 2008:
Butā€¦
Successful e-commerce ļ¬rms have IPOā€™ed and are
now busy building skyscrapers in downtown SFā€¦
Circa mid 2010s:
LinkedIn, 350 Bush
Transbay Transit
Salesforce, 415 Mission
An odd truism about the hubris of the uber-wealthy
and the timing of their skyscraper projectsā€¦
Butā€¦
Sears Tower, Chicago
Lehman Brothers, London
Fontainebleau, Las Vegas
An odd truism about the hubris of the uber-wealthy
and the timing of their skyscraper projectsā€¦
Butā€¦
Businesses lurv Optimization, lots of itā€¦	

ā€¢ ML circa 1985 focused on those needs, but got
knocked back to something inevitably more
aristotelian and predictable	

ā€¢ Outside of SiliconValley, weā€™ve made big strides	

ā€¢ One danger: next downturn cycle,VCs might ā€Ø
reshape tech industry, reverting to ā€œsafe betsā€
Circa mid 2010s: Back to the Future
However, a few extremely interesting
aspects have emergedā€¦
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
We have approximation, deep learning and
symbolic regression to assist on ā€œFeaturesā€
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Or, maybe, cognitive computing will help on
several of the more difļ¬cult aspects of thisā€¦
Circa mid 2010s: Extremely Interesting Emerging Aspects
Circa mid 2010s: Extremely Interesting Emerging Aspects
DeepDive @Stanford	

http://deepdive.stanford.edu/
Knowledge Graph @Google	

http://www.google.com/insidesearch/
features/search/knowledge.html
IBM Watson	

http://www.ibm.com/
smarterplanet/us/en/ibmwatson/
Scaled Inference	

https://scaledinference.com/
Circa mid 2010s: Extremely Interesting Emerging Aspects
Rhetorical postures:Ā ā€œIs AI a good idea,
or potentially harmful?ā€ ā€Ø
ā€“ per Elon Musk, et al.
Circa mid 2010s: Extremely Interesting Emerging Aspects
Clearly: good idea ā€Ø
brewbot.io
Rhetorical postures:Ā ā€œIs AI a good idea,
or potentially harmful?ā€ ā€Ø
ā€“ per Elon Musk, et al.
Circa mid 2010s: Extremely Interesting Emerging Aspects
Speaking of which, a highly recommended podcast ā€Ø
by actual data scientists drinking really good beers:
partiallyderivative.com
Circa mid 2010s: Extremely Interesting Emerging Aspects
2015: Notebooks in Containers in the Cloud
ā€œKeep simple things simple
and complex things possible.ā€
databricks.com/product
PublishingWorkļ¬‚ows for Jupyter	

Andrew Odewahn, Kyle Kelley, Rune Madsen	

odewahn.github.io/publishing-workflows-for-jupyter
IPython Interactive Demoā€Ø
Nature Magazine + Rackspace	

nature.com/news/ipython-interactive-demo-7.21492
2015: Notebooks in Containers in the Cloud
ā€œKeep simple things simple
and complex things possible.ā€
databricks.com/product
PublishingWorkļ¬‚ows for Jupyter	

Andrew Odewahn
odewahn.github.io/publishing-workflows-for-jupyter
IPython Interactive Demo
Nature Magazine + Rackspace	

nature.com/news/ipython-interactive-demo-7.21492
Circa mid 2010s: Extremely Interesting Emerging Aspects
Makes me wonder about the ā€œdata engineerā€
role ā€¦ notebooks simplify ops needs, while
ultimately the domain experts wield the real
power with data
Frontstory
Frontstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
dev-centric templates
Some gaze into the
heavens, sit back,
and explain the
processā€¦
20th c. stats
Frontstory: The Sun Also Rises
Sometimes, when
the sky gods become
angry and obscure
the Sun as our due
punishmentā€¦ VCs during recessions
Frontstory: The Sun Also Rises
Others create and
evaluate models to
predict the Earthā€™s
orbit of the Sun
Whatā€™s needed most
Frontstory: The Sun Also Rises
Forward Motion:
SV trend: early data scientists displace old-school
product managers	

Because there are hard ā€Ø
problems to be solvedā€¦	

Because we need ā€Ø
new eyes on targetā€¦	

Because use casesā€¦
Because Use Cases
Because Use Cases: Health Care
ā€œIn fact, using ourTopological Data Analysis system, they were
able to discover multiple types of Type 2 diabetes ā€¦ huge
impact on all the hundreds of millions of peopleā€ ā€“ Ayasdi
ā€œNobody knows what to do with those archives ā€¦Theyā€™re just
sitting there, costing money. This is just seen as a big opportunity.
Itā€™s like,ā€˜Oh, this is what we were saving this up for!ā€™ā€ ā€“ Enlitic
ā€œSloan-Kettering is also trainingWatson on 1,500 real-world lung
cancer cases, helping it to decipher physician notes and learn
from the hospitalā€™s expertise in treating cancer.ā€ ā€“ IBM Watson
Employing tech such as deep learning and
cognitive computing for vital use cases in ā€Ø
health care:
Because Use Cases: Transportation
http://automatic.com/	

!
Detects events like hard braking, acceleration ā€“ uploaded in
real-time with geolocation to a Spark Streaming pipeline ā€¦
data trends indicate road hazards, blind intersections, bad
signal placement, and other input to improve trafļ¬c planning.
Also detects inefļ¬cient vehicle operation, under-inļ¬‚ated tires,
poor driving behaviors, aggressive acceleration, etc.
Because Use Cases: Education
https://databricks.com/blog/2014/12/08/
pearsonā€¦	

!
Integrates Kafka + Spark Streaming + Cassandra +
Blur, running within aYARN cluster on AWS to provide
a scalable, reliable, cloud-based platform for services
that analyze student performance across product and
institution boundaries.
Delivers immersive learning experiences
designed for how students read, think,
and learn; as well as efļ¬cacy insights to
both learners and institutions which were
not possible before.	

!
Reliability features handle Kafka node
failures, receiver failures, leader changes,
committed offset in ZK, plus adjustable
data-rate throughput.
Because Use Cases: Language, everywhere
http://idibon.com/
!
!
!
Our social fabric is encoded as text documents,
and similarly it get tested, deployed, maintained,
and monitored there ā€“ itā€™s the launch point for
cognitive computing.
http://digitalreasoning.com/
http://digitalreasoning.com/
Because Use Cases: Language, everywhere
http://idibon.com/
!
!
!
Our social fabric is encoded as text documents,
and similarly it get tested, deployed, maintained,
and monitored there ā€“ itā€™s the launch point for
cognitive computing.
Robert Munroe, 12:00 ā€œBuilding Better
Experts: co-optimization of human and
machine intelligence at Idibonā€
AndrewTrask, David Gilmore 11:00
ā€œDeep Learning for Natural Language
Processingā€
Because Use Cases: Geospatial
Advanced geo uses cases throughout all levels of gov ā€Ø
and industry for Big Data, machine learning, graph
algorithms, approximations, etc.	

If you roll trucks you probably use licenses from ESRI.	

Also consider the IoT sensor data, e.g., from National
Instruments' customers ā€“ where does it go, what do
organizations use to analyze it?	

These are the large-scale optimization problems
you were looking forā€¦
http://esri.github.io/gis-tools-for-hadoop/ (and Spark)
http://thunderheadxpler.blogspot.com/
http://geotrellis.io/
http://www.oculusinfo.com/tiles/
https://databricks.com/blog/2014/12/03/app...
Because Use Cases: Telecom,Travel, Banking, etc.
http://spark-summit.org/2014/talk/
stratio-streamingā€¦	

Stratio represents one of the most sophisticated
integrations for Spark Streaming ā€“ the union of
a real-time messaging bus with a complex event
processing engine: Kafka, Spark Streaming,
Cassandra, along with the Siddhi CEP engine	

Telecom, in particular, is leveraging this new
streaming technology as a big win near-term	

http://www.openstratio.org/ā€Ø
https://github.com/stratio	

https://github.com/Stratio/streaming-
cep-engine
BTW if youā€™re in Madrid next fall ā€Ø
check out Big Data Hispano
Because Use Casesā€¦
Common theme: many of those use cases are
powered by Apache Spark ā€“	

Especially notice Spark Streaming, which is a big
game-changer for analytics across industry
Because Use Casesā€¦
Common theme: many of those use cases are
powered by
Especially notice
game-changer for analytics across industry
Taylor Goetz 11:00ā€Ø
ā€œBeyond theTweetingToaster: IoT
Streaming AnalyticsWith Apache
Storm, Kafka, and Arduinoā€
Hari Shreedharan 12:00ā€Ø
ā€œRealTime Data Processing Using
Spark Streamingā€
Because Use Cases: Agriculture
Ag+Data Issuesā€Ø
http://radar.oreilly.com/2014/04/agdata.html	

Data Guild whitepaper: Ag Systems + Data Outlookā€Ø
http://goo.gl/OK8RFf	

ā€¢ livelihood for 40% of world population	

ā€¢ $15T/year annual GDP globally	

ā€¢ data-intensive issues, much legal impasse	

Over a half billion small farms worldwide, and most ā€Ø
are family-run farms that rely on rain-fed agriculture	

Nudge, and I just might propose DWave clusters ā€Ø
into cold craters on the Lunar South Pole with ā€Ø
routers @L5 and an LLO skyhookā€¦ to handleā€Ø
the vector quantization demands. Or something.
airships
e.g., JP Aerospace, 40 km
atmostats
e.g.,Titan Aerospace, 20 km
microsats
e.g., Planet Labs, 400 km
robots
e.g., Blue River, 1 m
sensors
e.g., Hortau, -0.3 m
drones
e.g., HoneyComb, 120 m
Layered Sensing Networks
Resources
Apache Spark developer certificate program
ā€¢ http://oreilly.com/go/sparkcert
ā€¢ defined by Spark experts @Databricks
ā€¢ assessed by Oā€™Reilly Media
ā€¢ establishes the bar for Spark expertise
certification:
MOOCs:
Anthony Josephā€Ø
UC Berkeley	

begins 2015-02-23	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkarā€Ø
UCLA	

begins 2015-04-14	

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
http://spark-summit.org/
confs:
Strata CAā€Ø
San Jose, Feb 18-20ā€Ø
strataconf.com/strata2015
Spark Summit Eastā€Ø
NYC, Mar 18-19ā€Ø
spark-summit.org/east
Big Data Tech Conā€Ø
Boston, Apr 26-28ā€Ø
bigdatatechcon.com
Strata EUā€Ø
London, May 5-7ā€Ø
strataconf.com/big-data-conference-uk-2015
Spark Summit 2015ā€Ø
SF, Jun 15-17ā€Ø
spark-summit.org
books:
Fast Data Processing ā€Ø
with Sparkā€Ø
Holden Karauā€Ø
Packt (2013)ā€Ø
shop.oreilly.com/product/
9781782167068.do
Spark in Actionā€Ø
Chris Freglyā€Ø
Manning (2015*)ā€Ø
sparkinaction.com/
Learning Sparkā€Ø
Holden Karau, ā€Ø
Andy Konwinski,
Matei Zahariaā€Ø
Oā€™Reilly (2015*)ā€Ø
shop.oreilly.com/product/
0636920028512.do
presenter:
Just Enough Math
Oā€™Reilly, 2014
justenoughmath.comā€Ø
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, ā€Ø
events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workļ¬‚ows
with Cascading
Oā€™Reilly, 2013
shop.oreilly.com/product/
0636920028536.do

More Related Content

What's hot

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
Ā 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Ā 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
Ā 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
Ā 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Ā 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
Ā 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabDanny Bickson
Ā 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
Ā 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
Ā 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
Ā 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Ā 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
Ā 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
Ā 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
Ā 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
Ā 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
Ā 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
Ā 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
Ā 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
Ā 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
Ā 

What's hot (20)

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Ā 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Ā 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Ā 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
Ā 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Ā 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Ā 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
Ā 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Ā 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
Ā 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
Ā 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
Ā 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Ā 
Data science presentation
Data science presentationData science presentation
Data science presentation
Ā 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Ā 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Ā 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Ā 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Ā 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Ā 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
Ā 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
Ā 

Viewers also liked

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
Ā 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
Ā 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
Ā 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Ā 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
Ā 
QCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
QCon SĆ£o Paulo: Real-Time Analytics with Spark StreamingQCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
QCon SĆ£o Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
Ā 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
Ā 
Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015Mindset Dynamics
Ā 
Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015Mundo Contact
Ā 
EcologĆ­a Gasoducto Vs. EnergĆ­a electrica
EcologĆ­a Gasoducto Vs. EnergĆ­a electricaEcologĆ­a Gasoducto Vs. EnergĆ­a electrica
EcologĆ­a Gasoducto Vs. EnergĆ­a electricaChristopher Marrero
Ā 
What is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratnaWhat is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratnaPearl Gemstone
Ā 
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2Burton Lee
Ā 
PresentaciĆ³n ODEBRECHT - Foro 08-09-10
PresentaciĆ³n ODEBRECHT - Foro 08-09-10PresentaciĆ³n ODEBRECHT - Foro 08-09-10
PresentaciĆ³n ODEBRECHT - Foro 08-09-10Felix Zambrano A.
Ā 
Trabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion KumonTrabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion KumonLisandro Cunci
Ā 
Hiperplasia prostƔtica benigna
Hiperplasia prostƔtica benignaHiperplasia prostƔtica benigna
Hiperplasia prostƔtica benignaIMSS
Ā 

Viewers also liked (18)

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Ā 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Ā 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Ā 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Ā 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Ā 
QCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
QCon SĆ£o Paulo: Real-Time Analytics with Spark StreamingQCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
QCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
Ā 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
Ā 
OS Accelerate London - 09/16/15
OS Accelerate London - 09/16/15OS Accelerate London - 09/16/15
OS Accelerate London - 09/16/15
Ā 
Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015
Ā 
QuƩ Has Hecho Hoy?
QuƩ Has Hecho Hoy?QuƩ Has Hecho Hoy?
QuƩ Has Hecho Hoy?
Ā 
Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015
Ā 
EcologĆ­a Gasoducto Vs. EnergĆ­a electrica
EcologĆ­a Gasoducto Vs. EnergĆ­a electricaEcologĆ­a Gasoducto Vs. EnergĆ­a electrica
EcologĆ­a Gasoducto Vs. EnergĆ­a electrica
Ā 
What is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratnaWhat is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratna
Ā 
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Ā 
Cable utp
Cable utpCable utp
Cable utp
Ā 
PresentaciĆ³n ODEBRECHT - Foro 08-09-10
PresentaciĆ³n ODEBRECHT - Foro 08-09-10PresentaciĆ³n ODEBRECHT - Foro 08-09-10
PresentaciĆ³n ODEBRECHT - Foro 08-09-10
Ā 
Trabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion KumonTrabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion Kumon
Ā 
Hiperplasia prostƔtica benigna
Hiperplasia prostƔtica benignaHiperplasia prostƔtica benigna
Hiperplasia prostƔtica benigna
Ā 

Similar to A New Year in Data Science: ML Unpaused

Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsErika Marr
Ā 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
Ā 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
Ā 
Mdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-dataMdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-dataRafael Alvarado
Ā 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
Ā 
Session1
Session1Session1
Session1Van Pham
Ā 
Cloud Computing
Cloud ComputingCloud Computing
Cloud ComputingRahul Pola
Ā 
Cloud computing
Cloud computingCloud computing
Cloud computingBasil John
Ā 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfRanvinuHewage
Ā 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadKelly Technologies
Ā 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
Ā 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019Dhiana Deva
Ā 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Ā 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
Ā 
New professional careers in data
New professional careers in dataNew professional careers in data
New professional careers in dataDavid Rostcheck
Ā 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerceVincent Michel
Ā 
100_Days_of_Data_Science
100_Days_of_Data_Science100_Days_of_Data_Science
100_Days_of_Data_ScienceSajzat hossain
Ā 

Similar to A New Year in Data Science: ML Unpaused (20)

Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business Analytics
Ā 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Ā 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Ā 
Mdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-dataMdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-data
Ā 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
Ā 
Session1
Session1Session1
Session1
Ā 
Session1
Session1Session1
Session1
Ā 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Ā 
Cloud computing
Cloud computingCloud computing
Cloud computing
Ā 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdf
Ā 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
Ā 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
Ā 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon SĆ£o Paulo 2019
Ā 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
Ā 
Data Science at UCI
Data Science at UCIData Science at UCI
Data Science at UCI
Ā 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Ā 
New professional careers in data
New professional careers in dataNew professional careers in data
New professional careers in data
Ā 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerce
Ā 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
Ā 
100_Days_of_Data_Science
100_Days_of_Data_Science100_Days_of_Data_Science
100_Days_of_Data_Science
Ā 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Ā 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Ā 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Ā 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
Ā 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
Ā 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
Ā 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
Ā 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
Ā 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
Ā 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
Ā 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
Ā 

More from Paco Nathan (11)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Ā 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Ā 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Ā 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Ā 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Ā 
Computable Content
Computable ContentComputable Content
Computable Content
Ā 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Ā 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Ā 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Ā 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Ā 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Ā 

Recently uploaded

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Ā 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Ā 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Ā 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 

A New Year in Data Science: ML Unpaused

  • 1. A New Year in Data Science: ā€Ø ML Unpaused Data Day Texasā€Ø Austin, 2015-01-10 Paco Nathan, @pacoid
  • 2. Observations about Machine Learning, Data Science, Big Data, Open Source, Cluster Computing, Notebooks, etc., over the past year ā€¦ plus, a look ahead
  • 4. Backstory: The Sun Also Rises Some wake early in the morning and go build buildings
  • 5. Backstory: The Sun Also Rises Some wake early in the morning and go build buildings
  • 6. Backstory: The Sun Also Rises Some gaze into the heavens, sit back, and explain the processā€¦
  • 7. Backstory: The Sun Also Rises Some gaze into the heavens, sit back, and explain the processā€¦ Clearly, provably, ā€Ø our Sun revolves around the Earth ā€Ø at an observable rate
  • 8. Backstory: The Sun Also Rises Others create and evaluate models to predict the Earthā€™s orbit of the Sun
  • 9. Backstory: The Sun Also Rises Sometimes, when ā€Ø the sky gods become angry and obscure the Sun as our due punishmentā€¦ We grow scared and react: sacriļ¬ces must be offered, our plans must change, etc.
  • 10. Backstory: The Sun Also Rises Sometimes, when the sky gods become angry and obscure the Sun punishmentā€¦ We grow scared and react: sacriļ¬ces must be offered, our plans must These points are what ā€Ø Iā€™d like to discuss today
  • 13. Feel free to disagree, but I ļ¬nd that deļ¬nition ā€Ø to be ļ¬‚awedā€¦ Whither Data Science?
  • 14. Feel free to disagree, but I ļ¬nd that deļ¬nition ā€Ø to be ļ¬‚awedā€¦ 1. That ignores DevOps (howā€™s that working out?) ā€Ø andĀ Visualization/Design (ditto) Whither Data Science?
  • 15. Feel free to disagree, but I ļ¬nd that deļ¬nition ā€Ø to be ļ¬‚awedā€¦ 1. That ignores DevOps (howā€™s that working out?) ā€Ø andĀ Visualization/Design (ditto) 2. When the CEO asks you to help explain why ā€Ø revenue nose-dived over the past monthā€¦ neither ļ¬eld has a clue about how to model business phenomena Whither Data Science?
  • 16. Software Engineering: ā€Ø implement and test a model that somebody selected ā€¦almost ignores the matter of modeling entirely, ā€Ø at least not since old school types like Dijkstra ! Statistics: ā€Ø measure and justify a model that somebody selected ā€¦was never particularly good at teaching how to ā€Ø model problems ā€“ as two renowned statisticians, ā€Ø William Cleveland and Leo Breiman, noted Whither Data Science?
  • 17. Software Engineering: implement and test a model that somebody selected ā€¦almost ignores the matter of modeling entirely, at least not since old school types like ! Statistics: measure and justify a model that somebody selected ā€¦was never particularly good at teaching how to model problems ā€“ as two renowned statisticians, William Cleveland Whither Data Science? Both ļ¬elds are necessary, but not sufļ¬cient
  • 18. TheThorn in the Side of Big Data: too few artistsā€Ø Christopher RĆ©, Stanfordā€Ø safaribooksonline.com/library/view/strata-conference-santa/9781491900321/ part92.html Whither Data Science?
  • 19. TheThorn in the Side of Big Data: too few artists Christopher RĆ©, Stanford safaribooksonline.com/library/view/strata-conference-santa/9781491900321/ part92.html Whither Data Science? ā€œYou should think about features and not algorithmsā€
  • 21. Floyd Marinescu observed about the aftermath ā€Ø of EJBs in Brief Historyā€¦ Intended for building framework components,ā€Ø e.g., for IBM, Oracle, Sun, but not many others Based on RMI, prior to notions ā€Ø like RESTful web services Enterprise Java Beans: Lessons from hate-watch reality television
  • 22. Maybe a handful of people in the world would ā€Ø ever actually need to use EJBs, but those few people wanted a spec Then, for tragic political reasons (MSFT envy), ā€Ø Sun Microsystems made EJBs prominent in ā€Ø their Java APIs Enterprise Java Beans: Lessons from hate-watch reality television
  • 23. Fortunately, we evolved: Spring, JBoss, etc., ā€Ø those came along as relatively more sane tech Now we see the Docker thing soar, with notions such as microservices displacing legacy cruft (BTW, if you havenā€™t yet, check out Weave) Enterprise Java Beans: Lessons from hate-watch reality television
  • 24. I mention this because, to me, EJB represented ā€Ø a convoluted form of template thinking: Enterprise Java Beans: Lessons from hate-watch reality television developing complex web apps ā€Ø for the sake of ā€Ø developing complex web apps
  • 25. Enterprise Java Beans: Lessons from hate-watch reality television IRL developers and template thinking donā€™t determine public policyā€¦ right?
  • 26. Enterprise Java Beans: Lessons from hate-watch reality television To paraphrase Dean Wampler, consider WordCount a simple apps written for MapReduce in Hadoop ā€¦ ~50 lines of unapologetic Java that feels hella like writing EJBs:
  • 27. Enterprise Java Beans: Lessons from hate-watch reality television Compare that with functional programming, where ā€Ø the same WC app is three lines of easily-read Scala when run in Apache Spark:
  • 28. Enterprise Java Beans: Lessons from hate-watch reality television Check out Deanā€™s talk at 11:00, ā€Ø ā€œWhy Scala isTaking Over ā€Ø the Big DataWorldā€ Compare that with functional programming, where ā€Ø the same WC app is three lines of easily-read Scala when run in Apache Spark:
  • 29. Enterprise Java Beans: Lessons from hate-watch reality television Hadoop suffers because, IMHO, that convoluted ā€Ø EJB style of developer-centric template thinking staged a coup Perhaps we could ā€œdonateā€ some OSS talentā€¦ Send a pull requestā€¦ Or something.
  • 30. Lies, Damn Lies, ā€Ø Statistics, and ā€Ø Data Science
  • 31. Probability got going, formally, in the 16th c. ā€“ ā€Ø although interesting mathematical estimations ā€Ø trace back to classical times Arabs in the 9th c. used frequency analysis ā€“ ā€Ø later rediscovered by Europeans during the ā€Ø early Italian Renaissance Statistics followed, originally more about what ā€Ø we might call demographics ā€“ through 18th c. Lies, Damn Lies, Statistics, Data Science
  • 32. Laplace, Gauss, et al., bridged the ļ¬elds in the ā€Ø late 18th c. using distributions (what we studied ā€Ø in Stats 101) to infer the probability of errors ā€Ø in estimates ! ! Much of the 19th/20th c. work was about using goodness of ļ¬t tests, etc., justifying some distribution ā€¢ generally speaking, that require samples ā€¢ that, in turn, implies batch windows Lies, Damn Lies, Statistics, Data Science
  • 33. Lies, Damn Lies, Statistics, Data Science That kind of template thinking in actionā€Ø really lurvs it some batch windows
  • 34. While 19th/20th c. stats work focused on defensibility 21st c. work, w.r.t. Big Data apps, focuses more ā€Ø on predictability ā€“ plus thereā€™s a shift in how we make estimatesā€¦ Lies, Damn Lies, Statistics, Data Science BTW, doesnā€™t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream?
  • 35. A fascinating, relatively new area pioneered by relatively few people ā€“ e.g., Philippe Flajolet Provides approximation with error bounds using much less resources (RAM, CPU, etc.) highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics- data-mining/ Lies, Damn Lies, Statistics, Data Science
  • 36. algorithm use case example Bloom Filter set membership code MinHash set similarity code HyperLogLog set cardinality code Count-Min Sketch frequency summaries code DSQ streaming quantiles code SkipList ordered sequence search code Lies, Damn Lies, Statistics, Data Science
  • 37. Lies, Damn Lies, Statistics, Data Science E.g., Ā±4% could buy you two orders of magnitude reduction in the required memory footprint for ā€Ø an analytics app ! OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale
  • 38. Lies, Damn Lies, Statistics, Data Science E.g., Ā±4% could buy you two orders of magnitude reduction in the required memory footprint for an analytics app ! OSS projects such as provide for this newer approach to the math of approximati Oscar Boykin at 14:00, ā€Ø ā€œAggregators: Modeling ā€Ø Data Queries Functionallyā€ co-author of Algebird, Scalding
  • 40. Data Science is inherently interdisciplinary To paraphrase Chris RĆ©, emphasis on algorithms ā€Ø is relatively minor in the grand scheme ā€“ Especially when compared to needs for modeling business problems effectively To wit: beyond phenomenology, leading ā€Ø into quantitative analysis and repeatable results On the one hand, CS + Stats do not quite address those needsā€¦ The Interzone
  • 41. On the other hand, Physics does well to teach modeling ā€“ I like to hire physicists to work on Data teamsā€¦ The Interzone They tend to get the interdisciplinary aspects: ā€Ø got the math background, coding experience, generally good at systems engineering, etc. Not saying we should all rush out to get Physics degrees; thereā€™s something to be learned there, ā€Ø vital for the work and priorities ahead
  • 42. I mention this because we are at a crossroads, ā€Ø which has more to do with the physical world ā€“ ā€Ø some talks here at DDTx15 help illustrate that Vast implications for Health Care, Transportation, Agriculture, Energy, Gov, Manufacturing in generalā€¦ More about that ā€Ø in a bit ā€“ The Interzone
  • 44. Most of the ML libraries that one encounters ā€Ø today focus on two general kinds of solutions: ā€¢ convex optimization ā€¢ matrix factorization The Libraries: Alexandria Redux
  • 45. One might think of the convex optimization ā€Ø in this case as a kind of curve ļ¬tting ā€“ generally ā€Ø with some regularization term to avoid overļ¬tting, ā€Ø which is not good Good Bad The Libraries: Alexandria Redux
  • 46. For supervised learning, used to create classiļ¬ers: 1. categorize the expected data into N classes 2. split a sample of the data into train/test sets 3. use learners to optimize classiļ¬ers based onā€Ø the training set, to label the data into N classes 4. evaluate the classiļ¬ers against the test set, measuring error in predicted vs. expected labels The Libraries: Alexandria Redux
  • 47. Bokay, great for security problems with simply two classes: good guys vs. bad guys How do you decide what the classes are ā€Ø for more complex problems in business? Thatā€™s where the matrix factorization parts come in handyā€¦ The Libraries: Alexandria Redux
  • 48. For unsupervised learning, which is often used ā€Ø to reduce dimension: 1. create a covariance matrix of the data 2. solve for the eigenvectors and eigenvalues ā€Ø of the matrix 3. select the top N eigenvectors, based on diminishing returns for how they explain variance in the data 4. those eigenvectors deļ¬ne your N classes The Libraries: Alexandria Redux
  • 49. An excellent overview of ML deļ¬nitions ā€Ø (up to this point) is given in: The Libraries: Alexandria Redux To wit: ā€Ø Generalization = Representation + Optimization + Evaluation A Few UsefulThings to Know about Machine Learningā€Ø Pedro Domingosā€Ø CACM 55:10 (Oct 2012)ā€Ø http://dl.acm.org/citation.cfm?id=2347755
  • 50. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in a workļ¬‚owā€¦ Results are shown in blue, and the real work ā€Ø is highlighted in red The Libraries: Alexandria Redux
  • 51. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far Results are shown in is highlighted in 1. focus on features not algorithms 2. learn how to model business problems by leveraging data 3. notice the workļ¬‚ows needed? 4. leave the dev-centric thinking ā€Ø for odd city council meetings The Libraries: Alexandria Redux
  • 52. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far Results are shown in is highlighted in The Libraries: Alexandria Redux Matthew Kirk 12:00ā€Ø ā€œLessons Learned: Machine Learning andTechnical Debtā€ Ted Dunning 13:00ā€Ø ā€œComputing with Chaosā€ Julia Evans 15:00ā€Ø ā€œData Pipelines.They're a lot of work!ā€ Christopher Johnson 16:00ā€Ø ā€œScala Data Pipelines for Music Recommendationsā€
  • 53. Even so, business demands exceed far beyond what classiļ¬ers and labels alone can give usā€¦ Businesses lurv Optimization, gobs of it;Ā in ā€Ø that context ML libraries today merely scratch the surface Round hole, square peg The Libraries: Alexandria Redux
  • 54. Imagine that you compete with FedExā€¦ how do you optimize delivery routes for airplanes, trucks, trains, nanodrones, hoverboards, etc.? Which do you optimize: fuel cost, delivery time, maintenance schedules, minimizing lost packages? Doesnā€™t sound much like online advertising, social networks, or ā€Ø any episode of Silicon Valley The Libraries: Alexandria Redux
  • 56. What were the origins of machine learning? ā€¢ Marvin Minsky @MIT, 1950s ā€¢ Support Vector Machines @Bell Labs, 1990s ā€¢ Google @Stanford, 1990s ā€¢ Ray Kurzweil, 2000s Nopeā€¦ ML, Unpaused
  • 57. ML has been an aspect of AI research for a ā€Ø long while, through several different vectors A good early history (up to 1980s) is given in: ML, Unpaused Machine Learning:A Historical and Methodological Analysisā€Ø Jaime Carbonell, Ryszard Michalski, Tom Mitchellā€Ø AI Magazine 4:3 (1983)ā€Ø http://dx.doi.org/10.1609/aimag.v4i3.406 To wit: task-oriented studies, knowledge acquisition, cognitive simulation, theoretical exploration ā€¦ overall, a much ā€Ø broader class of optimization problems
  • 58. An era of anticipation ā€“ AI was making inroadsā€¦ ā€¢ emphasis on capturing/representing knowledge ā€Ø and expertise ā€“ production use cases in medicine ā€¢ Fifth Generation Computing (parallel h/w) ā€Ø in Japan MCC, etc. However: ā€¢ few outside academia had enough cluster compute power ā€“ aside from 3-letter agencies and AT&T ā€¢ meanwhile ML was not yet considered ā€œacademicā€ enough within academia Circa early 1980s:
  • 60. Some fundamental tech platforms emergeā€¦ ā€¢ Hubble Space Telescope, Human Genome Project, WWW, electric cars relaunched And throughout that decade: ā€¢ Linux, Java @Sun, JavaScript @Netscape ā€¢ Fireļ¬‚y, an initial commercial ML app ā€Ø on teh interwebs @MIT Media Lab ā€¢ Rise of e-commerce leveraging horizontal ā€Ø scale-out with commodity hardware Circa early 1990s:
  • 61. Stock market ā€œtumbledā€ in 2000: Butā€¦
  • 62. GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR emerged out of the dustā€¦ ā€¢ web apps dominated for search, e-commerce, ā€Ø social networks, etc. ā€¢ did we mention EJBs and template thinking? ā€¢ mobile picked up traction ā€¢ recommender systems went mainstream ā€¢ AI picked up with semantic web effortsā€¦ Circa early 2000s:
  • 63. Stock market ā€œwent free-fallā€ in 2008: Butā€¦
  • 64. Successful e-commerce ļ¬rms have IPOā€™ed and are now busy building skyscrapers in downtown SFā€¦ Circa mid 2010s: LinkedIn, 350 Bush Transbay Transit Salesforce, 415 Mission
  • 65. An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projectsā€¦ Butā€¦ Sears Tower, Chicago Lehman Brothers, London Fontainebleau, Las Vegas
  • 66. An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projectsā€¦ Butā€¦
  • 67. Businesses lurv Optimization, lots of itā€¦ ā€¢ ML circa 1985 focused on those needs, but got knocked back to something inevitably more aristotelian and predictable ā€¢ Outside of SiliconValley, weā€™ve made big strides ā€¢ One danger: next downturn cycle,VCs might ā€Ø reshape tech industry, reverting to ā€œsafe betsā€ Circa mid 2010s: Back to the Future However, a few extremely interesting aspects have emergedā€¦
  • 68. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms We have approximation, deep learning and symbolic regression to assist on ā€œFeaturesā€ evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Or, maybe, cognitive computing will help on several of the more difļ¬cult aspects of thisā€¦ Circa mid 2010s: Extremely Interesting Emerging Aspects
  • 69. Circa mid 2010s: Extremely Interesting Emerging Aspects DeepDive @Stanford http://deepdive.stanford.edu/ Knowledge Graph @Google http://www.google.com/insidesearch/ features/search/knowledge.html IBM Watson http://www.ibm.com/ smarterplanet/us/en/ibmwatson/ Scaled Inference https://scaledinference.com/
  • 70. Circa mid 2010s: Extremely Interesting Emerging Aspects Rhetorical postures:Ā ā€œIs AI a good idea, or potentially harmful?ā€ ā€Ø ā€“ per Elon Musk, et al.
  • 71. Circa mid 2010s: Extremely Interesting Emerging Aspects Clearly: good idea ā€Ø brewbot.io Rhetorical postures:Ā ā€œIs AI a good idea, or potentially harmful?ā€ ā€Ø ā€“ per Elon Musk, et al.
  • 72. Circa mid 2010s: Extremely Interesting Emerging Aspects Speaking of which, a highly recommended podcast ā€Ø by actual data scientists drinking really good beers: partiallyderivative.com
  • 73. Circa mid 2010s: Extremely Interesting Emerging Aspects 2015: Notebooks in Containers in the Cloud ā€œKeep simple things simple and complex things possible.ā€ databricks.com/product PublishingWorkļ¬‚ows for Jupyter Andrew Odewahn, Kyle Kelley, Rune Madsen odewahn.github.io/publishing-workflows-for-jupyter IPython Interactive Demoā€Ø Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492
  • 74. 2015: Notebooks in Containers in the Cloud ā€œKeep simple things simple and complex things possible.ā€ databricks.com/product PublishingWorkļ¬‚ows for Jupyter Andrew Odewahn odewahn.github.io/publishing-workflows-for-jupyter IPython Interactive Demo Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492 Circa mid 2010s: Extremely Interesting Emerging Aspects Makes me wonder about the ā€œdata engineerā€ role ā€¦ notebooks simplify ops needs, while ultimately the domain experts wield the real power with data
  • 76. Frontstory: The Sun Also Rises Some wake early in the morning and go build buildings dev-centric templates
  • 77. Some gaze into the heavens, sit back, and explain the processā€¦ 20th c. stats Frontstory: The Sun Also Rises
  • 78. Sometimes, when the sky gods become angry and obscure the Sun as our due punishmentā€¦ VCs during recessions Frontstory: The Sun Also Rises
  • 79. Others create and evaluate models to predict the Earthā€™s orbit of the Sun Whatā€™s needed most Frontstory: The Sun Also Rises
  • 80. Forward Motion: SV trend: early data scientists displace old-school product managers Because there are hard ā€Ø problems to be solvedā€¦ Because we need ā€Ø new eyes on targetā€¦ Because use casesā€¦
  • 82. Because Use Cases: Health Care ā€œIn fact, using ourTopological Data Analysis system, they were able to discover multiple types of Type 2 diabetes ā€¦ huge impact on all the hundreds of millions of peopleā€ ā€“ Ayasdi ā€œNobody knows what to do with those archives ā€¦Theyā€™re just sitting there, costing money. This is just seen as a big opportunity. Itā€™s like,ā€˜Oh, this is what we were saving this up for!ā€™ā€ ā€“ Enlitic ā€œSloan-Kettering is also trainingWatson on 1,500 real-world lung cancer cases, helping it to decipher physician notes and learn from the hospitalā€™s expertise in treating cancer.ā€ ā€“ IBM Watson Employing tech such as deep learning and cognitive computing for vital use cases in ā€Ø health care:
  • 83. Because Use Cases: Transportation http://automatic.com/ ! Detects events like hard braking, acceleration ā€“ uploaded in real-time with geolocation to a Spark Streaming pipeline ā€¦ data trends indicate road hazards, blind intersections, bad signal placement, and other input to improve trafļ¬c planning. Also detects inefļ¬cient vehicle operation, under-inļ¬‚ated tires, poor driving behaviors, aggressive acceleration, etc.
  • 84. Because Use Cases: Education https://databricks.com/blog/2014/12/08/ pearsonā€¦ ! Integrates Kafka + Spark Streaming + Cassandra + Blur, running within aYARN cluster on AWS to provide a scalable, reliable, cloud-based platform for services that analyze student performance across product and institution boundaries. Delivers immersive learning experiences designed for how students read, think, and learn; as well as efļ¬cacy insights to both learners and institutions which were not possible before. ! Reliability features handle Kafka node failures, receiver failures, leader changes, committed offset in ZK, plus adjustable data-rate throughput.
  • 85. Because Use Cases: Language, everywhere http://idibon.com/ ! ! ! Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there ā€“ itā€™s the launch point for cognitive computing. http://digitalreasoning.com/
  • 86. http://digitalreasoning.com/ Because Use Cases: Language, everywhere http://idibon.com/ ! ! ! Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there ā€“ itā€™s the launch point for cognitive computing. Robert Munroe, 12:00 ā€œBuilding Better Experts: co-optimization of human and machine intelligence at Idibonā€ AndrewTrask, David Gilmore 11:00 ā€œDeep Learning for Natural Language Processingā€
  • 87. Because Use Cases: Geospatial Advanced geo uses cases throughout all levels of gov ā€Ø and industry for Big Data, machine learning, graph algorithms, approximations, etc. If you roll trucks you probably use licenses from ESRI. Also consider the IoT sensor data, e.g., from National Instruments' customers ā€“ where does it go, what do organizations use to analyze it? These are the large-scale optimization problems you were looking forā€¦ http://esri.github.io/gis-tools-for-hadoop/ (and Spark) http://thunderheadxpler.blogspot.com/ http://geotrellis.io/ http://www.oculusinfo.com/tiles/ https://databricks.com/blog/2014/12/03/app...
  • 88. Because Use Cases: Telecom,Travel, Banking, etc. http://spark-summit.org/2014/talk/ stratio-streamingā€¦ Stratio represents one of the most sophisticated integrations for Spark Streaming ā€“ the union of a real-time messaging bus with a complex event processing engine: Kafka, Spark Streaming, Cassandra, along with the Siddhi CEP engine Telecom, in particular, is leveraging this new streaming technology as a big win near-term http://www.openstratio.org/ā€Ø https://github.com/stratio https://github.com/Stratio/streaming- cep-engine BTW if youā€™re in Madrid next fall ā€Ø check out Big Data Hispano
  • 89. Because Use Casesā€¦ Common theme: many of those use cases are powered by Apache Spark ā€“ Especially notice Spark Streaming, which is a big game-changer for analytics across industry
  • 90. Because Use Casesā€¦ Common theme: many of those use cases are powered by Especially notice game-changer for analytics across industry Taylor Goetz 11:00ā€Ø ā€œBeyond theTweetingToaster: IoT Streaming AnalyticsWith Apache Storm, Kafka, and Arduinoā€ Hari Shreedharan 12:00ā€Ø ā€œRealTime Data Processing Using Spark Streamingā€
  • 91. Because Use Cases: Agriculture Ag+Data Issuesā€Ø http://radar.oreilly.com/2014/04/agdata.html Data Guild whitepaper: Ag Systems + Data Outlookā€Ø http://goo.gl/OK8RFf ā€¢ livelihood for 40% of world population ā€¢ $15T/year annual GDP globally ā€¢ data-intensive issues, much legal impasse Over a half billion small farms worldwide, and most ā€Ø are family-run farms that rely on rain-fed agriculture Nudge, and I just might propose DWave clusters ā€Ø into cold craters on the Lunar South Pole with ā€Ø routers @L5 and an LLO skyhookā€¦ to handleā€Ø the vector quantization demands. Or something. airships e.g., JP Aerospace, 40 km atmostats e.g.,Titan Aerospace, 20 km microsats e.g., Planet Labs, 400 km robots e.g., Blue River, 1 m sensors e.g., Hortau, -0.3 m drones e.g., HoneyComb, 120 m Layered Sensing Networks
  • 93. Apache Spark developer certificate program ā€¢ http://oreilly.com/go/sparkcert ā€¢ defined by Spark experts @Databricks ā€¢ assessed by Oā€™Reilly Media ā€¢ establishes the bar for Spark expertise certification:
  • 94. MOOCs: Anthony Josephā€Ø UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkarā€Ø UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 95. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 97. confs: Strata CAā€Ø San Jose, Feb 18-20ā€Ø strataconf.com/strata2015 Spark Summit Eastā€Ø NYC, Mar 18-19ā€Ø spark-summit.org/east Big Data Tech Conā€Ø Boston, Apr 26-28ā€Ø bigdatatechcon.com Strata EUā€Ø London, May 5-7ā€Ø strataconf.com/big-data-conference-uk-2015 Spark Summit 2015ā€Ø SF, Jun 15-17ā€Ø spark-summit.org
  • 98. books: Fast Data Processing ā€Ø with Sparkā€Ø Holden Karauā€Ø Packt (2013)ā€Ø shop.oreilly.com/product/ 9781782167068.do Spark in Actionā€Ø Chris Freglyā€Ø Manning (2015*)ā€Ø sparkinaction.com/ Learning Sparkā€Ø Holden Karau, ā€Ø Andy Konwinski, Matei Zahariaā€Ø Oā€™Reilly (2015*)ā€Ø shop.oreilly.com/product/ 0636920028512.do
  • 99. presenter: Just Enough Math Oā€™Reilly, 2014 justenoughmath.comā€Ø preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, ā€Ø events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workļ¬‚ows with Cascading Oā€™Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do