The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā
A New Year in Data Science: ML Unpaused
1. A New Year in Data Science: āØ
ML Unpaused
Data Day TexasāØ
Austin, 2015-01-10
Paco Nathan, @pacoid
2. Observations about Machine Learning, Data Science,
Big Data, Open Source, Cluster Computing, Notebooks,
etc., over the past year ā¦ plus, a look ahead
4. Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
5. Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
6. Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
processā¦
7. Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
processā¦
Clearly, provably, āØ
our Sun revolves
around the Earth āØ
at an observable
rate
8. Backstory: The Sun Also Rises
Others create and
evaluate models to
predict the Earthās
orbit of the Sun
9. Backstory: The Sun Also Rises
Sometimes, when āØ
the sky gods become
angry and obscure
the Sun as our due
punishmentā¦
We grow scared and
react: sacriļ¬ces must
be offered, our plans
must change, etc.
10. Backstory: The Sun Also Rises
Sometimes, when
the sky gods become
angry and obscure
the Sun
punishmentā¦
We grow scared and
react: sacriļ¬ces must
be offered, our plans
must
These points are what āØ
Iād like to discuss today
13. Feel free to disagree, but I ļ¬nd that deļ¬nition āØ
to be ļ¬awedā¦
Whither Data Science?
14. Feel free to disagree, but I ļ¬nd that deļ¬nition āØ
to be ļ¬awedā¦
1. That ignores DevOps (howās that working out?) āØ
andĀ Visualization/Design (ditto)
Whither Data Science?
15. Feel free to disagree, but I ļ¬nd that deļ¬nition āØ
to be ļ¬awedā¦
1. That ignores DevOps (howās that working out?) āØ
andĀ Visualization/Design (ditto)
2. When the CEO asks you to help explain why āØ
revenue nose-dived over the past monthā¦
neither ļ¬eld has a clue about how to model
business phenomena
Whither Data Science?
16. Software Engineering: āØ
implement and test a model that somebody selected
ā¦almost ignores the matter of modeling entirely, āØ
at least not since old school types like Dijkstra
!
Statistics: āØ
measure and justify a model that somebody selected
ā¦was never particularly good at teaching how to āØ
model problems ā as two renowned statisticians, āØ
William Cleveland and Leo Breiman, noted
Whither Data Science?
17. Software Engineering:
implement and test a model that somebody selected
ā¦almost ignores the matter of modeling entirely,
at least not since old school types like
!
Statistics:
measure and justify a model that somebody selected
ā¦was never particularly good at teaching how to
model problems ā as two renowned statisticians,
William Cleveland
Whither Data Science?
Both ļ¬elds are necessary,
but not sufļ¬cient
21. Floyd Marinescu observed about the aftermath āØ
of EJBs in Brief Historyā¦
Intended for building framework components,āØ
e.g., for IBM, Oracle, Sun, but not many others
Based on RMI, prior to notions āØ
like RESTful web services
Enterprise Java Beans: Lessons from hate-watch reality television
22. Maybe a handful of people in the world would āØ
ever actually need to use EJBs, but those few
people wanted a spec
Then, for tragic political reasons (MSFT envy), āØ
Sun Microsystems made EJBs prominent in āØ
their Java APIs
Enterprise Java Beans: Lessons from hate-watch reality television
23. Fortunately, we evolved: Spring, JBoss, etc., āØ
those came along as relatively more sane tech
Now we see the Docker thing soar, with notions
such as microservices displacing legacy cruft
(BTW, if you havenāt yet, check out Weave)
Enterprise Java Beans: Lessons from hate-watch reality television
24. I mention this because, to me, EJB represented āØ
a convoluted form of template thinking:
Enterprise Java Beans: Lessons from hate-watch reality television
developing complex web apps āØ
for the sake of āØ
developing complex web apps
25. Enterprise Java Beans: Lessons from hate-watch reality television
IRL developers and template thinking donāt
determine public policyā¦ right?
26. Enterprise Java Beans: Lessons from hate-watch reality television
To paraphrase Dean Wampler, consider WordCount
a simple apps written for MapReduce in Hadoop ā¦
~50 lines of unapologetic Java that feels hella like
writing EJBs:
27. Enterprise Java Beans: Lessons from hate-watch reality television
Compare that with functional programming, where āØ
the same WC app is three lines of easily-read Scala
when run in Apache Spark:
28. Enterprise Java Beans: Lessons from hate-watch reality television
Check out Deanās talk at 11:00, āØ
āWhy Scala isTaking Over āØ
the Big DataWorldā
Compare that with functional programming, where āØ
the same WC app is three lines of easily-read Scala
when run in Apache Spark:
29. Enterprise Java Beans: Lessons from hate-watch reality television
Hadoop suffers because, IMHO, that convoluted āØ
EJB style of developer-centric template thinking
staged a coup
Perhaps we could
ādonateā some
OSS talentā¦
Send a pull
requestā¦
Or something.
31. Probability got going, formally, in the 16th c. ā āØ
although interesting mathematical estimations āØ
trace back to classical times
Arabs in the 9th c. used frequency analysis ā āØ
later rediscovered by Europeans during the āØ
early Italian Renaissance
Statistics followed, originally more about what āØ
we might call demographics ā through 18th c.
Lies, Damn Lies, Statistics, Data Science
32. Laplace, Gauss, et al., bridged the ļ¬elds in the āØ
late 18th c. using distributions (what we studied āØ
in Stats 101) to infer the probability of errors āØ
in estimates
!
!
Much of the 19th/20th c. work was about using
goodness of ļ¬t tests, etc., justifying some distribution
ā¢ generally speaking, that require samples
ā¢ that, in turn, implies batch windows
Lies, Damn Lies, Statistics, Data Science
33. Lies, Damn Lies, Statistics, Data Science
That kind of template thinking in actionāØ
really lurvs it some batch windows
34. While 19th/20th c. stats work focused on
defensibility
21st c. work, w.r.t. Big Data apps, focuses more āØ
on predictability ā plus thereās a shift in how we
make estimatesā¦
Lies, Damn Lies, Statistics, Data Science
BTW, doesnāt it seem weird to crunch through piles
of data in large batch jobs, at large expense, when
the results get used to approximate features
ultimately? Why not perform that in stream?
35. A fascinating, relatively new area pioneered by
relatively few people ā e.g., Philippe Flajolet
Provides approximation with error bounds
using much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-
data-mining/
Lies, Damn Lies, Statistics, Data Science
36. algorithm use case example
Bloom Filter set membership code
MinHash
set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code
Lies, Damn Lies, Statistics, Data Science
37. Lies, Damn Lies, Statistics, Data Science
E.g., Ā±4% could buy you two orders of magnitude
reduction in the required memory footprint for āØ
an analytics app
!
OSS projects such as Algebird and BlinkDB
provide for this newer approach to the math of
approximations at scale
38. Lies, Damn Lies, Statistics, Data Science
E.g., Ā±4% could buy you two orders of magnitude
reduction in the required memory footprint for
an analytics app
!
OSS projects such as
provide for this newer approach to the math of
approximati
Oscar Boykin at 14:00, āØ
āAggregators: Modeling āØ
Data Queries Functionallyā
co-author of Algebird, Scalding
41. On the other hand, Physics
does well to teach modeling ā
I like to hire physicists to work
on Data teamsā¦
The Interzone
They tend to get the interdisciplinary aspects: āØ
got the math background, coding experience,
generally good at systems engineering, etc.
Not saying we should all rush out to get Physics
degrees; thereās something to be learned there, āØ
vital for the work and priorities ahead
42. I mention this because we are at a crossroads, āØ
which has more to do with the physical world ā āØ
some talks here at DDTx15 help illustrate that
Vast implications for Health Care, Transportation,
Agriculture, Energy, Gov, Manufacturing in generalā¦
More about that āØ
in a bit ā
The Interzone
44. Most of the ML libraries that one encounters āØ
today focus on two general kinds of solutions:
ā¢ convex optimization
ā¢ matrix factorization
The Libraries: Alexandria Redux
45. One might think of the convex optimization āØ
in this case as a kind of curve ļ¬tting ā generally āØ
with some regularization term to avoid overļ¬tting, āØ
which is not good
Good Bad
The Libraries: Alexandria Redux
46. For supervised learning, used to create classiļ¬ers:
1. categorize the expected data into N classes
2. split a sample of the data into train/test sets
3. use learners to optimize classiļ¬ers based onāØ
the training set, to label the data into N classes
4. evaluate the classiļ¬ers against the test set,
measuring error in predicted vs. expected labels
The Libraries: Alexandria Redux
47. Bokay, great for security problems with simply
two classes: good guys vs. bad guys
How do you decide what the classes are āØ
for more complex problems in business?
Thatās where the matrix factorization
parts come in handyā¦
The Libraries: Alexandria Redux
48. For unsupervised learning, which is often used āØ
to reduce dimension:
1. create a covariance matrix of the data
2. solve for the eigenvectors and eigenvalues āØ
of the matrix
3. select the top N eigenvectors, based on
diminishing returns for how they explain
variance in the data
4. those eigenvectors deļ¬ne your N classes
The Libraries: Alexandria Redux
49. An excellent overview of ML deļ¬nitions āØ
(up to this point) is given in:
The Libraries: Alexandria Redux
To wit: āØ
Generalization = Representation + Optimization + Evaluation
A Few UsefulThings to Know about Machine LearningāØ
Pedro DomingosāØ
CACM 55:10 (Oct 2012)āØ
http://dl.acm.org/citation.cfm?id=2347755
52. evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far
Results are shown in
is highlighted in
The Libraries: Alexandria Redux
Matthew Kirk 12:00āØ
āLessons Learned: Machine Learning
andTechnical Debtā
Ted Dunning 13:00āØ
āComputing with Chaosā
Julia Evans 15:00āØ
āData Pipelines.They're a lot of work!ā
Christopher Johnson 16:00āØ
āScala Data Pipelines for Music
Recommendationsā
53. Even so, business demands exceed far beyond
what classiļ¬ers and labels alone can give usā¦
Businesses lurv Optimization, gobs of it;Ā in āØ
that context ML libraries today merely scratch
the surface
Round hole, square peg
The Libraries: Alexandria Redux
54. Imagine that you compete with FedExā¦ how do
you optimize delivery routes for airplanes, trucks,
trains, nanodrones, hoverboards, etc.?
Which do you optimize: fuel cost,
delivery time, maintenance schedules,
minimizing lost packages?
Doesnāt sound much like online
advertising, social networks, or āØ
any episode of Silicon Valley
The Libraries: Alexandria Redux
56. What were the origins of machine learning?
ā¢ Marvin Minsky @MIT, 1950s
ā¢ Support Vector Machines @Bell Labs, 1990s
ā¢ Google @Stanford, 1990s
ā¢ Ray Kurzweil, 2000s
Nopeā¦
ML, Unpaused
57. ML has been an aspect of AI research for a āØ
long while, through several different vectors
A good early history (up to 1980s) is given in:
ML, Unpaused
Machine Learning:A Historical and Methodological AnalysisāØ
Jaime Carbonell, Ryszard Michalski, Tom MitchellāØ
AI Magazine 4:3 (1983)āØ
http://dx.doi.org/10.1609/aimag.v4i3.406
To wit:
task-oriented studies, knowledge acquisition, cognitive
simulation, theoretical exploration ā¦ overall, a much āØ
broader class of optimization problems
58. An era of anticipation ā AI was making inroadsā¦
ā¢ emphasis on capturing/representing knowledge āØ
and expertise ā production use cases in medicine
ā¢ Fifth Generation Computing (parallel h/w) āØ
in Japan MCC, etc.
However:
ā¢ few outside academia had enough cluster compute
power ā aside from 3-letter agencies and AT&T
ā¢ meanwhile ML was not yet considered āacademicā
enough within academia
Circa early 1980s:
60. Some fundamental tech platforms emergeā¦
ā¢ Hubble Space Telescope, Human Genome Project,
WWW, electric cars relaunched
And throughout that decade:
ā¢ Linux, Java @Sun, JavaScript @Netscape
ā¢ Fireļ¬y, an initial commercial ML app āØ
on teh interwebs @MIT Media Lab
ā¢ Rise of e-commerce leveraging horizontal āØ
scale-out with commodity hardware
Circa early 1990s:
62. GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR
emerged out of the dustā¦
ā¢ web apps dominated for search, e-commerce, āØ
social networks, etc.
ā¢ did we mention EJBs and template thinking?
ā¢ mobile picked up traction
ā¢ recommender systems went mainstream
ā¢ AI picked up with semantic web effortsā¦
Circa early 2000s:
64. Successful e-commerce ļ¬rms have IPOāed and are
now busy building skyscrapers in downtown SFā¦
Circa mid 2010s:
LinkedIn, 350 Bush
Transbay Transit
Salesforce, 415 Mission
65. An odd truism about the hubris of the uber-wealthy
and the timing of their skyscraper projectsā¦
Butā¦
Sears Tower, Chicago
Lehman Brothers, London
Fontainebleau, Las Vegas
66. An odd truism about the hubris of the uber-wealthy
and the timing of their skyscraper projectsā¦
Butā¦
67. Businesses lurv Optimization, lots of itā¦
ā¢ ML circa 1985 focused on those needs, but got
knocked back to something inevitably more
aristotelian and predictable
ā¢ Outside of SiliconValley, weāve made big strides
ā¢ One danger: next downturn cycle,VCs might āØ
reshape tech industry, reverting to āsafe betsā
Circa mid 2010s: Back to the Future
However, a few extremely interesting
aspects have emergedā¦
68. evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
We have approximation, deep learning and
symbolic regression to assist on āFeaturesā
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Or, maybe, cognitive computing will help on
several of the more difļ¬cult aspects of thisā¦
Circa mid 2010s: Extremely Interesting Emerging Aspects
70. Circa mid 2010s: Extremely Interesting Emerging Aspects
Rhetorical postures:Ā āIs AI a good idea,
or potentially harmful?ā āØ
ā per Elon Musk, et al.
71. Circa mid 2010s: Extremely Interesting Emerging Aspects
Clearly: good idea āØ
brewbot.io
Rhetorical postures:Ā āIs AI a good idea,
or potentially harmful?ā āØ
ā per Elon Musk, et al.
72. Circa mid 2010s: Extremely Interesting Emerging Aspects
Speaking of which, a highly recommended podcast āØ
by actual data scientists drinking really good beers:
partiallyderivative.com
73. Circa mid 2010s: Extremely Interesting Emerging Aspects
2015: Notebooks in Containers in the Cloud
āKeep simple things simple
and complex things possible.ā
databricks.com/product
PublishingWorkļ¬ows for Jupyter
Andrew Odewahn, Kyle Kelley, Rune Madsen
odewahn.github.io/publishing-workflows-for-jupyter
IPython Interactive DemoāØ
Nature Magazine + Rackspace
nature.com/news/ipython-interactive-demo-7.21492
74. 2015: Notebooks in Containers in the Cloud
āKeep simple things simple
and complex things possible.ā
databricks.com/product
PublishingWorkļ¬ows for Jupyter
Andrew Odewahn
odewahn.github.io/publishing-workflows-for-jupyter
IPython Interactive Demo
Nature Magazine + Rackspace
nature.com/news/ipython-interactive-demo-7.21492
Circa mid 2010s: Extremely Interesting Emerging Aspects
Makes me wonder about the ādata engineerā
role ā¦ notebooks simplify ops needs, while
ultimately the domain experts wield the real
power with data
76. Frontstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
dev-centric templates
77. Some gaze into the
heavens, sit back,
and explain the
processā¦
20th c. stats
Frontstory: The Sun Also Rises
78. Sometimes, when
the sky gods become
angry and obscure
the Sun as our due
punishmentā¦ VCs during recessions
Frontstory: The Sun Also Rises
79. Others create and
evaluate models to
predict the Earthās
orbit of the Sun
Whatās needed most
Frontstory: The Sun Also Rises
80. Forward Motion:
SV trend: early data scientists displace old-school
product managers
Because there are hard āØ
problems to be solvedā¦
Because we need āØ
new eyes on targetā¦
Because use casesā¦
82. Because Use Cases: Health Care
āIn fact, using ourTopological Data Analysis system, they were
able to discover multiple types of Type 2 diabetes ā¦ huge
impact on all the hundreds of millions of peopleā ā Ayasdi
āNobody knows what to do with those archives ā¦Theyāre just
sitting there, costing money. This is just seen as a big opportunity.
Itās like,āOh, this is what we were saving this up for!āā ā Enlitic
āSloan-Kettering is also trainingWatson on 1,500 real-world lung
cancer cases, helping it to decipher physician notes and learn
from the hospitalās expertise in treating cancer.ā ā IBM Watson
Employing tech such as deep learning and
cognitive computing for vital use cases in āØ
health care:
83. Because Use Cases: Transportation
http://automatic.com/
!
Detects events like hard braking, acceleration ā uploaded in
real-time with geolocation to a Spark Streaming pipeline ā¦
data trends indicate road hazards, blind intersections, bad
signal placement, and other input to improve trafļ¬c planning.
Also detects inefļ¬cient vehicle operation, under-inļ¬ated tires,
poor driving behaviors, aggressive acceleration, etc.
84. Because Use Cases: Education
https://databricks.com/blog/2014/12/08/
pearsonā¦
!
Integrates Kafka + Spark Streaming + Cassandra +
Blur, running within aYARN cluster on AWS to provide
a scalable, reliable, cloud-based platform for services
that analyze student performance across product and
institution boundaries.
Delivers immersive learning experiences
designed for how students read, think,
and learn; as well as efļ¬cacy insights to
both learners and institutions which were
not possible before.
!
Reliability features handle Kafka node
failures, receiver failures, leader changes,
committed offset in ZK, plus adjustable
data-rate throughput.
85. Because Use Cases: Language, everywhere
http://idibon.com/
!
!
!
Our social fabric is encoded as text documents,
and similarly it get tested, deployed, maintained,
and monitored there ā itās the launch point for
cognitive computing.
http://digitalreasoning.com/
86. http://digitalreasoning.com/
Because Use Cases: Language, everywhere
http://idibon.com/
!
!
!
Our social fabric is encoded as text documents,
and similarly it get tested, deployed, maintained,
and monitored there ā itās the launch point for
cognitive computing.
Robert Munroe, 12:00 āBuilding Better
Experts: co-optimization of human and
machine intelligence at Idibonā
AndrewTrask, David Gilmore 11:00
āDeep Learning for Natural Language
Processingā
87. Because Use Cases: Geospatial
Advanced geo uses cases throughout all levels of gov āØ
and industry for Big Data, machine learning, graph
algorithms, approximations, etc.
If you roll trucks you probably use licenses from ESRI.
Also consider the IoT sensor data, e.g., from National
Instruments' customers ā where does it go, what do
organizations use to analyze it?
These are the large-scale optimization problems
you were looking forā¦
http://esri.github.io/gis-tools-for-hadoop/ (and Spark)
http://thunderheadxpler.blogspot.com/
http://geotrellis.io/
http://www.oculusinfo.com/tiles/
https://databricks.com/blog/2014/12/03/app...
88. Because Use Cases: Telecom,Travel, Banking, etc.
http://spark-summit.org/2014/talk/
stratio-streamingā¦
Stratio represents one of the most sophisticated
integrations for Spark Streaming ā the union of
a real-time messaging bus with a complex event
processing engine: Kafka, Spark Streaming,
Cassandra, along with the Siddhi CEP engine
Telecom, in particular, is leveraging this new
streaming technology as a big win near-term
http://www.openstratio.org/āØ
https://github.com/stratio
https://github.com/Stratio/streaming-
cep-engine
BTW if youāre in Madrid next fall āØ
check out Big Data Hispano
89. Because Use Casesā¦
Common theme: many of those use cases are
powered by Apache Spark ā
Especially notice Spark Streaming, which is a big
game-changer for analytics across industry
90. Because Use Casesā¦
Common theme: many of those use cases are
powered by
Especially notice
game-changer for analytics across industry
Taylor Goetz 11:00āØ
āBeyond theTweetingToaster: IoT
Streaming AnalyticsWith Apache
Storm, Kafka, and Arduinoā
Hari Shreedharan 12:00āØ
āRealTime Data Processing Using
Spark Streamingā
91. Because Use Cases: Agriculture
Ag+Data IssuesāØ
http://radar.oreilly.com/2014/04/agdata.html
Data Guild whitepaper: Ag Systems + Data OutlookāØ
http://goo.gl/OK8RFf
ā¢ livelihood for 40% of world population
ā¢ $15T/year annual GDP globally
ā¢ data-intensive issues, much legal impasse
Over a half billion small farms worldwide, and most āØ
are family-run farms that rely on rain-fed agriculture
Nudge, and I just might propose DWave clusters āØ
into cold craters on the Lunar South Pole with āØ
routers @L5 and an LLO skyhookā¦ to handleāØ
the vector quantization demands. Or something.
airships
e.g., JP Aerospace, 40 km
atmostats
e.g.,Titan Aerospace, 20 km
microsats
e.g., Planet Labs, 400 km
robots
e.g., Blue River, 1 m
sensors
e.g., Hortau, -0.3 m
drones
e.g., HoneyComb, 120 m
Layered Sensing Networks
93. Apache Spark developer certificate program
ā¢ http://oreilly.com/go/sparkcert
ā¢ defined by Spark experts @Databricks
ā¢ assessed by OāReilly Media
ā¢ establishes the bar for Spark expertise
certification: