SlideShare a Scribd company logo
1 of 65
Predictive Analytics with Hadoop
Michael Hausenblas
MapR Technologies
© 2014 MapR Technologies 2© 2014 MapR Technologies
© 2014 MapR Technologies 3
Agenda
• Examples
• Data-driven solutions
• Obtaining big training data
• Recommendation with Mahout and Solr
• Operational considerations
© 2014 MapR Technologies 4
Recommendation is Everywhere
Media and
Advertising
e-commerce Enterprise Sales
• Recommend sales
opportunities to partners
• $40M revenue in year 1
• 1.5B records per day
• Using MapR
© 2014 MapR Technologies 5
Classification is Everywhere
IP address blacklisting Fraud Detection Customer 360
Scoring & Categorization
• Customer 360 application
• Each customer is scored and
categorized based on all their
activity
• Data from hundreds of
streams and databases
• Using MapR
• 600+ variables considered for
every IP address
• Billions of data points
• Using MapR
• Identify anomalous patterns
indicating fraud, theft and
criminal activity
• Stop phishing attempts
• Using MapR
Fortune 100
Telco
© 2014 MapR Technologies 6
Data-Driven Solutions
• Physics is „simple‟: f = ma; E=mc2
• Human behavior is much more complex
– Which ad will they click?
– Is a behavior fraudulent? Why?
• Don‟t look for complex models that try to discover general rules
– The size of the dataset is the most important factor
– Simple models (n-gram, linear classifiers) with Big Data
• A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of
data. IEEE Intelligent Systems, 24(2):8-12, 2009.
© 2014 MapR Technologies 7
The Algorithms Are Less Important
© 2014 MapR Technologies 8
Focus on the Data
• Most algorithms come down to counting and simple math
• Invest your time where you can make a difference
– Getting more data can improve results by 2x
• eg, add beacons everywhere to instrument user behavior
– Tweaking an ML algorithm will yield a fraction of 1%
• Data wrangling
– Feature engineering
– Moving data around
– …
© 2014 MapR Technologies 9
Obtaining Big Training Data
• Can‟t really rely on experts to label the data
– Doesn‟t scale (not enough experts out there)
– Too expensive
• So how do you get the training data?
– Crowdsourcing
– Implicit feedback
• “Obvious” features
• User engagement
© 2014 MapR Technologies 10
Using Crowdsourcing for Annotation
R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast –
but is it good? Evaluating non-expert annotations for natural
language tasks. EMNLP, 2008.
Quantity: $2 for 7000 annotations
(leveraging Amazon Mechanical
Turk and a “flat world”)
Quality: 4 non-experts = 1 expert
© 2014 MapR Technologies 11
Using “Obvious” Features for Annotation
:)
:(
© 2014 MapR Technologies 12
Leveraging Implicit Feedback
• Users behavior provides valuable training data
• Google adjusts search rankings based on engagement
– Did the user click on the result?
– Did the user come back to the search page within seconds?
• Most recommendation algorithms are based solely on user activity
– What products did they view/buy?
– What ads did they click on?
• T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G.
Gay. Evaluating the accuracy of implicit feedback from clicks and query
reformulations in Web search. ACM TOIS, 25(2):1, 2007.
© 2014 MapR Technologies 13
Increasing Exploration
We need to find ways to
increase exploration to
broaden our learning
Exploration of the second page
Can‟t learn much
about results on
page 2, 3 or 4!
© 2014 MapR Technologies 14
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual
performance much better
– “Made more difference than any other change”
© 2014 MapR Technologies 15
Simple Dithering Algorithm
• Generate synthetic score from log rank plus Gaussian
• Pick noise scale to provide desired level of mixing
• Typically:
• Oh… use floor(t/T) as seed so results don‟t change too often
s = logr + N(0,e)
e Î 0.4, 0.8[ ]
Dr µrexpe
© 2014 MapR Technologies 16
Example:ε = 0.5
• Each line represents a
recommendation of 8
items
• The non-dithered
recommendation
would be 1, 2, …, 8
1 2 6 5 3 4 13 16
1 2 3 8 5 7 6 34
1 4 3 2 6 7 11 10
1 2 4 3 15 7 13 19
1 6 2 3 4 16 9 5
1 2 3 5 24 7 17 13
1 2 3 4 6 12 5 14
2 1 3 5 7 6 4 17
4 1 2 7 3 9 8 5
2 1 5 3 4 7 13 6
3 1 5 4 2 7 8 6
2 1 3 4 7 12 17 16
© 2014 MapR Technologies 17
Example:ε = log 2 = 0.69
• Each line represents a
recommendation of 8
items
• The non-dithered
recommendation
would be 1, 2, …, 8
1 2 8 3 9 15 7 6
1 8 14 15 3 2 22 10
1 3 8 2 10 5 7 4
1 2 10 7 3 8 6 14
1 5 33 15 2 9 11 29
1 2 7 3 5 4 19 6
1 3 5 23 9 7 4 2
2 4 11 8 3 1 44 9
2 3 1 4 6 7 8 33
3 4 1 2 10 11 15 14
11 1 2 4 5 7 3 14
1 8 7 3 22 11 2 33
© 2014 MapR Technologies 18© 2014 MapR Technologies
Recommendations with Mahout and Solr
© 2014 MapR Technologies 19
What is Recommendation?
The behavior of a crowd helps us understand what individuals will do…
© 2014 MapR Technologies 20
Batch and Real-Time
• We can learn about the relationship between items every X hours
– These relationships don‟t change often
– People who buy Nikon D7100 cameras also buy a Nikon EN-EL15 battery
• What to recommend to Bob has to be determined in real-time
– Bob may be a new user with no history
– Bob is shopping for a camera right now, but he was buying a baby bottle an
hour ago
• How do we do that?
– Mahout for the heavy number crunching
– Solr/Elasticsearch for the real-time recommendations
© 2014 MapR Technologies 21
Real-Time Recommender Architecture
Co-occurrence
(Mahout)
Indexing
(Solr)
Index shardsItem metadata
Web tier
Search
(Solr)
Recent history for
single user
Recommendation
Complete history
(log files/tables)
Note: All data lives in the cluster
© 2014 MapR Technologies 22
Recommendations
Alice got an apple and a puppy
Charles got a bicycle
Alice
Charles
© 2014 MapR Technologies 23
Recommendations
Alice got an apple and a puppy
Charles got a bicycle
Bob got an apple
Alice
Bob
Charles
© 2014 MapR Technologies 24
Recommendations
? What else would Bob like?
Alice
Bob
Charles
© 2014 MapR Technologies 25
Log Files
Alice
Bob
Charles
Alice
Bob
Charles
Alice
© 2014 MapR Technologies 26
History Matrix
Alice
Bob
Charle
s
✔ ✔ ✔
✔ ✔
✔ ✔
© 2014 MapR Technologies 27
Co-Occurrence Matrix: Items by Items
Q: How do you tell which co-occurrences are useful?
A: Let Mahout do the math…
-
1 2
1 1
1
1
2 1
0
0
0 0
© 2014 MapR Technologies 29
Indicator Matrix: Anomalous Co-Occurrences
✔
✔
Result: The marked row will be added to the indicator field in the item document…
© 2014 MapR Technologies 30
Indicator Matrix
✔
id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
indicators: (t1)
That one row from indicator matrix becomes
the indicator field in the Solr document used
to deploy the recommendation engine.
Note: Data for the indicator field is added directly to meta-data for a document
in Solr index. You don‟t need to create a separate index for the indicators.
© 2014 MapR Technologies 31
Internals of the Recommender Engine
Q: What should we
recommend if new user
listened to 2122:Fats
Domino & 303:Beatles?
A: Search the index with
“indicator_artists:2122 OR
indicator_artists:303”
© 2014 MapR Technologies 32
Internals of the Recommender Engine
1710:Check Berry is the
top recommendation
© 2014 MapR Technologies 33
Toolbox for Predictive Analytics with Hadoop
• Mahout
– Use it for Recommendations, Clustering, Math
• Vowpal Wabbit
– Use it for Classification (but it‟s harder to use)
• SkyTree
– Commercial (not open source) implementation
of common algorithms
© 2014 MapR Technologies 34
Operational Considerations
• Snapshot your raw data before training
– Reproducible process
– MapR snapshots
• Leverage NFS for real-time data ingestion
– Train the model on today‟s data
– Learning schedule independent from ingestion schedule
• Look for HA, data protection, disaster recovery
– Predictive analytics increases revenue or reduces cost
– Quickly becomes a must-have, not a nice-to-have
© 2014 MapR Technologies 35
Q&AEngage with us!
mhausenblas@mapr.com
+353 86 0215164
@MapR_EMEA maprtech
+MaprTechnologies
maprtech
MapR Technologies
Analyzing Big Data with Open Source
R and Hadoop
Steven Sit
IBM
Analyzing big data with open
source R and Hadoop
Steven Sit
IBM Silicon Valley Laboratory
39
Outline
• R on Hadoop
• Rationale and requirements
• Various approaches
• Using R and “Big R” packages
• Data exploration, statistical analysis and Visualization
• Scale out R with data partitioning
• Distributed machine learning
• Demo
40
Rationale and requirements
Productivity
• Use natural R syntax to access massive data in Hadoop
• Leverage existing packages and libraries
Platform for analytics
• Common input and output data formats across analytic
algorithms
Customizability & Extensibility
• Easily customize existing algorithms
• Support for both conventional data mining and large
scale ML needs
Scalability & Performance
• Scale to massively parallel clusters
• Analyze terabytes and petabytes of data
41
Streams
ETL Tools SQOOP
Flume
NFSRest API
Files NoSQL DB
Hadoop
Business Intelligence
Tools
ODBC
JDBC,
Utilities
Warehouses, Marts,
MDM
Analytic Runtimes
Analytical Tools
SQL engines
DDL
Database Tools
PIG, Java, etc Exploration Tools Search
Indexes
Models
Sources
Common Patterns for Hadoop and
Analytics
42
Quick Tour of R
• R is an interpreted language
• Open-source implementation of the S language (1976)
• Best suited for statistical analysis and modeling
• Data exploration and manipulation
• Descriptive statistics
• Predictive analytics and machine learning
• Visualization
• +++
• Can produce “publication quality graphics”
• Emerging as a competitor to proprietary platforms
43
R is hot!... but
• Quite lean as far as software goes
• Free!
• State of the art algorithms
• Statistical researchers often
provide their methods as R
packages
• New techniques available
without delay
• Commercial packages usually
behind the curve
• 4700+ packages as of today
• Active and vibrant user
community
• Universities are teaching R
• IT shops are adopting R
• Companies are integrating R into
their products
• R jobs and demand for R skills on
the rise
Unfortunately R is not built for Big
Data
Working with large datasets is
limited by RAM
44
R and Big Data: Various
approaches• R + Hadoop streaming
• Use R as a scripting language
• Write map/reduce jobs in R
• Or Python for mappers, and R for the reducers
• R + open-source packages for Hadoop
• RHadoop
• RMR2: Write low-level map/reduce with API in R
• RHBase and RHDFS: Simple R API over Hadoop capabilities
• RHipe
• R + SparkR + Hadoop
• R frontend to Spark in memory structure and data from Hadoop
• R + High Level Languages
• JaQL<->R “bridge”
• R interfaces over non-R engines
• ScalaR from Revolution R
• Oracle’s ORE, Netezza’s NZR, Teradata’s teradataR, +++
45
RHadoop Example
Given a table describing flight information for every airline spanning
several decades:
“Carrier”, “Year”, “Month”, “Day”, “DepDelay”, …
What’s the mean departure delay (“DepDelay”)
for each airline for each month?
46
RHadoop Example (contd.)
Map Reduce
key =
c(Carrier, Year, Mo
nth)
value = DepDelay
key =
c(Carrier, Year, Mo
nth)
value =
vector(DepDelay)
key = NULL
value = vector (columns
in a line)
key =
c(Carrier, Year, Mo
nth)
value =
mean(DepDelay)
Shuffle
47
RHadoop Example (contd.)
csvtextinputformat = function(line) keyval(NULL, unlist(strsplit(line, ",")))
deptdelay = function (input, output) {
mapreduce(input = input,
output = output,
textinputformat = csvtextinputformat,
map = function(k, fields) {
# Skip header lines and bad records:
if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {
deptDelay <- fields[[16]]
# Skip records where departure dalay is "NA":
if (!(identical(deptDelay, "NA"))) {
# field[9] is carrier, field[1] is year, field[2] is month:
keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptDelay)}}},
reduce = function(keySplit, vv) {
keyval(keySplit[[2]], c(keySplit[[3]], length(vv), keySplit[[1]], mean(as.numeric(vv))))})}
from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
Source: http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html
48
What is “Big R”
• Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm
• Scale out R with MR
programming
• Partitioning of large data
• Parallel cluster execution of R
code
• Distributed Machine Learning
• A scalable statistics engine
that provides canned
algorithms, and an ability to
author new ones, all via R
R Clients
Scalabl
e
Machin
e
Learnin
g
Data Sources
Embedde
d R
Executio
n
IBM R Packages
IBM R Packages
Pull data
(summaries)
to R client
Or, push R
functions
right on the
data
1
2
3
Let‟s mix it up a little
...
... with a demo
interspersed with
slides.
50
Big R Demo Data
• Publicly available “airline” data
• 22 years of actual arrival/departure information
• Every scheduled flight in the US
• 1987 onwards
• From U.S. Department of Transportation
• Cleansed version
• http://stat-computing.org/dataexpo/2009/the-data.html
51
Airline data description
Year 1987-2008
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime actual departure time (local, hhmm)
CRSDepTime scheduled departure time (local, hhmm)
ArrTime actual arrival time (local, hhmm)
CRSArrTime scheduled arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number
ActualElapsedTime in minutes
CRSElapsedTime in minutes
AirTime in minutes
ArrDelay arrival delay, in minutes
Airline data description
(contd.)
Origin origin IATA airport code
Dest destination IATA airport code
Distance in miles
TaxiIn taxi in time, in minutes
TaxiOut taxi out time in minutes
Cancelled was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B =
weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay in minutes
WeatherDelay in minutes
NASDelay in minutes
SecurityDelay in minutes
LateAircraftDelay in minutes
53
Explore, visualize, transform, model Hadoop data with
R
• Represent Big Data objects as R datatypes
• R's programming syntax and paradigm
• Data stays in HDFS
• R classes (e.g. bigr.data.frame) as proxies
• No low-level map/reduce constructs
• No underlying scripting languages
1
# Construct a bigr.frame to access large data set
air <- bigr.frame(dataPath="airline_demo.csv", …)
# How many flights were flown by United or Delta?
length(UniqueCarrier[UniqueCarrier %in% c("UA", "DL")])
# Filter all flights that were delayed by 15+ minutes at departure or arrival.
airSubset <- air[air$Cancelled == 0 & (air$DepDelay >= 15 | air$ArrDelay >= 15),
c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")]
# For these filtered flights, compute key statistics (# of flights,
# average flying distance and flying time), grouped by airline
summary(count(UniqueCarrier) + mean(Distance) + mean(CRSElapsedTime) ~
UniqueCarrier, dataset = airSubset)
Bigr.boxplot(air$Distance ~ air$UniqueCarrier, …)
54
Scale out R in Hadoop
• Support parallel / partitioned execution of R
• Work around R’s memory limitations
• Execute R snippets on chunks of data
• Partitioned by key, by #rows, via sampling, …
• Follows R’s “apply” model
• Parallelized seamlessly Map/Reduce engine
2
# Filter the airline data on United and Hawaiian
bf <- air[air$UniqueCarrier %in% c("HA", "UA"),]
# Build one decision-tree model per airliner
models <- groupApply(data = bf, groupingColumns = list(bf$UniqueCarrier),
rfunction = function(df) { library(rpart)
predcols <- c('ArrDelay', 'DepDelay', 'DepTime'', 'Distance')
return (rpart(ArrDelay ~ ., df[,predcols]))})
# Pull the model for HA to the client
modelHA <- bigr.pull(models$HA)
# Visualize the model
prettyTree(modelHA)
Big R API: Core Classes & Methods
Connection handling
bigr.connect() and
bigr.disconnect()
is.bigr.connected()
bigr.frame
Modeled after R’s data.frame
Proxy for tabular data
bigr.vector
Modeled after R’s vector
datatype
Proxy for a column
bigr.list
(Loosely) modeled after R’s list
Proxy for collections of serialized
R objects
• Basic exploration
• head(), tail()
• dim(), length(), nrow(
), ncol()
• str(), summary()
• Selection and Projections
• [
• $
• Arithmetic and Logical
operators
• +, -, /, -
• &, |, !
• ifelse()
• String and Math functions
• Lots of these
• Other relational operators
• table()
• unique()
• merge()
• summary()
• sort()
• na.omit(), na.exclude()
• Data movement
• as.data.frame, as.bigr.fram
e
• bigr.persist
• Sampling
• bigr.sample()
• bigr.random()
• Visualizations
• built into R packages (E.g.
ggplot2)
Big R API: Core Classes & Methods
(contd.)
• Summarization and
aggregation
• summary(), very
powerful when
used with
“formula” notation
• max(), min(), ran
ge()
• Mean, variance and
standard deviation
• mean()
• var()
• sd()
• Correlation
• cov() and cor()
• Hadoop options
• bigr.get.server.option()
• bigr.set.server.option()
• bigr.debug(T)
• Useful for servicing
• Will print out internal
debugging output
• Catalog access
• bigr.listfs()
• bigr.listTables(), bigr.lis
tColumns()
• groupApply()
• Primary function for
embedded execution
• Can return tables or
objects
• Run “help(groupApply)”
inside R for extensive
documentation
• Examining R execution logs
• bigr.logs()
• Other *Apply functions
• rowApply(), for running R
on batches of rows
• tableApply(), for running R
on entire dataset
57
Example: What is the average scheduled flight time, actual
gate-to-gate time, and actual airtime for each city pair per
year?
mapper.year.market.enroute_time = function(key, val) {
# Skip header lines, cancellations, and diversions:
if ( !identical(as.character(val['Year']), 'Year')
& identical(as.numeric(val['Cancelled']), 0)
& identical(as.numeric(val['Diverted']), 0) ) {
# We don't care about direction of travel, so construct 'market'
# with airports ordered alphabetically
# (e.g, LAX to JFK becomes 'JFK-LAX'
if (val['Origin'] < val['Dest'])
market = paste(val['Origin'], val['Dest'], sep='-')
else
market = paste(val['Dest'], val['Origin'], sep='-')
# key consists of year, market
output.key = c(val['Year'], market)
# output gate-to-gate elapsed times (CRS and actual) + time in
air
output.val =
c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime'])
return( keyval(output.key, output.val) )
}
}
reducer.year.market.enroute_time = function(key, val.list) {
# val.list is a list of row vectors
# a data.frame is a list of column vectors
# plyr's ldply() is the easiest way to convert IMHO
if ( require(plyr) )
val.df = ldply(val.list, as.numeric)
else { # this is as close as my deficient *apply skills can come w/o
plyr
val.list = lapply(val.list, as.numeric)
val.df = data.frame( do.call(rbind, val.list) )
}
colnames(val.df) = c('actual','crs','air')
output.key = key
output.val = c( nrow(val.df), mean(val.df$actual, na.rm=T),
mean(val.df$crs, na.rm=T),
mean(val.df$air, na.rm=T) )
return( keyval(output.key, output.val) )
}
mr.year.market.enroute_time = function (input, output) {
mapreduce(input = input,
output = output,
input.format = asa.csvtextinputformat,
map = mapper.year.market.enroute_time,
reduce = reducer.year.market.enroute_time,
backend.parameters = list(
hadoop = list(D = "mapred.reduce.tasks=10")
),
verbose=T)
}
hdfs.output.path = file.path(hdfs.output.root, 'enroute-time')
results =
mr.year.market.enroute_time(hdfs.input.path, hdfs.output.path)
results.df = from.dfs(results, to.data.frame=T)
colnames(results.df) = c('year', 'market', 'flights', 'scheduled',
'actual', 'in.air')
save(results.df, file="out/enroute.time.RData")
RHadoop Implementation
(>35 lines of code)
Equivalent Big R Implementation
(4 lines of code)
air <- bigr.frame(dataPath = "airline.csv", dataSource = “DEL", na.string="NA")
air$City1 <- ifelse(air$Origin < air$Dest, air$Origin, air$Dest)
air$City2 <- ifelse(air$Origin >= air$Dest, air$Origin, air$Dest)
summary(count(UniqueCarrier) + mean(ActualElapsedTime) +
mean(CRSElapsedTime) + mean(AirTime) ~ Year + City1 + City2 ,
dataset = air[air$Cancelled == 0 & air$Diverted == 0,])
58
Use Cases
• Where Big R works well
• When data can be partitioned cleanly ...
• ... and each partition fits in the memory of a server node
• In other words:
• size of entire data can be bigger than cluster memory
• size of individual partitions limited to node memory (due to R)
• Real-world customer scenarios we’ve seen:
• Model data from individual tractors
• Build time-series anomaly detection on IP address pairs
• Build models on each customer’s behavior
• And where it doesn’t ...
• When building one monolithic model on the entire data
• without sampling
• Without use some form of blending such as ensembles
59
Large scale analytics in Hadoop
• Some workloads are not logically partitionable, or partitions are still large
• SystemML - a scalable engine running natively over Hadoop:
• Deliver pre-built ML algorithms
• Regression, Classification, Clustering, etc.
• Ability to author new algorithms
3
# Build a model on the entire data set
model <- bigr.lm(ArrDelay ~ ., df)
# Or, build several models, partitioned by airline
models <- groupApply(
input = air,
groupBy = air$UniqueCarrier,
function(df) {
# Linear regression
model <- bigr.lm(ArrDelay ~ ., df)
return (model)
})
60
Association Rules
Association
• Apriori and pattern growth
for frequent itemsets
• sequence miner
Clustering
K-Means
Data Mining
Dimension Reduction
• Non-negative Matrix
Factorizations
• Principal Component
Analysis (large n, small p)
• Singular Value
Decompositions
Time Series Analysis
Granger modeling
Predictive Analytics
Regression
Linear Regression for large,
sparse datasets.
Generalized Linear Models.
Classification
Linear Logistic Regression
Trust Region Method
Linear SVMs
Modified Finite Newton Method
Random decision trees
for classification & regression
Ranking
PageRank of a directed
graph
HITS Hubs and Authorities
Optimization
Conjugate Gradient for
Sparse Linear Systems
Parallel Optimization for
sparse linear models
Stochastic Gradient Descent
Outlier Detection
Recursive Binning and
reprojection for distance
based outlier detection
Univariate
Scale, nominal, ordinal
variables
Scale: min, max, range,
mean, variance,
moments
(kurtosis, skewness),
order statistics (median,
quantile, iqm), outliers
Categorical: mode,
histograms
Bivariate
Scale/categorical
Eta, ftest, grouped
mean/variance/weight
Categorical/categorical
cramer‟s V, pearson
ChiSquare, spearman
Data Exploration
Large scale analytics (systemML) modules
Recommender
Systems
Matrix Completion
algorithms
Meta Learning
Ensemble Learning
Cross Validation
61
Example - Recommendation Systems with
Collaborative Filtering
ratings
people
W
H
Kfactors
moviesK factors
people
1 1 0.10
1 2 0.30
: : :
1 1 0.10
1 2 0.30
1 3 0.22
1 4 1.24
: : :
: : :
movies
Analyzing both similarity of people and products
62
Example: Topic Evolution in Social Media with
Clustering
tokens
documents
1 1 0.10
1 2 0.30
1 3 0.22
1 4 1.24
: : :
: : :
W
H
Ktopics
wordsK topics
documents
1 1 0.10
1 2 0.30
: : :
63
Example: Gaussian Non-negative Matrix Factorization
package gnmf;
import java.io.IOException;
import java.net.URISyntaxException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
public class MatrixGNMF
{
public static void main(String[] args) throws IOException, URISyntaxException
{
if(args.length < 10)
{
System.out.println("missing parameters");
System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " +
"[k] [num mappers] [num reducers] [replication] [working directory] " +
"[final directory of w] [final directory of h]");
System.exit(1);
}
String vDir = args[0];
String wDir = args[1];
String hDir = args[2];
int k = Integer.parseInt(args[3]);
int numMappers = Integer.parseInt(args[4]);
int numReducers = Integer.parseInt(args[5]);
int replication = Integer.parseInt(args[6]);
String outputDir = args[7];
String wFinalDir = args[8];
String hFinalDir = args[9];
JobConf mainJob = new JobConf(MatrixGNMF.class);
String vDirectory;
String wDirectory;
String hDirectory;
FileSystem.get(mainJob).delete(new Path(outputDir));
vDirectory = vDir;
hDirectory = hDir;
wDirectory = wDir;
String workingDirectory;
String resultDirectoryX;
String resultDirectoryY;
long start = System.currentTimeMillis();
System.gc();
System.out.println("starting calculation");
System.out.print("calculating X = WT * V... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = WT * W * H... ");
workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication,
wDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating H = H .* X ./ Y... ");
workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication,
hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k);
System.out.println("done");
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back H... ");
FileSystem.get(mainJob).delete(new Path(hDirectory));
hDirectory = workingDirectory;
System.out.println("done");
System.out.print("calculating X = V * HT... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = W * H * HT... ");
workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication,
hDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating W = W .* X ./ Y... ");
workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication,
wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k);
System.out.println("done");
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back W... ");
FileSystem.get(mainJob).delete(new Path(wDirectory));
wDirectory = workingDirectory;
System.out.println("done");
package gnmf;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep2
{
static class UpdateWHStep2Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector>
{
@Override
public void map(TaggedIndex key, MatrixVector value,
OutputCollector<TaggedIndex, MatrixVector> out,
Reporter reporter) throws IOException
{
out.collect(key, value);
}
}
static class UpdateWHStep2Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject>
{
@Override
public void reduce(TaggedIndex key, Iterator<MatrixVector> values,
OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter)
throws IOException
{
MatrixVector result = null;
while(values.hasNext())
{
MatrixVector current = values.next();
if(result == null)
{
result = current.getCopy();
} else
{
result.addVector(current);
}
}
if(result != null)
{
out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X),
new MatrixObject(result));
}
}
}
public static String runJob(int numMappers, int numReducers, int replication,
String inputDir, String outputDir) throws IOException
{
String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";
JobConf job = new JobConf(UpdateWHStep2.class);
job.setJobName("MatrixGNMFUpdateWHStep2");
job.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputDir));
job.setOutputFormat(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(workingDirectory));
job.setNumMapTasks(numMappers);
job.setMapperClass(UpdateWHStep2Mapper.class);
job.setMapOutputKeyClass(TaggedIndex.class);
job.setMapOutputValueClass(MatrixVector.class);
package gnmf;
import gnmf.io.MatrixCell;
import gnmf.io.MatrixFormats;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep1
{
public static final int UPDATE_TYPE_H = 0;
public static final int UPDATE_TYPE_W = 1;
static class UpdateWHStep1Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject>
{
private int updateType;
@Override
public void map(TaggedIndex key, MatrixObject value,
OutputCollector<TaggedIndex, MatrixObject> out,
Reporter reporter) throws IOException
{
if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL)
{
MatrixCell current = (MatrixCell) value.getObject();
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL),
new MatrixObject(new MatrixCell(key.getIndex(), current.getValue())));
} else
{
out.collect(key, value);
}
}
@Override
public void configure(JobConf job)
{
updateType = job.getInt("gnmf.updateType", 0);
}
}
static class UpdateWHStep1Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector>
{
private double[] baseVector = null;
private int vectorSizeK;
@Override
public void reduce(TaggedIndex key, Iterator<MatrixObject> values,
OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter)
throws IOException
{
if(key.getType() == TaggedIndex.TYPE_VECTOR)
{
if(!values.hasNext())
throw new RuntimeException("expected vector");
MatrixFormats current = values.next().getObject();
if(!(current instanceof MatrixVector))
throw new RuntimeException("expected vector");
baseVector = ((MatrixVector) current).getValues();
} else
{
while(values.hasNext())
{
MatrixCell current = (MatrixCell) values.next().getObject();
if(baseVector == null)
{
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
new MatrixVector(vectorSizeK));
} else
{
if(baseVector.length == 0)
throw new RuntimeException("base vector is corrupted");
MatrixVector resultingVector = new MatrixVector(baseVector);
resultingVector.multiplyWithScalar(current.getValue());
if(resultingVector.getValues().length == 0)
throw new RuntimeException("multiplying with scalar failed");
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
resultingVector);
}
}
baseVector = null;
}
}
@Override
public void configure(JobConf job)
{
vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0);
if(vectorSizeK == 0)
Java Implementation
(>1500 lines of code)
Equivalent Big R - SystemML Implementation
(12 lines of code)
# Perform matrix operations, say non-negative factorization
# V ~~ WH
V <- bigr.matrix("V.mtx”); # initial matrix on HDFS
W <- bigr.matrix(nrow=nrow(V), ncols=k); # initialize starting points
H <- bigr.matrix(nrow=k, ncols=ncols(V));
for (i in 1:numiterations) {
H <- H * (t(W) %*% V / t(W) %*% W %*% H);
W <- W * (V %*% t(H) / W %*% H %*% t(H));
}
64
Summary - Popular R Analytics on
Hadoop
• R is the preferred language for Data
Scientists
• Many approaches exist to enable R
on Hadoop with pros and cons
• “Big R” approach:
• Data exploration, statistical analysis and
Visualization with natural R syntax
• Scale out R with data partitioning
• Support for standard R tools and
existing packages and libraries
• SystemML engine for Distributed
Machine Learning that provides canned
algorithms, and an ability to author new
ones, all via R-like syntax
Data Sources
Hive Tables HBase Tables Files
R
Runtime
Hadoop
R Clients
Distributed
ML
Runtime
65 65
www.ibm.com/software/data/infosphere/hadoop
Predictive Analytics with Hadoop

More Related Content

What's hot

NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Revolution Analytics
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationMapR Technologies
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

What's hot (20)

NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
MapR 5.2 Product Update
MapR 5.2 Product UpdateMapR 5.2 Product Update
MapR 5.2 Product Update
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital Transformation
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Viewers also liked

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Benjamin Taylor
 
Making Display Advertising Work for Auto Dealers
Making Display Advertising Work for Auto DealersMaking Display Advertising Work for Auto Dealers
Making Display Advertising Work for Auto DealersSpeed Shift Media
 
Customer satisfaction process
Customer satisfaction processCustomer satisfaction process
Customer satisfaction processPimsat University
 
Top 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answersTop 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answersJackRyab456
 
The Emerging Customer Experience Platform Trend
The Emerging Customer Experience Platform TrendThe Emerging Customer Experience Platform Trend
The Emerging Customer Experience Platform TrendBackbase
 
Embedded System Tools ppt
Embedded System Tools  pptEmbedded System Tools  ppt
Embedded System Tools pptHalai Hansika
 
Importance of documentation for gmp compliance
Importance of documentation for gmp complianceImportance of documentation for gmp compliance
Importance of documentation for gmp complianceJRamniwas
 
Pharmaceutical Factory Layout
Pharmaceutical Factory LayoutPharmaceutical Factory Layout
Pharmaceutical Factory LayoutZil Shah
 
BUILDING MATERIALS AS A PLASTIC
BUILDING MATERIALS AS A PLASTICBUILDING MATERIALS AS A PLASTIC
BUILDING MATERIALS AS A PLASTICAbhishek Mewada
 
Seperation and purification techniques
Seperation and purification techniquesSeperation and purification techniques
Seperation and purification techniquesSuresh Selvaraj
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Equity derivatives basics
Equity derivatives basicsEquity derivatives basics
Equity derivatives basicsRomit Jain
 
radioligand binding studies
radioligand binding studiesradioligand binding studies
radioligand binding studiesankit
 
Alternative Approach to Permanent way Alignment Design
Alternative Approach to Permanent way Alignment DesignAlternative Approach to Permanent way Alignment Design
Alternative Approach to Permanent way Alignment DesignConstantin Ciobanu
 

Viewers also liked (20)

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Making Display Advertising Work for Auto Dealers
Making Display Advertising Work for Auto DealersMaking Display Advertising Work for Auto Dealers
Making Display Advertising Work for Auto Dealers
 
Customer satisfaction process
Customer satisfaction processCustomer satisfaction process
Customer satisfaction process
 
Sandia 2014 Wind Turbine Blade Workshop- Newman
Sandia 2014 Wind Turbine Blade Workshop- NewmanSandia 2014 Wind Turbine Blade Workshop- Newman
Sandia 2014 Wind Turbine Blade Workshop- Newman
 
Top 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answersTop 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answers
 
Crm cycle
Crm cycleCrm cycle
Crm cycle
 
The Emerging Customer Experience Platform Trend
The Emerging Customer Experience Platform TrendThe Emerging Customer Experience Platform Trend
The Emerging Customer Experience Platform Trend
 
Embedded System Tools ppt
Embedded System Tools  pptEmbedded System Tools  ppt
Embedded System Tools ppt
 
Importance of documentation for gmp compliance
Importance of documentation for gmp complianceImportance of documentation for gmp compliance
Importance of documentation for gmp compliance
 
Pharmaceutical Factory Layout
Pharmaceutical Factory LayoutPharmaceutical Factory Layout
Pharmaceutical Factory Layout
 
BUILDING MATERIALS AS A PLASTIC
BUILDING MATERIALS AS A PLASTICBUILDING MATERIALS AS A PLASTIC
BUILDING MATERIALS AS A PLASTIC
 
Seperation and purification techniques
Seperation and purification techniquesSeperation and purification techniques
Seperation and purification techniques
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Equity derivatives basics
Equity derivatives basicsEquity derivatives basics
Equity derivatives basics
 
radioligand binding studies
radioligand binding studiesradioligand binding studies
radioligand binding studies
 
Overview Of CRM
Overview Of CRMOverview Of CRM
Overview Of CRM
 
Alternative Approach to Permanent way Alignment Design
Alternative Approach to Permanent way Alignment DesignAlternative Approach to Permanent way Alignment Design
Alternative Approach to Permanent way Alignment Design
 

Similar to Predictive Analytics with Hadoop

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadooplucenerevolution
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" BusinessMapR Technologies
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 

Similar to Predictive Analytics with Hadoop (20)

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoop
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" Business
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Predictive Analytics with Hadoop

  • 1. Predictive Analytics with Hadoop Michael Hausenblas MapR Technologies
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies
  • 3. © 2014 MapR Technologies 3 Agenda • Examples • Data-driven solutions • Obtaining big training data • Recommendation with Mahout and Solr • Operational considerations
  • 4. © 2014 MapR Technologies 4 Recommendation is Everywhere Media and Advertising e-commerce Enterprise Sales • Recommend sales opportunities to partners • $40M revenue in year 1 • 1.5B records per day • Using MapR
  • 5. © 2014 MapR Technologies 5 Classification is Everywhere IP address blacklisting Fraud Detection Customer 360 Scoring & Categorization • Customer 360 application • Each customer is scored and categorized based on all their activity • Data from hundreds of streams and databases • Using MapR • 600+ variables considered for every IP address • Billions of data points • Using MapR • Identify anomalous patterns indicating fraud, theft and criminal activity • Stop phishing attempts • Using MapR Fortune 100 Telco
  • 6. © 2014 MapR Technologies 6 Data-Driven Solutions • Physics is „simple‟: f = ma; E=mc2 • Human behavior is much more complex – Which ad will they click? – Is a behavior fraudulent? Why? • Don‟t look for complex models that try to discover general rules – The size of the dataset is the most important factor – Simple models (n-gram, linear classifiers) with Big Data • A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8-12, 2009.
  • 7. © 2014 MapR Technologies 7 The Algorithms Are Less Important
  • 8. © 2014 MapR Technologies 8 Focus on the Data • Most algorithms come down to counting and simple math • Invest your time where you can make a difference – Getting more data can improve results by 2x • eg, add beacons everywhere to instrument user behavior – Tweaking an ML algorithm will yield a fraction of 1% • Data wrangling – Feature engineering – Moving data around – …
  • 9. © 2014 MapR Technologies 9 Obtaining Big Training Data • Can‟t really rely on experts to label the data – Doesn‟t scale (not enough experts out there) – Too expensive • So how do you get the training data? – Crowdsourcing – Implicit feedback • “Obvious” features • User engagement
  • 10. © 2014 MapR Technologies 10 Using Crowdsourcing for Annotation R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP, 2008. Quantity: $2 for 7000 annotations (leveraging Amazon Mechanical Turk and a “flat world”) Quality: 4 non-experts = 1 expert
  • 11. © 2014 MapR Technologies 11 Using “Obvious” Features for Annotation :) :(
  • 12. © 2014 MapR Technologies 12 Leveraging Implicit Feedback • Users behavior provides valuable training data • Google adjusts search rankings based on engagement – Did the user click on the result? – Did the user come back to the search page within seconds? • Most recommendation algorithms are based solely on user activity – What products did they view/buy? – What ads did they click on? • T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM TOIS, 25(2):1, 2007.
  • 13. © 2014 MapR Technologies 13 Increasing Exploration We need to find ways to increase exploration to broaden our learning Exploration of the second page Can‟t learn much about results on page 2, 3 or 4!
  • 14. © 2014 MapR Technologies 14 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better – “Made more difference than any other change”
  • 15. © 2014 MapR Technologies 15 Simple Dithering Algorithm • Generate synthetic score from log rank plus Gaussian • Pick noise scale to provide desired level of mixing • Typically: • Oh… use floor(t/T) as seed so results don‟t change too often s = logr + N(0,e) e Î 0.4, 0.8[ ] Dr µrexpe
  • 16. © 2014 MapR Technologies 16 Example:ε = 0.5 • Each line represents a recommendation of 8 items • The non-dithered recommendation would be 1, 2, …, 8 1 2 6 5 3 4 13 16 1 2 3 8 5 7 6 34 1 4 3 2 6 7 11 10 1 2 4 3 15 7 13 19 1 6 2 3 4 16 9 5 1 2 3 5 24 7 17 13 1 2 3 4 6 12 5 14 2 1 3 5 7 6 4 17 4 1 2 7 3 9 8 5 2 1 5 3 4 7 13 6 3 1 5 4 2 7 8 6 2 1 3 4 7 12 17 16
  • 17. © 2014 MapR Technologies 17 Example:ε = log 2 = 0.69 • Each line represents a recommendation of 8 items • The non-dithered recommendation would be 1, 2, …, 8 1 2 8 3 9 15 7 6 1 8 14 15 3 2 22 10 1 3 8 2 10 5 7 4 1 2 10 7 3 8 6 14 1 5 33 15 2 9 11 29 1 2 7 3 5 4 19 6 1 3 5 23 9 7 4 2 2 4 11 8 3 1 44 9 2 3 1 4 6 7 8 33 3 4 1 2 10 11 15 14 11 1 2 4 5 7 3 14 1 8 7 3 22 11 2 33
  • 18. © 2014 MapR Technologies 18© 2014 MapR Technologies Recommendations with Mahout and Solr
  • 19. © 2014 MapR Technologies 19 What is Recommendation? The behavior of a crowd helps us understand what individuals will do…
  • 20. © 2014 MapR Technologies 20 Batch and Real-Time • We can learn about the relationship between items every X hours – These relationships don‟t change often – People who buy Nikon D7100 cameras also buy a Nikon EN-EL15 battery • What to recommend to Bob has to be determined in real-time – Bob may be a new user with no history – Bob is shopping for a camera right now, but he was buying a baby bottle an hour ago • How do we do that? – Mahout for the heavy number crunching – Solr/Elasticsearch for the real-time recommendations
  • 21. © 2014 MapR Technologies 21 Real-Time Recommender Architecture Co-occurrence (Mahout) Indexing (Solr) Index shardsItem metadata Web tier Search (Solr) Recent history for single user Recommendation Complete history (log files/tables) Note: All data lives in the cluster
  • 22. © 2014 MapR Technologies 22 Recommendations Alice got an apple and a puppy Charles got a bicycle Alice Charles
  • 23. © 2014 MapR Technologies 23 Recommendations Alice got an apple and a puppy Charles got a bicycle Bob got an apple Alice Bob Charles
  • 24. © 2014 MapR Technologies 24 Recommendations ? What else would Bob like? Alice Bob Charles
  • 25. © 2014 MapR Technologies 25 Log Files Alice Bob Charles Alice Bob Charles Alice
  • 26. © 2014 MapR Technologies 26 History Matrix Alice Bob Charle s ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 27. © 2014 MapR Technologies 27 Co-Occurrence Matrix: Items by Items Q: How do you tell which co-occurrences are useful? A: Let Mahout do the math… - 1 2 1 1 1 1 2 1 0 0 0 0
  • 28. © 2014 MapR Technologies 29 Indicator Matrix: Anomalous Co-Occurrences ✔ ✔ Result: The marked row will be added to the indicator field in the item document…
  • 29. © 2014 MapR Technologies 30 Indicator Matrix ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. Note: Data for the indicator field is added directly to meta-data for a document in Solr index. You don‟t need to create a separate index for the indicators.
  • 30. © 2014 MapR Technologies 31 Internals of the Recommender Engine Q: What should we recommend if new user listened to 2122:Fats Domino & 303:Beatles? A: Search the index with “indicator_artists:2122 OR indicator_artists:303”
  • 31. © 2014 MapR Technologies 32 Internals of the Recommender Engine 1710:Check Berry is the top recommendation
  • 32. © 2014 MapR Technologies 33 Toolbox for Predictive Analytics with Hadoop • Mahout – Use it for Recommendations, Clustering, Math • Vowpal Wabbit – Use it for Classification (but it‟s harder to use) • SkyTree – Commercial (not open source) implementation of common algorithms
  • 33. © 2014 MapR Technologies 34 Operational Considerations • Snapshot your raw data before training – Reproducible process – MapR snapshots • Leverage NFS for real-time data ingestion – Train the model on today‟s data – Learning schedule independent from ingestion schedule • Look for HA, data protection, disaster recovery – Predictive analytics increases revenue or reduces cost – Quickly becomes a must-have, not a nice-to-have
  • 34. © 2014 MapR Technologies 35 Q&AEngage with us! mhausenblas@mapr.com +353 86 0215164 @MapR_EMEA maprtech +MaprTechnologies maprtech MapR Technologies
  • 35.
  • 36. Analyzing Big Data with Open Source R and Hadoop Steven Sit IBM
  • 37. Analyzing big data with open source R and Hadoop Steven Sit IBM Silicon Valley Laboratory
  • 38. 39 Outline • R on Hadoop • Rationale and requirements • Various approaches • Using R and “Big R” packages • Data exploration, statistical analysis and Visualization • Scale out R with data partitioning • Distributed machine learning • Demo
  • 39. 40 Rationale and requirements Productivity • Use natural R syntax to access massive data in Hadoop • Leverage existing packages and libraries Platform for analytics • Common input and output data formats across analytic algorithms Customizability & Extensibility • Easily customize existing algorithms • Support for both conventional data mining and large scale ML needs Scalability & Performance • Scale to massively parallel clusters • Analyze terabytes and petabytes of data
  • 40. 41 Streams ETL Tools SQOOP Flume NFSRest API Files NoSQL DB Hadoop Business Intelligence Tools ODBC JDBC, Utilities Warehouses, Marts, MDM Analytic Runtimes Analytical Tools SQL engines DDL Database Tools PIG, Java, etc Exploration Tools Search Indexes Models Sources Common Patterns for Hadoop and Analytics
  • 41. 42 Quick Tour of R • R is an interpreted language • Open-source implementation of the S language (1976) • Best suited for statistical analysis and modeling • Data exploration and manipulation • Descriptive statistics • Predictive analytics and machine learning • Visualization • +++ • Can produce “publication quality graphics” • Emerging as a competitor to proprietary platforms
  • 42. 43 R is hot!... but • Quite lean as far as software goes • Free! • State of the art algorithms • Statistical researchers often provide their methods as R packages • New techniques available without delay • Commercial packages usually behind the curve • 4700+ packages as of today • Active and vibrant user community • Universities are teaching R • IT shops are adopting R • Companies are integrating R into their products • R jobs and demand for R skills on the rise Unfortunately R is not built for Big Data Working with large datasets is limited by RAM
  • 43. 44 R and Big Data: Various approaches• R + Hadoop streaming • Use R as a scripting language • Write map/reduce jobs in R • Or Python for mappers, and R for the reducers • R + open-source packages for Hadoop • RHadoop • RMR2: Write low-level map/reduce with API in R • RHBase and RHDFS: Simple R API over Hadoop capabilities • RHipe • R + SparkR + Hadoop • R frontend to Spark in memory structure and data from Hadoop • R + High Level Languages • JaQL<->R “bridge” • R interfaces over non-R engines • ScalaR from Revolution R • Oracle’s ORE, Netezza’s NZR, Teradata’s teradataR, +++
  • 44. 45 RHadoop Example Given a table describing flight information for every airline spanning several decades: “Carrier”, “Year”, “Month”, “Day”, “DepDelay”, … What’s the mean departure delay (“DepDelay”) for each airline for each month?
  • 45. 46 RHadoop Example (contd.) Map Reduce key = c(Carrier, Year, Mo nth) value = DepDelay key = c(Carrier, Year, Mo nth) value = vector(DepDelay) key = NULL value = vector (columns in a line) key = c(Carrier, Year, Mo nth) value = mean(DepDelay) Shuffle
  • 46. 47 RHadoop Example (contd.) csvtextinputformat = function(line) keyval(NULL, unlist(strsplit(line, ","))) deptdelay = function (input, output) { mapreduce(input = input, output = output, textinputformat = csvtextinputformat, map = function(k, fields) { # Skip header lines and bad records: if (!(identical(fields[[1]], "Year")) & length(fields) == 29) { deptDelay <- fields[[16]] # Skip records where departure dalay is "NA": if (!(identical(deptDelay, "NA"))) { # field[9] is carrier, field[1] is year, field[2] is month: keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptDelay)}}}, reduce = function(keySplit, vv) { keyval(keySplit[[2]], c(keySplit[[3]], length(vv), keySplit[[1]], mean(as.numeric(vv))))})} from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month")) Source: http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html
  • 47. 48 What is “Big R” • Explore, visualize, transform, and model big data using familiar R syntax and paradigm • Scale out R with MR programming • Partitioning of large data • Parallel cluster execution of R code • Distributed Machine Learning • A scalable statistics engine that provides canned algorithms, and an ability to author new ones, all via R R Clients Scalabl e Machin e Learnin g Data Sources Embedde d R Executio n IBM R Packages IBM R Packages Pull data (summaries) to R client Or, push R functions right on the data 1 2 3
  • 48. Let‟s mix it up a little ... ... with a demo interspersed with slides.
  • 49. 50 Big R Demo Data • Publicly available “airline” data • 22 years of actual arrival/departure information • Every scheduled flight in the US • 1987 onwards • From U.S. Department of Transportation • Cleansed version • http://stat-computing.org/dataexpo/2009/the-data.html
  • 50. 51 Airline data description Year 1987-2008 Month 1-12 DayofMonth 1-31 DayOfWeek 1 (Monday) - 7 (Sunday) DepTime actual departure time (local, hhmm) CRSDepTime scheduled departure time (local, hhmm) ArrTime actual arrival time (local, hhmm) CRSArrTime scheduled arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number ActualElapsedTime in minutes CRSElapsedTime in minutes AirTime in minutes ArrDelay arrival delay, in minutes
  • 51. Airline data description (contd.) Origin origin IATA airport code Dest destination IATA airport code Distance in miles TaxiIn taxi in time, in minutes TaxiOut taxi out time in minutes Cancelled was the flight cancelled? CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) Diverted 1 = yes, 0 = no CarrierDelay in minutes WeatherDelay in minutes NASDelay in minutes SecurityDelay in minutes LateAircraftDelay in minutes
  • 52. 53 Explore, visualize, transform, model Hadoop data with R • Represent Big Data objects as R datatypes • R's programming syntax and paradigm • Data stays in HDFS • R classes (e.g. bigr.data.frame) as proxies • No low-level map/reduce constructs • No underlying scripting languages 1 # Construct a bigr.frame to access large data set air <- bigr.frame(dataPath="airline_demo.csv", …) # How many flights were flown by United or Delta? length(UniqueCarrier[UniqueCarrier %in% c("UA", "DL")]) # Filter all flights that were delayed by 15+ minutes at departure or arrival. airSubset <- air[air$Cancelled == 0 & (air$DepDelay >= 15 | air$ArrDelay >= 15), c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")] # For these filtered flights, compute key statistics (# of flights, # average flying distance and flying time), grouped by airline summary(count(UniqueCarrier) + mean(Distance) + mean(CRSElapsedTime) ~ UniqueCarrier, dataset = airSubset) Bigr.boxplot(air$Distance ~ air$UniqueCarrier, …)
  • 53. 54 Scale out R in Hadoop • Support parallel / partitioned execution of R • Work around R’s memory limitations • Execute R snippets on chunks of data • Partitioned by key, by #rows, via sampling, … • Follows R’s “apply” model • Parallelized seamlessly Map/Reduce engine 2 # Filter the airline data on United and Hawaiian bf <- air[air$UniqueCarrier %in% c("HA", "UA"),] # Build one decision-tree model per airliner models <- groupApply(data = bf, groupingColumns = list(bf$UniqueCarrier), rfunction = function(df) { library(rpart) predcols <- c('ArrDelay', 'DepDelay', 'DepTime'', 'Distance') return (rpart(ArrDelay ~ ., df[,predcols]))}) # Pull the model for HA to the client modelHA <- bigr.pull(models$HA) # Visualize the model prettyTree(modelHA)
  • 54. Big R API: Core Classes & Methods Connection handling bigr.connect() and bigr.disconnect() is.bigr.connected() bigr.frame Modeled after R’s data.frame Proxy for tabular data bigr.vector Modeled after R’s vector datatype Proxy for a column bigr.list (Loosely) modeled after R’s list Proxy for collections of serialized R objects • Basic exploration • head(), tail() • dim(), length(), nrow( ), ncol() • str(), summary() • Selection and Projections • [ • $ • Arithmetic and Logical operators • +, -, /, - • &, |, ! • ifelse() • String and Math functions • Lots of these • Other relational operators • table() • unique() • merge() • summary() • sort() • na.omit(), na.exclude() • Data movement • as.data.frame, as.bigr.fram e • bigr.persist • Sampling • bigr.sample() • bigr.random() • Visualizations • built into R packages (E.g. ggplot2)
  • 55. Big R API: Core Classes & Methods (contd.) • Summarization and aggregation • summary(), very powerful when used with “formula” notation • max(), min(), ran ge() • Mean, variance and standard deviation • mean() • var() • sd() • Correlation • cov() and cor() • Hadoop options • bigr.get.server.option() • bigr.set.server.option() • bigr.debug(T) • Useful for servicing • Will print out internal debugging output • Catalog access • bigr.listfs() • bigr.listTables(), bigr.lis tColumns() • groupApply() • Primary function for embedded execution • Can return tables or objects • Run “help(groupApply)” inside R for extensive documentation • Examining R execution logs • bigr.logs() • Other *Apply functions • rowApply(), for running R on batches of rows • tableApply(), for running R on entire dataset
  • 56. 57 Example: What is the average scheduled flight time, actual gate-to-gate time, and actual airtime for each city pair per year? mapper.year.market.enroute_time = function(key, val) { # Skip header lines, cancellations, and diversions: if ( !identical(as.character(val['Year']), 'Year') & identical(as.numeric(val['Cancelled']), 0) & identical(as.numeric(val['Diverted']), 0) ) { # We don't care about direction of travel, so construct 'market' # with airports ordered alphabetically # (e.g, LAX to JFK becomes 'JFK-LAX' if (val['Origin'] < val['Dest']) market = paste(val['Origin'], val['Dest'], sep='-') else market = paste(val['Dest'], val['Origin'], sep='-') # key consists of year, market output.key = c(val['Year'], market) # output gate-to-gate elapsed times (CRS and actual) + time in air output.val = c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime']) return( keyval(output.key, output.val) ) } } reducer.year.market.enroute_time = function(key, val.list) { # val.list is a list of row vectors # a data.frame is a list of column vectors # plyr's ldply() is the easiest way to convert IMHO if ( require(plyr) ) val.df = ldply(val.list, as.numeric) else { # this is as close as my deficient *apply skills can come w/o plyr val.list = lapply(val.list, as.numeric) val.df = data.frame( do.call(rbind, val.list) ) } colnames(val.df) = c('actual','crs','air') output.key = key output.val = c( nrow(val.df), mean(val.df$actual, na.rm=T), mean(val.df$crs, na.rm=T), mean(val.df$air, na.rm=T) ) return( keyval(output.key, output.val) ) } mr.year.market.enroute_time = function (input, output) { mapreduce(input = input, output = output, input.format = asa.csvtextinputformat, map = mapper.year.market.enroute_time, reduce = reducer.year.market.enroute_time, backend.parameters = list( hadoop = list(D = "mapred.reduce.tasks=10") ), verbose=T) } hdfs.output.path = file.path(hdfs.output.root, 'enroute-time') results = mr.year.market.enroute_time(hdfs.input.path, hdfs.output.path) results.df = from.dfs(results, to.data.frame=T) colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'in.air') save(results.df, file="out/enroute.time.RData") RHadoop Implementation (>35 lines of code) Equivalent Big R Implementation (4 lines of code) air <- bigr.frame(dataPath = "airline.csv", dataSource = “DEL", na.string="NA") air$City1 <- ifelse(air$Origin < air$Dest, air$Origin, air$Dest) air$City2 <- ifelse(air$Origin >= air$Dest, air$Origin, air$Dest) summary(count(UniqueCarrier) + mean(ActualElapsedTime) + mean(CRSElapsedTime) + mean(AirTime) ~ Year + City1 + City2 , dataset = air[air$Cancelled == 0 & air$Diverted == 0,])
  • 57. 58 Use Cases • Where Big R works well • When data can be partitioned cleanly ... • ... and each partition fits in the memory of a server node • In other words: • size of entire data can be bigger than cluster memory • size of individual partitions limited to node memory (due to R) • Real-world customer scenarios we’ve seen: • Model data from individual tractors • Build time-series anomaly detection on IP address pairs • Build models on each customer’s behavior • And where it doesn’t ... • When building one monolithic model on the entire data • without sampling • Without use some form of blending such as ensembles
  • 58. 59 Large scale analytics in Hadoop • Some workloads are not logically partitionable, or partitions are still large • SystemML - a scalable engine running natively over Hadoop: • Deliver pre-built ML algorithms • Regression, Classification, Clustering, etc. • Ability to author new algorithms 3 # Build a model on the entire data set model <- bigr.lm(ArrDelay ~ ., df) # Or, build several models, partitioned by airline models <- groupApply( input = air, groupBy = air$UniqueCarrier, function(df) { # Linear regression model <- bigr.lm(ArrDelay ~ ., df) return (model) })
  • 59. 60 Association Rules Association • Apriori and pattern growth for frequent itemsets • sequence miner Clustering K-Means Data Mining Dimension Reduction • Non-negative Matrix Factorizations • Principal Component Analysis (large n, small p) • Singular Value Decompositions Time Series Analysis Granger modeling Predictive Analytics Regression Linear Regression for large, sparse datasets. Generalized Linear Models. Classification Linear Logistic Regression Trust Region Method Linear SVMs Modified Finite Newton Method Random decision trees for classification & regression Ranking PageRank of a directed graph HITS Hubs and Authorities Optimization Conjugate Gradient for Sparse Linear Systems Parallel Optimization for sparse linear models Stochastic Gradient Descent Outlier Detection Recursive Binning and reprojection for distance based outlier detection Univariate Scale, nominal, ordinal variables Scale: min, max, range, mean, variance, moments (kurtosis, skewness), order statistics (median, quantile, iqm), outliers Categorical: mode, histograms Bivariate Scale/categorical Eta, ftest, grouped mean/variance/weight Categorical/categorical cramer‟s V, pearson ChiSquare, spearman Data Exploration Large scale analytics (systemML) modules Recommender Systems Matrix Completion algorithms Meta Learning Ensemble Learning Cross Validation
  • 60. 61 Example - Recommendation Systems with Collaborative Filtering ratings people W H Kfactors moviesK factors people 1 1 0.10 1 2 0.30 : : : 1 1 0.10 1 2 0.30 1 3 0.22 1 4 1.24 : : : : : : movies Analyzing both similarity of people and products
  • 61. 62 Example: Topic Evolution in Social Media with Clustering tokens documents 1 1 0.10 1 2 0.30 1 3 0.22 1 4 1.24 : : : : : : W H Ktopics wordsK topics documents 1 1 0.10 1 2 0.30 : : :
  • 62. 63 Example: Gaussian Non-negative Matrix Factorization package gnmf; import java.io.IOException; import java.net.URISyntaxException; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.JobConf; public class MatrixGNMF { public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done"); package gnmf; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep2 { static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/"; JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class); package gnmf; import gnmf.io.MatrixCell; import gnmf.io.MatrixFormats; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep1 { public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0) Java Implementation (>1500 lines of code) Equivalent Big R - SystemML Implementation (12 lines of code) # Perform matrix operations, say non-negative factorization # V ~~ WH V <- bigr.matrix("V.mtx”); # initial matrix on HDFS W <- bigr.matrix(nrow=nrow(V), ncols=k); # initialize starting points H <- bigr.matrix(nrow=k, ncols=ncols(V)); for (i in 1:numiterations) { H <- H * (t(W) %*% V / t(W) %*% W %*% H); W <- W * (V %*% t(H) / W %*% H %*% t(H)); }
  • 63. 64 Summary - Popular R Analytics on Hadoop • R is the preferred language for Data Scientists • Many approaches exist to enable R on Hadoop with pros and cons • “Big R” approach: • Data exploration, statistical analysis and Visualization with natural R syntax • Scale out R with data partitioning • Support for standard R tools and existing packages and libraries • SystemML engine for Distributed Machine Learning that provides canned algorithms, and an ability to author new ones, all via R-like syntax Data Sources Hive Tables HBase Tables Files R Runtime Hadoop R Clients Distributed ML Runtime

Editor's Notes

  1. http://www-01.ibm.com/software/data/infosphere/biginsights/