SlideShare a Scribd company logo
1 of 36
© Hortonworks Inc. 2014
Hortonworks
Bayesian Networks with R and Hadoop
Hadoop Summit, June 2014
Ofer Mendelevitch
© Hortonworks Inc. 2014 Page 2
A bit about me
Ofer Mendelevitch
Director, Data Science @ Hortonworks
Previously: Nor1, Yahoo!, Risk Insight, Quiver
Personal blog: www.achessdad.com
© Hortonworks Inc. 2014 Page 3
What I will cover today…
•What is a Bayesian Network?
•Why I think it’s cool
•Bayesian networks with R: the bnlearn package
•Bayes Networks Inference with R and Hadoop
© Hortonworks Inc. 2014 Page 4
Introduction to Bayesian Networks
(with examples using R)
© Hortonworks Inc. 2014 Page 5
Example: “Asia” Bayesian Network
Each node is a random variable: yes/no
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath
© Hortonworks Inc. 2014 Page 6
Example: “Asia” Bayesian Network
Graph structure reflects “causal” relationships
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath
© Hortonworks Inc. 2014 Page 7
Example: “Asia” Bayesian Network
node CPT: P(node | parents)
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath
SoB
Tub or
Cancer
Bronchitis T F
T T 0.7 0.3
F T 0.4 0.6
T F 0.45 0.55
F F 0.05 0.95
CPT
© Hortonworks Inc. 2014 Page 8
What is a (discrete) Bayesian Network?
(also called Bayes Nets, Belief Nets, etc)
• A network structure (DAG):
– Nodes => random variables, taking discrete values
– Edges => conditional dependencies
• E.g., lung cancer is statistically dependent on smoking
• A set of conditional probability tables (CPTs):
– Each node has a set of parents, determined by the graph
– CPT holds P(node | parent-A, parent-B, …) for each node
© Hortonworks Inc. 2014 Page 9
Why are Bayesian Networks cool?
• Intuitive/adaptive modeling tool:
– Graphs are natural for modeling relationships
– Easy to combine data-driven learning with expert know-how
– You can start small, and add knowledge as it is acquired
• “Naturally” addresses inference with missing values
• Inference can be applied to any variable/node
– As opposed to a single (target) variable in supervised learning
© Hortonworks Inc. 2014 Page 10
Bayesian networks have been successfully used for
a variety of real-world applications
• Healthcare: medical diagnosis, genetic modeling
• Security: crime pattern analysis, terrorism risk
management
• Education: student modeling
• Finance: credit rating, predicting defaults
• Tech support: troubleshooting for computers/printers
See “Bayesian networks: a practical guide to applications”, Pourret et al
© Hortonworks Inc. 2014 Page 11
Bayesian networks with R
• http://cran.r-project.org/web/views/Bayesian.html
• We will focus on “bnlearn” (by Marco Scutari)
– Implements various structure learning algorithms (hc, tabu,
gs, iamb, mmhc, rsmax2, etc)
– Provides automated learning of CPT
– Approximate inference: “likelihood sampling” and “likelihood
weighting”
– Supports snow/parallel for some algorithms
© Hortonworks Inc. 2014 Page 12
Step 1: Constructing the graph
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath
• Manually (expert knowledge)
• Automatically from data
© Hortonworks Inc. 2014 Page 13
Manual graph construction: Asia
> library(bnlearn)
> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")
> ag = empty.graph(varnames)
> arcs(ag, ignore.cycles=T) = data.frame(
> "from”=c("Asia", "Smoking", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC"),
> "to”=c("Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC", "SoB", "X-ray", "SoB"))
> graphviz.plot(ag)
© Hortonworks Inc. 2014 Page 14
Automated graph construction: Asia
> library(bnlearn)
> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")
> data(asia); names(asia) = varnames
> bg = hc(asia)
> graphviz.plot(bg)
© Hortonworks Inc. 2014 Page 15
Automated learning does not always work
perfectly…
For example:
• May not learn all the “expected” edges
• May learn in the wrong direction
Therefore, in practice it helps to:
• Provide whitelist and blacklist to the algorithm
• Pre-seed with a manual networks structure, and let the
algorithm learn from there
• Ensemble learning of structure (see boot.strength)
© Hortonworks Inc. 2014 Page 16
Step 2: Learning the CPT / probabilities
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath
SoB
Tub or
Cancer
Bronchitis T F
T T 0.85 0.15
F T 0.79 0.21
T F 0.73 0.27
F F 0.1 0.9
CPT
© Hortonworks Inc. 2014 Page 17
Learning CPT for each node in the graph
> fitted = bn.fit(ag, asia)
> print(fitted$SoB)
Parameters of node SoB (multinomial distribution)
Conditional probability table:
, , Tub-or-LC = no
Bronchitis
SoB no yes
no 0.90017286 0.21373057
yes 0.09982714 0.78626943
, , Tub-or-LC = yes
Bronchitis
SoB no yes
no 0.27737226 0.14592275
yes 0.72262774 0.85407725
© Hortonworks Inc. 2014 Page 18
Using the BN for inference
• Given evidence: (1) visit to asia, (2) SoB (3) Bronchitis
• What is the likelihood of “lung cancer”?
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath
© Hortonworks Inc. 2014 Page 19
Inferring with missing values
• We provide evidence (“yes” or “no” in this case) only
for those nodes where we have such evidence
• If a value is “missing” it’s just not included in the
evidence when doing inference…
This is in contrast to supervised learning, where ALL
values are typically needed for inference.
© Hortonworks Inc. 2014 Page 20
Exact Inference with gRain
• The gRain package implements exact inference for
discrete Bayesian Networks using the “Junction Tree”
belief propagation algorithm
• Bnlearn/gRain cooperate nicely
> jtree = compile(as.grain(fitted))
> jp = setFinding(jtree, nodes = c("Asia", "Sob", "Bronchitis"),
states = c("yes", "yes", "yes"))
> print(querygrain(jp, nodes="LC")$LC)
LC
no yes
0.934 0.066
© Hortonworks Inc. 2014 Page 21
Approximate inference with bnlearn
Bnlearn implements approximate inference: logic
sampling (aka rejection sampling) and likelihood
weighting
> # Infer probability P(SoB | Asia, Bronchitis) using logic sampling
> p1 = cpquery(fitted, event = eval(SoB == 'yes'),
evidence = eval(Asia == 'yes' & Bronchitis == 'yes'), method="ls")
> print(p1)
[1] 0.8014706
> # Infer probability P(SoB | Asia, Bronchitis) using likelihood weighting
> evidence = list("yes", "yes")
> names(evidence) = c("Asia", "Bronchitis")
> p2 = cpquery(fitted, eval(SoB == 'yes'), evidence, method="lw")
> print(p2)
[1] 0.795404
© Hortonworks Inc. 2014 Page 22
Large scale Bayes Networks
Inference with R and Hadoop
© Hortonworks Inc. 2014 Page 23
What is large?
• Number of nodes:
– 10s: Medium
– 100s: Large
– 1000s: Very large
• Number of instances:
– 100,000s to millions
© Hortonworks Inc. 2014 Page 24
Manually constructing large graphs is hard
© Hortonworks Inc. 2014 Page 25
Large scale learning in practice:
manual + automated
• Define nodes
• Seed with some known edges, based on expert
knowledge
• Augment with automated learning (e.g., hc, tabu,
rsmax2, etc)
© Hortonworks Inc. 2014 Page 26
Large scale inference: Exact or Approximate?
Pros Cons
Exact (Jtree)
gRain
Fast inference time Computational complexity
determined (exponentially) by
largest clique size
Approximate
(LS, LW)
Bnlearn
Can be used for any graph
Not limited by “clique” size
Inference is often much slower
Not accurate for rare events
© Hortonworks Inc. 2014 Page 27
About RHadoop/RMR
• An open source project, supported by revolution
analytics
• Various sub-projects: RMR, RHDFS, RHBASE, plyrmr, etc
• We will focus on RMR
– Implement mapper/reducer code using R
• RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki
• Installing RMR on HDP: http://www.slideshare.net/Hadoop_Summit/enabling-r-on-
hadoop
http://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/
© Hortonworks Inc. 2014 Page 28
Large scale inference with R and Hadoop
Infer with RMR
BN
model
Mapper
No-op
Results
Hadoop cluster
RMR
Mapper
No-op
Chunk 1
Chunk N
Chunk 2
Instances file
Reducer
CPQuery
Reducer
CPQuery
Reducer
CPQuery
Inference is embarrassingly parallel
Hadoop determines # of mappers, based on file size
SO  we’ll use reducers to parallelize CPQuery
© Hortonworks Inc. 2014 Page 29
Example: Adult dataset
• Donated by Ronny Kohavi and Barry Becker, 1996 -
http://archive.ics.uci.edu/ml/datasets/Adult
• Extracted from 1994 census data
• 48842 instances, 14 features such as:
– Age, country, occupation, marital status, capital gain, etc
– Goal: predict if income is >50K or not
…
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
…
© Hortonworks Inc. 2014 Page 30
Sample learned network structure for “adult”
© Hortonworks Inc. 2014 Page 31
Inference with RMR on adult dataset
NUM_REDUCERS = 4
opt = rmr.options(backend = "hadoop”,
backend.parameters = list(hadoop=list(D="mapreduce.reduce.memory.mb=1024",
D=paste0("mapreduce.job.reduces=”, NUM_REDUCERS))))
inpFile = 'adult.test'
outFile = 'adult.out'
mapreduce(input=inpFile, input.format="text",
output=outFile, output.format="csv",
map=map_func, reduce=reduce_func)
© Hortonworks Inc. 2014 Page 32
Our mapper: passing on to reducer…
map_func <- function(., values)
{
out_klist= list(); out_vlist = list()
for (v in values) {
fvec = unlist(strsplit(v, ',', fixed=T)) # Read row and split into columns
if (length(fvec)<15) { next; } # deal with row not in expected format
key = floor(runif(1, 0, NUM_REDUCERS))
out_klist = c(out_klist, key)
out_vlist = c(out_vlist, v)
}
return (keyval(out_klist, out_vlist))
}
© Hortonworks Inc. 2014 Page 33
Our reducer: where all the action happens
trim <- function (x) gsub("^s+|s+$", "", x)
reduce_func <- function(., values)
{
out_klist = list(); out_vlist = list()
for (v in values) {
increment.counter('bn-demo', 'row', 1) # to let MR know we are still active
fvec = sapply(strsplit(v, ',', fixed=T), trim) # read row and split into columns
names(fvec)=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship",
"race","sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income")
pv = dataprep(fvec) # transform to “learned” features
evidence = as.list(pv[1,setdiff(colnames(pv), 'income')])
prob = cpquery(fitted, event = (income == ">50K"), evidence = evidence, method="lw")
out_klist = c(out_klist, v)
out_vlist = c(out_vlist, format(prob, digits=2))
}
return (keyval(out_klist, out_vlist))
}
© Hortonworks Inc. 2014 Page 34
Example output: adult.out
26, Private, 191573, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K.,0.37
52, Private, 203635, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K.,0.14
36, Private, 68798, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.,0.019
34, Private, 31752, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, ?, <=50K.,0.14
59, ?, 291856, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K.,0.074
26, Private, 135848, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 10, Guatemala, <=50K.,0.03
50, Local-gov, 237356, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States,
>50K.,0.89
56, Self-emp-not-inc, 140729, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States,
<=50K.,0.14
22, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K.,0.21
45, Self-emp-inc, 88500, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.94
© Hortonworks Inc. 2014 Page 35
More information
• Detailed step-by-step guide and code used can be found on:
https://github.com/ofermend/bayes-net-r-hadoop
• Download Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
• Further reading/learning:
– http://www.bnlearn.com/
– PGM class on Coursera:
https://www.coursera.org/course/pgm
– PGM Ebook from UCL:
http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf
– Many others…
© Hortonworks Inc. 2014 Page 36
Thank you!
Any Questions?
Ofer Mendelevitch, ofer@hortonworks.com, @ofermend
We’re hiring! www.hortonworks.com/careers
Hortonworks training: www.hortonworks.com/training
Hortonworks blog: www.hortonworks.com/blog

More Related Content

What's hot

Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Universitat Politècnica de Catalunya
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
Usage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in HealthcareUsage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in HealthcareGlobalLogic Ukraine
 
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...Databricks
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesKonstantinos Zagoris
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingMarco Wirthlin
 
Backbone can not be trained at once rolling back to pre trained network for p...
Backbone can not be trained at once rolling back to pre trained network for p...Backbone can not be trained at once rolling back to pre trained network for p...
Backbone can not be trained at once rolling back to pre trained network for p...NAVER Engineering
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...
【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...
【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...Deep Learning JP
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksZak Jost
 
Using Deep Learning to Find Similar Dresses
Using Deep Learning to Find Similar DressesUsing Deep Learning to Find Similar Dresses
Using Deep Learning to Find Similar DressesHJ van Veen
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Oswald Campesato
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 

What's hot (20)

Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Usage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in HealthcareUsage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in Healthcare
 
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
 
Icml2017 overview
Icml2017 overviewIcml2017 overview
Icml2017 overview
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational Modelling
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
Backbone can not be trained at once rolling back to pre trained network for p...
Backbone can not be trained at once rolling back to pre trained network for p...Backbone can not be trained at once rolling back to pre trained network for p...
Backbone can not be trained at once rolling back to pre trained network for p...
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...
【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...
【DL輪読会】Spectral Normalisation for Deep Reinforcement Learning: An Optimisatio...
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 
Using Deep Learning to Find Similar Dresses
Using Deep Learning to Find Similar DressesUsing Deep Learning to Find Similar Dresses
Using Deep Learning to Find Similar Dresses
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 

Similar to Bayesian Networks with R and Hadoop

Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein AlignmentCloudera, Inc.
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...diannepatricia
 
Deployment of Randomised Optimisation Algorithms Benchmarking in DAPHNE
Deployment of Randomised Optimisation Algorithms Benchmarking in DAPHNEDeployment of Randomised Optimisation Algorithms Benchmarking in DAPHNE
Deployment of Randomised Optimisation Algorithms Benchmarking in DAPHNEUniversity of Maribor
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
Link prediction with the linkpred tool
Link prediction with the linkpred toolLink prediction with the linkpred tool
Link prediction with the linkpred toolRaf Guns
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksNguyen Quang
 

Similar to Bayesian Networks with R and Hadoop (20)

2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein Alignment
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
 
Deployment of Randomised Optimisation Algorithms Benchmarking in DAPHNE
Deployment of Randomised Optimisation Algorithms Benchmarking in DAPHNEDeployment of Randomised Optimisation Algorithms Benchmarking in DAPHNE
Deployment of Randomised Optimisation Algorithms Benchmarking in DAPHNE
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Link prediction with the linkpred tool
Link prediction with the linkpred toolLink prediction with the linkpred tool
Link prediction with the linkpred tool
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Bayesian Networks with R and Hadoop

  • 1. © Hortonworks Inc. 2014 Hortonworks Bayesian Networks with R and Hadoop Hadoop Summit, June 2014 Ofer Mendelevitch
  • 2. © Hortonworks Inc. 2014 Page 2 A bit about me Ofer Mendelevitch Director, Data Science @ Hortonworks Previously: Nor1, Yahoo!, Risk Insight, Quiver Personal blog: www.achessdad.com
  • 3. © Hortonworks Inc. 2014 Page 3 What I will cover today… •What is a Bayesian Network? •Why I think it’s cool •Bayesian networks with R: the bnlearn package •Bayes Networks Inference with R and Hadoop
  • 4. © Hortonworks Inc. 2014 Page 4 Introduction to Bayesian Networks (with examples using R)
  • 5. © Hortonworks Inc. 2014 Page 5 Example: “Asia” Bayesian Network Each node is a random variable: yes/no Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath
  • 6. © Hortonworks Inc. 2014 Page 6 Example: “Asia” Bayesian Network Graph structure reflects “causal” relationships Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath
  • 7. © Hortonworks Inc. 2014 Page 7 Example: “Asia” Bayesian Network node CPT: P(node | parents) Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath SoB Tub or Cancer Bronchitis T F T T 0.7 0.3 F T 0.4 0.6 T F 0.45 0.55 F F 0.05 0.95 CPT
  • 8. © Hortonworks Inc. 2014 Page 8 What is a (discrete) Bayesian Network? (also called Bayes Nets, Belief Nets, etc) • A network structure (DAG): – Nodes => random variables, taking discrete values – Edges => conditional dependencies • E.g., lung cancer is statistically dependent on smoking • A set of conditional probability tables (CPTs): – Each node has a set of parents, determined by the graph – CPT holds P(node | parent-A, parent-B, …) for each node
  • 9. © Hortonworks Inc. 2014 Page 9 Why are Bayesian Networks cool? • Intuitive/adaptive modeling tool: – Graphs are natural for modeling relationships – Easy to combine data-driven learning with expert know-how – You can start small, and add knowledge as it is acquired • “Naturally” addresses inference with missing values • Inference can be applied to any variable/node – As opposed to a single (target) variable in supervised learning
  • 10. © Hortonworks Inc. 2014 Page 10 Bayesian networks have been successfully used for a variety of real-world applications • Healthcare: medical diagnosis, genetic modeling • Security: crime pattern analysis, terrorism risk management • Education: student modeling • Finance: credit rating, predicting defaults • Tech support: troubleshooting for computers/printers See “Bayesian networks: a practical guide to applications”, Pourret et al
  • 11. © Hortonworks Inc. 2014 Page 11 Bayesian networks with R • http://cran.r-project.org/web/views/Bayesian.html • We will focus on “bnlearn” (by Marco Scutari) – Implements various structure learning algorithms (hc, tabu, gs, iamb, mmhc, rsmax2, etc) – Provides automated learning of CPT – Approximate inference: “likelihood sampling” and “likelihood weighting” – Supports snow/parallel for some algorithms
  • 12. © Hortonworks Inc. 2014 Page 12 Step 1: Constructing the graph Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath • Manually (expert knowledge) • Automatically from data
  • 13. © Hortonworks Inc. 2014 Page 13 Manual graph construction: Asia > library(bnlearn) > varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB") > ag = empty.graph(varnames) > arcs(ag, ignore.cycles=T) = data.frame( > "from”=c("Asia", "Smoking", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC"), > "to”=c("Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC", "SoB", "X-ray", "SoB")) > graphviz.plot(ag)
  • 14. © Hortonworks Inc. 2014 Page 14 Automated graph construction: Asia > library(bnlearn) > varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB") > data(asia); names(asia) = varnames > bg = hc(asia) > graphviz.plot(bg)
  • 15. © Hortonworks Inc. 2014 Page 15 Automated learning does not always work perfectly… For example: • May not learn all the “expected” edges • May learn in the wrong direction Therefore, in practice it helps to: • Provide whitelist and blacklist to the algorithm • Pre-seed with a manual networks structure, and let the algorithm learn from there • Ensemble learning of structure (see boot.strength)
  • 16. © Hortonworks Inc. 2014 Page 16 Step 2: Learning the CPT / probabilities Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath SoB Tub or Cancer Bronchitis T F T T 0.85 0.15 F T 0.79 0.21 T F 0.73 0.27 F F 0.1 0.9 CPT
  • 17. © Hortonworks Inc. 2014 Page 17 Learning CPT for each node in the graph > fitted = bn.fit(ag, asia) > print(fitted$SoB) Parameters of node SoB (multinomial distribution) Conditional probability table: , , Tub-or-LC = no Bronchitis SoB no yes no 0.90017286 0.21373057 yes 0.09982714 0.78626943 , , Tub-or-LC = yes Bronchitis SoB no yes no 0.27737226 0.14592275 yes 0.72262774 0.85407725
  • 18. © Hortonworks Inc. 2014 Page 18 Using the BN for inference • Given evidence: (1) visit to asia, (2) SoB (3) Bronchitis • What is the likelihood of “lung cancer”? Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath
  • 19. © Hortonworks Inc. 2014 Page 19 Inferring with missing values • We provide evidence (“yes” or “no” in this case) only for those nodes where we have such evidence • If a value is “missing” it’s just not included in the evidence when doing inference… This is in contrast to supervised learning, where ALL values are typically needed for inference.
  • 20. © Hortonworks Inc. 2014 Page 20 Exact Inference with gRain • The gRain package implements exact inference for discrete Bayesian Networks using the “Junction Tree” belief propagation algorithm • Bnlearn/gRain cooperate nicely > jtree = compile(as.grain(fitted)) > jp = setFinding(jtree, nodes = c("Asia", "Sob", "Bronchitis"), states = c("yes", "yes", "yes")) > print(querygrain(jp, nodes="LC")$LC) LC no yes 0.934 0.066
  • 21. © Hortonworks Inc. 2014 Page 21 Approximate inference with bnlearn Bnlearn implements approximate inference: logic sampling (aka rejection sampling) and likelihood weighting > # Infer probability P(SoB | Asia, Bronchitis) using logic sampling > p1 = cpquery(fitted, event = eval(SoB == 'yes'), evidence = eval(Asia == 'yes' & Bronchitis == 'yes'), method="ls") > print(p1) [1] 0.8014706 > # Infer probability P(SoB | Asia, Bronchitis) using likelihood weighting > evidence = list("yes", "yes") > names(evidence) = c("Asia", "Bronchitis") > p2 = cpquery(fitted, eval(SoB == 'yes'), evidence, method="lw") > print(p2) [1] 0.795404
  • 22. © Hortonworks Inc. 2014 Page 22 Large scale Bayes Networks Inference with R and Hadoop
  • 23. © Hortonworks Inc. 2014 Page 23 What is large? • Number of nodes: – 10s: Medium – 100s: Large – 1000s: Very large • Number of instances: – 100,000s to millions
  • 24. © Hortonworks Inc. 2014 Page 24 Manually constructing large graphs is hard
  • 25. © Hortonworks Inc. 2014 Page 25 Large scale learning in practice: manual + automated • Define nodes • Seed with some known edges, based on expert knowledge • Augment with automated learning (e.g., hc, tabu, rsmax2, etc)
  • 26. © Hortonworks Inc. 2014 Page 26 Large scale inference: Exact or Approximate? Pros Cons Exact (Jtree) gRain Fast inference time Computational complexity determined (exponentially) by largest clique size Approximate (LS, LW) Bnlearn Can be used for any graph Not limited by “clique” size Inference is often much slower Not accurate for rare events
  • 27. © Hortonworks Inc. 2014 Page 27 About RHadoop/RMR • An open source project, supported by revolution analytics • Various sub-projects: RMR, RHDFS, RHBASE, plyrmr, etc • We will focus on RMR – Implement mapper/reducer code using R • RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki • Installing RMR on HDP: http://www.slideshare.net/Hadoop_Summit/enabling-r-on- hadoop http://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/
  • 28. © Hortonworks Inc. 2014 Page 28 Large scale inference with R and Hadoop Infer with RMR BN model Mapper No-op Results Hadoop cluster RMR Mapper No-op Chunk 1 Chunk N Chunk 2 Instances file Reducer CPQuery Reducer CPQuery Reducer CPQuery Inference is embarrassingly parallel Hadoop determines # of mappers, based on file size SO  we’ll use reducers to parallelize CPQuery
  • 29. © Hortonworks Inc. 2014 Page 29 Example: Adult dataset • Donated by Ronny Kohavi and Barry Becker, 1996 - http://archive.ics.uci.edu/ml/datasets/Adult • Extracted from 1994 census data • 48842 instances, 14 features such as: – Age, country, occupation, marital status, capital gain, etc – Goal: predict if income is >50K or not … 53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K 28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K 37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K 49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K 52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K …
  • 30. © Hortonworks Inc. 2014 Page 30 Sample learned network structure for “adult”
  • 31. © Hortonworks Inc. 2014 Page 31 Inference with RMR on adult dataset NUM_REDUCERS = 4 opt = rmr.options(backend = "hadoop”, backend.parameters = list(hadoop=list(D="mapreduce.reduce.memory.mb=1024", D=paste0("mapreduce.job.reduces=”, NUM_REDUCERS)))) inpFile = 'adult.test' outFile = 'adult.out' mapreduce(input=inpFile, input.format="text", output=outFile, output.format="csv", map=map_func, reduce=reduce_func)
  • 32. © Hortonworks Inc. 2014 Page 32 Our mapper: passing on to reducer… map_func <- function(., values) { out_klist= list(); out_vlist = list() for (v in values) { fvec = unlist(strsplit(v, ',', fixed=T)) # Read row and split into columns if (length(fvec)<15) { next; } # deal with row not in expected format key = floor(runif(1, 0, NUM_REDUCERS)) out_klist = c(out_klist, key) out_vlist = c(out_vlist, v) } return (keyval(out_klist, out_vlist)) }
  • 33. © Hortonworks Inc. 2014 Page 33 Our reducer: where all the action happens trim <- function (x) gsub("^s+|s+$", "", x) reduce_func <- function(., values) { out_klist = list(); out_vlist = list() for (v in values) { increment.counter('bn-demo', 'row', 1) # to let MR know we are still active fvec = sapply(strsplit(v, ',', fixed=T), trim) # read row and split into columns names(fvec)=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income") pv = dataprep(fvec) # transform to “learned” features evidence = as.list(pv[1,setdiff(colnames(pv), 'income')]) prob = cpquery(fitted, event = (income == ">50K"), evidence = evidence, method="lw") out_klist = c(out_klist, v) out_vlist = c(out_vlist, format(prob, digits=2)) } return (keyval(out_klist, out_vlist)) }
  • 34. © Hortonworks Inc. 2014 Page 34 Example output: adult.out 26, Private, 191573, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K.,0.37 52, Private, 203635, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K.,0.14 36, Private, 68798, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.,0.019 34, Private, 31752, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, ?, <=50K.,0.14 59, ?, 291856, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K.,0.074 26, Private, 135848, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 10, Guatemala, <=50K.,0.03 50, Local-gov, 237356, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.89 56, Self-emp-not-inc, 140729, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K.,0.14 22, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K.,0.21 45, Self-emp-inc, 88500, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.94
  • 35. © Hortonworks Inc. 2014 Page 35 More information • Detailed step-by-step guide and code used can be found on: https://github.com/ofermend/bayes-net-r-hadoop • Download Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ • Further reading/learning: – http://www.bnlearn.com/ – PGM class on Coursera: https://www.coursera.org/course/pgm – PGM Ebook from UCL: http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf – Many others…
  • 36. © Hortonworks Inc. 2014 Page 36 Thank you! Any Questions? Ofer Mendelevitch, ofer@hortonworks.com, @ofermend We’re hiring! www.hortonworks.com/careers Hortonworks training: www.hortonworks.com/training Hortonworks blog: www.hortonworks.com/blog

Editor's Notes

  1. A bayesian network is: Set of nodes Arrows between the nodes (graph) Conditional probability table: P(X | parents) for each node
  2. A bayesian network is: Set of nodes Arrows between the nodes (graph) Conditional probability table: P(X | parents) for each node
  3. A bayesian network is: Set of nodes Arrows between the nodes (graph) Conditional probability table: P(X | parents) for each node