Bayesian Networks with R and Hadoop

© Hortonworks Inc. 2014
Hortonworks
Bayesian Networks with R and Hadoop
Hadoop Summit, June 2014
Ofer Mendelevitch

© Hortonworks Inc. 2014 Page 2
A bit about me
Ofer Mendelevitch
Director, Data Science @ Hortonworks
Previously: Nor1, Yahoo!, Risk Insight, Quiver
Personal blog: www.achessdad.com

What I will cover today…
•What is a Bayesian Network?
•Why I think it’s cool
•Bayesian networks with R: the bnlearn package
•Bayes Networks Inference with R and Hadoop

Introduction to Bayesian Networks
(with examples using R)

Example: “Asia” Bayesian Network
Each node is a random variable: yes/no
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis
or cancer
X-ray result Shortness of
breath

Graph structure reflects “causal” relationships
Tuberculosis
or cancer
breath

node CPT: P(node | parents)
Tuberculosis
or cancer
breath
SoB
Tub or
Cancer
Bronchitis T F
T T 0.7 0.3
F T 0.4 0.6
T F 0.45 0.55
F F 0.05 0.95
CPT

What is a (discrete) Bayesian Network?
(also called Bayes Nets, Belief Nets, etc)
• A network structure (DAG):
– Nodes => random variables, taking discrete values
– Edges => conditional dependencies
• E.g., lung cancer is statistically dependent on smoking
• A set of conditional probability tables (CPTs):
– Each node has a set of parents, determined by the graph
– CPT holds P(node | parent-A, parent-B, …) for each node

Why are Bayesian Networks cool?
• Intuitive/adaptive modeling tool:
– Graphs are natural for modeling relationships
– Easy to combine data-driven learning with expert know-how
– You can start small, and add knowledge as it is acquired
• “Naturally” addresses inference with missing values
• Inference can be applied to any variable/node
– As opposed to a single (target) variable in supervised learning

Bayesian networks have been successfully used for
a variety of real-world applications
• Healthcare: medical diagnosis, genetic modeling
• Security: crime pattern analysis, terrorism risk
management
• Education: student modeling
• Finance: credit rating, predicting defaults
• Tech support: troubleshooting for computers/printers
See “Bayesian networks: a practical guide to applications”, Pourret et al

Bayesian networks with R
• http://cran.r-project.org/web/views/Bayesian.html
• We will focus on “bnlearn” (by Marco Scutari)
– Implements various structure learning algorithms (hc, tabu,
gs, iamb, mmhc, rsmax2, etc)
– Provides automated learning of CPT
– Approximate inference: “likelihood sampling” and “likelihood
weighting”
– Supports snow/parallel for some algorithms

Step 1: Constructing the graph
Tuberculosis
or cancer
breath
• Manually (expert knowledge)
• Automatically from data

Manual graph construction: Asia
> library(bnlearn)
> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")
> ag = empty.graph(varnames)
> arcs(ag, ignore.cycles=T) = data.frame(
> "from”=c("Asia", "Smoking", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC"),
> "to”=c("Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC", "SoB", "X-ray", "SoB"))
> graphviz.plot(ag)

Automated graph construction: Asia
> library(bnlearn)
> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")
> data(asia); names(asia) = varnames
> bg = hc(asia)
> graphviz.plot(bg)

Automated learning does not always work
perfectly…
For example:
• May not learn all the “expected” edges
• May learn in the wrong direction
Therefore, in practice it helps to:
• Provide whitelist and blacklist to the algorithm
• Pre-seed with a manual networks structure, and let the
algorithm learn from there
• Ensemble learning of structure (see boot.strength)

Step 2: Learning the CPT / probabilities
Tuberculosis
or cancer
breath
SoB
Tub or
Cancer
Bronchitis T F
T T 0.85 0.15
F T 0.79 0.21
T F 0.73 0.27
F F 0.1 0.9
CPT

Learning CPT for each node in the graph
> fitted = bn.fit(ag, asia)
> print(fitted$SoB)
Parameters of node SoB (multinomial distribution)
Conditional probability table:
, , Tub-or-LC = no
Bronchitis
SoB no yes
no 0.90017286 0.21373057
yes 0.09982714 0.78626943
, , Tub-or-LC = yes
Bronchitis
SoB no yes
no 0.27737226 0.14592275
yes 0.72262774 0.85407725

Using the BN for inference
• Given evidence: (1) visit to asia, (2) SoB (3) Bronchitis
• What is the likelihood of “lung cancer”?
Tuberculosis
or cancer
breath

Inferring with missing values
• We provide evidence (“yes” or “no” in this case) only
for those nodes where we have such evidence
• If a value is “missing” it’s just not included in the
evidence when doing inference…
This is in contrast to supervised learning, where ALL
values are typically needed for inference.

Exact Inference with gRain
• The gRain package implements exact inference for
discrete Bayesian Networks using the “Junction Tree”
belief propagation algorithm
• Bnlearn/gRain cooperate nicely
> jtree = compile(as.grain(fitted))
> jp = setFinding(jtree, nodes = c("Asia", "Sob", "Bronchitis"),
states = c("yes", "yes", "yes"))
> print(querygrain(jp, nodes="LC")$LC)
LC
no yes
0.934 0.066

Approximate inference with bnlearn
Bnlearn implements approximate inference: logic
sampling (aka rejection sampling) and likelihood
weighting
> # Infer probability P(SoB | Asia, Bronchitis) using logic sampling
> p1 = cpquery(fitted, event = eval(SoB == 'yes'),
evidence = eval(Asia == 'yes' & Bronchitis == 'yes'), method="ls")
> print(p1)
[1] 0.8014706
> # Infer probability P(SoB | Asia, Bronchitis) using likelihood weighting
> evidence = list("yes", "yes")
> names(evidence) = c("Asia", "Bronchitis")
> p2 = cpquery(fitted, eval(SoB == 'yes'), evidence, method="lw")
> print(p2)
[1] 0.795404

Large scale Bayes Networks
Inference with R and Hadoop

What is large?
• Number of nodes:
– 10s: Medium
– 100s: Large
– 1000s: Very large
• Number of instances:
– 100,000s to millions

Manually constructing large graphs is hard

Large scale learning in practice:
manual + automated
• Define nodes
• Seed with some known edges, based on expert
knowledge
• Augment with automated learning (e.g., hc, tabu,
rsmax2, etc)

Large scale inference: Exact or Approximate?
Pros Cons
Exact (Jtree)
gRain
Fast inference time Computational complexity
determined (exponentially) by
largest clique size
Approximate
(LS, LW)
Bnlearn
Can be used for any graph
Not limited by “clique” size
Inference is often much slower
Not accurate for rare events

About RHadoop/RMR
• An open source project, supported by revolution
analytics
• Various sub-projects: RMR, RHDFS, RHBASE, plyrmr, etc
• We will focus on RMR
– Implement mapper/reducer code using R
• RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki
• Installing RMR on HDP: http://www.slideshare.net/Hadoop_Summit/enabling-r-on-
hadoop
http://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/

Large scale inference with R and Hadoop
Infer with RMR
BN
model
Mapper
No-op
Results
Hadoop cluster
RMR
Mapper
No-op
Chunk 1
Chunk N
Chunk 2
Instances ﬁle
Reducer
CPQuery
Reducer
CPQuery
Reducer
CPQuery
Inference is embarrassingly parallel
Hadoop determines # of mappers, based on file size
SO  we’ll use reducers to parallelize CPQuery

Example: Adult dataset
• Donated by Ronny Kohavi and Barry Becker, 1996 -
http://archive.ics.uci.edu/ml/datasets/Adult
• Extracted from 1994 census data
• 48842 instances, 14 features such as:
– Age, country, occupation, marital status, capital gain, etc
– Goal: predict if income is >50K or not
…
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
…

Sample learned network structure for “adult”

Inference with RMR on adult dataset
NUM_REDUCERS = 4
opt = rmr.options(backend = "hadoop”,
backend.parameters = list(hadoop=list(D="mapreduce.reduce.memory.mb=1024",
D=paste0("mapreduce.job.reduces=”, NUM_REDUCERS))))
inpFile = 'adult.test'
outFile = 'adult.out'
mapreduce(input=inpFile, input.format="text",
output=outFile, output.format="csv",
map=map_func, reduce=reduce_func)

Our mapper: passing on to reducer…
map_func <- function(., values)
{
out_klist= list(); out_vlist = list()
for (v in values) {
fvec = unlist(strsplit(v, ',', fixed=T)) # Read row and split into columns
if (length(fvec)<15) { next; } # deal with row not in expected format
key = floor(runif(1, 0, NUM_REDUCERS))
out_klist = c(out_klist, key)
out_vlist = c(out_vlist, v)
}
return (keyval(out_klist, out_vlist))
}

Our reducer: where all the action happens
trim <- function (x) gsub("^s+|s+$", "", x)
reduce_func <- function(., values)
{
out_klist = list(); out_vlist = list()
for (v in values) {
increment.counter('bn-demo', 'row', 1) # to let MR know we are still active
fvec = sapply(strsplit(v, ',', fixed=T), trim) # read row and split into columns
names(fvec)=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship",
"race","sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income")
pv = dataprep(fvec) # transform to “learned” features
evidence = as.list(pv[1,setdiff(colnames(pv), 'income')])
prob = cpquery(fitted, event = (income == ">50K"), evidence = evidence, method="lw")
out_klist = c(out_klist, v)
out_vlist = c(out_vlist, format(prob, digits=2))
}
return (keyval(out_klist, out_vlist))
}

Example output: adult.out
26, Private, 191573, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K.,0.37
52, Private, 203635, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K.,0.14
36, Private, 68798, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.,0.019
34, Private, 31752, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, ?, <=50K.,0.14
59, ?, 291856, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K.,0.074
26, Private, 135848, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 10, Guatemala, <=50K.,0.03
50, Local-gov, 237356, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States,
>50K.,0.89
56, Self-emp-not-inc, 140729, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States,
<=50K.,0.14
22, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K.,0.21
45, Self-emp-inc, 88500, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.94

More information
• Detailed step-by-step guide and code used can be found on:
https://github.com/ofermend/bayes-net-r-hadoop
• Download Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
• Further reading/learning:
– http://www.bnlearn.com/
– PGM class on Coursera:
https://www.coursera.org/course/pgm
– PGM Ebook from UCL:
http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf
– Many others…

Thank you!
Any Questions?
Ofer Mendelevitch, ofer@hortonworks.com, @ofermend
We’re hiring! www.hortonworks.com/careers
Hortonworks training: www.hortonworks.com/training
Hortonworks blog: www.hortonworks.com/blog

Bayesian Networks with R and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian Networks with R and Hadoop

Similar to Bayesian Networks with R and Hadoop (20)

Recently uploaded

Recently uploaded (20)

Bayesian Networks with R and Hadoop

Editor's Notes