This document discusses using Bayesian networks to model relationships in data. It introduces Bayesian networks as directed acyclic graphs that represent conditional dependencies between random variables. The document describes approaches for finding the optimal Bayesian network structure given data, including scoring functions and dealing with issues like cycles. It also introduces BNFinder, an open-source Python library for learning Bayesian networks from data that can handle both discrete and continuous variables efficiently in parallel. Examples are given demonstrating BNFinder's ability to learn predictive models from genomic and gene expression data.
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
1. Understanding your data
with Bayesian networks
(in python)
Bartek Wilczyński
bartek@mimuw.edu.pl
University of Warsaw
PyData Silicon Valey, May 5th 2014
2. Are you confused enough?
Or should I confuse you a bit more ?
Image from xkcd.org/552/
4. There may be factors we haven't thought about
● Maybe confusion helps
with learning?
● Or maybe there is
an alternative explanation?
● As long as these are just
cartoon models – we
cannot really rule out any
structure
Paying
attention
Being
confused
Correct
answer
Being
confused
Correct
answer
or
5. What do I mean by data?
Sex Age Smoking Stress Lung Heart Feel
M 0-20 never N No no great
F 70 sometimes N minor no OK
M 50-70 daily Y no severe Not-so-well
M 20-50 daily N no minor OK
F 70 never N no minor great
F 20-50 sometimes Y severe minor Not-so-well
F 20-50 never Y no no great
M 20-50 sometimes N minor no great
M 50-70 never Y severe no OK
F 0-20 never N no severe OK
M 20-50 daily Y no no OK
M 0-20 daily N no no Not-so-well
M 20-50 never N minor no OK
.... ... ... ... ... ... ...
6. Network of connections
Smoking
(daily, sometimes, never)
Age
(0-20,20-50, 50-70,70+)
Stressful job
(yes,no)
Lung problems
(no,minor,severe)
Heart problems
(no,minor,severe)
Sex
(male,female)
How did you feel this morning?
(great, OK, not-so-well, terrible)
7. What is a Bayesian Network ?
●
A directed acyclic graph without cycles
●
with nodes representing random variables
●
and edges between nodes representing dependencies
(not necessarily causal)
●
Each edge is directed from a parent to a child, so all
nodes with connections to a given node constitute its
set of parents
●
Each variable is associated with a value domain and a
probability distribution conditional on parents' values
8. Back to our confused students
● Let us consider our model of
confused students
● We can consider the model
with an additional variable
● We need to heve data on the
additional variable to be
predictive
● Sometimes we need to use
“wrong” models if they are
predictive
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
confused 80% 0%
not confused 20% 100%
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
correct 50% 20%
incorrect 50% 80%
9. Can we find the “best” Bayesian Network?
● Given a dataset with observations,
we can try to find the “best”
network topology (i.e. the best
collection of parents' sets)
● In order to do it automatically we
need a scoring function to define
what we mean by “best”
● A score function is useful if it can
be written as a sum over
variables, i.e. the best network
consists of best parent sets for
variables (modulo acyclicity)
10. How to find the best network?
● There are generally three main approaches to defining BN scores:
– Bayesian statistics, e.g. BDe (Herskovits et al. '95)
– Information Theoretic, e.g. MDL (Lam et al. '94)
– Hypothesis testing, e.g. MMPC (Salehi et al. '10)
● There are also hybrid approaches, like the recent MIT (de Campos '06)
approach that uses information theory and hypothesis testing
● We have two issues:
– There are exponentially many potential parent sets
– The desired network needs to have no cycles
● The second issue is more important and makes the problem NP-complete
(Chickering '96)
11. Cycles are not always a problem
● Dynamic Bayesian
Networks are avariant of
BN models that describe
temporal dependencies
● We can safely assume that
the causal links only go
forward in time
● That breaks the problem of
cycles as we now have two
versions of each variable:
“before” and “after”
X1
X2
X3
X1 X1
t t+1
X2 X2
X3 X3
12. Different types of variables
● Another common situation is
when we have different types
of variables
● We may know that only
certain types of connections
are causal
● Or we may be interested only in
certain types of connections
● This breaks the cycles as well
Mutations
Protein expression
Diseases
13. BNFinder – python library for Bayesian Networks
● A library for identification of
optimal Bayesian Networks
● Works under assumption of
acyclicity by external
constraints (disjoint sets of
variables or dynamic
networks)
● fast and efficient (relatively)
15. Now, parallellize!
● Since we have external
constraints on acyclicity, we
can search for parent sets
independently
● This leads to a simple
parallelization scheme and
good efficiency
27. Summary
● Bayesian Networks can provide predictive models based on
conditional probability distributions
● BNFinder is an effective tool for finding optimal networks given
tabular data. And it's open source!
● It can be used as a commandline tool or as a library
● It can use continuous data as well as discrete
● Can be run in parallel on multiple cores (with good efficiency)
● Convenience functions (cross-validation, ROC plots) included
http://launchpad.net/bnfinder