SlideShare a Scribd company logo
1 of 41
Data analysis and integration
challenges in genomics
Uppsala
March 19, 2015
Mikael Huss, SciLifeLab / Stockholm
University
Where I work
INTEGRATIVEANDTECHNOLOGYDRIVENRESEARCHINHIGH-
THROUGHPUTBIOLOGY
SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
 Inaugurated mid-2010
 Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
 Approximately 700 researchers
 More than 100 researchers in bioinformatics and
systems biology
http://ngi-status.scilifelab.se/
National genomics facilities at SciLifeLab
Clinical Genomics Clinical biomarkers Clinical sequencing
Functional genomics
Eukaryotic Single Cell Genomics
Single Cell Proteomics
Microbial Single Cell Genomics
Karolinska High Throughput Center
(KHTC)
Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy
Drug discovery – ADME, Antibody Therapeutics, Protein Expression &
Characterization, Lead Indetification, Biophysical Screning etc.
Chemical Biology Consortium Sweden – Umeå, Uppsala, KI
Structural Biology – Protein Science Facility
National facilities at SciLifeLab
Clinical diagnostics
Affinity proteomics
Biobank profiling, Cell profiling,
Fluorescence Tissue Profiling,
Mass Cytometry, PLA Proteomics,
Protein and Peptide Arrays,
Tissue Profiling
Bioinformatics facilities
• Bioinformatics compute and storage (UPPNEX)
• Short-term support (2 weeks / 80h) + paid extension
– About 45 FTEs
• Long-term support (500h) for projects selected by
external committee
“embedded
bioinformaticians”
Participate in projects on a
longer term basis
Long-term bioinformatics support group
• Currently 13 senior bioinformaticians + 2 managers
• Currently recruiting for 10 new employees and
thereby expanding from Uppsala and Stockholm to
other locations in Sweden
• Example projects (from my own work):
– Characterizing the human muscle transcriptome in connection with exercise
– Metagenomics for looking at the connection between international travel and
antibiotic resistance
– Characterizing neural stem cells in developing mouse brain
– Small RNAs involved in the CRISPR/Cas9 system in bacteria
Integrative bioinformatics initiative
(“big data” project)
• Advertising for 4 positions, 2 in Gothenburg & 2 in
Stockholm
• More in-depth support, experimental planning,
method development
• Data integration
Pilot project
Connecting layers of information
DNA Whole-genome sequencing
Exome sequencing
CGH
Mutations, SNVs
Copy number variations
Structural variations
Gene fusions
RNA mRNA isoforms
Allele specific expression
Fusion transcripts
eQTLS
proteins
RNA-seq
Microarrays
High throughput mass
spectrometry
Protein isoforms
Post-translational modifications
My blog: Follow the Data
Machine learning, “big data”, “data science”, often in connection with life science
Published brief notes on APIs from One Codex, Google Genomics, SolveBio
Let’s get the ”big data” buzzword out of the way …
… but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html
Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
SciLifeLabKing
NYSE
Sanger
Spotify BGI
Twitter
Facebook
Baidu
NSA
Google Ebay
Internet
World
1e+001e+021e+041e+06
S
Genomics big data in context: Storage
Data stored (petabytes)
pb
AZ
SciLifeLab
Spotify
Sanger
Novartis
Ebay
Facebook
Baidu
NSA
Google
110100100010000
Aside: Storage & processing frameworks
Hadoop, the standard solution for “big data” in industry, has not really caught on
in genomics … Why? Some ideas –
- Existing computing infrastructure is sufficient
- Or, focused on supercomputing solutions rather than commodity servers
- The programming/sysadmin skills and training are not there
- Many problems not parallelizable
- Not enough flexibility for ad hoc, exploratory analysis
Spark/ADAM, new framework enabling more interactive and in-memory-
oriented analysis
Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)
Ideas on improving data integration
1. APIs to mitigate friction in data collection and preprocessing
2. Querying “by data set”
3. Leveraging advances in machine learning
So much public data out there!
APIs
Lowering barriers to entry with APIs (application programming interfaces; ways
for a computer program to automatically retrieve information in a defined
manner).
“80% of the time of a data scientist is spent finding and preparing the data”
APIs against good reference collections mitigate the hassle of looking for the right
data sources, handling different versions/releases, etc.
We should be able to ask questions such as:
“Which gene variants in a patient have been previously associated to a specific
disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of
the Tute annotation db)
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
“Give me the publicly available RNA-seq sequences that support this peptide that I
found in mass spectrometry and which appears to have been translated from a
fusion transcript”
Data provenance
Researchers often want to look at processed data (avoiding the work of
reprocessing everything from scratch) but they want to know how the processing
was done.
 Each data set should have an “analysis history” attached
Also important for reproducibility and paper writing
Querying by data set
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
Querying by dataset
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
Using the dataset itself, or a statistical
description of it, as a query
Jeff Jonas:
“Data finds data”
“The data is the query”
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
Cumulative biology and metagenomics:
The unknown
http://www.ted.com/talks/nathan_wolfe_wha
t_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate,
less than 1% of the viral
diversity has been explored!
=> Reference databases very limited!
The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
 80% of the 398 billion sequences they obtained could not be assembled into
putative genes
 Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!
Ergo…
For metagenomics in particular, but also for other applications, we would like
to have everything that has been published indexed in a better way, so we can
relate new stuff to those. We need to have a constantly growing index.
When we perform a new experiment, we could then relate our results to all of
the data out there, not just the part that has made it into the official reference
databases.
Machine learning
Google has had great success with deep learning …
Learning to recognize cats from
unlabel Youtube videos (2012)
Neural network with “3 million
neurons and 1 billion synapses”
…now it’s all over the place
Inaugural Stockholm deep learning meetup,
March 10, 2015
Deep learning
Perhaps deep learning could be used in genomics, proteomics etc to
transform diverse data sets into a more general representation which would
facilitate data integration?
New datasets can then be overlaid onto representations trained on large
collections.
Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
First step: Casey Greene’s group (Dartmouth)
A denoising autoencoder learned a generalized
representation of breast cancer expression
profiles based on the METABRIC cohort (>2000
samples). Validated on TCGA.
The nodes in the net can be interpreted to stand
for different biological features.
Tan et al. (2015)
Deep learning in genomics (2)
Convolutional network for splice site detection
Reads the DNA sequence directly and abstracts into higher-level features.
This network learned patterns of splice sites
And also re-discovered the concept of codons
Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf
“Classical” machine learning
Predictive modeling as a way to integrate information from different experimental assays.
Example: ongoing mouse neural development project
A number of genome-wide experiments have been done in developing spinal cord and
cortex; have measurements/genome-wide signals about:
- Gene expression (RNA-seq)
- Where the Sox2 transcription factor is bound in each tissue (ChIP-seq)
- How open/accessible the chromatin is (DNase-seq)
- Potential transcription factor binding sites (DNase footprints)
as well as some calculated features like certain interesting “DNA words” (transcription
factor binding motifs) and how conserved each stretch of DNA is between mice and other
organisms.
How to make some sense of all these data?
“Genome browser” view of genomic landscape around a gene
Gene
Conservation
Different data tracks
“Openness”
Sox2
binding
raw signal
peaks
(borrowed from Mark Gerstein)
We decided we are most interested in
understanding differences in gene
expression between spinal cord and
cortex neurons. Can the other
measurements help?
Progressively summarized and
abstracted the raw signals into blocks
with various features => matrix of
~20,000 genes x 13 features
Use machine learning techniques to
predict relative gene expression in
cortex/spinal cord based on these
features (ongoing…)
Indexing and querying technology such as Google’s can help genomics researchers by
e g
- Enabling programmatic access to published data (processed but with a known
analysis history) to lower the threshold for integrative analysis
- Allowing them to relate their datasets to other published data without overly
relying on curated reference databases (cumulative biology)
- Facilitating ingestion into machine learning (e g deep learning) systems for learning
general features of biological data from a very large set of samples
Recap
Extra slides

More Related Content

What's hot

Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesLarry Smarr
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesAmos Watentena
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics Christopher Mason
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Sreekanth Gali
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1Hamid Ur-Rahman
 

What's hot (20)

Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And Challenges
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Viewers also liked

Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq datamikaelhuss
 
X-omics Data Integration Challenges
X-omics Data Integration ChallengesX-omics Data Integration Challenges
X-omics Data Integration ChallengesCOST action BM1006
 
Pd L1 3: Dr BÙI ĐẮC CHÍ
Pd L1 3: Dr BÙI ĐẮC CHÍPd L1 3: Dr BÙI ĐẮC CHÍ
Pd L1 3: Dr BÙI ĐẮC CHÍhungnguyenthien
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biologylemberger
 
Protein ligand interaction.
Protein ligand interaction.Protein ligand interaction.
Protein ligand interaction.Rachana Tiwari
 

Viewers also liked (9)

Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq data
 
X-omics Data Integration Challenges
X-omics Data Integration ChallengesX-omics Data Integration Challenges
X-omics Data Integration Challenges
 
Pd L1 3: Dr BÙI ĐẮC CHÍ
Pd L1 3: Dr BÙI ĐẮC CHÍPd L1 3: Dr BÙI ĐẮC CHÍ
Pd L1 3: Dr BÙI ĐẮC CHÍ
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
OMICS tecnology
OMICS tecnologyOMICS tecnology
OMICS tecnology
 
Omics era
Omics eraOmics era
Omics era
 
Protein ligand interaction.
Protein ligand interaction.Protein ligand interaction.
Protein ligand interaction.
 

Similar to Data analysis & integration challenges in genomics

Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
 
Itqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplateItqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplateHelena Deus
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960mare34
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08Russ Altman
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
 
Sequencing Genomics: The New Big Data Driver
Sequencing Genomics:The New Big Data DriverSequencing Genomics:The New Big Data Driver
Sequencing Genomics: The New Big Data DriverLarry Smarr
 
Dynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical CommunicationsDynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical CommunicationsTim Clark
 
The real world of ontologies and phenotype representation: perspectives from...
The real world of ontologies and phenotype representation:  perspectives from...The real world of ontologies and phenotype representation:  perspectives from...
The real world of ontologies and phenotype representation: perspectives from...Maryann Martone
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europeopen_phacts
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!adcobb
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsProf. Wim Van Criekinge
 
Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1BigData_Europe
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAGopen_phacts
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsAmit Sheth
 

Similar to Data analysis & integration challenges in genomics (20)

Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Itqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplateItqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplate
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
 
A biologist in e-Science
A biologist in e-ScienceA biologist in e-Science
A biologist in e-Science
 
Sequencing Genomics: The New Big Data Driver
Sequencing Genomics:The New Big Data DriverSequencing Genomics:The New Big Data Driver
Sequencing Genomics: The New Big Data Driver
 
Dynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical CommunicationsDynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical Communications
 
The real world of ontologies and phenotype representation: perspectives from...
The real world of ontologies and phenotype representation:  perspectives from...The real world of ontologies and phenotype representation:  perspectives from...
The real world of ontologies and phenotype representation: perspectives from...
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformatics
 
Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
 

Recently uploaded

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

Data analysis & integration challenges in genomics

  • 1. Data analysis and integration challenges in genomics Uppsala March 19, 2015 Mikael Huss, SciLifeLab / Stockholm University
  • 3. SciLifeLab – an infrastructure for massive biology Science 328,805 (14 May 2010)  Inaugurated mid-2010  Hosted by three universities in Stockholm: Karolinska Institutet (medical faculty), Royal Institute of Technology (technical) and Stockholm University (natural science). SciLifeLab node in Uppsala.  Approximately 700 researchers  More than 100 researchers in bioinformatics and systems biology
  • 5. Clinical Genomics Clinical biomarkers Clinical sequencing Functional genomics Eukaryotic Single Cell Genomics Single Cell Proteomics Microbial Single Cell Genomics Karolinska High Throughput Center (KHTC) Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy Drug discovery – ADME, Antibody Therapeutics, Protein Expression & Characterization, Lead Indetification, Biophysical Screning etc. Chemical Biology Consortium Sweden – Umeå, Uppsala, KI Structural Biology – Protein Science Facility National facilities at SciLifeLab Clinical diagnostics Affinity proteomics Biobank profiling, Cell profiling, Fluorescence Tissue Profiling, Mass Cytometry, PLA Proteomics, Protein and Peptide Arrays, Tissue Profiling
  • 6. Bioinformatics facilities • Bioinformatics compute and storage (UPPNEX) • Short-term support (2 weeks / 80h) + paid extension – About 45 FTEs • Long-term support (500h) for projects selected by external committee “embedded bioinformaticians” Participate in projects on a longer term basis
  • 7. Long-term bioinformatics support group • Currently 13 senior bioinformaticians + 2 managers • Currently recruiting for 10 new employees and thereby expanding from Uppsala and Stockholm to other locations in Sweden • Example projects (from my own work): – Characterizing the human muscle transcriptome in connection with exercise – Metagenomics for looking at the connection between international travel and antibiotic resistance – Characterizing neural stem cells in developing mouse brain – Small RNAs involved in the CRISPR/Cas9 system in bacteria
  • 8. Integrative bioinformatics initiative (“big data” project) • Advertising for 4 positions, 2 in Gothenburg & 2 in Stockholm • More in-depth support, experimental planning, method development • Data integration
  • 9. Pilot project Connecting layers of information DNA Whole-genome sequencing Exome sequencing CGH Mutations, SNVs Copy number variations Structural variations Gene fusions RNA mRNA isoforms Allele specific expression Fusion transcripts eQTLS proteins RNA-seq Microarrays High throughput mass spectrometry Protein isoforms Post-translational modifications
  • 10.
  • 11. My blog: Follow the Data Machine learning, “big data”, “data science”, often in connection with life science Published brief notes on APIs from One Codex, Google Genomics, SolveBio
  • 12. Let’s get the ”big data” buzzword out of the way …
  • 13. … but some people are willing to go out on a limb “Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5- 10TB of structured data, which cannot be reduced any further” (OLRAC SPS) http://www.itweb.co.za/index.php?option=com_con tent&view=article&id=111815 ”There is no such thing as biomedical big data” (Will Bush, Vanderbilt University Center for Human Genetic Research) http://gettinggeneticsdone.blogspot.se/2014/02/no- such-thing-biomedical-bigdata.html
  • 14. Genomics big data in context: Throughput Data processed per day (terabytes) Tb SciLifeLabKing NYSE Sanger Spotify BGI Twitter Facebook Baidu NSA Google Ebay Internet World 1e+001e+021e+041e+06 S
  • 15. Genomics big data in context: Storage Data stored (petabytes) pb AZ SciLifeLab Spotify Sanger Novartis Ebay Facebook Baidu NSA Google 110100100010000
  • 16. Aside: Storage & processing frameworks Hadoop, the standard solution for “big data” in industry, has not really caught on in genomics … Why? Some ideas – - Existing computing infrastructure is sufficient - Or, focused on supercomputing solutions rather than commodity servers - The programming/sysadmin skills and training are not there - Many problems not parallelizable - Not enough flexibility for ad hoc, exploratory analysis Spark/ADAM, new framework enabling more interactive and in-memory- oriented analysis
  • 17. Genomics big data in context: Heterogeneity “The size of the data is not the whole story. If the data are uniform, they can almost always be compressed and filtered with traditional methods. You do not get a ‘big data’ processing challenge until other factors, such as variety, non- uniformity and continuous growth, are added to a large data set.” (adapted from Aleksi Kallio)
  • 18. Ideas on improving data integration 1. APIs to mitigate friction in data collection and preprocessing 2. Querying “by data set” 3. Leveraging advances in machine learning So much public data out there!
  • 19. APIs Lowering barriers to entry with APIs (application programming interfaces; ways for a computer program to automatically retrieve information in a defined manner). “80% of the time of a data scientist is spent finding and preparing the data” APIs against good reference collections mitigate the hassle of looking for the right data sources, handling different versions/releases, etc. We should be able to ask questions such as: “Which gene variants in a patient have been previously associated to a specific disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of the Tute annotation db)
  • 20. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API
  • 21. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?”
  • 22. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?” “What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!)
  • 23. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?” “What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!) “Download all available sequences for arthropoda and store them as FASTQ files” <= addressed by bionode.io
  • 24. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?” “What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!) “Download all available sequences for arthropoda and store them as FASTQ files” <= addressed by bionode.io “Give me the publicly available RNA-seq sequences that support this peptide that I found in mass spectrometry and which appears to have been translated from a fusion transcript”
  • 25. Data provenance Researchers often want to look at processed data (avoiding the work of reprocessing everything from scratch) but they want to know how the processing was done.  Each data set should have an “analysis history” attached Also important for reproducibility and paper writing
  • 26. Querying by data set Querying by dataset – we often want to relate our dataset to something “out there” without necessarily having a good preconception what it could be. (especially in metagenomics!) NextBio does an interesting version of this but costs money (has been acquired by Illumina) and focuses on selected types of functional studies.
  • 27. Querying by dataset Querying by dataset – we often want to relate our dataset to something “out there” without necessarily having a good preconception what it could be. (especially in metagenomics!) NextBio does an interesting version of this but costs money (has been acquired by Illumina) and focuses on selected types of functional studies. Using the dataset itself, or a statistical description of it, as a query Jeff Jonas: “Data finds data” “The data is the query” “we want to support automated data exploration in ways that are simply not possible today” C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
  • 28. Cumulative biology and metagenomics: The unknown http://www.ted.com/talks/nathan_wolfe_wha t_s_left_to_explore.html “Biological dark matter” “The unknown continent” According to one estimate, less than 1% of the viral diversity has been explored! => Reference databases very limited!
  • 29. The unknown In a recent paper on soil metagenomics, Titus Brown and colleagues report that:  80% of the 398 billion sequences they obtained could not be assembled into putative genes  Of the cases where sequences could be assembled into putative genes which would create putative proteins, 60% of these proteins could not be matched to anything in the databases!
  • 30. Ergo… For metagenomics in particular, but also for other applications, we would like to have everything that has been published indexed in a better way, so we can relate new stuff to those. We need to have a constantly growing index. When we perform a new experiment, we could then relate our results to all of the data out there, not just the part that has made it into the official reference databases.
  • 31. Machine learning Google has had great success with deep learning … Learning to recognize cats from unlabel Youtube videos (2012) Neural network with “3 million neurons and 1 billion synapses” …now it’s all over the place Inaugural Stockholm deep learning meetup, March 10, 2015
  • 32. Deep learning Perhaps deep learning could be used in genomics, proteomics etc to transform diverse data sets into a more general representation which would facilitate data integration? New datasets can then be overlaid onto representations trained on large collections.
  • 33. Deep learning in genomics (1) How do gene expression patterns relate to cell type and state? Hard problem to classify expression profiles into cell types because it is really a hierarchy where different genes are important at different levels of the hierarchy We may be starting to accumulate enough data to enable a deep learning approach to learn a hierarchical representation of cell state based on expression profiles (particularly with all the single-cell RNA-seq data now coming out)
  • 34. Deep learning in genomics (1) How do gene expression patterns relate to cell type and state? Hard problem to classify expression profiles into cell types because it is really a hierarchy where different genes are important at different levels of the hierarchy We may be starting to accumulate enough data to enable a deep learning approach to learn a hierarchical representation of cell state based on expression profiles (particularly with all the single-cell RNA-seq data now coming out) First step: Casey Greene’s group (Dartmouth) A denoising autoencoder learned a generalized representation of breast cancer expression profiles based on the METABRIC cohort (>2000 samples). Validated on TCGA. The nodes in the net can be interpreted to stand for different biological features. Tan et al. (2015)
  • 35. Deep learning in genomics (2) Convolutional network for splice site detection Reads the DNA sequence directly and abstracts into higher-level features. This network learned patterns of splice sites And also re-discovered the concept of codons Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf
  • 36. “Classical” machine learning Predictive modeling as a way to integrate information from different experimental assays. Example: ongoing mouse neural development project A number of genome-wide experiments have been done in developing spinal cord and cortex; have measurements/genome-wide signals about: - Gene expression (RNA-seq) - Where the Sox2 transcription factor is bound in each tissue (ChIP-seq) - How open/accessible the chromatin is (DNase-seq) - Potential transcription factor binding sites (DNase footprints) as well as some calculated features like certain interesting “DNA words” (transcription factor binding motifs) and how conserved each stretch of DNA is between mice and other organisms. How to make some sense of all these data?
  • 37. “Genome browser” view of genomic landscape around a gene Gene Conservation Different data tracks “Openness” Sox2 binding raw signal peaks
  • 38. (borrowed from Mark Gerstein) We decided we are most interested in understanding differences in gene expression between spinal cord and cortex neurons. Can the other measurements help? Progressively summarized and abstracted the raw signals into blocks with various features => matrix of ~20,000 genes x 13 features Use machine learning techniques to predict relative gene expression in cortex/spinal cord based on these features (ongoing…)
  • 39. Indexing and querying technology such as Google’s can help genomics researchers by e g - Enabling programmatic access to published data (processed but with a known analysis history) to lower the threshold for integrative analysis - Allowing them to relate their datasets to other published data without overly relying on curated reference databases (cumulative biology) - Facilitating ingestion into machine learning (e g deep learning) systems for learning general features of biological data from a very large set of samples Recap
  • 40.