Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015
1. Dr. Sven Nahnsen,
Quantitative Biology Center (QBiC)
Data Management for Quantitative
Biology
Lecture 1: Introduction and overview
2. Overview
â˘âŻAdministrative stuff (credits, requirements)
â˘âŻMotivation/quick review of relevant contents
(Bioinformatics I and II)
â˘âŻIntroduction to this lecture series
â˘âŻSemester overview
2
3. Course requirements
To pass this course you must:
â˘âŻ regularly and actively participate in the weekly problem sessions,
â˘âŻ pass the final exam, assignments and projects
â˘âŻ You have to work on assignments alone
â˘âŻ You will work in small groups for the problem-orientated research
project
3
4. Course credits and grading
â˘âŻ Credits
-⯠MSc Bioinfo: 4 LP, module âWahlpflichtbereich Bioinformatikâ
-⯠MSc Info: 4 LP, area âWahlpflichtbereich Informatikâ
â˘âŻ Grade
-⯠30% assignments
-⯠20% project
-⯠50% finals
â˘âŻ Finals: oral exam (30 minutes) covering the contents of the whole
lecture, the assignments and the project
â˘âŻ Finals will be scheduled at the end of the semester (Thu,
30/07/2015)
4
5. Recommended literature
â˘âŻ We will point to relevant papers during the course of the literature
â˘âŻ Important overview papers:
§ď§âŻ Hastings et al., 2005, Quantitative Bioscience for the 21st century. BioScience. Vol
55 No. 6
§ď§âŻ Cohen JE (2004) Mathematics Is Biology's Next Microscope, Only Better; Biology
Is Mathematics' Next Physics, Only Better. PLoS Biol 2(12): e439.
â˘âŻ Books
§ď§âŻ Free E-Book: Data management in Bioinformatics (
http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics)
§ď§âŻ Lacroix, Z.; Critchlow, T. (eds): Bioinformatics: Managing Scientific Data. Morgan
Kaufmann Publishers, San Francisco 2003
§ď§âŻ Michale E. Wall, Quantitative Biology: From Molecular to Cellular Systems. 2012.
Chapman & Hall
§ď§âŻ Pierre Bonnet. Enterprise Data Governance: Reference and Master Data
Management Semantic Modeling. 2013. Wiley
â˘âŻ Web resources
§ď§âŻ http://www.ariadne.ac.uk: Ariadne, Web Magazine for Information Professionals
§ď§âŻ http://www.dama.org: THE GLOBAL DATA MANAGEMENT COMMUNITY
§ď§âŻ H.D. Ehrich: http://www.ifis.cs.tu-bs.de/node/2855
5
6. Recommended Software
â˘âŻ These software tools/framework and webservers will be used
during the problem sessions
http://www.cisd.ethz.ch/software/openBIS https://usegalaxy.org
https://vaadin.com/home https://www.knime.org/
6
7. Contact and organization
â˘âŻ Questions concerning the lecture/assignments
§ď§âŻ dmqb-ss15@informatik.uni-tuebingen.de
â˘âŻ Website
§ď§âŻ abi.inf.uni-tuebingen.de/Teaching/ws-2013-14/CPM
â˘âŻ Christopher Mohr (Sand 14, C322) , Andreas Friedrich (Sand 14, C 304)
â˘âŻ Dr. Sven Nahnsen (Quantitative Biology Center, Auf der Morgenstelle 10,
C2P43, please send e-mail first)
â˘âŻ Course material will be available on the website (see above), through
social media channels and (if wished) as a hard copy during the lecture
facebook.com/qbic.tuebingen twitter.com/qbic_tue
7
8. Who am I
â˘âŻ Most of me and on our work can be found here:
www.qbic.uni-tuebingen.de
8
9. Contents of this lecture
Date
 Lecturer
 Lecture
 8-Ââ10
 AM
Â
Thursday
 16
 April
 15
 Nahnsen
 Introduc8on
 and
 overview
Â
Thursday
 23
 April
 15
 Nahnsen
 Biological
 Data
 Management
Â
Thursday
 30
 April
 15
 Czemmel
Â
Data
 sources
 ("Next-Ââgenera8on"
Â
technologies)
Â
Dr. Stefan Czemmel
9
10. Contents of this lecture
Date
 Lecturer
 Lecture
 8-Ââ10
 AM
Â
Thursday
 7
 May
 15
 Codrea
Â
Database
 systems
Â
 (mySQL,
Â
noSQL,
 etc.)
Â
Thursday
 14
 May
 15
Â
Ascension
 Day
Â
(Himmelfahrt)
Â
Thursday
 21
 May
 15
 Czemmel
 LIMS
 and
 E-ÂâLab
 books
Â
Thursday
 28,
 May
 15
 Kenar
 Experimental
 Design
Â
Dr. Marius CodreaErhan Kenar
10
11. Contents of this lecture
Date
 Lecturer
 Lecture
 8-Ââ10
 AM
Â
Thursday
 4
 June
 15
Â
Corpus
 Chris8
 Day
Â
(Fronleichnam)
Â
Thursday
 11
 June
 15
 Nahnsen
 Data
 analysis
 workďŹows
 (I)
Â
Thursday
 18
 June
 15
 Nahnsen
 Data
 analysis
 workďŹows
 (II)
Â
Thursday
 25
 June
 15
 Nahnsen
 Standardiza8on
Â
Thursday
 2
 July
 15
 Nahnsen
 Big
 Data
Â
Thursday
 9
 July
 15
 Nahnsen
Â
Integrated
 data
 management
Â
(OpenBIS,
 OpenBEB)
Â
Thursday
 16
 July
 15
 Nahnsen
 Applica8ons
Â
Thursday
 23
 July
 15
 Nahnsen
 Exam
 prepara8on
Â
Thursday
 30
 July
 15
Â
Nahnsen,
 Mohr,
Â
Friedrich
 EXAMS
Â
11
12. What is your background?
Ad hoc collection from the audience, Apr. 16, 2015
â˘âŻ Computer Science
â˘âŻ Bioinformatics(immonoinformatics; User Front-end;integration ,
visualization)
â˘âŻ Biology
â˘âŻ Drug design
â˘âŻ Agricultural biology (plant breeding)
â˘âŻ Bioinformatics (Tx, NGS)
â˘âŻ Geoecology
â˘âŻ (ecology)
â˘âŻ Biochemistry; Molecular Biology
â˘âŻ Structural Biology
â˘âŻ Electronic business
12
13. Let us brainstorm
Ad hoc collection from the audience, Apr. 16, 2015
â˘âŻ What is data management?
-⯠Rapid access to data
-⯠Selective access to data; database queries
-⯠Combine data; manipulate data efficiently
-⯠Big data storage/analysis
-⯠Curating quality
-⯠Data visualization
-⯠Make data interpretable
13
14. Let us brainstorm
â˘âŻ What is data management?
http://zonese7en.com/wp-content/uploads/2014/04/Data-Management.jpg, accessed Apr 10,
2015, 11 AM 14
15. Data Management
â˘âŻ The official definition provided by DAMA (Data management
association) International, the professional organization for those
in the data management profession, is: "Data Resource
Management is the development and execution of
architectures, policies, practices and procedures that
properly manage the full data lifecycle needs of an enterprise.â
â˘âŻ Further, the DAMA â Data management Body of Knowledge
((DAMA-DMBOK)) states:â Data management is the development,
execution and supervision of plans, policies, programs and
practices that control, protect, deliver and enhance the value of
data and information assets â
Wikipedia: http://en.wikipedia.org/wiki/Data_management accessed Mar 30, 2015, 10 PM
15
16. 10 Data Management functions According to the DAMA
Data Management Body of
Knowledge (DMBOK)
16
17. Data governance
â˘âŻ Strategy
â˘âŻ Organization and roles
â˘âŻ Policies and standards
â˘âŻ Projects and services
â˘âŻ Issues
â˘âŻ Valuation
Source: DAMA DMBOK Guide, p. 10
âPlanning, supervision and control over data management and useâ
http://meship.com
17
18. Data quality management
â˘âŻ Data cleansing
â˘âŻ Data integrity
â˘âŻ Data enrichment
â˘âŻ Data quality
â˘âŻ Data quality assurance
Source: DAMA DMBOK Guide, p. 10
âdefining, monitoring and improving data qualityâ
http://www.arcplan.com/
18
19. Data architecture management
â˘âŻ Data architecture
â˘âŻ Data analysis
â˘âŻ Data design (modeling)
Source: DAMA DMBOK Guide, p. 10
atasourceconsulting.com
19
20. Data development
â˘âŻ Analysis
â˘âŻ Data modeling
â˘âŻ Database design
â˘âŻ Implementation
Source: DAMA DMBOK Guide, p. 10
dataone.org
20
âData development is the process of building a data set for a specific purpose. The
process includes identifying what data are required and how feasible it is to obtain
the data. Data development includes developing or adopting data standards in
consultation with stakeholders to ensure uniform data collection and reporting, and
obtaining authoritative approval for the data set.â, A guide to data development,
Australian Institute of Health and Welfare Canberra, 2007
21. Database management
â˘âŻ Data maintenance
â˘âŻ Data administration
â˘âŻ Database management
system
Source: DAMA DMBOK Guide, p. 10
21
23. Reference and Master Data management
â˘âŻ External/internal codes
â˘âŻ Customer Data
â˘âŻ Product Data
â˘âŻ Dimension management (why do
different dimensions (entities) relate to each other)
â˘âŻ Taxonomy/Ontology
Source: DAMA DMBOK Guide, p. 10
Master Reference
23
Reference data
management
25. Data warehousing and business intelligence management
â˘âŻ Architecture
â˘âŻ Implementation
â˘âŻ Training and Support
â˘âŻ Monitoring and Tuning
Source: DAMA DMBOK Guide, p. 10
Raw data
Metadata
âŚ
Summary
data
Data warehouse
25
26. Data warehousing and business intelligence management
Raw data
Metadata
âŚ
Summary
data
Data warehouse
Input
Report
Business intelligence
26
27. Document, record and content management
â˘âŻ Acquisition and storage
â˘âŻ Backup and Recovery
â˘âŻ Content Management
â˘âŻ Retrieval
â˘âŻ Retention
Source: DAMA DMBOK Guide, p. 10
27
28. Metadata management
Metadata is data about data
â˘âŻ Architecture
â˘âŻ Integration
â˘âŻ Control
â˘âŻ Delivery
Source: DAMA DMBOK Guide, p. 10
28
29. DAMA â DMBOK
â˘âŻ A broad collection of all discipline and subtopics that are
summarized under the umbrella of data management
â˘âŻ These concern many business-related issues, but many concepts
are very well applicable to the field of bioscience
â˘âŻ We will come back to various aspects of the DAMA DMBOK during
the course
29
30. Data management needs in science and research
â˘âŻ Survey at the University of Oregon, USA (Brian Westra. "Data Services for the Sciences: A
Needs Assessment". July 2010, Ariadne Issue 64 http://www.ariadne.ac.uk/issue64/westra/)
â˘âŻ Different scientific discipline
30
31. Data management in science and research
Brian Westra. "Data Services for the Sciences: A Needs Assessment". July 2010, Ariadne Issue 64
http://www.ariadne.ac.uk/issue64/westra/, accessed Apr. 10, 2015, 11 AM
1 2 3 4 5 6 7 8 9 10 11
1 Data storage and backup 7 Finding and accessing related data from others
2 Making scientific data findable by others 8 Connecting data storage to data analysis
3 Connecting data acquisition to data storage 9 Liniking this data to publications or other asset
4 Allowing or controlling access to scientific data by others 10 Ensuring data is secure and trustworthy
5 Documenting and tracking updates 11 Others
6 Data analysis and manipulation
31
32. Let us brainstorm
â˘âŻ What is Quantitative Biology?
Ad hoc collection from the audience, Apr. 16, 2015
-⯠Not only yes/no, but put amounts to entities
-⯠Huge amount of data
-⯠Qunatitative methods to study biology
-⯠System-wide analysis; specific pathways
-⯠Make results human readable and accesible
32
33. Quantitative Biology
â˘âŻ The term quantitative biology has been coined by Hastings et al.,
2005.
â˘âŻ High-throughput methods have led to a paradigm shift in
biomedical research
â˘âŻ Traditionally, the focus was on one-molecule-at-a-time for most
bio(medical) research projects
â˘âŻ Now, data on whole genomes, exomes, epigenomes,
transcriptomes, proteomes and metabolomes can be generated at
low cost.
â˘âŻ The term quantitative biology is used to describe this paradigm
shift. Improvements in this area have been driven mainly by two
technological developments:
Hastings et al., 2005, Quantitative Bioscience for the 21st century. BioScience. Vol 55 No. 6
33
34. Technological innovations
â˘âŻ State-of-the-art mass spectrometers coupled to high-
performance liquid chromatography through soft ionization
techniques (HPLC-ESI-MS) have quickly changed the way we do
proteomics, metabolomics, and lipidomics.
â˘âŻ Next-generation sequencing has similarly changed the way we
look at genomes, epigenomes, transcriptomes, and metagenomes.
Due to advances in chemistry and imaging, sequencing reactions
have been parallelized on a very large scale. The
comprehensiveness of the data produced by high-throughput
methods makes them particularly interesting as general-purpose
analytical and diagnostic techniques.
34
35. Technological innovations
â˘âŻ Imaging technologies can now produce high-resolution pictures
of fine-grained cellular details at a very high speed
â˘âŻ Finally methods from bioinformatics and computational biology
have matured to rapidly analyze the huge raw data sets that are
generated by these high throughput technologies
35
36. Contents from Bioinformatics 2 (high-throughput
technologies
â˘âŻ Most of the high throughput technologies have been introduced
during the Bioinformatics II lecture
â˘âŻ There are specialized lectures on âTranscriptomicsâ and on
âComputational Proteomics and Metabolomicsâ
â˘âŻ We will give a short Recap on the Bioinformatics II contents that
are relevant for this lecture
â˘âŻ More advanced topics on data generation methods will be
introduced in lecture 3 by Dr. Stefan Czemmel (focus on next
generation sequencing)
36
37. Origin of the âCentral Dogma of Molecular Biologyâ (Francis Crick, 1956)
The central dogma of molecular biology
â˘âŻ First articulation by Francis Crick in 1956
â˘âŻ Published in Nature in 1970
37
38. The central dogma â classical view
â˘âŻ In general, the classic view reflects how biology is (biological data
are) organized
â˘âŻ Genomics, however, enabled a more complex view
Cox Systems Biology Lab | Research, University of Toronto, Canada
38
40. Recap Bioinformatics II: Systems biology
â˘âŻ Quantitative data on various levels of biological complexity build
fundaments of systems biology
â˘âŻ Mathematical modeling has been based on gene expression
â˘âŻ Recent important technological improvements allow the analysis of protein
and metabolite profiles to a great depth
â˘âŻ Important layers for understanding biology
â˘âŻ New experimental techniques offer tremendous challenges for
computational analysis
40
41. Recap Bioinformatics II: Aims of Systems Biology
â˘âŻ Describe large-scale organization
â˘âŻ Quantitative modeling
â˘âŻ Describe cell as system of networks
-⯠Fundamental research: time-resolved quantitative
understanding of living systems
-⯠Medicine: enable personalized medicine (e.g., improve
treatment strategies for cancer patients)
-⯠Biotechnology: improve production, degradation, construction
of synthetic organisms, etc.
41
42. Exp. Methods â Transcriptomics
â˘âŻ Extract and amplify RNA
â˘âŻ Hybridization on microarray
â˘âŻ Identify and quantify by fluorescence signal
â˘âŻ Sequences can be mapped back to genome
Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803
42
43. Microarray Data Analysis
â˘âŻ Key problems in microarray
data analysis are
-⯠Data normalization
-⯠Clustering
-⯠Dimension reduction
-⯠Diagnostics/classification
-⯠Network inference
-⯠Visualization of results
Janko Dietzsch , Nils Gehlenborg and Kay Nieselt. Mayday-a microarray
data analysis workbench. Bioinformatics 2006 22(8):1010-1012 43
45. Genome sequencing
â˘âŻ 2001: initial publication
â˘âŻ 2003: 2nd draft âHuman Genomeâ
â˘âŻ > 13 years of work and > 3*109 $
â˘âŻ 2010: 8 days 1*104 $
â˘âŻ Today: approximately 5.5 days and < 1*104 $
â˘âŻ Future: within 3 years Biotech company (Pacific Biosciences)
expects similar amount of data in < 15 min for < 1*103 $
45
46. Status genomics/transcriptomics
â˘âŻ Dramatic drop in cost for genome sequencing
â˘âŻ Number of sequenced genomes grows continuously
â˘âŻ Genome is a very static snapshot of living system
â˘âŻ Biological adaption is rather slow; long-term information storage
â˘âŻ Proteins and their reaction products, metabolites are much closer
to reality
â˘âŻ Genome and transcriptome databases are essential bases for
proteomics and metabolomics research
46
47. Genomics vs. Proteomics
Genomics Proteomics
Genomes rather static
~ 20 k genes
established technology
(capillary sequencer)
Proteomes are dynamic
(age, tissue, breakfast,
âŚ)
up to 1000 k proteins
emerging technologies
(MS, HPLC/MS, protein
chips)
47
59. Large-scale study data â 1000 Genomes
â˘âŻ Sample lists and sequencing progress
â˘âŻ Variant Calls
â˘âŻ Alignments
â˘âŻ Raw sequence files
http://www.1000genomes.org/data
59
60. Large-scale study data â The cancer genome atlas (TCGA)
â˘âŻ TCGA aims to help to diagnose, treat and prevent cancer
â˘âŻ explore the entire spectrum of genomic changes involved in more
than 20 types of human cancer.
â˘âŻ Approx. 2 PB of genomic raw data
http://cancergenome.nih.gov
60
61. Laboratory information management systems/
Electronic Lab Books
â˘âŻ How to track all information that is generated in the laboratory
â˘âŻ Automated annotation of all experimental parameters is essential
for reproducible science
â˘âŻ Currently, most experiments are protocolled manually in lab
textbooks
â˘âŻ Data security (intellectual property versus open data)
61
62. Experimental design
â˘âŻ Biological experiments are very complex
â˘âŻ Statistical significance requires a high number of biological
replicates
â˘âŻ Often many different conditions and time points need to be
considered
â˘âŻ One study can involve many different experiments (multi omics
studies involve different omics layers, e.g. genomics +
transcriptomics + proteomics)
â˘âŻ All experiments come with different meta data requirements
â˘âŻ For various reasons the experimental design is not always
balanced (e.g. 5 samples in group A and and only 3 samples are
available for group B)
Friedrich, A., et al. Biomed Research International, April 2015 â in press.
Nahnsen, S., Drug Target, May 2015 â in press. 62
63. Experimental design
Friedrich, A., et al. Biomed Research International, April 2015 â in press.
Nahnsen, S., Drug Target, May 2015 â in press. 63
64. Data analysis workflows
â˘âŻ Chain different (heterogeneous) tools
â˘âŻ Parameter handling
â˘âŻ Execution in high performance computing environment made easy
64
65. Standardization in bioinformatics
â˘âŻ Many world-wide bioinformatics initiatives need to rely on open
standards
â˘âŻ Development of standards has to be a community effort
â˘âŻ Standardized data formats are important to guarantee
-⯠Sustainability
-⯠Independence of instrument vendors
-⯠Independence of analysis software
-⯠Exchangeability of raw data
â˘âŻ Standard formats increase the amount of data by a factor of x (x =
2-4)
â˘âŻ Many people refrain from using open standards
65
66. http://en.wikipedia.org/wiki/Big_Data, accessed Apr 24, 2014
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database management
tools or traditional data processing applications. The challenges include
capture, curation, storage, search, sharing, transfer, analysis and
visualization. The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of related data,
as compared to separate smaller sets with the same total amount of data,
allowing correlations to be found âŚâŚ
Big data
66
67. Big data examples
â˘âŻ European Council for Nuclear Research (CERN) Geneva,
Swizerland
25 Petabyte/Jahr at LHC (Large Hadron collider) (~6.2 Mio.
DVDs)
CERN
LHD data
Big data Beispiele
ep.ph.bham.ac.uk, 2014
67
68. Big data examples
â˘âŻ Google verarbeitet 9.1 Exabyte/year (300 Mio. DVDs)
GOOGLE
data
Mayer-SchĂśnberger, 2013, ititch.com, 2014
68
69. Biology and Big data?
â˘âŻ Klassisch: Beobachtung der Natur
und deren Phänomene
DNA RNA Proteine
Träger der
Erbinformation
Expression von
bestimmten Genen
Ăben nĂśtige Funktion
in der Zelle aus
1956 formuliert Francis Crick das zentrale Dogma der Molekularbiologie:
â˘âŻ 1950er JahreDurchbruch in der
Molekularbiologie
69
70. Big data
Vivien Marx, Biology: The big challenges of big data, Nature. 2013, doi:10.1038/498255a
70
71. Integrated data management in biology/biomedicine
71
http://media.americanlaboratory.com/m/20/Article/35231-fig1.jpg
73. NGS Lab
Lab
Storage
Data movers
â˘âŻ Automatically moves large to huge file-based data to a remote
(central) storage
â˘âŻ Uses rsync routine; easy configuration using config file
â˘âŻ Data mover athentification: public/private key ssh authentification
â˘âŻ Moves data to openbis dropboxes (individual boxes and users for
each of the five member labs)
Data
Mover
DataMover:
â˘âŻ Developed at ETH Zurich as part of OpenBIS
â˘âŻ http://www.cisd.ethz.ch/software/Data_Mover
73
74. openBIS (meta) data store
â˘âŻ Open, distributed system for managing biological
information
â˘âŻ Captures different experiment types (OMICS,
imaging, screening,...)
â˘âŻ Tracking, annotating and sharing of experiments,
samples and datasets for distributed research
â˘âŻ Different servers for meta data and bulk raw
data
â˘âŻ Underlying PostgreSQL database
â˘âŻ ETL routines for extraction of meta data and
linking
74
78. Contact:
Quantitative Biology Center (QBiC)
Auf der Morgenstelle 10
72076 Tßbingen ¡ Germany
dmqb-ss15@informatik.uni-tuebingen.de
Thanks for listening â See you next week