I've been experimenting with automating simple and complex data analysis and report generation tasks for biological data and mostly using R and LATEX. You can see some of my progress and challenges encountered.
2. Bots write the darndest things
http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-
california-rdivor,0,3229825.story#axzz2wQwc82EK
•fill in the template (easy)
•human-guided automation
(e.g. Metaboanalyst,
intermediate)
•intelligent/reactive writing
(e.g. ~AI, advanced)
http://narrativescience.com/
3. Humans + Bots
Interaction:
•Bots and humans combine
in guided analyses
•Humans: make choices
(based on bot guides)
•Bots: automate!
Facilitate:
• workflow logging and
template creation
•reproducible results
Bot: Initial data and meta data
parsing and quality validation
(need: template input)
Human: data cleaning and
experimental design identification
(use: multiple choice, dynamic GUI)
Bot: instantiation of complex
workflows
Human: overview of bot
assumptions and results
Bot: Numerical and text output
generation
4. Humans + Bots write
darndender things?
Choose Your Own Life Adventure!
?
https://github.com/
dgrapov/AdventureR
5. Data Analysis Tasks
Visualization (how does it look?)
• histograms, density plots, box plots, line plots, scatter plots, networks, etc.
Statistical Analysis (what is statistically significant?)
• summary tables, ANOVA, FDR adjustment, power analysis, etc.
Exploration (what are the major patterns/trends?)
• clustering, PCA, ICA, etc.
Predictive Modeling (what explains my hypothesis?)
• mixed effects, partial least squares (O-/PLS/-DA), etc.
Network Analysis and Mapping (how are things related?)
• Functional analysis: pathway enrichment or overrepresentation
• Networks: biochemical, structural, mass spectral and empirical networks
• Mapping: projection of analysis results onto network
6. WCMC Data Analysis Reports ™
Statistical analysis
Clustering
PCA
O-PLS-DA
Biochemical enrichment
Network mapping
Input template: BinBase
•inference of experimental
goals from sample meta data
•mapping variables to external
databases
Tasks:
Report:
Tools:
7. Automation Challenges
Data cleaning and quality validation
•use: quality control samples; identify: precision/accuracy,
normalization, batch corrections; mitigate: outliers, missing
values, batch effects, etc.
Identification of experimental goals
•use: meta data, identify: main and accessory effects;
choose: statistics, multivariate tests and visualizations
Integration of multiple tasks to evolve robust analyses
•tasks: statistics, multivariate, functional, networks, database
mapping, etc
Data analysis report generation
•use: R, Latex, markdown
?
8. Challenges to automated
metabolite ID mapping
Stereochemistry?
Search: catechin
Best Match:
Catechin
Biologically relevant:
D-catechin
Synonyms?
Search: UDP GlcNAc
FAIL: UDP GlcNac
PASS: UDP-GlcNac
9. Strategies for automated
metabolite ID mapping (from synonym)
#1: CTS+ #2: Web query #3: Curated DB
•Use CTS to translate
from synonyms to KEGG
(KID) and PubChem (CID)
•Use KEGGREST and
PUG to filter and choose
most appropriate IDs
•Use fuzzy matching and
word similarity metrics
(e.g. Damerau–
Levenshtein distance)
•Use KEGGREST +
PubChem PUG to
translate synonyms to
IDs
•For KEGG ID:
synonym SID KID
•Generate a curated DB
for KEGG and CID
translations +
•Include InChI Keys
•Map to other DBs
•Allow fuzzy matching
on synonyms
•e.g. IDEOM
http://bioinformatics.oxfordjournals.org/content
/early/2012/02/04/bioinformatics.bts069
10. Interactive Analysis and
Report Generation
knitr (http://yihui.name/knitr/)
Analysis Report Generation
•Analysis on rails or open sandbox
•Humans facilitate robust results generation + Bots ensure reproduction
•Generation of Methods and Results should be automateable
11. Devium 2.0
Human-guided automated data
analysis and report generator
Human-guided automation could help
ensure robust results by making choices
which are otherwise difficult to automate.
https://github.com/dgrapov/DeviumWeb
12. MetaMapR
Linking data analysis and biology
https://github.com/dgrapov/MetaMapR
Integration of complex work flows is key to automation.
13. + Workflows for complex experiments (e.g. time-course)
+ Biochemical functional analysis (pathway enrichment)
+ GUI for report generation (Devium 2.0)
+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)
+ Scientific literature mining (RapportR)
+ Interactive plots and networks (JavaScript)
Future Goals