1. WikiGenomes and Chlambase: Microbial genomics data in Wikidata.
Tim E. Putman1, Sebastian Burgstaller-Muehlbacher1, Andra Waagmeester2, Chunlei Wu1,
Kevin Hybiske3, Benjamin M. Good1, and Andrew I. Su1
1 Department of Integrative, Structural and Computational Biology, the Scripps Research Institute, La Jolla, USA; sulab.org
2 Micelio, Antwerp, Belgium
3 Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington
Motivation
Wikidata provides an extensible open framework ideal for aggregating
distributed data in a centralized database that supports:
• complex querying based a semantic data model
• providing data for domain specific web applications that allow the user to
both read and write data
Here, we describe the use of Wikidata to integrate microbial genomics data
using WikiGenomes and a Chlamydia-specific instance called Chlambase.
A
A) Semantic microbial data model consisting of a hierarchical taxonomic schema
and separate entities for gene and protein. The nodes are Wikidata ‘items’ and
‘properties’ define the relationships. B) Python based ‘Bot’ software for gathering
data from different resources and reading and writing directly to Wikidata
(https://github.com/SuLab/WikidataIntegrator).
Data model and implementation
A) Various data sources for microbial genetic data. B) Cumulative sum of bacterial
and eukaryotic genome assemblies submitted to NCBI GenBank by year.
A B
Scope and diversity of microbial data
Modeling microbial interactions
C. trachomatis
genome
www.ncbi.nlm.nih.gov/
genome/
indole
www.drugbank.ca/
Chlamydia trachomatis:
genes
www.ncbi.nlm.nih.gov/gene/
Human:
indoleamine 2, 3-dioxygenase
www.uniprot.org/
tryptophanase
www.uniprot.org/
C.trachomatis:
trp. synth.
alpha
and
beta
www.uniprot.org/
C.trachomatis:
tryptophan
synthase
www.rhea-db.org
C.trachomatis:
trpRBA operon
www.operondb.jp/
Akers et al. 2006
A) The interactions between host, pathogen,
microbiome, and small molecules that lead to
pathogen persistence during a chlamydial infection in
humans (originally hypothesized by Caldwell et al.
2003). Blue URLs indicate source of data and edges
are defined by properties in Wikidata. B) SPARQL
query results for organisms that are capable of
producing indole .
B. Organisms that produce indole
Acknowledgements
We would like to thank Lynn Schriml and Elvira Mitraka of the University of Maryland, the members
of The Apollo Project and the many members of the Wikidata community for valuable contributions
to this project.
References/Funding
Caldwell et al. 2003 (PMID:12782678)
Putman et al. 2016 (PMID:27022157)
Burgstaller-Muehlbacher et al. 2015
(PMID:26989148)
This work is supported by the National Institutes of
Health under grants GM089820 and GM114833.
Domain Specific Portals into Wikidata
WikiGenomes serves as a centralized and generalizable microbial genomics database
for the Long Tail of sequenced genomes. WikiGenomes engages domain experts by
providing integrated gene reports that are otherwise difficult of tedious to access.
WikiGenomes also provides an easy interface that supports community annotation,
which is then immediately written to Wikidata.
L-tryptophan
www.drugbank.ca/
Bacteria
(Q10876)
domain
C.
trachomatis
434/BU
(Q20800254)
strain
trpA
(Q21153861)
gene
TRPA
(Q21153984)
protein
found in taxon
(P703)
parent taxon (P171)
encodes (P688)
encoded by (P702)
subclass of (P279)
Entrez ID (P351)
gen. start (P644)
gen. stop (P645)
subclass of
(P279)
UniProt ID
(P352)
RefSeq ID (P637)
molecular
function
(P680)
locus tag (P2393)
C.
trachomatis
(Q131065)
species
biological
process
(P681)
cell
component
(P682)
found in taxon
(P703)
B
N-Formylkynurenine
www.drugbank.ca/
A
Join the team!
bit.ly/genewikidata; sulab.org