More Related Content Similar to Information Genetic Content (IGC): a comprehensive discovery platform for disease-gene research association (20) More from Thermo Fisher Scientific (20) Information Genetic Content (IGC): a comprehensive discovery platform for disease-gene research association1. Yun Zhu, Emily Williams, Yuan Tian, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Seq Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008.
Table 1. Disease annotation for the 28 identified gene clusters.ABSTRACT
We developed Information Genetic Content (IGC), a comprehensive
knowledgebase and discovery tool for human genes and genetic disorders
research use. IGC comprises three components: the Disease-Association
Database (DAD), the Gene Scoring Algorithm (GSA), and the Virtual Panel
Library (VPL). The DAD module contains over 400,000 associations
between over 17,000 genes and 15,000 Mendelian and complex diseases
from both expert-curated and text-mined data. The DAD module also
features a hierarchical organization of human diseases using a UMLS-
controlled vocabulary, permitting queries at any level of the disease
ontology hierarchy. The GSA module aims to prioritize genes for a specific
disease of interest. This gene scoring algorithm is distinctive in the way it
combines the strength of association and the number of associated
diseases to provide an unbiased score for each gene. In conjunction with
the DAD module, the GSA module is able to produce a list of ranked genes
for one or more diseases at any level of the disease hierarchy. The VPL
module generates optimal gene grouping by disease classification using
hierarchical-clustering-based network analysis. Genes that are involved in
the same pathological pathways are grouped into the same cluster.
INTRODUCTION
The identification of disease-associated genes is an important step towards
understanding disease mechanisms, diagnosis, and therapy for the future.
However, due to the complex and distributed nature of the problem, current
scientific knowledge is spread out over several overlapping databases
maintained by independent groups. It is unclear how to rank gene-disease
research associations due to the distributed and dispersed nature of our
knowledge. To fill this gap, we developed Information Genetic Content
(IGC), a comprehensive knowledgebase and discovery tool for human
genes and genetic disorders research use. IGC is unique in two aspects.
First, it integrates data from multiple databases into one system. Second, it
provides an unbiased scoring algorithm to rank gene-disease research
association at any level of the disease ontology hierarchy.
METHODS
CONCLUSIONS
We created a comprehensive, efficient, and informative engine, the IGC, to optimize
gene selection given diseases at any level of the disease ontology hierarchy:
• The DAD organizes diseases into an effective hierarchical structure for
lookup, and associate diseases to genes.
• The GSA ranks genes by clinical relevance, and summarizes the scores for
disease at any level of the hierarchy.
• The VPL efficiently groups genes into pools by disease classifications, and
further ranks the genes within clusters by their relative importance to
diseases.
REFERENCES
1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of
human diseases and their genes. Database 2015:bav028.
2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic
Acids Res. 2004 Jan 1;32(Database issue):D267-70.
3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC
Bioinformatics 2008: 9:559
For Research Use only. Not for use in diagnostic procedures
© 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo Fisher Scientific and
its subsidiaries unless otherwise specified.
Information Genetic Content (IGC): a comprehensive discovery platform for disease-gene research association
Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com
Figure 2. Gene Association Database (DAD) maps genes to diseases
• DAD contains over 400,000 associations between over 17,000 genes and 15,000 Mendelian
and complex diseases from both expert and text-mined data.
• DAD established gene-disease relationships based on DisGeNET1, which scores gene-
disease associations according to expert-curated sources (e.g. CTD, CLINVAR, and
ORPHANET), predicted data using mouse models, and text-mining of publications. Blue
circles: two neurological diseases – schizophrenia and bipolar disorder. Green circles: genes
associated with these two diseases.
• The disease association database (DAD) organizes diseases into an effective hierarchical
structure for lookup, using disease parent-child relationships established in NIH Unified
Medical Language System (UMLS).
• For any disease in the hierarchical tree, the GSA computes the rank-weighted sum score
(RWSS) to summarize the strength of the gene’s association with all of its child diseases (see
below).
Figure 3. Gene Scoring Algorithm (GSA)
Figure 5. Gene clustering identified 28 VPLs that can be well defined by
disease classifications.
A
B
Disease Key
MeSH
Category
Description
C04 Neoplasms
C05 Musculoskeletal Diseases
C06 Digestive System Diseases
C07 Stomatognathic Diseases
C08 Respiratory Tract Diseases
C09 Otorhinolaryngologic Diseases
C10 Nervous System Diseases
C11 Eye Diseases
C12 Male Urogenital Diseases
C13
Female Urogenital Diseases and
Pregnancy Complications
C14 Cardiovascular Diseases
C15 Hemic and Lymphatic Diseases
C16
Congenital, Hereditary, and Neonatal
Diseases and Abnormalities
C17 Skin and Connective Tissue Diseases
C18 Nutritional and Metabolic Diseases
C19 Endocrine System Diseases
C20 Immune System Diseases
Cluster Groups
Disease of interest
DisGeNET Database
Rank-Weighted Sum Score (RWSS)
RWSS is an unbiased gene scoring method
that accounts for both the strength and number
of gene-disease pairs.
From the top 5,000 genes that are clinical relevant by GSA, 28 gene clusters were identified
using WGCNA algorithm3. A) Hierarchical clustering of genes according to their association
patterns with 16 high-level MeSH categories relevant to inherited diseases. B) Gene cluster
association scores with the 16 MeSH disease categories are shown with p-values.
RESULTS
Figure 1. Overview of IGC framework
Figure 4. Gene Scoring in multiple disease hierarchies
Level 1
Level 2
Level 3
Level 4
• The GSA module uses RWSS method to prioritize genes for a specific disease of interest.
• In conjunction with the DAD module, the GSA module is able to produce a list of ranked
genes for one or more diseases at any level of the disease hierarchy.
Module # Module Color GeneCount Disease Annotation
1 turquoise 530 Nervous System Diseases
2 blue 321 Nutritional and Metabolic Diseases
3 brown 307 Cardiovascular Diseases
4 yellow 280 Digestive System Diseases
5 green 253 Eye Diseases
6 red 250 Skin and Tissue Connective Diseases
7 black 229 Male and Female Urogenital Diseases
8 pink 205 Musculoskeletal Diseases
9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases
10 purple 150 Hemic and Lymphatic Diseases
11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases
12 tan 137 Neoplasms
13 salmon 129 Respiratory Tract Diseases
14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases
15 midnightblue 90 Male Urogenital Diseases;
16 lightcyan 87
Immune; Male Urogenital Diseases; Female Urogenital Diseases and
Pregnancy Complications
17 grey60 76 Stomatognathic Diseases
18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases
19 lightyellow 67
Female Urogenital Diseases and Pregnancy Complications; Endocrine System
Diseases
20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications
21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases
22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases
23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic Diseases
24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases
25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases
26 orange 36 Immune System Diseases
27 white 35 Endocrine System Diseases
28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases