2. Motivation
• Scientists typically need to integrate a spectrum of
information to successfully complete a task.
• On average a scientist or knowledge worker
spends 1 day per week searching for, integrating
and analyzing information, 50% of which is
unstructured digital formats.
• Access to information structured according to
explicit knowledge representations or taxonomies
is a fundamental concern of all scientists.
• Moving beyond keyword search requires tools that
provide lexical matching to semantic, conceptual
and contextual levels of information and this
entails an infrastructure for indexing text segments
according to domain-specific metadata
3. In the future ….
• Users will be involved in the design of information systems
• Publishers will charge users for value added search:
(who will build such search systems)
• Users will search across semantically integration data
sources and data types (how to facilitate system creation /
adoption)
• Knowledge driven systems - rapidly built and deployed with
the engagement of domain experts in a knowledge
engineering team
4. Literature-driven, Ontology-centric
Knowledge Integration and Navigation
Visual Query
Reasoning
Ontology 50
sentences
500 documents, to read
blogs, newsfeeds
Ontology
to browse Population
Text Mining
Content delivery using expressive semantics
5. W3C Semantic Web Technologies
• URI / LSID
• Ontologies
• Reasoners
• Query Languages
• Web Services
• Service Registries
• Agents
• Multi Agent Systems
• Workflows Engines
• GRID / Semantic GRID
• Text Mining
• Service Oriented Architecture
6. Controlled Vocabularies Ontologies
Catalog/ Thesauri Formal General
ID “narrower term” is-a Frames logical
Controlled vocabularies part-of (properties) constraints
Terms/ Informal Formal Value
Glossary/Controlled is-a instance restrictions
vocabularies part-of
Capture knowledge: Make the content in
The meaning of important vocabulary information sources explicit.
(classes, properties/relations
and instance data in a domain model).
Common domain terminology
Index and query model Basis for interoperability
to a repository of information.
between information systems.
7. Lipid Ontology
> Implementation:
OWL-DL
> DL Expressivity
ALCHIQ
> Uses LIPIDMAPS
systematic
nomenclature
> 560 Named classes
> 352 Lipid subclasses
71 Object properties
(inc inv.)
> 4 Datatype
properties
> Lipid instance: Graph fragment
DL Axioms
LIPIDMAPS
systematic name
Lipid Hierarchy
> Depth: 8 levels
Domain Knowledge vs
information Concept Definitions
system metadata
10. Ontology-centric Knowledge Integration
• Content Delivery Platform - Automated
Document delivery from online databases
Tools for conversion to text-minable text
Content
Acquisition
• Text Mining - Customized and Automated
Regular Expressions, Named Entities,
Relations,
Domain • Knowledge Engineering – Ontology Creation
specific Domain Modeling / Customized Rapid
raw text Prototyping
• Ontology Population – Automated Instantiation
Sentences as instances / Co-occurrence and
named relations (Rules)
12. Ontology Population Workflow
• Ontology based information retrieval
applies NLP to link documents to
existing ontologies
• Ontology-driven NLP - NLP that
actively uses ontological resources for
NLP tasks
• Ontological NLP - ontologies used as a
knowledge base for NLP tasks while
also exporting the results of NLP
analyses into an ontology that can then
subsequent semantic queries to the
ontology using description logic
reasoners and a box reasoning
• Ontology based NLP - the results of
NLP are exported to another ontology,
using external resources for text
processing,
Witte etal. 2007
13. Text Mining
• Class Instance Generation from full text
– Named entity recognition (gazetteer based)
– Dictionary based matching of text tokens to domain
specific vocabularies i.e. (LipidBank, Lipidmaps,
KEGG, IUPAC) and curated Swissprot terms and disease
ontology of CGM
– Normalization and grounding to canonical names
• Relation Detection - Role Assertions:
– Co-occurrence and Rule-based relation detection of binary
pairs from which knowledgebase instances are generated.
Primary set of binary interactions mined from text:
– Lipid-Protein, Lipid-Disease, Protein-Disease
– Domain specific library of curated biological relations.
14. Knowledgebase Instantiation
1) Rule based identification of Sentences containing target keywords
2) Instantiation with JENA API http://jena.sourceforge.net/ for this purpose.
Target keywords found in sentences are instantiated to corresponding
ontology class
• Lipid / Protein / Disease instances are instantiated to the respective ontology
classes (as tagged by the gazetteer)
• Binary pairs instantiated to the respective Object Properties as role assertions
• Sentences instantiated to the respective Data type properties.
For each lipid identified in a sentence the corresponding data
are instantiated to the ontology from Lipid Data Warehouse records
requiring no further text processing.
• Lipid - LIPIDMAPS Systematic Name and its associated
• Lipid - IUPAC Name, Lipid – synonyms, Lipid - Database ID.
15. Knowledgebase Instantiation
Rule Based Sentence Processing
<Lipid> AND <Protein> AND LipidProteinInteraction-TriggerWord e.g. quot;interactquot;, quot;bindquot;, quot;mediatequot;
<Lipid> AND <Disease> AND LipidDiseaseInteraction-TriggerWord e.g quot;involvequot;, quot;causequot;
Lipid Class Protein
Instance
Lipid Instance
Lipid Instance
16. Knowledge Integration and Query
User input query Search Web content or
Engine Full text papers NLP tagging
Papers identified: 262
121 papers with no lipid protein relations
141 papers contributed to ontology instantiation
186 lipid names
docs
528 protein names
tagged
After normalisation and grounding:
with
92 Lipidmaps systematic names relevant
52 IUPAC names, 412 exact synonyms, 6 broad synonyms, 319 protein names name
Cross link to 59 Lipidbank entries entities
Sentences:
Co-occurrence before rules 1356 Sentences, After rules 683 Interaction sentences
92 Lipidmaps names instantiated to 35 classes (2.6 lipids per class)
Instantiation Time: 22 seconds
Ontology
Knowledge “Instantiated ontology” instantiation
User Output for end user Navigation Baker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and
vehicle Wenk MR. Towards ontology-driven navigation of the lipid
bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.
17. Knowledge Integration and Query
User input query Search Web content or
Engine Full text papers NLP tagging
docs
tagged
with
relevant
name
entities
Ontology
Knowledge “Instantiated ontology” instantiation
User Output for end user Navigation Baker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and
vehicle Wenk MR. Towards ontology-driven navigation of the lipid
bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.
19. Complex Query Generation
rma tician
In f o
x pert
ain e
D om
Find documents and sentences describing proteins-
lipid interaction and corresponding lipid synonyms.
20. Pathway Discovery Algorithm
Finds transitive paths
across the graph:
between source and
target concepts. Can
define path length
and result size
… paths between any object
properties or a user defined
object properties only e.g.
protein interacts with protein
21. Pathway Knowledge Discovery
2 concepts or keywords
... across
Results with multiple
Kanagasabai R. Low HS ,Ang WT, Wenk MR, Baker CJO.
semantic labelling Ontology-centric navigation of pathway information mined from text,
Bio-Ontologies SIG: Knowledge in Biology, ISMB July 2008
relations
24. 1 search term
(instance or
concept)
generates a
list of natural
language
questions
answerable by
the ontology
and a
direct link
to answers
Ang WT, Kanagasabai R, Baker CJ.
Knowledge Translation: Computing the
query potential of bio-ontologies,
Genome Informatics Workshop 2008
Submitted …..
29. Acknowledgements
Semantic Technology Group
Christopher J. O. Baker
Kanagasabi Rajaraman
Menaka Rajapakse
Anitha Veeramani
Ang Wee Tiong
Alexander Garcia (Alumnus)
Collaborators
Markus R Wenk, NUS
Low Hong-Sang, NUS
Choo Kar Heng, I2R
Shoba Ranganathan NUS
Suisheng Tan, I2R