Natural Language Processing for biomedical text mining - Thierry Hamon

Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion
Natural Language Processing
for biomedical text mining
Thierry Hamon
LIMSI, CNRS, Université Paris-Saclay, Orsay, France
Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France
hamon@limsi.fr
14/06/2017
1/75 Grammarly Meet-up T Hamon

Context
Most of the data are unstructured
about 90% of the data produced in 2011 (1.8 trillion of
gigabytes) [Oracle, 2011]
85% of data produced in compagnies
Unstructured data: textual data
Important source of information
Accessing and reading are costly, time-consuming and
sometimes impossible
Need of methods for
information retrieval and information extraction

Context
In biomedical domain, constant increase of amount of
Scientific Medical literature
Scientific papers in digital libraries or portal
Medical, pharmacological, epidemiological reports
Electronic Health Records in hospitals
Discharge summaries
Radiological reports
Patient-related textual data
documents explaining diseases to patients, health
behaviors
social media (online discussion forums, twitter messages)

Context
Example: Scientific article publications
Medline (U.S. National Library of Medicine bibliographic
database) - https://www.ncbi.nlm.nih.gov/pubmed/
Evolution of the number of references to articles in life
sciences
Citations Added to MEDLINE® per Year
Currently: More than 27 million references

What is text mining?
Objective: Extraction of useful and non-trivial knowledge from
texts
Extraction of information
useful for a given application
from textual data, i.e. writen in natural language
Collecting and linking this information
Feed databases or knowledge bases with information
extracted from texts
Indirectly: allow data mining on unstructured/textual data

Data mining vs. Text mining
Data mining
Methods and algorithms to explore structured data, issued
from databases, data warehouse or knowledge bases
Objectives: Highlight rules, identify trends or behaviours
which are invisible to humans

Data mining vs. Text mining
Data mining
Methods and algorithms to explore structured data, issued
from databases, data warehouse or knowledge bases
Objectives: Highlight rules, identify trends or behaviours
which are invisible to humans
Text mining
Methods and algorithms to explore unstructured data, i.e.
texts written in Natural Language
Objectives: Extraction and categorisation of information
available in the texts

What are text mining applications?
EHR:
Search and find relevant information, Hospital information
system
Provide synthetic views of patient-related information
EHR / Scientific literature:
Information storage in databases for statistics,
epidemiologic survey, Information system in hospital, etc.
Formalize information or knowledge
Social media:
Epidemiologic analysis, Therapeutical Patient Education,
Potential adverse drug effect identification

What information to identify?
Semantic entities: terms with semantic types
Semantic relations between entities
Temporal information related to events
Numerical information
Modifiers for identifying polarity, modality,
presence/absence, uncertainty

Needs for analysis of biomedical texts
Various resources:
Terminologies, Ontologies, Open Linked Data
Lexica, Consumer Health Vocabularies
Semantic description of entities
NLP approaches and methods:
Rule-based approaches (more or less sophisticated regular
expressions)
Machine Learning approaches (supervised,
semi-supervised, unsupervised)
Evaluation against independent reference data

Difficulties
Textual data may be noisy, sparse, multilingual
Text processing is time-consuming, may require contextual
information
Terminological and semantic variation, semantic ambiguity,
unknown or new words and terms, etc.
→ High and unpredictable number of dimensions
Complex and embedded semantic relations

Difficulties
Ambiguities of the natural language at each level:
lexicon:
spell[N] vs. spell[V], Apple[company] vs. apple[fruit]
гори[V] (a form of burn) vs. гори (inflectional form of
mountain)
syntax:
the doctor examines the patient with a stetoscope
Joe experienced severe shortness of breath and chest pain
at home while having sex, which became more unpleasant
at the emergency room.

Difficulties
Ambiguities of the natural language at each level:
semantics:
a red pencil, He reached the bank.
поділися (form of disappear) vs. поділися or lemma of
share)
pragmatics:
The chicken is ready to eat.
Margaret invited Susan for a visit, and she gave her a good
lunch.
a very pleasant patient

Difficulties
Variation in semantically similar wording:
Bayer is buying Monsanto
Bayer clinches Monsanto
Bayer and Monsanto [...] will merge
Bayer's announced acquisition of Monsanto
Monsanto-Bayer merger
Metonymy: the latest Apple/Samsung
Metaphor: Web giants, or noir (black gold in French)
Spelling errors: Appel(call in French)/Apple
Mix of Latin and Ukrainian characters (different UTF-8
codes): i vs. і, o vs. о, p vs. р, y vs. у...

Three experiments in biomedical text mining
1 Recognition of Medication, assertion, temporal information
in EHR
[Hamon and Grabar10, Périnet et al.11, Grouin et al.13, Zweigenbaum et al.13,
Hamon and Grabar14]
Work with Natalia Grabar (CNRS STL - Lille 3), Amandine Périnet
(LIM&BIO - Paris 13), Cyril Grouin, Sophie Rosset, Xavier Tannier,
Pierre Zweigenbaum (LIMSI, CNRS)
2 Mining literature for identifying risk factors
[Hamon et al.10]
Work with Martin Graña, Víctor Raggio and Hugo Naya (Institut Pasteur
de Montevideo), and Natalia Grabar (CNRS STL - Lille 3)
3 Cross-Lingual Transfer Methods for Terminology
Acquisition
[Hamon and Grabar16]
Work with Natalia Grabar (CNRS STL - Lille 3)

Mining Patients' Electronic Health Records
[Hamon and Grabar10, Périnet et al.11, Grouin et al.13, Zweigenbaum et al.13,
Hamon and Grabar14]
Description of the hospitalization
A lot of (personal) information about patients
Problems
Therapies (treatments, drugs, etc.)
Tests and analysis (lab data, etc.)
Assertions regarding facts (certainty, hypothesis, etc.)
Temporal information (useful for the clinical timeline)
The best way to record information (database are difficult
to maintain)
BUT the texts are written by practitioners:
in a hurry, with mistakes, with little or incorrect syntactic
structures, etc.

Objectives
Identification of
Medication names given to patients
Related information (dosage, duration, frequency, mode of
administration, reason for prescription)
Assertion: certainty and uncertainty of information in
medical texts
focus on the relation {patient / medical problem}
Temporal expressions: date, time and duration of medical
events
Participation to several I2B2 Challenges

Drug-related information
acne
osteoporosis
swelling of face
arterial hypertension
ulcer of stomach
depression
solumedrol
salt
phosphate disodique anhydre
phosphate monosodique anhydre
sodium
lactosis
cortisone
steroidal anti−inflammatory
allergic shock
Quincke oedema
suffocation by larynx oedema
brain oedema
methylprednisolone
adverseeffects
digitaline
insulin
composition
is a
prescribedfor
INN
DDI
FDI
dosage
mode
frequency
reason
duration
prescriptionfeatures17/75 Grammarly Meet-up T Hamon

Assertion task
Degree of
certainty from abdominal pain
With shrimps, the patient suffers
The patient is to call the hospital
if he suffers from abdominal pain
The patient denies suffering
from abdominal pain
abdominal pain
The patient suffers from
might suffer from abdominal pain
It was thought that the patient
Certainty
Hypothesis
Condition
Negative certainty
Positive certainty
Assertion
Possibility

Example
Medication name, associated information, assertions and time expressions
The patient is currently off diuretics at this time. Daily
weights should be checked and if her weight increases by
more than 3 pounds Dr. Bockoven should be notified. The
patient was also started on calcitriol given elevation of
parathyroid hormone. Cardiovascular: Rate and rhythm:
The patient has a history of atrial fibrillation with a slow
ventricular response. Two weeks ago, the patient was
started on metoprolol 12.5 mg p.o. q.6 h. for rate control ,
however , this dose was decreased to 12.5 mg p.o. twice a
day, given some bradycardia on her telemetry. The patient
was also started on Flecainide 75 mg p.o. q.12 h. She will
continue on these two medications upon discharge.

Example
Medication name, associated information, assertions and time expressions
RRR , lots of BS's , neuro nonfocal , ext with 1+ edema. On
atenolol , zestril , norvasc , premarin , detrol , lasix 60 qd ,
nebs prn at home. Labs sig for Cr 0.7 , CK 48 , TnI .05 ,
QBC 9.5 , Hct 41.3. From CV point of view , thought to be
CHF exac. ROMI'd without events on monitor and diuresed
2L/day. IV Lasix 80 bid to start transitioned to 60 po bid.
BNP>assay. 6/17 dobut MIBI with mod sized ant septal wall
defect c/w diagonal lesion , 3/22 Echo with EF 55-60% ,
mild LAE/RAE , no WMA , mod large RV. No further CV
studies. Cont previously meds on d/c. From FEN point of
view , 2 L fluid restriction , 2 g Na restriction. Nutrition
consult , but pt very resistant to diet changes. From GI point
of view , GERD; nexium started. From pulm point of view ,
CXR c/w sl fluid overload , no focal findings , no pulm
edema. Given NC O2 and BiPAP at night.

Material
Documents
Discharge summaries: 1,249 documents (provided by the
I2B2 challenges)
2009: 649 docs in the training set , 553 docs in the test set,
17 manually annotated documents (for illustrating the
annotation guidelines)
2010: 349 annotated documents + 827 raw documents in
the training set, 477 in the test set
Assertions: 11,968 in the training set, 18,550 in the test set
2012: 190 docs in the training set, 120 docs in the test

Material
Terminologies and lexica
Medication names: RxNorm (243,869 entries) and
Therapeutic classes and groups of medication from the
FDA website
Ambiguous medication (red blood cells, magnesium, iron):
specific status during the annotation process
Medical problems: 45,898 terms (Diagnosis and
Morphology axes of the Snomed International), 476 terms
from the training set documents
Medication-related information
Regular expressions for frequency, dosage, duration and
mode of administration
52 identification rules for reasons: characterization of
Snomed Int terms and/or extracted terms as reasons

Material
Terminologies and lexica
Assertions:
Negation: 284 markers from the NegEx resource
[Chapman et al.01]
Lexical clues: on exertion (condition)
Morphological clues: afebrile (negative certainty)
Contextual information (342 markers)
Clues in the sentence, Section headings
... could represent a multifocal pneumonic process
(possible)
ALLERGIES, SOCIAL HISTORY, lists
Lexico-syntactic patterns (137 patterns)
be to (address | request | notify) DT (office | clinic | hospital) if PB (Hypothesis)
TE to (evaluate | check | eval | consult) (from | if | with | against) PB (Possibility)

Document processing
Annotation of the documents
Use of terminological and linguistic resources and
selection and disambiguation rules
CRF-based models [Grouin et al., Minard et al.11]
tuning Heideltime system
[Strotgen and Gertz12, Hamon and Grabar14]
Design of post-processing modules for
Disambiguation and negative contexts of medication names
Computing of dependency relations between patient,
medication names and related information, or assertion
Improving the CRF-based system with extracted terms
[Aubin and Hamon06]

Enriching documents with linguistic information
Extraction of
terms
Ontology
Lemmatisation
Tagging
of the terms Terminoloy
Semantic tagging
linguistic and structural
annotations
XML document with
of named
Dictionary
entities
Named entity tagging
Word and
sentence segmentation
Specialised
lexicon
Part−Of−Speech
Tagging
Tokenisation
XML document with
structural annotations
Symbolic approach: use of NLP methods
Terminological resources and
disambiguation rules
Concurrent annotations and annotation
selection
Design of post-processing modules for
Annotation disambiguation
Establishment of dependency
relations between patient,
medication names and related
information, or assertion
Annotation based on the Ogmios NLP platform
(developed during the EU Project Alvis)

Enriching document with linguistic information
Identification of the sentences
The patient has a history of atrial fibrillation with a slow ventricular response .
Two weeks ago , the patient was started on metoprolol 12.5 mg p.o.
q.6 h. for rate control ...

Identification of the sentences, words
The patient has a history of atrial fibrillation with a slow ventricular response .
Two weeks ago , the patient was started on metoprolol 12.5 mg p.o.
q.6 h. for rate control ...

Identification of the sentences, words, lemma and
part-of-speech
The
DT
patient
NN
has
VBZ
a
DT
history
NN
of
IN
atrial
JJ
fibrillation
NN
with
IN
a
DT
slow
JJ
ventricular
JJ
response
NN
.
Two
CD
weeks
NNS
ago
RB
, the
DT
patient
NN
was
VBD
started
VBN
on
IN
metoprolol
FW
12.5
CD
mg
NN
p.o.
SYM
q.6
FW
h.
NP
for
IN
rate
NN
control
NN
...

part-of-speech, named entities
[TIMEX3] [DOSAGE] [MODADM]
[FREQ]
The
DT
patient
NN
has
VBZ
a
DT
history
NN
of
IN
atrial
JJ
fibrillation
NN
with
IN
a
DT
slow
JJ
ventricular
JJ
response
NN
.
Two
CD
weeks
NNS
ago
RB
, the
DT
patient
NN
was
VBD
started
VBN
on
IN
metoprolol
FW
12.5
CD
mg
NN
p.o.
SYM
q.6
FW
h.
NP
for
IN
rate
NN
control
NN
...

part-of-speech, named entities and terms with semantic
types
[TIMEX3] [DOSAGE] [MODADM]
[FREQ] [DISORDER]
[DRUG]
[DISORDER] [DISORDER]
The
DT
patient
NN
has
VBZ
a
DT
history
NN
of
IN
atrial
JJ
fibrillation
NN
with
IN
a
DT
slow
JJ
ventricular
JJ
response
NN
.
Two
CD
weeks
NNS
ago
RB
, the
DT
patient
NN
was
VBD
started
VBN
on
IN
metoprolol
FW
12.5
CD
mg
NN
p.o.
SYM
q.6
FW
h.
NP
for
IN
rate
NN
control
NN
...

Concurrent annotation of documents
Preparing material for document annotation
Named Entity Recognition (frequency, duration, dosage,
mode of administration)
+ internal disambiguation (avoid nested annotations of
different types and merge annotations of the same type)
Term and semantic tagging (medication and reasons,
negation and reason marker, assertion)
based on linguistic information (word and sentence
segmentation, lemmatization)
+ internal disambiguation (nested terms, parenthesed
medication names, etc.)

Time expression identification
Tuning Heideltime system [Strotgen and Gertz12] for
English and French EHR
Enrichment and encoding of linguistic temporal
expressions specific to medical and clinical domain:
post-operative day #, b.i.d. meaning twice a day, day of life, etc.
Admission date as the reference or starting point for
computing relative dates and their normalised value
if the admission date is 14 June 2017, the normalised value of
2 days later is 16 June 2017.
Additional normalizations of the temporal expressions:
normalization the durations in approximate numerical values to
avoid undefined values
external computation for some durations and frequencies due to
limitations in HeidelTime's internal arithmetic processor

Annotation selection
Processing of ambiguous medication names : laboratory
data or medication
1 if a list section: status changed in medication
HOME MEDS: methadone 20 bid, imdur 120 bid, hydral taking 25
bid, lasix 20 bid, coumadin, colace, iron, nexium 40 bid
Rejection of medicaton names: if in allergy sections
ALLERGY: prednisone, penicillins, tamsulosin, simvastatin
Removal of drug names in negative contexts
Guessing new drug names with semantic patterns
m do mo? f [Hamon et al.13]
1 Noun phrases recognized by the term extractor YATEA
2 Stopwords rejected
3 Filtering with typical suffixes of the medication names
Diovan 160mg PO BID, HCTZ 25mg PO QD, Imdur ER 60mg PO
QD, NTG .4mg PRN CP, Norvasc 10mg PO QD, Pavachol 80mg
PO QD.

Results
Medication task
Focus on various parameters for reason identification and
guessing medication names
RUN2 RUN1 RUN3
System 0.7801 0.7681 (-0.0120) 0.7719 (-0.0082)
m 0.8142 0.8093 (-0.0049) 0.808 (-0.0062)
do 0.8234 0.8172 (-0.0062) 0.821 (-0.0024)
f 0.837 0.8304 (-0.0066) 0.8345 (-0.0025)
mo 0.8655 0.8577 (-0.0078) 0.8624 (-0.0031)
du 0.3575 0.3516 (-0.0059) 0.3505 (-0.0070)
r 0.2867 0.2759 (-0.0108) 0.2666 (-0.0201)
RUN1: All reasons
RUN2: All reasons without semantic tagging and reason markers
RUN3:
All reasons without semantic tagging and use of reason markers
Guessing medication names

Results
Medication task
exact inexact
F P R F P R
System 0.7801 0.7997 0.7614 0.7792 0.8111 0.7497
m 0.8142 0.8448 0.7858 0.8304 0.8666 0.7971
do 0.8234 0.8728 0.7793 0.8503 0.8799 0.8226
f 0.837 0.8306 0.8435 0.8411 0.8436 0.8386
mo 0.8655 0.8543 0.877 0.863 0.844 0.8828
du 0.3575 0.3483 0.3673 0.3607 0.3669 0.3546
r 0.2867 0.3047 0.2708 0.3386 0.4386 0.2757
Reason: difficult to identify the exact noun phrases (-13%
between inexact and exact precision)

Results
Assertion task and time expression identification
List of markers + section headings
Categories Training Test
P R F P R F
Associated to somebody else 0.96 0.80 0.88 0.84 0.74 0.79
Hypothesis 0.71 0.31 0.43 0.63 0.24 0.35
Condition 0.08 0.40 0.14 0.08 0.33 0.12
Possibility 0.46 0.57 0.51 0.51 0.47 0.49
Absent 0.92 0.75 0.82 0.87 0.75 0.81
Present 0.86 0.90 0.88 0.84 0.87 0.86
Assertions 0.82 0.82 0.82 0.80 0.80 0.80
Precision Recall F-measure
Temporal expressions 0.8611 0.8170 0.8385

Conclusion
F-measure of the system: 0.800 (avg)
Analysis of the resource contribution:
Importance of the markers
Need to include syntactic structures
Difficulty to identify certainty degrees
few examples for condition and hypothesis

Further improvements
Medication tasks:
Duration extraction: identification of specific prepositional
phrases based on parsing
Medical problem identification: development of a specific
reasoning module
Assertion task:
Enrich resources with synonyms (Wordnet)
Improving the patterns:
using syntactic dependencies
integrating semantic classes
(verbs of evidence, verbs to get in touch with somebody,
etc.)

Mining literature to identify
relations between risk factors and their pathologies
[Hamon et al.10]
Objective: Massive exploitation of Medline bibliographical
database for extracting risk factors and their associations
with health conditions
Risk factors: increase people's chance to develop a given
disease
Information on risk factors is wide-spread over the web:
websites, bibliographical databases, ...
Previous works:
Genomic scientific literature (BioCreative, TREC
Genomics), clinical records (I2B2 NLP Challenge 2014),
processing of narratives [Blake04]
Data mining (KDD challenge 2004)
[Ahmad and Bath05, Cerrito04, Kolyshkina and van rooyen06]

Material
Bibliographical database Medline (titles, abtracts)
Selection of potential citations/PMIDs, i.e. containing the
sequences risk factors, factor of risk
187,544 citations selected: over 42 million word
occurrences
MeSH (thesaurus for information storage and retrieval)
Disease-related MeSH term recognition in citations

Document processing
1 Annotation of Medline citations with linguistic information
Ogmios NLP platform [Hamon et al.07]
Segmentation, POS-tagging & lemmatization -- Genia
Tagger [Tsuruoka et al.05]
Term recognition but also term extraction -- YATEA
[Aubin and Hamon06]
2 Risk factors identification

Term recognition vs. Term extraction
Term recognition: Tagging of texts with terms issued from a
terminologies
Use of more or less complexe methods (string matching,
terminological variant computing, semantic distances,
ML methods...)
Term extraction: Discovering of terms in texts
Identification of noun phrases which are potential terms
(term candidates)
Computing of
the strength of the term components (unithood)
the strength of the relation to the domain (termhood)
[Kageura and Umino96]

Term extraction with YATEA
Yet Another Term ExtrActor
(Aubin&Hamon, 2006)
Term extration from French and English texts
Shallow parsing of texts
Parsing focusing on the parts of the sentence which may
contain terms (usually the noun phrases)
With
recursively applied minimal parsing patterns
endogenous learning
Term candidate decomposition in Head and Modifier
components (component syntactic role in the noun phrase)
Each component of a term candidate is also considered as
a term candidate
Unparseable noun phrases are rejected

YATEA
Yet Another Term ExtrActor
(Aubin et Hamon, 2006)
Several statistical measures are associated with each term
candidate (Number of occurrences, C-Value1, C-Value*,
etc.) [Hamon et al.14]
Module CPAN http://search.cpan.org/~thhamon/Lingua-YaTeA/
Developpement during the European project ALVIS
Description of the shallow parsing with configuration files
Possibility of tuning for a domain (Bi
oYATEA ) [Golik et al.13]
For other languages: on-going work for Ukrainian and
Arabic

Textes
lemmatisation
+ POS tagging
22CD yoJJ maleNN ,, hNN /SYMoNN primitiveJJ
neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN
andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,,
initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO
theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN
aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD ..
HePRP hadVBD backJJ painNN inin 2CD /SYM04CD ,, seenVBN atIN
Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO
brainNN andCC spineNN ..

Textes
lemmatisation
+ POS tagging
Term extraction
rule-based approaches
Identification of chunks thanks to morpho-syntactic
information (frontiers - verbs, adverbs, etc.)
22CD yoJJ maleNN ,, hNN /SYMoNN primitiveJJ
neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN
andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,,
initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO
theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN
aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD ..
HePRP hadVBD backJJ painNN inin 2CD /SYM04CD ,, seenVBN atIN
Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO
brainNN andCC spineNN ..

Parsing of the noun phrases to detect term candidates
1. Identification of term candidates described by parsing
patterns
NNJJ
M H
(< H > : Head of the noun phrase, < M > : modifier of the head)
neuroectodermal tumor → (neuroectodermal< M >
tumor< T >)
tumorneuroectodermal
M H
shortness of breath → shortness< T > of breath< M >
(of) breathshortness
H M

2. Use of the previously parsed term candidates (island of
reliability) to parse remaining noun phrases
Example: primitive neuroectodermal tumor

Use of the already parsed term
neuroectodermal tumor
M H

primitive tumorneuroectodermal
M H

M H
Temporary simplification (folding): primitiveJJ tumorNN

M H
Use of the parsing pattern:
NNJJ
M H
→
tumorprimitive
M H

M H
Use of the parsing pattern:
NNJJ
M H
→
tumorprimitive
M H
Unfolding :
M H
primitive
M
H

Textes
lemmatisation
+ POS tagging
22CD yoJJ maleNN ,, hNN /SYMoNN primitiveJJ neuroectodermalJJ
tumorNN withIN metsNNS toTO brainNN andCC spineNN ,,
transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP
andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD
initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN
//resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin
2CD /SYM04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN
toTO haveVB metsNNS toTO brainNN andCC spineNN ..

Textes
lemmatisation
+ POS tagging
Term extraction
Candidate
terms
yo male thoracic gangliogliom
h back pain
o mets
primitive neuroectodermal tumor brain
mets spine
brain floor
spine
...

Textes
lemmatisation
+ POS tagging
Term extraction
Candidate
terms
Term ranking
frequency
term length
C-Value
Ranked term
candidates
f l Cv1 f l Cv1
yo male 1 1 1.58 spine 2 1 2
h 1 1 1 floor 1 1 1
o 1 1 0 thoracic gangliogliom 1 2 1.58
mets 2 1 2 back pain 1 2 1.58
brain 2 1 2
primitive neuroectodermal tumor 1 3 2.32
...

Document processing
1 Annotation of Medline citations with linguistic information
Ogmios NLP platform [Hamon et al.07]
Segmentation, POS-tagging & lemmatization -- Genia
Tagger [Tsuruoka et al.05]
Term recognition and extraction -- YATEA
[Aubin and Hamon06]
2 Risk factors identification

Risk factor identification
Semantico-syntactic patterns
5 patterns for risk factors and pathologies
12 patterns for handling enumerations
3 patterns for pathologies
<NP-RF> as a risk factor for <NP-P>
where
as a risk factor for: trigger sequence
<NP-RF>: noun phrases corresponding to risk factors
<NP-P>: pathologies
? and *: optional and recurrent elements
MeSH descriptors of citations
Descriptors belonging to C heading of diseases

Risk factor identification
Examples
Pattern: <NP-RF-list> is a risk factor for <NP-P>
...a high intake of calcium and phosphorus is a risk
factor for the development of metabolic acidosis .
(PMID 1435825)
Pattern: risk factors for <NP-P>,? include <NP-RF-list>
...had more than one of the common risk factors for
cerebrovascular accidents , including hypertension ,
advanced age , hyperfibrinogenemia ,
diabetes mellitus , and
past history of cerebrovascular accident. (PMID 1560589)

Results
Application of three kinds of patterns
(1) {risk factor, pathology}, (2) risk factors, (3) pathologies
Definition of relations:
direct relations with patterns {risk factor, pathology}
combination of information provided by (2) and (3)
10,445 PMIDs provide information
313 pairs {risk factor, pathology}
15,398 pairs by combination of (2) and (3)
5,873 risk factors (2) not associated with any pathology
MeSH indexing: 5,106 pathologies and health conditions
21,584 triplets {risk factor, pathologytext?, pathologyMeSH?}
17,620 (14,895) pairs only provided by the patterns
5,717 (4,412) pairs contain MeSH descriptors as pathology

Evaluation
Evaluation of precision
ratio of correct extractions among the overall results
Manual evaluation:
no dedicated and comprehensive gold standard is available
Comparison with three relationships provided by Snomed
CT (nomenclature for organizing and exhanging clinical
data)
has causative agent: direct cause of the disorder or finding
(92,807 relations)
bacterial endocarditis has causative agent bacterium
due to: relate a clinical finding directly to its cause (25,309
relations)
acute pancreatitis due to infection
associated with: clinically relevant association between terms
without either asserting or excluding a causal or sequential
relationship between the two (36,134 relations)
fentanyl allergy has causative agent fentanyl

Evaluation
1 Quality and exhaustiveness of risk factors for a given
pathology
Evaluation by medical doctor of 1,102 risk factors for
coronary heart disease: 88.38% precision
hypertension: {smoking; cigarette smoking; smoking history;
importance of total life consumption of cigarettes}
2 Comparison between text mining results for 20 pathologies
(3,100 extractions, about 25%) and Snomed CT causal
and associative relations (154,130 pairs)
19 extractions (0.6%) considered as already in Snomed CT
Snomed CT not dedicated to risk factors, but they may
occur
acquired immunodeficiency syndrome: {bisexuality, blood
transfusion, intravenous drug abuse }

Conclusion
Extraction of information related to risk factors
Relation with associated pathologies
Text mining approach based on semantico-syntactic
patterns
Evaluation by medical doctor and computer scientist
88.38% of risk factors related to coronary heart disease are
correct
about 70% of extracted pathologies are equivalent with
MeSH indexing
Snomed CT is not dedidated to the recording of risk factors,
although they may occur
⇒ Creation of a dedicated resource for risk factors is suitable

Future work
Use of other patterns, i.e. predictor, precursor ...
Machine learning methods
Knowledge representation:
homogeneous groups of risk factors
environmental, social, clinical, behavioral ...
Characterization of this information
modal, negative contexts
Geographical, demographic variation

Adaptation of Cross-Lingual Transfer Methods
for the Building of Medical Terminology in Ukrainian
Nowadays, methods and automatic tools for several
European languages and Japanese
[Kageura and Umino96, Cabre et al.01, Pazienza et al.05]
For many languages:
few NLP tools are available and suitable for automatic
terminology extraction
while textual data exist and terminological resources are
required

Our objective
Design of specific methods for the acquisition of such
terminological resources in Ukrainian
Approaches:
Compilation of terminological resources
Automatic building of terminologies
Observations: increasing availability of parallel bilingual corpora
Methodology: Use of specialized parallel corpora including a
low-resourced language (Ukrainian)
to build bilingual and trilingual terminologies
by the means of the cross-lingual transfer principle

Cross-lingual transfer principle
[Yarowsky et al.01, Lopez et al.02]
Hypothesis:
parallel and aligned corpora with two languages L1 and L2
syntactic or semantic annotations and information from L1
Method:
transpose these annotations or information from L1 to L2,
obtain the corresponding annotations and information in L2
Efficient way for [Zeman and Resnik08, Mcdonald et al.11]
processing multilingual texts from low-resourced
languages
creating various types of annotations: part-of-speech,
semantic categories or even acoustic and prosodic
features

Drawbacks of the transfer principle
The transfer methodology depends on
the quality of the extracted information and annotation from
L1 texts
the quality of alignment
usually a statistical alignment method
depending on the size of the corpora:
the bigger the better
→ Define an approach to bypass these drawbacks

Material
Medical data in three languages (Ukrainian, French, and
English):
Ukrainian Wikipedia:
source of relevant terms
help for the word-level alignment of the MedlinePlus corpus
MedlinePlus corpus:
a collection of specialized texts
providing the basis for the building of the terminology

Medicine-related articles
from Ukrainian Wikipedia
Selection of the Ukrainian part of the Wikipedia using
medicine-related categories, such as Медицина
(medicine) or Захворювання (disorders)
Potentially covers a wide range of medical notions
Use of information in the infobox

Parallel medical corpus
[Hamon and Grabar17] -- http://natalia.grabar.free.fr/resources.php
Patient-oriented brochures in three languages (Ukrainian,
French, and English) from MedlinePlus
on several medical topics (body systems, disorders and
conditions, diagnosis and therapy, health and wellness)
created in English and then translated in several other
languages (including French and Ukrainian)
About 43,000 words for each language
English Ukrainian
Cancer cells grow and divide more
quickly than healthy cells. Cancer
treatments are made to work on these
fast growing cells.
Ракові клітини ростуть і діляться
швидше, ніж здорові клітини. При лі-
куванні раку здійснюється вплив на ці
клітини, що швидко ростуть.
- Tiredness - Втома
- Nausea or vomiting - Нудота або блювота
- Pain - Біль
- Hair loss called alopecia - Втрата волосся, що називається
алопецією

Extraction of bilingual terminology
from the Wikipedia
Objective: complete and help the alignment method applied to
the MedlinePlus corpus
Use of content of the infoboxes
Ukrainian Wikipedia
medical part

from the Wikipedia
Ukrainian Wikipedia
medical part
Processing of the InfoBoxes

from the Wikipedia
Ukrainian Wikipedia
medical part
Medical terms with MeSH codes
Цукровий діабет тип 2

from the Wikipedia
Ukrainian Wikipedia
medical part
UMLSQuerying UMLS
UMLS Цукровий діабет тип 2
NIDDM
Type 2 Diabetes Mellitus
DID2,
Diabète avec insulinorésistance

from the Wikipedia
Ukrainian Wikipedia
medical part
UMLSQuerying UMLS
Pairs of medical terms
(UK/FR and UK/EN)
Цукровий діабет тип 2
NIDDM
Type 2 Diabetes Mellitus
DID2,
Diabète avec insulinorésistance

from the MedlinePlus corpus
Illustration of the transfer methods
English Ukrainian
Cancer cells grow and divide more
quickly than healthy cells. Cancer
treatments are made to work on
these fast growing cells.
Ракові клітини ростуть і діля-
ться швидше, ніж здорові кліти-
ни. При лікуванні раку здійсню-
ється вплив на ці клітини, що
швидко ростуть.
- Tiredness - Втома
- Nausea or vomiting - Нудота або блювота
- Pain - Біль
- Hair loss called alopecia - Втрата волосся, що називає-
ться алопецією

MedlinePlus Corpora
UK/FR & UK/EN
Cleaning and manual paragraph alignment

MedlinePlus Corpora
UK/FR & UK/EN
POS tagging with TreeTagger and Flemm
FR & EN term extraction with YATEA

Transfer 1 MedlinePlus Corpora
UK/FR & UK/EN
Extraction of UK terms
corresponding to lines

Transfer 1 MedlinePlus Corpora
UK/FR & UK/EN
Pairs of candidate terms
(UK/FR and UK/EN)

Transfer 1 Transfer 2MedlinePlus Corpora
UK/FR & UK/EN
(UK/FR and UK/EN)

UK/FR & UK/EN
Giza++ suite
(including MkCls)
(UK/FR and UK/EN)
MedlinePlus corpora
aligned at the word level

UK/FR & UK/EN
Giza++ suite
(including MkCls)
(UK/FR and UK/EN)
MedlinePlus corpora
UK term extraction by transfer

UK/FR & UK/EN
Giza++ suite
(including MkCls)
(UK/FR and UK/EN)
MedlinePlus corpora
(UK/FR and UK/EN)

UK/FR & UK/EN
Giza++ suite
(including MkCls)
(UK/FR and UK/EN)
MedlinePlus corpora
(UK/FR and UK/EN)
Cross-fertilization
with single-word terms

UK/FR & UK/EN
Giza++ suite
(including MkCls)
(UK/FR and UK/EN)
MedlinePlus corpora
(UK/FR and UK/EN)
Wikipedia pairs
of medical terms
Cross-fertilization
Cross-fertilization

Evaluation
Performed by an Ukrainian native speaker having
knowledge in medical informatics
Manual checking of the extracted candidates: correct/non
correct
Validation:
Terms: independently in each language
Bilingual and trilingual relations
Computing of the precision of the results:
correct answers
all the answers
with exact and inexact match (the correct term is included
or includes the candidate)

Results
Bilingual terminology from Wikipedia
357 Ukrainian medical terms (among them 177
single-word terms)
Use of the MeSH codes and UMLS:
1428 French terms (among them, 339 single-word terms)
3625 English terms (among them, 448 single-word terms)
Difference with the number of Ukrainian terms due to the MeSH
synonyms
Bilingual pairs:
1,515 Ukrainian/French term pairs (270 pairs between
single-word terms)
3,789 Ukrainian/English term pairs (405 pairs between
single-word terms)
Precision: 1 because of the collecting method

Results
Bilingual terminology from the MedlinePlus - Transfer 1
436 Ukrainian terms with 0.966 precision
associated with 316 French terms and 354 English terms
282 triples between Ukrainian/French/English terms (prec.:
0.954)
63 pairs only between Ukrainian/French terms (prec.:
0.937)
115 pairs only between Ukrainian/English terms (prec.:
0.965)
Relations
involving synonyms: {втома, fatigue/tiredness},
{фаллопієва труба, trompes de fallope/trompe utérine} (fallopian
tube),
{втрата слуху/втрачається слух, hearing loss}
associating several case forms with same English or
French form: {вагітність, pregnancy} and {вагітності,
pregnancy}

Analysis
Few errors:
mainly partial match between two languages:
{ви можете спати, dormir/sleep} - lit. you can sleep.
{появу виразок у роті, mouth sores} - lit. (appearance of)
mouth sores
Causes of silence:
variation due to the translation which prevents the
transfer 1 method to extract term in French or English
Догляд: match with French title Soins but not with the
English title Your care
Problem solved by the Transfer 2 method
errors in the POS tagging or term extraction strategy
Incapacity of the term extractor to identify French or English
terms

Results
9,040 Ukrainian extracted terms (prec.: 0.454)
Exact match:
Higher precision of the French (0.674) and English terms (0.761)
But low number of terms: 3,671 for French, 3,597 for English
Due to the rich morphology of the Ukrainian language:
{напад, нападу} - attack, {припадків, припадки} - seizure,
{костей, кістки} - bones
Extraction of synonymous terms:
{биття, удару} - beats,
{приступам, припадків} - attacks/seizures
Relations:
3,724 pairs of Ukrainian/French terms (prec.: 0.309)
4,745 pairs of Ukrainian/English terms (prec.: 0.401)
4,724 triples of Ukrainian/French/English terms (prec.: 0.419)
Inexact match:
Higher precision: +0.40 points for the Ukrainian terms, +0.05 for
the French and English terms.
Due to the alignment quality?

Analysis
Error analysis
Most of the errors are due to the alignment problems
when the alignment is correct, the Ukrainian terms are correctly
extracted by the transfer
Term analysis
Most of the extracted terms are specific to the medical domain
{шприца, syringe}, {холестерину, cholesterol}, {фактори
ризику, risk factors}, {трахеотомією, tracheostomy}),
Other terms: close and approximating notions:
{діти, children}, {здорову їжу, healthy diet}, {серцевий напад,
heart attack}, {склянок рідини, glasses of liquid}
Interesting observation:
French and English terms correspond to phrases in Ukrainian:
undercooked foods: не до кінця приготовлену їжу (lit. food which is
not fully cooked)
indolore (painless): При цьому обстеженні Ви не відчуєте жодного
болю (lit. With this exam you will feel no pain)

Conclusion
Proposition of transfer-based methods to
extract the term candidates in Ukrainian
create term pairs Ukrainian/French and Ukrainian/English
Works on freely available multilingual corpora in French,
English and Ukrainian
Resulting terminological resource: 4,588 Ukrainian medical
terms and 34,267 relations with French and English terms
→ Method suitable for building terminology in
low-resourced languages

Future Work
Bilingual word alignment with Fast-Align [Dyer et al.13]
Use of statistical and morphological cues
Use of transfer method for keyphrase extraction from scientific
papers
⇒ Ongoing work with Kyiv Institute of Cybernetics
Proposing a similar term extration method to work with
comparable copora

Overall conclusion
Biomedical text mining: a complex task which involves
several types of information ...
... to link together
many strategies for identifying the information
a lot of terminological and linguistic resources ...
... more or less available or difficult to build according to
languages and areas
Current challenges
concept recognition (disambiguation, normalization)
multilingual approaches
approaches for low-resourced languages
use of information issued from social media

Ongoing funded projects
Mining literature and using Open Linked Data
MIAM Project (French National Agency, 2016)
Mining literature to collect interactions existing between
drugs and food which might lead to adverse drug events
Example: Grapefruit has an adverse effect on the CPY3A4
enzyme contained in many drugs
Objectives:
Aggregating information issued from unstructured data with
knowledge already recored in knowledge bases or Linked
Open Data repository (Drugbank, Thériaque, Sider,
Diseasome, etc.)
Managing certainty and reliability of this information
Formalisation of the interactions in Linked Open Data

Drug-related information
acne
osteoporosis
swelling of face
arterial hypertension
ulcer of stomach
depression
solumedrol
salt
phosphate disodique anhydre
phosphate monosodique anhydre
sodium
lactosis
cortisone
steroidal anti−inflammatory
allergic shock
Quincke oedema
suffocation by larynx oedema
brain oedema
methylprednisolone
adverseeffects
digitaline
insulin
composition
is a
prescribedfor
INN
DDI
FDI
dosage
mode
frequency
reason
duration
prescriptionfeatures73/75 Grammarly Meet-up T Hamon

Terminology acquisition for Ukrainian
Use of transfer method for keyphrase extraction from
scientific papers
Tuning of YATEA of Ukrainian
Definition and design of methods for terminological and
semantic relation acquisition

Дякую!

Ahmad (Rabiah) et Bath (Peter A). --
Identification of risk factors for 15-year mortality among community-dwelling older people using Cox
regression and a genetic algorithm. Journal of Gerontology, vol. 60 (8), 2005, pp. 1052--8.
Aubin (Sophie) et Hamon (Thierry). --
Improving Term Extraction with Terminological Resources. In : Advances in Natural Language Processing
(5th International Conference on NLP, FinTAL 2006), éd. par Salakoski (Tapio), Ginter (Filip), Pyysalo
(Sampo) et Pahikkala (Tapio). pp. 380--387. --
Springer.
Blake (Catherine). --
A text mining approach to enable detection of candidate risk factors. In : Medinfo, pp. 1528--1528.
Cabré (MT), Estopà (R) et Vivaldi (J). --
Automatic term detection: a review of current systems, pp. 53--88. --
John Benjamins, 2001.
Cerrito (Patricia). --
Inside text Mining. Health management technology, vol. 25 (3), 2004, pp. 28--31.
Chapman (Wendy), Bridewell (Will), Hanbury (Paul), Cooper (Gregory) et Buchanan (Bruce). --
Evaluation of negation phrases in narrative clinical reports. In : Annual Symposium of the American Medical
Informatics Association (AMIA). --
Washington, 2001.
Dyer (Chris), Chahuneau (Victor) et Smith (Noah A.). --
A Simple, Fast, and Effective Reparameterization of IBM Model 2. In : NAACL/HLT, pp. 644--648.
Golik (Wiktoria), Bossy (Robert), Ratkovic (Zorana) et Nédellec (Claire). --
Improving term extraction with linguistic analysis in the biomedical domain. In : Proceedings of the 14th
International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'13). --
Samos, Greece, March 2013.
Grouin (Cyril), Abacha (Asma Ben), Bernhard (Delphine), Cartoni (Bruno), Deléger (Louise), Grau (Brigitte),
Ligozat (Anne-Laure), Minard (Anne-Lyse), Rosset (Sophie) et Zweigenbaum (Pierre). --75/75 Grammarly Meet-up T Hamon

CARAMBA: Concept, Assertion, and Relation Annotation using Machine-learning Based Approaches. In :
Proceedings of the workshop I2B2 2010.
Grouin (Cyril), Grabar (Natalia), Hamon (Thierry), Rosset (Sophie), Tannier (Xavier) et Zweigenbaum
(Pierre). --
Eventual situations for timeline extraction from clinical reports. Journal of American Medical Informatics
Association, vol. 20 (5), September 2013, pp. 820--827. --
(IF: 3.609).
Hamon (Thierry) et Grabar (Natalia). --
Linguistic approach for identification of medication names and related information in clinical narratives.
Journal of American Medical Informatics Association, vol. 17 (5), Sep-Oct 2010, pp. 549--554. --
PMID: 20819862.
Tuning HeidelTime for identifying time expressions in clinical texts in English and French. In : Proceedings of
The Fifth International Workshop on Health Text Mining and Information Analysis (LOUHI2014) -- Short
paper/Poster, pp. 101--105. --
Gothenburg, Sweden, April 2014.
Adaptation of Cross-Lingual Transfer Methods for the Building of Medical Terminology in Ukrainian. In :
Proceedings of the 17th International Conference on Intelligent Text Processing and Computational
Linguistics (CICLING2016). --
Springer.
Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation. In :
Proceedings of Computational Linguistics and Intelligent Systems (COLINS 2017), pp. 10--19.
Hamon (Thierry), Nazarenko (Adeline), Poibeau (Thierry), Aubin (Sophie) et Derivière (Julien). --
A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis. In : Proceedings of
RIAO 2007. --
Pittsburgh, USA, 2007. 15 pages.

Hamon (Thierry), Graña (Martin), Raggio (Víctor), Grabar (Natalia) et Naya (Hugo). --
Identification of relations between risk factors and their pathologies or health conditions by mining scientific
literature. In : Proceedings of MEDINFO 2010, pp. 964--968. --
PMID: 20841827.
Hamon (Thierry), Grabar (Natalia) et Kokkinakis (Dimitrios). --
Medication Extraction and Guessing in Swedish, French and English. In : Proceedings of MedInfo 2013. --
Copenhagen, Danemark, August 2013.
Hamon (Thierry), Engström (Christopher) et Silvestrov (Sergei). --
Term ranking adaptation to the domain: genetic algorithm based optimisation of the C-Value. In : Proceedings
of PolTAL 2014 -- Advances in Natural Language Processing, éd. par Springer , pp. 71--83.
Kageura (K) et Umino (B). --
Methods of Automatic Term Recognition. In : National Center for Science Information Systems, pp. 1--22.
Kolyshkina (I) et van Rooyen (M). --
Text mining for insurance claim cost prediction, pp. 192--202. --
Springer-Verlag, 2006.
Lopez (Adam), Nossal (Mike), Hwa (Rebecca) et Resnik (Philip). --
Word-Level Alignment for Multilingual Resource Acquisition. In : LREC Workshop on Linguistic Knowledge
Acquisition and Representation: Bootstrapping Annotated Data. --
Las Palmas, Spain, 2002.
McDonald (Ryan), Petrov (Slav) et Hall (Keith). --
Multi-source transfer of delexicalized dependency parsers. In : EMNLP.
Minard (AL), Ligozat (AL), Ben Abacha (A), Bernhard (D), Cartoni (B), Deléger (L), Grau (B), Rosset (S),
Zweigenbaum (P) et Grouin (C). --
Hybrid methods for improving information access in clinical documents: concept, assertion, and relation
identification. J Am Med Inform Assoc, vol. 18 (5), 2011, pp. 588--93.
Pazienza (Maria Teresa), Pennacchiotti (Marco) et Zanzotto (FabioMassimo). --

Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. In : Knowledge Mining, éd. par
Sirmakessis (Spiros), pp. 255--279. --
Springer Berlin Heidelberg, 2005.
Périnet (Amandine), Grabar (Natalia) et Hamon (Thierry). --
Identification des assertions dans les textes médicaux : application à la relation {patient, problème médical}.
Traitement Automatique des Langues (TAL), vol. 52 (1), 2011, pp. 97--132.
Strötgen (Jannik) et Gertz (Michael). --
Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. In : Proceedings of
the Eigth International Conference on Language Resources and Evaluation (LREC'12). pp. 3746--3753. --
ELRA.
Tsuruoka (Yoshimasa), Tateishi (Yuka), Kim (Jin-Dong), Ohta (Tomoko), McNaught (John), Ananiadou
(Sophia) et Tsujii (Jun'ichi). --
Developing a Robust Part-of-Speech Tagger for Biomedical Text. In : Proceedings of Advances in
Informatics - 10th Panhellenic Conference on Informatics, pp. 382--392.
Yarowsky (David), Ngai (Grace) et Wicentowski (Richard). --
Inducing multilingual text analysis tools via robust projection across aligned corpora. In : HLT.
Zeman (D) et Resnik (P). --
Cross-language parser adaptation between related languages. In : NLP for Less Privileged Languages.
Zweigenbaum (Pierre), Lavergne (Thomas), Grabar (Natalia), Hamon (Thierry), Rosset (Sophie) et Grouin
(Cyril). --
Combining an expert-based medical entity recognizer to a machine-learning system: methods and a case
study. Biomedical Informatics Insights, vol. 6 (Suppl. 1), 2013, pp. 51--62.

Natural Language Processing for biomedical text mining - Thierry Hamon

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Natural Language Processing for biomedical text mining - Thierry Hamon

Similar to Natural Language Processing for biomedical text mining - Thierry Hamon (20)

More from Grammarly

More from Grammarly (14)

Recently uploaded

Recently uploaded (20)

Natural Language Processing for biomedical text mining - Thierry Hamon