2. Er is natuurlijk hele goede zoeksoftware…
Zoek op alle
gestructureerde-,
ongestructureerde
informatie en alle
combinaties
3. Regular Expressions
+ matches the preceding element one or more times
{m} matches the preceding element m times exactly
{m,} matches the preceding element at least m times
{m,n} matches preceding element at least m times but not more than n times
• 𝑚 ≤ 𝑛, 𝑚, 𝑛 ∈ ℵ0 = {0, 1, 2, … }
• The element can be a literal, literal range, escaped wildcard, ? wildcard, and number
Examples:
• [abcte]+ = (cab or cat or bat or bet or tab …)
• appl[a-t]+ = (apple or apples or application or …)
• 10+ = (10 or 100 or 1000 or 100000 ...)
• [a-t]{0,10}dam = (amsterdam or dam or rotterdam or … )
• [0-9]{3}-[0-9]{4} = (123-4567 or 435-1539 or …)
• bo{1,}k{1,}* = (book or bookkeeper or Boké ...)
4. Voor de liefhebber: …
4
Building Backtracking NFA
Matches: Mississippi, mission, missing
5. Toch is de tijd van traditioneel zoeken wel een beetje voorbij
• Te veel data, teveel hits, geen
relevance ranking die altijd het beste
werkt;
• Je weet nooit precies wat je krijgt en
wat je mist;
• Te veel (geografisch verspreide)
bronnen;
• Te veel talen;
• Allerlei spellingsvariaties;
• Steeds meer niet-tekstuele formaten;
7. Wat is Artificial Intelligence?
“State-of-the Art”:
• Intelligent zoeken
• Informatie detective en
extractie
• Classificatie van informatie
• Representeren van kennis
• Overdragen van kennis
• Redeneren met kennis
• Machine Learning
8. Voorbeelden van AI om zoeken te verbeteren
• Intelligent zoeken
• Intelligent analyseren van
de inhoud van
documenten
• Identificatie en
extraheren relevante
informatie
• Classificatie van
informatie
• Leren en opslaan van
kennis van bepaalde
onderwerpen
• Machinaal vertalen
• Audio en video search
12. Information Extraction Hierarchy
• Entities: the basis units that can be found in a text; for example: people,
companies, locations, products, medicines, and genes.
• Attributes: these are the properties of the found entities: consider function title, a
person’s age and social security number, addresses of locations, quantity of
products, car registration numbers, and the type of organisation.
• Facts: these are relationships between entities, for example, a contractual
relationship between a company and a person.
• Events: these are interesting events or activities that involve entities, such as: “one
person speaks to another person”, “a person travels to a location”, and “a
company transfers money to another company”.
• Concepts, Sentiments or Emotions: finding abstract entities such as problems,
requests, sentiments, emotions, etc.
14. Zoeken op patronen in plaats van op
woorden
PERSON [visits | meets | lunches ]
PERSON
PERSON | COMPANY |
ORGANIZATION [pays | wires |
transfers] PERSON | COMPANY |
ORGANIZATION
• Zoeken op hoger (semantisch)
niveau.
• Automatisch vervoegen
werkwoorden
• Automatisch oplossen co-
referenties en persoonlijke
voornaamwoorden.
• Geen noodzaak meer om hele
lange queries te onderhouden.
14
16. Wat zijn de belangrijkste doelstellingen van
document classificatie?
• Documenten automatisch classifieren in relevant en niet
relevant.
• Documenten classificeren in diverse conceptuele
categorien.
• Maximaliseren recall (> 80%).
• Besparen op zoektijd.
• Relevante documenten automatisch vinden zonder te veel
afhankelijk te zijn van de zoekvaardigheden van een
eindgebruiker.
• Vinden zonder dat je precies weet wat je zoekt.
16
17. Hoe verhoudt dit zich t.o.v.
andere zoektechnieken?
• Supervised Document Classification
• Topic Modeling
Machine Learning
• OCR bitmaps
• Audio Search
• Text Mining & Regular Expressions
• Visual Classification
Advanced Processing
• Fuzzy & Wildcard Search
• Quorum & Proximity Search
• Ranking
• Regular Expressions
Advanced Search
• Document Properties
• File Properties
• Collection Properties
Metadata
Search
• Boolean Search
Standard
SearchRules Based TAR
Machine Learning
0%
100%
Recall
17
18. Welke technologien worden gebruikt?
Protocols supported Random Start, Search Start (Continuous Active Learning)
and Start with Topic Modeling or combine all methods
Supervised Machine Learning
Algorithm
Support Vector Machines (SVM)
Classifier type Binary
Document Representation Term Frequency–Inverse Document Frequency (TF-IDF) on
full-text or on extracted semantic document features*
(entities)
Evaluations 11-point precision/recall measurements in combinations
with 10-fold cross validation
18
* Patented by ZyLAB
20. Term Frequency (TF)–Inverse Document Frequency (IDF)
• The TF-IDF weight of a term is the product of its TF weight and its
IDF weight.
• Best known weighting scheme in information retrieval
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
• Automatically removes non-discriminating terms
20
21. Support Vector Machines (SVM)
• Best known text-classifier do far.
• Implements automatic feature selection: selects most discriminating
features automatically.
• SVMs support a highly dimensional spaces as seen in text
classification.
• SVMs have been reported to work better for text classification.
• ZyLAB is using a linear SVM which makes it very fast
• ZyLAB uses SVM as a binary classifier: one classifier per issue. Multi-
topic classification is possible by using multiple classifiers (one per
issue) at the same time.
• A SVM classifier returns a classification value between [0-1]. 0 is 100%
non-match, 1 is a 100% match. This is known as a confidence value.
21
22. Now imagine 1.2 million dimensional …
2-dimensional
3-dimensional
22
26. Machine Learning in de praktijk
Find Relevant
Documents using
standard Search
Techniques
Review Documents
for Correctness
_______
best matching first
Every X new correct
document, build
classifier with
manually reviewed
documents to
recognize similar
documents
Find potential
relevant documents
by matching
classifier with all
non-reviewed
documents in data
Calculate Precision
& Recall classifier
using 10-fold cross
validation on
Training Set.
Calculate precision
return set.
Stop if Precision and
Recall of the
Training Set or the
Return Set is Larger
than a pre-agreed
quality level
(typically 70-80%)
26
Return Best-Matching Documents
27. Wat is een stop conditie?
De classifier is goed genoeg om de rest van de
documenten automatisch te classificeren.
“Goed genoeg” kan zijn:
• Precisie – recall van de classifier is structureel
> 80% voor zowel precisie als de recall.
• Precisie van de classificatie van nieuwe
documenten is > 80%
• Precisie van de classificatie van nieuwe
documenten is < 10 % nadat het eerst naar
>80% is gegaan.
27
28. Simulatie op de Reuters Documenten Set
• 806.791 articles in total
• War, Civil War (GVIO):
32.615 articles (4,04%):
90% is found after
reviewing only 45.000
documents, which is
only 5.6% of full corpus.
• Sports (GSPO): 35.317
articles (4,38%): 90% is
found after reviewing
only 32.000 documents.
This is only 4% of full
corpus.
28
35. Question Entities or patterns to address this
question
Visualization Options
Who is it about? PERSON, COMPANY, ORGANIZATION.
EMAIL ADDRESS
Pie Chart, Bar Graph
What is it about? Result of Topic Modeling (NMF) or
Document-Term Correlation Matrix
(A*AT)
Word cloud, Word wheel
When did it happen? DATE, TIME, MONTH, DAY WEEK, YEAR Time line with bar graph
Where did it happen? ADDRESS, CITY, COUNTRY, CONTINENT,
DEPARTMENT and other geo-locations
Geographical Mappings
Why did it happen? Sentiments, emotions and cursing Word Cloud, Word Sheel on
emitions and sentiments
How did it happen? Custom patterns to recognize events,
holistic OBJECT-PREDICATE-SUBJECT
and RDF extractions
Relation graphs
How much/often did it happen? Quantitative measures such as
amounts, currencies, and other
numbers. Also frequency and averages
on entity occurrences.
Bar graphs
35
36. Who When Where Why What How How Much
Who Centrality
(Eigenvalue)
Link Networks
Timeline Geo-mapping Centrality
(Eigenvalue)
Link Networks
Count
Average
Bar Graph
When Time Line Topic Rivers Count
Average
Bar Graph
Where Count
Average
Bar Graph
Why Centrality
(Eigenvalue)
Link Networks
Count
Average
Bar Graph
What Topic Rivers Automatic
Correlation
Detection of
synonyms
Count
Average
Bar Graph
How Count
Average
Bar Graph
How Much Count
Average
Bar Graph
Count
Average
Bar Graph
36
41. Nog meer te weten komen en
hands-on demo’s?
Meld u aan voor de relatiedag op donderdag 30 maart:
“Automatisch antwoord op al uw onderzoeksvragen”
Locatie: Amsterdam, WTC, 9:00 – 14:00 uur
Key-note van misdaadverslaggever Peter R. de Vries
“Op zoek naar wat niet in dossier staat. Het eerste spoor”
www.zylab.nl/relatiedag