Презентация подготовлена по материалам выступления Филиппа Ерёменко на витебском Miniq #26, который был проведен 25 июня 2020 года:
https://community-z.com/events/miniq-qa .
Про доклад:
Многие сталкивались (или нет) с поисковыми движками типа Solr, Elasticsearch, AWS/Google решениями и т.д. на разных уровнях. Часто бывает так, что стандартный поиск не дотягивает до желаемого качества что бы вы ни делали. Почему не получается сделать как у Google или даже лучше? Что есть у них, чего нет у нас? Ответ – семантический поиск. Что это такое, чем отличается от стандартного подхода любого поискового движка и как это делается и как это делаем мы – об этом мой доклад.
2. About me
Pilip Yaromenka
EPAM Search CC and Java CC Expert, Solution Architect
Experience: 17+ years in software development, solution architecture development, team and project
management, technical directorship
Business areas: Tax & Law, Publishing, Medical
Expert areas: Java (web development, desktop development, web services), Search, Semantic Web
3. [TOPIC]
Agenda
• Facts about search
• Problem statement
• Real-life example
• Semantics overview
• Ontologies
• Ontology-based approach
• Results
13. [TOPIC]
>60%
of sites
- don’t support thematic search queries such as “spring jacket” or “office chair”
- don’t support symbols and abbreviations, resulting in users missing out on perfectly
relevant products if searching for inch when the site has used " or in
Some facts
14. [TOPIC]
Some facts
~80%
of users wished that search engines could actually kind of “read their minds” to
produce the results they were looking for
16. [TOPIC]
Problem statement
80% of cases the information requests are entered
in a full text box:
• The average query is typed with two to four concepts (not
just words, as many times a set of words that express a
single concept)
• Around 10% of the queries include typos
• The inclusion of numbers and metadata in the full text box
is increasing, as it is reducing the use of the more precise
“advanced forms”
9%
9%
4%
78%
Search stats
Browsing
Metadata
Boolean operators
Search box
19. [TOPIC]
Problems with classical full-text search
TF-IDF
Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus
Term frequency – measures how frequently a term (char sequence) occurs in a document:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
Inverse document frequency - measures how important a term is. Rare are more important ones:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
Final score = TF*IDF
25. Common user
An average
user
Energy saving
fridge without an ice
maker!
I need some specific fridge, so
please do your best to understand
my specific needs.
28. No results
Not happy!
o energy saving fridge without an ice maker – NOT OK
o smart vacuum cleaner - NOT OK
o ceramic microwave - NOT OK
o bubble washing machine with sensor control - NOT OK
o sensor child-safety microwave with grill - NOT OK
o …
Is there any better approach?
30. [TOPIC]
What is “semantic”?
• Semantics - the branch of linguistics and logic concerned with meaning
• Semantic – something relating to meaning in language or logic
31. [TOPIC]
What is “semantic search”?
Wikipedia
Semantic search seeks to improve search accuracy by understanding searcher intent and the
contextual meaning of terms as they appear in the searchable dataspace, whether on the Web
or within a closed system, to generate more relevant results.
Techopedia
Semantic search is a data searching technique in a which a search query aims to not only find
keywords, but to determine the intent and contextual meaning of the words a person is using
for search.
32. [TOPIC]
Semantic search
Semantic search seeks to improve search accuracy by understanding searcher intent and the
contextual meaning of terms as they appear in the searchable dataspace, whether on the Web
or within a closed system, to generate more relevant results
So, we need to save the meaning somehow…
33. [TOPIC]
Concepts vs Terms
Terms are nothing but just the symbols
People associates everything with concepts in mind
A concept is an abstraction or generalization from experience or the result of a
transformation of existing ideas.
The concept is instantiated by all of its actual or potential instances, whether these are
things in the real world or other ideas.
= DOG
= TREE
35. [TOPIC]
Semantic Web
Semantic Web - is an extension of the Web through standards by the
World Wide Web Consortium (W3C). The standards promote common
data formats and exchange protocols on the Web, most fundamentally
the Resource Description Framework
36. [TOPIC]
Semantic Web
Main ideas
• Everything is a concept or resource
• Every resource is described (or could be described) on the Web
• Every resource has unique URI (real of fake)
• Resources could act as properties of each other
• Data became linked across system boundaries
• Resources could have complex logical relationships – the same as concepts in mind
39. [TOPIC]
Links
• W3C starting page:
• https://www.w3.org/standards/semanticweb/
• Book:
• Semantic Web for the Working Ontologist: Effective Modeling in RDFS
and OWL
51. [TOPIC]
Query Classification
o energy saving fridge without an
ice maker
o smart vacuum cleaner
o ceramic microwave
o bubble washing machine with
sensor control
o sensor child-safety microwave
with grill
Fridge and
Freezers
Cookers and
Ovens
57. [TOPIC]
Is this such an easy task?
energy saving fridge without an ice maker
Gives us:
o energy saving fridge
o without
o an ice maker
fridge without an ice maker energy
saving
Gives us:
o fridge
o Without
o an ice maker energy saving
High variance, hard to process
61. [TOPIC]
Ontology structure provides all required information
Root: FridgeFreezer, Fridge, Refrigerator, Ice Box, Cooler, Cold Storage
Properties:
o Door open alarm: boolean
o Remote control: boolean
o Number of front doors: integer
o Number of fridge shelves: integer
o Ice maker: boolean
o Antibacterial: boolean
o Water filtration: boolean
o Reversible door: boolean
o Child lock: boolean
64. [TOPIC]
Query tagging
fridge without an ice maker energy saving
Corresponds to
EnergyRatingEU
Corresponds to boolean
Ice Maker property
Provides a `false` value for
Ice Maker property
Defines an exact subject to be
used by search logic
65. [TOPIC]
Query tagging
Input: fridge without an ice maker energy saving
Output:
fridge [
property:ice maker:false,
EnergyRatingEU:
EnergyA
EnergyA1
EnergyA2
EnergyA3
]
68. [TOPIC]
More complex stuff
• Natural language questions answering:
• Who was the us president when the berlin wall came down?
• What is the Rosetta stone when was it discovered and why is it so important
• Suggest correct answer before “search” itself
• Using user profile for context extraction:
• Get user profile from search system
• Get linked data from social networks
• Track related user activities
• Real-time event and sentiment analysis
• Link and analyze data from open sources and provide combined result
• Provide answers according to audience emotions