3. 3
Aardvark: Large-Scale Social Search Engine
(Horowitz and Kamvar, WWW2010)
“64% of queries contain subjective element in Aardvark”
(e.g., “Do you know of any great delis in Baltimore, MD?”
“What are the things/crafts/toys your children have made that made them really proud of themselves?”)
2010년google이$50,000,000 USD (한화530억) 에인수
사실검색VS. 컨센서스검색
컨센서스검색요구의증가
9. Uhm.. Yeah.. It is noisy, but…
9
Online Consumer Posts: 2ndmost trusted forms of advertising (The Nielson Company, Q3 2011)
10. Is consensus search ever possible…?
“Best Action Movies in 2013”
Not immediately answerable with conventional search engines
Because the answer should be based on consensus, which cannot be found in one of “top-10” documents
However, the answers are already on the Web
Numerous implicit votes from people on the Web and Social Networks
Only if we can process them ….
… ONLINE!
10
13. The Key Ideas (I)
Subdocument-level Indexing
Capture semantics from user opinion more precisely
Indexing unit no longer a page but;
•a reviewwithin a page if more than one reviews exist on the page,
•or a sentencewithin a review,
•or even a clauseor phrasewithin a sentence discussing one aspect of the target entity
Maximal Coherent Semantic Unit (MCSU)
•a finest granule indexing unit used in CONSENTO indexing
•maximal subsequence of words within a sentence, which carries single coherent semantics
Indexing MCSUs instead of documents enables semantic analysis to be performed during indexing time
•facilitating the online processing of consensus search in query time
13
14. The Key Ideas (II)
ConsensusRank: A Unique Ranking Method based on Public Sentiment
Virtually, all existing ranking methods rank target objects (either documents or entities) directly based on their relevance to the query terms
Contrastingly, ConsensusRankranks the entities indirectly through aggregating the scores of referring segments (e.g., MCSUs) that match to the query context
It can be viewed as a voting process where each reviewer casts a weighted vote on an entity with respect to a query by expressing positive or negative opinions about that entity
14
16. The current working prototype of CONSENTO is built on movie domain
CONSENTO crawled review pages from popular movie review sites such as IMDB, Meta Critics etc.
Review contents are extracted using DOM- tree parsing and XPATH queries
Extracted information include:
entity name (i.e., movie name)
review text,
date and time
review quality (e.g., “20 out of 30 people found the review helpful”)
I: Parsing & Preprocessing
17. Split the review contents into MCSUs
e.g., “The storyline is ridiculous, the acting is laughable, and the camera work is terrible.”
s1) “The storyline is ridiculous”
s2) “the acting is laughable”
s3) “the camera work is terrible”
II: Contents Segmentation
19. CONSENTOindexes MCSUs on a conventional inverted index that is used in most modern search engines.
Only mapping needs to be redefined logically from (terms → documents) to (terms → MCSUs)
III: Indexing
20. III: Indexing
20
Feature 2
Feature 1
excellent
visual effects,
but
plot
was
hard to follow
Entity Name
Transformer 3
sentiment
sentiment
Document #1
Bag of words
excellent
effects,
plot
hard
Doc#1
Term
Doc
excellent
#1
hard
#1
follow
#1
plot
#1
visual
#1
effects
#1
follow
visual
Traditional
Inverted index
Query: “excellent plot”. System return this document
* Conventional Indexing Method Example
21. III: Indexing
21
excellent
visual effects,
but
plot
was
hard to follow
Segment 2
Segment 1
SegmentID
ObjectName
Feature
Sentiment
Segment1
Transformer 3
visual effects
excellent
Segment 2
Transformer 3
plot
hard to follow
Sub-document level indexing
Term
SegmentID
ObjectName
Feature
Sentiment
excellent
SID1
Transformer 3
visual effects
excellent
visual
SID1
Transformer 3
visual effects
excellent
effect
SID1
Transformer 3
visual effects
excellent
plot
SID2
Transformer 3
plot
hard
hard
SID2
Transformer 3
plot
hard
follow
SID2
Transformer 3
plot
hard
Query: “excellent plot”, doesn't match any segment
* Subdocument-level Indexing Example
22. III: Indexing
Simply treating an MCSU as a document
Store additional information in each posting for use in the ranking stage
MCSU posting structure
23. rid
ts
rq
푟1
푡푠1
0.8
푟2
푡푠2
0.4
푟3
푡푠3
0.6
푟4
푡푠4
0.9
푟5
푡푠5
0.4
푟6
푡푠6
0.5
푟7
푡푠7
0.7
푟8
푡푠8
0.6
푟9
푡푠9
0.8
Site Name
Source ID
IMDb
푤1
Flixster
푤2
Metacritic
푤3
Yahoo!
푤4
Feature
id
music
푎1
soundtrack
푎2
story
푎3
plot
푎4
performance
푎5
acting
푎6
Sentiword
id
great
푚1
excellent
푚2
superb
푚3
tragic
푚4
Entity
id
Titanic
푒1
Brokeback
Mountain
푒2
Dark Knight
푒3
Avatar
푒4
Term
Postings
Cameron
<푠19, 푒4, [−], [푚3], 푟7, 푤3>
Pandora
<푠16, 푒4, [푎2], [−], 푟6, 푤3>,
<푠18, 푒4, [−], [−], 푟6, 푤3>
tragic
<푠7, 푒2, [푎3], [푚4], 푟3, 푤1>
performance
<푠5, 푒1, [푎6], [푚6], 푟2, 푤1>,
<푠9, 푒2, [푎6], [푚3], 푟3, 푤1>,
<푠11, 푒2, [푎6], [푚1], 푟4, 푤1>,
<푠13, 푒3, [푎6], [−], 푟5, 푤2>,
<푠15, 푒4, [푎6], [−], 푟5, 푤3>,
<푠20, 푒3, [푎6], [−], 푟8, 푤4>,
<푠21, 푒3,[푎6], [푚6], 푟9, 푤4>
soundtrack
<푠4, 푒1, [푎2],[−], 푟2, 푤1>,
<푠10, 푒2, [푎2],[푚2], 푟4, 푤1>,
<푠16, 푒4, [푎2],[−], 푟6, 푤2>,
<푠22, 푒3, [푎2],[푚1], 푟9, 푤4>
plot
<푠14, 푒3, [푎4],[−], 푟5, 푤2>
acting
<푠13, 푒4, [푎6], [−], 푟9, 푤4>,
music
<푠2, 푒1, [푎1], [푚1], 푟1, 푤1>,
<푠8, 푒2, [푎1], [푚1], 푟3, 푤1>
Yeston
<푠2, 푒1, [푎1], [−],푟1, 푤1>,
story
<푠1, 푒1, [푎3], [푚1],푟1, 푤1>,
<푠7, 푒2, [푎3], [−],푟3, 푤1>,
<푠12, 푒2, [푎3], [푚2],푟4, 푤1>,
<푠17, 푒4, [푎3], [−],푟6, 푤3>
(s7) beautiful tragic love story, //(s8)with great music.//(s9) superb performances in movies ever!
(s10) The soundtrack is also excellent,//
(s11)great performance, //(s12)excellent presentation of a love story…
Brokeback
Mountain
퐫ퟑ
퐫ퟒ
The Dark Knight
(s13) The performance by Heath Ledger was outstanding //(s14) and plot is amazing too…
퐫ퟓ
The Dark Knight
(s20) Joker shows phonemically awesome performance!…
(s21) nice performance //(s22)and backed up with great soundtrack. //(s23)excellent casting!
퐫ퟖ
퐫ퟗ
(s1) the greatest love stories of all //(s2)and beautiful music from Yeston. // (s3) Everything about this movie was excellent...
(푠4) touching soundtrack, //(푠5) and perfect handling of the known tragedy with nice performance. //(푠6)This has the best love scene I have ever seen…
Titanic
퐫ퟏ
퐫ퟐ
(s15) Navilooks very real, good performance,
//(s16) beautiful soundtrack that emphasize the vastness of the Pandora, //(s17)with love story.// (s18) The world of Pandora is stunning
Avatar
퐫ퟔ
퐫ퟕ
(s19) James Cameron deserves high praise for this creation…
Review ID
24. IV: Query Parsing
CONSENTOpreprocesses the query and performs query expansion
stop-word removal,
polarity only-word removal
feature expansion
stemming
Polarity only-word removal
"good action movie" and "greataction movie" should be treated as the same query
Feature words expanded for better recall
‘plot’ → {plot, story}
‘music’ → {music, soundtrack}
25. V: Retrieval
Retrieve MCSU segments that match to the query terms
Same as the conventional systems retrieve document posting lists
26. VI: Ranking
Group MCSU postings by entity and aggregate the scores of the postings to compute the score of the corresponding entity
31. Movie data sets
Source
•Amazon , IMDB, Metacritic, Flixster, Rotten Tomatoes and Yahoo Movies
Period
•2008 ~ 2010
More than 740 movies, and 30K reviews
Hotel data sets
hotel data set from Ganesanand Zhai
reviews for the hotels in 10 major cities from TripAdvisor
The authors provided us the corrected judgment set for our test
Experimental Setup: Data Set
32. Experiment
Methods
Ganesanand Zhai’sOE and QAM methods
•Opinion expansion word
•Query aspect model
Baseline
1) BM25
•b = 0.75
•k1 = 2
2) VSMBM (lucenedefault)
•Vector space model + Boolean model
3) ConsensusRank