This talk discusses a set of specific tasks and scenarios related to information access within the vast space that is casually referred to as conversational AI. While most of these problems have been identified in the literature for quite some time now, progress has been limited. Apart from the inherently challenging nature of these problems, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation methodology and resources. This talk presents some recent work towards filling this gap.
In one line of research, we investigate the presentation of tabular search results in a conversational setting. Instead of generating a static summary of a result table, we complement brief summaries with clues that invite further exploration, thereby taking advantage of the conversational paradigm. One of the main contributions of this study is the development of a test collection using crowdsourcing.
Another line of work focuses on large-scale evaluation of conversational recommender systems via simulated users. Building on the well-established agenda-based simulation framework from dialogue systems research, we develop interaction and preference models specific to the item recommendation scenario. For evaluation, we compare three existing conversational movie recommender systems with both real and simulated users, and observe high correlation between the two means of evaluation.
This talk has been given at the CIIR talk series at the University of Massachusetts Amherst in Jan 2021 as well as at the IR seminar series at the University of Glasgow in March 2021.
Disentangling the origin of chemical differences using GHOST
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
1. Krisztian Balog
University of Stavanger
@krisztianbalog krisztianbalog.com
What Does Conversational Information
Access Exactly Mean and How to Evaluate It?
2. In this talk
• My perspective on conversational information access
• Point #1: conversational aspect has not received due attention so far
• Point #2: evaluation is a bottleneck of progress
• Recent work on evaluation in the context of two specific conversational
information access scenarios
6. Traditional distinction [1,2]
• Aim to assist users to solve a specific task
(as efficiently as possible)
• Dialogues follow a clearly designed
structure (flow) that is developed for a
particular task in a closed domain
• Well-defined measure of performance that
is explicitly related to task completion
Task-oriented (goal-driven)
1. Chen et al. A Survey on Dialogue Systems: Recent Advances and New Frontiers. SIGKDD Explor. Newsl. 19(2), 2017.
2. Serbian et al. A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version. Dialogue & Discourse 9(1), 2018.
Non-task-oriented (non-goal-driven)
• Aim to carry on an extended conversation
(“chit-chat”) with the goal of mimicking
human-human interactions
• Developed for unstructured, open domain
conversations
• Objective is to be human-like, i.e., able to
talk about different topics (breadth and
depth) in an engaging and coherent manner
7. Contemporary distinction [1,2]
1. Gao et al. Neural Approaches to Conversational AI. Found. Trends Inf. Retr. 13(2-3), 2019.
2. Deriu et al. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 2020.
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
8. Where does Conversational Info Access fit?
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
10. Search as a conversation
places to visit in Stavanger
AI
11. Search as a conversation
places to visit in Stavanger
AI
By Stavanger you meant the city
in Norway, right?
Would you be interested in other
cities in Norway as well?
Do you want to know
more about Stavanger?
12. Main differences
• Degree of personalization and long-term user state
• Support for complex (multi-step) tasks
• Answer generation vs. answer retrieval
• Dialogue setting where a screen or keyboard may not be present
• Mixed initiative
13. J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the
Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.
14. J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the
Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.
16. Example dialogue
I’d like to buy new running shoes. Do you have something
similar to my Nike?
There is a new Pegasus model. Also, you might want to check
out Asics’ Gel Nimbus line, which has better ankle support.
AI
Why does ankle support matter?
I don’t enjoy running that much lately.
…
AI
…
AI
Task-oriented
Interactive QA
Social chat
17. Hybrid systems
• Most commercial systems are hybrid to combine the strengths of task-
specific approaches/architectures
• Hierarchical dialogue manager:
• Broker (top-level) that manages the overall conversation process
• Collection of skills (low-level) that handle different types of conversation segments
18. A disconnect
vs.
Siloed view
(Current practice)
Holistic view
(SWIRL’18 vision)
Conversational info access
Task-
oriented
Social
chat
Interactive
QA
19. A disconnect
vs.
Siloed view
(Current practice)
Holistic view
(SWIRL’18 vision)
• Separate architectures for
different types of systems
• Unified architecture that
supports multiple user goals
Conversational info access
Task-
oriented
Social
chat
Interactive
QA
broker
21. Progress so far (selection)
• Intent detection
• Asking clarification questions
• Query resolution
• Response retrieval
• …
• Mostly on the component level and not specific to the conversational
paradigm
22. Example: TREC CAST [1]
• TREC Conversational Assistance track
• Setup: Given a user utterance, retrieve an
answer from a corpus of paragraphs
• Evaluation is performed with respect to
answer relevance
• Main challenge is coreference resolution
1. http://www.treccast.ai/
What is throat cancer?
…
AI
Is it treatable?
…
AI
Tell me about lung cancer.
…
AI
What are its symptoms?
23. Issues
• Evaluation is limited to single turns and does not consider system responses
to previous user utterances
• User utterances are given beforehand and follow a fixed sentence
• Answers are limited to existing paragraphs in a corpus
• Answers are a ranked list
• Coreference resolution is not specific to conversational search, it’s been
studied in the context of Web/session search
• task == passage retrieval?
24. A disconnect
• All test collections are abstractions of real tasks, but we need to make
sure that we don’t abstract away the conversational aspects
vs.
U
S S
…
U
S
S
…
U
S
S
…
…
U
S
S
…
U
S
S
…
…
U
S
U
S
U
25. Re: SWIRL 2012 [1]
1. J. Allan, B. Croft, A. Moffat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL
2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.
26. Re: SWIRL 2012 [1]
1. J. Allan, B. Croft, A. Moffat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL
2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.
27. Take-away #2
We need to think (more) about what really makes a
conversational experience good/bad and develop
appropriate evaluation resources.
28. Part II
From Answer Retrieval to Answer Generation
based on:
S. Zhang, Z. Dai, K. Balog, and J. Callan. Summarizing and Exploring Tabular Data in Conversational Search. SIGIR ’20.
29. Conversational AI
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
31. Motivation
What is the largest field sport stadium in the world?
The AT&T Stadium in Arlington, Texas, USA.
AI
The AT&T Stadium in Arlington, Texas, USA, which can
seat over 80.000 people.
AI
The AT&T Stadium in Arlington, Texas, USA, home of the
Dallas Cowboys, which can seat over 80.000 people.
AI
Option 1:
Option 2:
Option 3:
32. What makes it a good response in a
conversational setting?
35. Motivation
What will the weather be like next week?
Today, it’ll be sunny, with temperatures between -6 and -2 °C.
Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C.
On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C.
On Monday, it’ll be rainy, with temperatures between 2 and 4 °C.
On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C.
…
AI
Option 1:
Today, it’ll be sunny. Over the weekend, it’ll be overcast with max
temperatures around 2 °C. From Monday onwards, it’ll be rainy with
temperatures between 2 and 6 °C. And, if you think that’s
depressing, you should look at the wind forecast.
AI
Option 2:
36. Motivation
What will the weather be like next week?
Today, it’ll be sunny, with temperatures between -6 and -2 °C.
Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C.
On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C.
On Monday, it’ll be rainy, with temperatures between 2 and 4 °C.
On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C.
…
AI
Option 1:
Today, it’ll be sunny. Over the weekend, it’ll be overcast with max
temperatures around 2 °C. From Monday onwards, it’ll be rainy with
temperatures between 2 and 6 °C. And, if you think that’s
depressing, you should look at the wind forecast.
AI
Option 2:
37. Motivation
• Approximately 10% of QA queries are answered by tabular data
• How to present tabular search results effectively in a conversational
setting?
• Summary is not static, but is conditioned on conversation context
• Summary should help drive the conversation (invite for further exploration)
38. Motivation
What is the largest field sport stadium in the world?
https://en.wikipedia.org/wiki/List_of_covered_stadiums_by_capacity
39. The anatomy of an answer
What is the largest field sport stadium in the world?
I found a table listing the largest field stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
AI
40. The anatomy of an answer
What is the largest field sport stadium in the world?
I found a table listing the largest field stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
Answer to the question
41. The anatomy of an answer
What is the largest field sport stadium in the world?
I found a table listing the largest field stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
Answer to the question
Further information to
help explore further
42. Would people want the be engaged in a conversation?
• Experiment: augmenting answers to
NBA-related questions [1]
• User engagement increased overall
• Explicit negative feedback also
increased
• Users are less inclined to engage in a
conversation when their team lost
1. Szpektor et al. Dynamic Composition for Conversational Domain Exploration. The Web Conf., 2020.
Figure taken from Szpektor et al. [1]
43. Objective
• Create a test collection for table
summarization in a conversational
setting
• Input: query q and result table T
• Output: summary S
Illustration of the conversational table summarization task.
What is the largest field sport stadium in the
world?
I found a table listing the largest field stadiums in
the world. The largest one is the AT&T Stadium,
which holds 80,000 people. All the listed stadiums
have a capacity of over 5,000 people, and 30 of 67
are in the US.
S
q
List of covered stadiums by capacity
T
44. Queries and tables
• 200 tables sampled randomly from
the WikiTableQuestions dataset [1]
• Tables with at least 6 rows and 4
columns
• Corresponding queries either
sampled (45) or written by two of the
authors (155)
1. Pasupat and Liang. Compositional Semantic Parsing on Semi-Structured Tables. ACL-IJCNLP 2015.
q: tell me about the movies he acted
T: Zhao_Dan#Selected_filmography
q: how many times was ed sheeran listed
as the performer
T: List of number-one singles and albums
in Sweden 2014
45. Creating candidate summaries
• Crowdsourcing task, mimicking a
conversation with a friend
• Summary in 30-50 words
(150-250 characters)
• ~15-30 seconds long spoken
• 5 summaries per query
Instructions given to crowd workers.
Image talking to a friend on the
phone. Your friend asks you a
question, and you find the following
table on the Web. Remember that
your friend cannot see the table.
Your goal is to let your friend to
capture essential information in the
table related to the question. Your
summary should be short but
comprehensive. Try to describe
several rows or columns that you
find interesting.
46. Assessments
• Aim to assist users to solve a specific task
(as efficiently as possible)
• Dialogues follow a clearly designed
structure (flow) that is developed for a
particular task in a closed domain
• Well-defined measure of performance that
is explicitly related to task completion
Summary-level Sentence-level
• Assessors (7) are presented with the question
and all candidate summaries (5)
• They are asked to select the best summary, by
considering
• language quality
• relevance
• ability to drive a conversation
• Summary with most votes is selected as the
ground truth
• In 77% of the cases at least 3 assessors agreed on
which summary was best
• Top-voted summaries tend to be longer and use a
richer vocabulary (more unique words)
• Summaries are split into sentences (avg. 3.5)
• Assessors (3) are presented with a sentence,
together with previous and next sentences as
context, and are asked to judge its relevance
• They are situated in a scenario where they
need to write a short summary (to a friend)
• Highly relevant: describes the table’s topic or
provides a fact regarding the query
• Relevant: is not directly relevant to the question
but helps to explain the contents of the table
• Non-relevant: not about the table or unrelated
to the query
47. Experiments
• Goals: establish a baseline, identify challenging aspects of the task
• Setup: 5-fold cross-validation; all summaries with >0 votes used for
training
• Methods: pre-trained neural language generation models (CopyNet,
GPT-2, T5)
• Tables are represented as a text sequence
• A "key:value" format is used to preserve structure. Additionally, page title, table
caption and total number of rows are also added with special keys.
49. Analysis
• Machine-generated summaries are
fluent
• Typical mistakes
• Wrong quantity (84% of summaries)
• “Purdue won 4 consecutive in the 1960s,
…” => it won only 3
• Wrong reference (38% of summaries)
• E.g., elevation vs. prominence when asking
for highest mountain
Example of ground truth vs. auto-generated summaries.
What album had the most sales in 2012 in Finland?
I saw a table showing sales figures of around
10 albums of 2012. Of these “vain elamaa”
by various artists ranked highest regarding
sales. It is performed by various artists. The
second rank is held by “koodi” by robin.
Manual
q
T5 I found a table of the top 10 albums of 2012.
The most sold album was “vain elämä” by
various artists. It sold 164,119 copies. The
next highest album was “koodi” by robin.
List of number-one albums of 2012 (Finland)
T
50. Summary
• Response generation is an open research challenge in Interactive QA
• Needs to be designed specifically for the conversational paradigm and go beyond
mere passage retrieval
• Needs appropriate evaluation methodology and resources
• Test collection for summarizing and exploring tabular data in conversational
search
• Includes both summary-level and sentence-level human assessments
• Available at https://github.com/iai-group/sigir2020-tablesum
51. Part III
Towards Automated End-to-end System
Evaluation
based on:
S. Zhang and K. Balog. Evaluating Conversational Recommender Systems via User Simulation. KDD ’20.
52. Conversational AI
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
53. Motivation
• Evaluating conversational systems with real users is time-consuming and
expensive
• Can we simulate users (for the purpose of evaluation)?
• The specific problem is conversational item recommendation
• Help users find an item they like from a large collection of items
• Elicit preferences on specific items or categories of items
54. Why simulation?
• Test-collection based ("offline") evaluation
• Possible to create a reusable test collection for a specific subtask
• Limited to a single turn, does not measure overall user satisfaction
• Human evaluation
• Possible to annotate entire conversations
• Expensive, time-consuming, does not scale
55. Objectives
• Develop a user simulator that
• produces responses that a real user would give in a certain dialog situation
• enables automatic assessment of conversational agents
• makes no assumptions about the inner workings of conversational agents
• is data-driven (requires only a small annotated dialogue corpus)
56. Problem statement
• For a given system S and user population U, the goal of user simulation U*
is to predict the performance of S when used by U, denoted as M(S,U)
• For two systems S1 and S2, U* should be such that
if M(S1,U) < M(S2,U)
then M(S1,U*) < M(S2,U*)
58. Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Translating an agent utterance
into a structured format
59. Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Determining the next user action based on
the understanding of the agent’s utterance
60. Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Turning a structured response
representation into natural language
61. Modeling simulated users
• Model dialogue as a Markov Decision Process (MDP)
• Every MDP is formally described by a finite
state space S, a finite action set A, and transition probabilities P
• Dialogue state st is the state of the dialogue manager at dialogue turn t
• Dialogue actions at at represents the user action that is being communicated in turn t
• Transition probabilities: the probability of transitioning from st to st+1
• Markov property:
s1 s3
s2
a1
a2
P(st+1|st, at, st 1, at 1, . . . , s0, a0) = P(st+1|st, at)
62. Agenda-based simulation [1]
• The action agenda A is a stack-like
representation for user actions that is
dynamically updated
• The next user action is selected from the top
of the agenda
• Agenda is updated based on whether the
agent understands the user action (by giving
a corresponding response)
• Accomplished goal → pull operation
• Not accomplished → push replacement action(s)
Great, let’s do this! Start by giving me ONE
movie you like and some reasons why.
Hello, I am looking for a movie to watch.
I like the remains of the day because I
like psychological movies.
Got it. About to jump into lightspeed! I'll
have your movies ready for you in a flash!
Bot
C = [ type = film; genre = psychology; name = [“R..”, …] ]
R = [ director =; rating = ]
disclose (type=film)
disclose(name=“R..”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
disclose (name=“I..”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
na
na
no
co
na
no
co
no
co
Bot
I like Requiem for a Dream.
I’m pretty solid on a bunch of things so far,
but not on this request. Can you give a
different movie?
reveal (name)
disclose (name=“xx”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
co
1. Schatzmann et al. Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System. NAACL 2007.
63. Interaction model
• The interaction model defines how the agenda
should be initialized (A0) and updated (At => At+1)
• Action space [2]
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Disclose I would to arrange a holiday in Italy
Reveal Actually, we need to go on the 3rd of May in the evening
Inquire What other regions in Europe are like that?
Navigate Which one is the cheapest option?
Note That hotel could be a possibility
Complete Thanks for the help, bye
2. Azzopardi et al. Conceptualizing Agent-human Interactions during the Conversational Search Process. CAIR 2018.
64. Interaction model
• QRFA Model [3]
• User: Query and Request
• Agent: Feedback and Accept
• CIR6 Model
Disclose Inquire Navigate Note
Reveal
Complete
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
3. Vakulenko et al. QRFA: A Data-Driven Model of Information Seeking Dialogues. ECIR 2019.
Q F
R A
65. Preference model
• The preference model is meant to capture individual
differences and personal tastes
• Preferences are represented as a set of attribute-value pairs
• Single Item Preference
• Partially rooted in historical user behavior
• Offers limited consistency
• Personal Knowledge Graph (PKG)
• PKG has two types of nodes: items and attributes
• Infers the rating for that attribute by considering the ratings of items that have that attribute
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
66. Personal Knowledge Graphs [4]
A personal knowledge graph (PKG) is a
resource of structured information
about entities that are of personal
interest to the user
Key differences from general KGs:
• Entities of personal interest to the user
• Distinctive shape (“spiderweb” layout)
• Links between a PKG and external
sources are inherent to its nature
4. Balog and Kenter. Personal Knowledge Graphs: A Research Agenda. ICTIR 2019.
67. Grounding in actual data
• Historical movie viewing data (ratings)
• User preference profiles are initialized based
on a sample
• The rest of the watch history is used as held-
out data
68. Evaluation architecture
• Three existing conversational movie recommenders (A, B, C)
are compared using both real ( ) and simulated ( ) users
• Real users: crowdsourcing
• Simulated users
• Preference model is initialized by sampling historical
preferences of a real user from MovieLens data
• Interaction model is trained based on behaviors of real
human users
• Both NLU and NLG use hand-crafted templates
User
Simulated User
Human User
Conversation
Manager
Conversational
Agent
69. Method AvgTurns UserActRatio DS-KL
A B C A B C A B C
Real users 9.20 14.84 20.24 0.374 0.501 0.500 — — —
QRFA-Single 10.52 12.28 17.51 0.359 0.500 0.500 0.027 0.056 0.029
CIR6-Single 9.44 12.75 15.92 0.382 0.500 0.500 0.055 0.040 0.025
CIR6-PKG 6.16 9.87 10.56 0.371 0.500 0.500 0.075 0.056 0.095
Characteristics of conversations
• (RQ1) How well do our simulation techniques capture the
characteristics of conversations?
• CIR6-PKG tends to have significantly shorter average conversation
length, since it terminates the dialog as soon as the user finds a
recommendation they like
70. Performance prediction
• (RQ2) How well do the relative ordering of systems according to some
measure correlate when using real vs. simulated users?
• High correlation between automatic and human evaluations
Method Reward Success rate
Real users A (8.88) > B (7.56) > C (6.04) B (0.864) > A (0.833) > C (0.727)
QRFA-Single A (8.04) > B (7.41) > C (6.30) B (0.836) > A (0.774) > C (0.718)
CIR6-Single A (8.64) > B (8.28) > C (6.01) B (0.822) > A (0.807) > C (0.712)
CIR6-PKG A (11.12) > B (10.65) > C (9.31) A (0.870) > B (0.847) > C (0.784)
Performance of conversational agents using real vs. simulated users, in terms of Reward and
Success Rate. We show the relative ordering of agents (A–C), with evaluation scores in parentheses.
71. Realisticity
• (RQ3) Do more sophisticated simulation approaches (i.e., more advanced
interaction and preference modeling) lead to more realistic simulation?
• Human raters were asked to guess which of the two dialogues was performed
by a human?
• Trying to "fool" the human evaluator ("reverse" Turing test)
AI
...
...
AI
...
...
AI
...
...
AI
...
...
72. Realisticity
• (RQ3) Do more sophisticated simulation approaches (i.e., more
advanced interaction and preference modeling) lead to more realistic
simulation?
• Our interaction model (CIR6) and personal knowledge graphs
for preference modeling both bring improvements
Method A B C All
Win Loss Tie Win Loss Tie Win Loss Tie Win Loss Tie
QRFA-Single 20 39 16 22 33 20 19 43 13 27% 51% 22%
CIR6-Single 27 30 18 23 33 19 26 40 9 33% 46% 21%
CIR6-PKG 22 39 14 27 29 19 32 25 18 36% 41% 23%
73. Further analysis
• We analyze the reasons when the crowd workers chose the real users,
and classify them as follows
Style
Realisticity how realistic or human-sounding a dialog is
Engagement involvement of the user in the conversation
Emotion expressions of feelings or emotions
Content
Response user does not seem to understand the agent correctly
Grammar language usage, including spelling and punctuation
Length the length of reply
74. Summary of contributions
• A general framework for evaluating conversational recommender agents
via simulation
• Interaction and preference models to better control the conversation flow
and to ensure the consistency of responses given by the simulated user
• Experimental comparison of three conversational movie recommender
agents, using both real and simulated users
• Analysis of comments collected from human evaluation, and identification
of areas for future development
76. Summary
• Time to move away from search engines to conversation engines
• Consider what really makes a conversational experience unsuccessful/successful
• Embrace a more holistic view which integrates and supports multiple user goals
• Progress needs to be made on evaluation methodology and resources
78. Questions?
ECIR 2022 in Stavanger, Norway
(ecir2022.org will go live later this week)
Simulation for IR Evaluation workshop
at SIGIR 2021
Advertisements: