SlideShare a Scribd company logo
1 of 78
Download to read offline
Krisztian Balog
University of Stavanger
@krisztianbalog krisztianbalog.com
What Does Conversational Information
Access Exactly Mean and How to Evaluate It?
In this talk
• My perspective on conversational information access
• Point #1: conversational aspect has not received due attention so far
• Point #2: evaluation is a bottleneck of progress
• Recent work on evaluation in the context of two specific conversational
information access scenarios
Part I
What does conversational information access
exactly mean?
Part I
What does conversational information access
exactly mean?
search AND recommendation
=
Where does conversational information
access fit within conversational AI?
Traditional distinction [1,2]
• Aim to assist users to solve a specific task
(as efficiently as possible)
• Dialogues follow a clearly designed
structure (flow) that is developed for a
particular task in a closed domain
• Well-defined measure of performance that
is explicitly related to task completion
Task-oriented (goal-driven)
1. Chen et al. A Survey on Dialogue Systems: Recent Advances and New Frontiers. SIGKDD Explor. Newsl. 19(2), 2017.
2. Serbian et al. A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version. Dialogue & Discourse 9(1), 2018.
Non-task-oriented (non-goal-driven)
• Aim to carry on an extended conversation
(“chit-chat”) with the goal of mimicking
human-human interactions
• Developed for unstructured, open domain
conversations
• Objective is to be human-like, i.e., able to
talk about different topics (breadth and
depth) in an engaging and coherent manner
Contemporary distinction [1,2]
1. Gao et al. Neural Approaches to Conversational AI. Found. Trends Inf. Retr. 13(2-3), 2019.
2. Deriu et al. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 2020.
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
Where does Conversational Info Access fit?
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
But wait…
Hasn’t search always been conversational?!
Search as a conversation
places to visit in Stavanger
AI
Search as a conversation
places to visit in Stavanger
AI
By Stavanger you meant the city
in Norway, right?
Would you be interested in other
cities in Norway as well?
Do you want to know
more about Stavanger?
Main differences
• Degree of personalization and long-term user state
• Support for complex (multi-step) tasks
• Answer generation vs. answer retrieval
• Dialogue setting where a screen or keyboard may not be present
• Mixed initiative
J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the
Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.
J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the
Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.
Do humans really converse according to
clearly separable goals?
Example dialogue
I’d like to buy new running shoes. Do you have something
similar to my Nike?
There is a new Pegasus model. Also, you might want to check
out Asics’ Gel Nimbus line, which has better ankle support.
AI
Why does ankle support matter?
I don’t enjoy running that much lately.
…
AI
…
AI
Task-oriented
Interactive QA
Social chat
Hybrid systems
• Most commercial systems are hybrid to combine the strengths of task-
specific approaches/architectures
• Hierarchical dialogue manager:
• Broker (top-level) that manages the overall conversation process
• Collection of skills (low-level) that handle different types of conversation segments
A disconnect
vs.
Siloed view
(Current practice)
Holistic view
(SWIRL’18 vision)
Conversational info access
Task-
oriented
Social
chat
Interactive
QA
A disconnect
vs.
Siloed view
(Current practice)
Holistic view
(SWIRL’18 vision)
• Separate architectures for
different types of systems
• Unified architecture that
supports multiple user goals
Conversational info access
Task-
oriented
Social
chat
Interactive
QA
broker
Take-away #1
It’s perhaps time to embrace a more
holistic view of conversational info access.
Progress so far (selection)
• Intent detection
• Asking clarification questions
• Query resolution
• Response retrieval
• …
• Mostly on the component level and not specific to the conversational
paradigm
Example: TREC CAST [1]
• TREC Conversational Assistance track
• Setup: Given a user utterance, retrieve an
answer from a corpus of paragraphs
• Evaluation is performed with respect to
answer relevance
• Main challenge is coreference resolution
1. http://www.treccast.ai/
What is throat cancer?
…
AI
Is it treatable?
…
AI
Tell me about lung cancer.
…
AI
What are its symptoms?
Issues
• Evaluation is limited to single turns and does not consider system responses
to previous user utterances
• User utterances are given beforehand and follow a fixed sentence
• Answers are limited to existing paragraphs in a corpus
• Answers are a ranked list
• Coreference resolution is not specific to conversational search, it’s been
studied in the context of Web/session search
• task == passage retrieval?
A disconnect
• All test collections are abstractions of real tasks, but we need to make
sure that we don’t abstract away the conversational aspects
vs.
U
S S
…
U
S
S
…
U
S
S
…
…
U
S
S
…
U
S
S
…
…
U
S
U
S
U
Re: SWIRL 2012 [1]
1. J. Allan, B. Croft, A. Moffat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL
2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.
Re: SWIRL 2012 [1]
1. J. Allan, B. Croft, A. Moffat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL
2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.
Take-away #2
We need to think (more) about what really makes a
conversational experience good/bad and develop
appropriate evaluation resources.
Part II
From Answer Retrieval to Answer Generation
based on:
S. Zhang, Z. Dai, K. Balog, and J. Callan. Summarizing and Exploring Tabular Data in Conversational Search. SIGIR ’20.
Conversational AI
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
Motivation
What is the largest field sport stadium in the world?
Motivation
What is the largest field sport stadium in the world?
The AT&T Stadium in Arlington, Texas, USA.
AI
The AT&T Stadium in Arlington, Texas, USA, which can
seat over 80.000 people.
AI
The AT&T Stadium in Arlington, Texas, USA, home of the
Dallas Cowboys, which can seat over 80.000 people.
AI
Option 1:
Option 2:
Option 3:
What makes it a good response in a
conversational setting?
Motivation
What will the weather be like next week?
AI
Motivation
What will the weather be like next week?
Today, it’ll be sunny, with temperatures between -6 and -2 °C.
Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C.
On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C.
On Monday, it’ll be rainy, with temperatures between 2 and 4 °C.
On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C.
…
AI
Option 1:
Today, it’ll be sunny. Over the weekend, it’ll be overcast with max
temperatures around 2 °C. From Monday onwards, it’ll be rainy with
temperatures between 2 and 6 °C. And, if you think that’s
depressing, you should look at the wind forecast.
AI
Option 2:
Motivation
What will the weather be like next week?
Today, it’ll be sunny, with temperatures between -6 and -2 °C.
Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C.
On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C.
On Monday, it’ll be rainy, with temperatures between 2 and 4 °C.
On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C.
…
AI
Option 1:
Today, it’ll be sunny. Over the weekend, it’ll be overcast with max
temperatures around 2 °C. From Monday onwards, it’ll be rainy with
temperatures between 2 and 6 °C. And, if you think that’s
depressing, you should look at the wind forecast.
AI
Option 2:
Motivation
• Approximately 10% of QA queries are answered by tabular data
• How to present tabular search results effectively in a conversational
setting?
• Summary is not static, but is conditioned on conversation context
• Summary should help drive the conversation (invite for further exploration)
Motivation
What is the largest field sport stadium in the world?
https://en.wikipedia.org/wiki/List_of_covered_stadiums_by_capacity
The anatomy of an answer
What is the largest field sport stadium in the world?
I found a table listing the largest field stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
AI
The anatomy of an answer
What is the largest field sport stadium in the world?
I found a table listing the largest field stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
Answer to the question
The anatomy of an answer
What is the largest field sport stadium in the world?
I found a table listing the largest field stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
Answer to the question
Further information to
help explore further
Would people want the be engaged in a conversation?
• Experiment: augmenting answers to
NBA-related questions [1]
• User engagement increased overall
• Explicit negative feedback also
increased
• Users are less inclined to engage in a
conversation when their team lost
1. Szpektor et al. Dynamic Composition for Conversational Domain Exploration. The Web Conf., 2020.
Figure taken from Szpektor et al. [1]
Objective
• Create a test collection for table
summarization in a conversational
setting
• Input: query q and result table T
• Output: summary S
Illustration of the conversational table summarization task.
What is the largest field sport stadium in the
world?
I found a table listing the largest field stadiums in
the world. The largest one is the AT&T Stadium,
which holds 80,000 people. All the listed stadiums
have a capacity of over 5,000 people, and 30 of 67
are in the US.
S
q
List of covered stadiums by capacity
T
Queries and tables
• 200 tables sampled randomly from
the WikiTableQuestions dataset [1]
• Tables with at least 6 rows and 4
columns
• Corresponding queries either
sampled (45) or written by two of the
authors (155)
1. Pasupat and Liang. Compositional Semantic Parsing on Semi-Structured Tables. ACL-IJCNLP 2015.
q: tell me about the movies he acted
T: Zhao_Dan#Selected_filmography
q: how many times was ed sheeran listed
as the performer
T: List of number-one singles and albums
in Sweden 2014
Creating candidate summaries
• Crowdsourcing task, mimicking a
conversation with a friend
• Summary in 30-50 words
(150-250 characters)
• ~15-30 seconds long spoken
• 5 summaries per query
Instructions given to crowd workers.
Image talking to a friend on the
phone. Your friend asks you a
question, and you find the following
table on the Web. Remember that
your friend cannot see the table.
Your goal is to let your friend to
capture essential information in the
table related to the question. Your
summary should be short but
comprehensive. Try to describe
several rows or columns that you
find interesting.
Assessments
• Aim to assist users to solve a specific task
(as efficiently as possible)
• Dialogues follow a clearly designed
structure (flow) that is developed for a
particular task in a closed domain
• Well-defined measure of performance that
is explicitly related to task completion
Summary-level Sentence-level
• Assessors (7) are presented with the question
and all candidate summaries (5)
• They are asked to select the best summary, by
considering
• language quality
• relevance
• ability to drive a conversation
• Summary with most votes is selected as the
ground truth
• In 77% of the cases at least 3 assessors agreed on
which summary was best
• Top-voted summaries tend to be longer and use a
richer vocabulary (more unique words)
• Summaries are split into sentences (avg. 3.5)
• Assessors (3) are presented with a sentence,
together with previous and next sentences as
context, and are asked to judge its relevance
• They are situated in a scenario where they
need to write a short summary (to a friend)
• Highly relevant: describes the table’s topic or
provides a fact regarding the query
• Relevant: is not directly relevant to the question
but helps to explain the contents of the table
• Non-relevant: not about the table or unrelated
to the query
Experiments
• Goals: establish a baseline, identify challenging aspects of the task
• Setup: 5-fold cross-validation; all summaries with >0 votes used for
training
• Methods: pre-trained neural language generation models (CopyNet,
GPT-2, T5)
• Tables are represented as a text sequence
• A "key:value" format is used to preserve structure. Additionally, page title, table
caption and total number of rows are also added with special keys.
Results
Method ROUGE-L ROUGE-1 ROUGE-2 BLEU
CopyNet 0.030 0.041 0.012 0.80
GPT-2 0.200 0.272 0.073 5.35
T5 0.276 0.362 0.143 10.43
Analysis
• Machine-generated summaries are
fluent
• Typical mistakes
• Wrong quantity (84% of summaries)
• “Purdue won 4 consecutive in the 1960s,
…” => it won only 3
• Wrong reference (38% of summaries)
• E.g., elevation vs. prominence when asking
for highest mountain
Example of ground truth vs. auto-generated summaries.
What album had the most sales in 2012 in Finland?
I saw a table showing sales figures of around
10 albums of 2012. Of these “vain elamaa”
by various artists ranked highest regarding
sales. It is performed by various artists. The
second rank is held by “koodi” by robin.
Manual
q
T5 I found a table of the top 10 albums of 2012.
The most sold album was “vain elämä” by
various artists. It sold 164,119 copies. The
next highest album was “koodi” by robin.
List of number-one albums of 2012 (Finland)
T
Summary
• Response generation is an open research challenge in Interactive QA
• Needs to be designed specifically for the conversational paradigm and go beyond
mere passage retrieval
• Needs appropriate evaluation methodology and resources
• Test collection for summarizing and exploring tabular data in conversational
search
• Includes both summary-level and sentence-level human assessments
• Available at https://github.com/iai-group/sigir2020-tablesum
Part III
Towards Automated End-to-end System
Evaluation
based on:
S. Zhang and K. Balog. Evaluating Conversational Recommender Systems via User Simulation. KDD ’20.
Conversational AI
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner
Motivation
• Evaluating conversational systems with real users is time-consuming and
expensive
• Can we simulate users (for the purpose of evaluation)?
• The specific problem is conversational item recommendation
• Help users find an item they like from a large collection of items
• Elicit preferences on specific items or categories of items
Why simulation?
• Test-collection based ("offline") evaluation
• Possible to create a reusable test collection for a specific subtask
• Limited to a single turn, does not measure overall user satisfaction
• Human evaluation
• Possible to annotate entire conversations
• Expensive, time-consuming, does not scale
Objectives
• Develop a user simulator that
• produces responses that a real user would give in a certain dialog situation
• enables automatic assessment of conversational agents
• makes no assumptions about the inner workings of conversational agents
• is data-driven (requires only a small annotated dialogue corpus)
Problem statement
• For a given system S and user population U, the goal of user simulation U*
is to predict the performance of S when used by U, denoted as M(S,U)
• For two systems S1 and S2, U* should be such that
if M(S1,U) < M(S2,U)
then M(S1,U*) < M(S2,U*)
Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Translating an agent utterance
into a structured format
Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Determining the next user action based on
the understanding of the agent’s utterance
Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Turning a structured response
representation into natural language
Modeling simulated users
• Model dialogue as a Markov Decision Process (MDP)
• Every MDP is formally described by a finite
state space S, a finite action set A, and transition probabilities P
• Dialogue state st is the state of the dialogue manager at dialogue turn t
• Dialogue actions at at represents the user action that is being communicated in turn t
• Transition probabilities: the probability of transitioning from st to st+1
• Markov property:
s1 s3
s2
a1
a2
P(st+1|st, at, st 1, at 1, . . . , s0, a0) = P(st+1|st, at)
Agenda-based simulation [1]
• The action agenda A is a stack-like
representation for user actions that is
dynamically updated
• The next user action is selected from the top
of the agenda
• Agenda is updated based on whether the
agent understands the user action (by giving
a corresponding response)
• Accomplished goal → pull operation
• Not accomplished → push replacement action(s)
Great, let’s do this! Start by giving me ONE
movie you like and some reasons why.
Hello, I am looking for a movie to watch.
I like the remains of the day because I
like psychological movies.
 Got it. About to jump into lightspeed! I'll
have your movies ready for you in a flash!
Bot
C = [ type = film; genre = psychology; name = [“R..”, …] ]
R = [ director =; rating = ]
disclose (type=film)
disclose(name=“R..”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
disclose (name=“I..”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
na
na
no
co
na
no
co
no
co
Bot
I like Requiem for a Dream.
I’m pretty solid on a bunch of things so far,
but not on this request. Can you give a
different movie?
reveal (name)
disclose (name=“xx”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
co
1. Schatzmann et al. Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System. NAACL 2007.
Interaction model
• The interaction model defines how the agenda
should be initialized (A0) and updated (At => At+1)
• Action space [2]
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Disclose I would to arrange a holiday in Italy
Reveal Actually, we need to go on the 3rd of May in the evening
Inquire What other regions in Europe are like that?
Navigate Which one is the cheapest option?
Note That hotel could be a possibility
Complete Thanks for the help, bye
2. Azzopardi et al. Conceptualizing Agent-human Interactions during the Conversational Search Process. CAIR 2018.
Interaction model
• QRFA Model [3]
• User: Query and Request
• Agent: Feedback and Accept
• CIR6 Model
Disclose Inquire Navigate Note
Reveal
Complete
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
3. Vakulenko et al. QRFA: A Data-Driven Model of Information Seeking Dialogues. ECIR 2019.
Q F
R A
Preference model
• The preference model is meant to capture individual
differences and personal tastes
• Preferences are represented as a set of attribute-value pairs
• Single Item Preference
• Partially rooted in historical user behavior
• Offers limited consistency
• Personal Knowledge Graph (PKG)
• PKG has two types of nodes: items and attributes
• Infers the rating for that attribute by considering the ratings of items that have that attribute
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Personal Knowledge Graphs [4]
A personal knowledge graph (PKG) is a
resource of structured information
about entities that are of personal
interest to the user
Key differences from general KGs:
• Entities of personal interest to the user
• Distinctive shape (“spiderweb” layout)
• Links between a PKG and external
sources are inherent to its nature
4. Balog and Kenter. Personal Knowledge Graphs: A Research Agenda. ICTIR 2019.
Grounding in actual data
• Historical movie viewing data (ratings)
• User preference profiles are initialized based
on a sample
• The rest of the watch history is used as held-
out data
Evaluation architecture
• Three existing conversational movie recommenders (A, B, C)
are compared using both real ( ) and simulated ( ) users
• Real users: crowdsourcing
• Simulated users
• Preference model is initialized by sampling historical
preferences of a real user from MovieLens data
• Interaction model is trained based on behaviors of real
human users
• Both NLU and NLG use hand-crafted templates
User
Simulated User
Human User
Conversation
Manager
Conversational
Agent
Method AvgTurns UserActRatio DS-KL
A B C A B C A B C
Real users 9.20 14.84 20.24 0.374 0.501 0.500 — — —
QRFA-Single 10.52 12.28 17.51 0.359 0.500 0.500 0.027 0.056 0.029
CIR6-Single 9.44 12.75 15.92 0.382 0.500 0.500 0.055 0.040 0.025
CIR6-PKG 6.16 9.87 10.56 0.371 0.500 0.500 0.075 0.056 0.095
Characteristics of conversations
• (RQ1) How well do our simulation techniques capture the
characteristics of conversations?
• CIR6-PKG tends to have significantly shorter average conversation
length, since it terminates the dialog as soon as the user finds a
recommendation they like
Performance prediction
• (RQ2) How well do the relative ordering of systems according to some
measure correlate when using real vs. simulated users?
• High correlation between automatic and human evaluations
Method Reward Success rate
Real users A (8.88) > B (7.56) > C (6.04) B (0.864) > A (0.833) > C (0.727)
QRFA-Single A (8.04) > B (7.41) > C (6.30) B (0.836) > A (0.774) > C (0.718)
CIR6-Single A (8.64) > B (8.28) > C (6.01) B (0.822) > A (0.807) > C (0.712)
CIR6-PKG A (11.12) > B (10.65) > C (9.31) A (0.870) > B (0.847) > C (0.784)
Performance of conversational agents using real vs. simulated users, in terms of Reward and
Success Rate. We show the relative ordering of agents (A–C), with evaluation scores in parentheses.
Realisticity
• (RQ3) Do more sophisticated simulation approaches (i.e., more advanced
interaction and preference modeling) lead to more realistic simulation?
• Human raters were asked to guess which of the two dialogues was performed
by a human?
• Trying to "fool" the human evaluator ("reverse" Turing test)
AI
...
...
AI
...
...
AI
...
...
AI
...
...
Realisticity
• (RQ3) Do more sophisticated simulation approaches (i.e., more
advanced interaction and preference modeling) lead to more realistic
simulation?
• Our interaction model (CIR6) and personal knowledge graphs
for preference modeling both bring improvements
Method A B C All
Win Loss Tie Win Loss Tie Win Loss Tie Win Loss Tie
QRFA-Single 20 39 16 22 33 20 19 43 13 27% 51% 22%
CIR6-Single 27 30 18 23 33 19 26 40 9 33% 46% 21%
CIR6-PKG 22 39 14 27 29 19 32 25 18 36% 41% 23%
Further analysis
• We analyze the reasons when the crowd workers chose the real users,
and classify them as follows
Style
Realisticity how realistic or human-sounding a dialog is
Engagement involvement of the user in the conversation
Emotion expressions of feelings or emotions
Content
Response user does not seem to understand the agent correctly
Grammar language usage, including spelling and punctuation
Length the length of reply
Summary of contributions
• A general framework for evaluating conversational recommender agents
via simulation
• Interaction and preference models to better control the conversation flow
and to ensure the consistency of responses given by the simulated user
• Experimental comparison of three conversational movie recommender
agents, using both real and simulated users
• Analysis of comments collected from human evaluation, and identification
of areas for future development
It’s time to wrap up
Summary
• Time to move away from search engines to conversation engines
• Consider what really makes a conversational experience unsuccessful/successful
• Embrace a more holistic view which integrates and supports multiple user goals
• Progress needs to be made on evaluation methodology and resources
Acknowledgments
• Shuo Zhang (Bloomberg), Zhuyun Dai (CMU), Jamie Callan (CMU)
Questions?
ECIR 2022 in Stavanger, Norway
(ecir2022.org will go live later this week)
Simulation for IR Evaluation workshop
at SIGIR 2021
Advertisements:

More Related Content

What's hot

What's hot (20)

Scaling MQTT With Apache Kafka
Scaling MQTT With Apache KafkaScaling MQTT With Apache Kafka
Scaling MQTT With Apache Kafka
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
 
Advanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdfAdvanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdf
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Heuristic evaluation
Heuristic evaluationHeuristic evaluation
Heuristic evaluation
 
Troubleshooting redis
Troubleshooting redisTroubleshooting redis
Troubleshooting redis
 
Alfresco Bulk Import toolのご紹介
Alfresco Bulk Import toolのご紹介Alfresco Bulk Import toolのご紹介
Alfresco Bulk Import toolのご紹介
 
ロードバランサのリソース問題を解決する ~NetScaler Clustering~
ロードバランサのリソース問題を解決する ~NetScaler Clustering~ ロードバランサのリソース問題を解決する ~NetScaler Clustering~
ロードバランサのリソース問題を解決する ~NetScaler Clustering~
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Taxonomy 101: Presented at Taxonomy Boot Camp 2019Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Taxonomy 101: Presented at Taxonomy Boot Camp 2019
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & Cassandra
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and Benchmarks
 
Resiliency vs High Availability vs Fault Tolerance vs Reliability
Resiliency vs High Availability vs Fault Tolerance vs  ReliabilityResiliency vs High Availability vs Fault Tolerance vs  Reliability
Resiliency vs High Availability vs Fault Tolerance vs Reliability
 
Vertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewVertica 7.0 Architecture Overview
Vertica 7.0 Architecture Overview
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 

Similar to What Does Conversational Information Access Exactly Mean and How to Evaluate It?

Demo day presentation
Demo day presentationDemo day presentation
Demo day presentation
Billy Kennedy
 
DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha A_Palalas C_G...
DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha  A_Palalas C_G...DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha  A_Palalas C_G...
DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha A_Palalas C_G...
Agnieszka (Aga) Palalas, Ed.D.
 
Coach as Facilitator Please respond to the following discussion.docx
Coach as Facilitator  Please respond to the following discussion.docxCoach as Facilitator  Please respond to the following discussion.docx
Coach as Facilitator Please respond to the following discussion.docx
clarebernice
 
doore dissertation grad expo 42716 white finalb
doore dissertation grad expo 42716 white finalbdoore dissertation grad expo 42716 white finalb
doore dissertation grad expo 42716 white finalb
Stacy Doore
 

Similar to What Does Conversational Information Access Exactly Mean and How to Evaluate It? (20)

Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
 
God Mode for designing scenario-driven skills for DeepPavlov Dream
God Mode for designing scenario-driven skills for DeepPavlov DreamGod Mode for designing scenario-driven skills for DeepPavlov Dream
God Mode for designing scenario-driven skills for DeepPavlov Dream
 
Can We Do Agile? Barriers to Agile Adoption
Can We Do Agile? Barriers to Agile AdoptionCan We Do Agile? Barriers to Agile Adoption
Can We Do Agile? Barriers to Agile Adoption
 
Midwest km pugh conversational ai and ai for conversation 190809
Midwest km pugh conversational ai and ai for conversation 190809Midwest km pugh conversational ai and ai for conversation 190809
Midwest km pugh conversational ai and ai for conversation 190809
 
Demo day presentation
Demo day presentationDemo day presentation
Demo day presentation
 
Carolyn Rosé - WESST - From Data to Design of Dynamic Support for Collaborati...
Carolyn Rosé - WESST - From Data to Design of Dynamic Support for Collaborati...Carolyn Rosé - WESST - From Data to Design of Dynamic Support for Collaborati...
Carolyn Rosé - WESST - From Data to Design of Dynamic Support for Collaborati...
 
DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha A_Palalas C_G...
DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha  A_Palalas C_G...DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha  A_Palalas C_G...
DBR (Design-Based Research) in mobile learning-Mlearn2013 Doha A_Palalas C_G...
 
시나리오 베이스 디자인 방법론 (Scenario Based Design)
시나리오 베이스 디자인 방법론 (Scenario Based Design)시나리오 베이스 디자인 방법론 (Scenario Based Design)
시나리오 베이스 디자인 방법론 (Scenario Based Design)
 
Enriching UX Research: Tools and Processes for User Research
Enriching UX Research: Tools and Processes for User ResearchEnriching UX Research: Tools and Processes for User Research
Enriching UX Research: Tools and Processes for User Research
 
EdMedia 2017 Outstanding Paper Award
EdMedia 2017 Outstanding Paper AwardEdMedia 2017 Outstanding Paper Award
EdMedia 2017 Outstanding Paper Award
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
Coach as Facilitator Please respond to the following discussion.docx
Coach as Facilitator  Please respond to the following discussion.docxCoach as Facilitator  Please respond to the following discussion.docx
Coach as Facilitator Please respond to the following discussion.docx
 
Bridging the missing middle for al_tversionfinal_14_08_2014
Bridging the missing middle for al_tversionfinal_14_08_2014Bridging the missing middle for al_tversionfinal_14_08_2014
Bridging the missing middle for al_tversionfinal_14_08_2014
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
doore dissertation grad expo 42716 white finalb
doore dissertation grad expo 42716 white finalbdoore dissertation grad expo 42716 white finalb
doore dissertation grad expo 42716 white finalb
 
Universal design for learners
Universal design for learnersUniversal design for learners
Universal design for learners
 
Interaction design: desiging user interfaces for digital products
Interaction design: desiging user interfaces for digital productsInteraction design: desiging user interfaces for digital products
Interaction design: desiging user interfaces for digital products
 
Intro to Agile and Lean UX
Intro to Agile and Lean UXIntro to Agile and Lean UX
Intro to Agile and Lean UX
 
Customer Centric: Product/Service Design for Business
Customer Centric: Product/Service Design for BusinessCustomer Centric: Product/Service Design for Business
Customer Centric: Product/Service Design for Business
 
The Innovation Engine for Team Building – The EU Aristotele Approach From Ope...
The Innovation Engine for Team Building – The EU Aristotele Approach From Ope...The Innovation Engine for Team Building – The EU Aristotele Approach From Ope...
The Innovation Engine for Team Building – The EU Aristotele Approach From Ope...
 

More from krisztianbalog

More from krisztianbalog (18)

Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphs
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Next
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Edition
 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
 
Entity Linking
Entity LinkingEntity Linking
Entity Linking
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Search
 
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
 
Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)
 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systems
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendation
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Search
 

Recently uploaded

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Recently uploaded (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

What Does Conversational Information Access Exactly Mean and How to Evaluate It?

  • 1. Krisztian Balog University of Stavanger @krisztianbalog krisztianbalog.com What Does Conversational Information Access Exactly Mean and How to Evaluate It?
  • 2. In this talk • My perspective on conversational information access • Point #1: conversational aspect has not received due attention so far • Point #2: evaluation is a bottleneck of progress • Recent work on evaluation in the context of two specific conversational information access scenarios
  • 3. Part I What does conversational information access exactly mean?
  • 4. Part I What does conversational information access exactly mean? search AND recommendation =
  • 5. Where does conversational information access fit within conversational AI?
  • 6. Traditional distinction [1,2] • Aim to assist users to solve a specific task (as efficiently as possible) • Dialogues follow a clearly designed structure (flow) that is developed for a particular task in a closed domain • Well-defined measure of performance that is explicitly related to task completion Task-oriented (goal-driven) 1. Chen et al. A Survey on Dialogue Systems: Recent Advances and New Frontiers. SIGKDD Explor. Newsl. 19(2), 2017. 2. Serbian et al. A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version. Dialogue & Discourse 9(1), 2018. Non-task-oriented (non-goal-driven) • Aim to carry on an extended conversation (“chit-chat”) with the goal of mimicking human-human interactions • Developed for unstructured, open domain conversations • Objective is to be human-like, i.e., able to talk about different topics (breadth and depth) in an engaging and coherent manner
  • 7. Contemporary distinction [1,2] 1. Gao et al. Neural Approaches to Conversational AI. Found. Trends Inf. Retr. 13(2-3), 2019. 2. Deriu et al. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 2020. Task-oriented Social chat • Aim to provide concise, direct answers to user queries • Dialogues are unstructured, but commonly follow a question-answer pattern; mostly open domain (dictated by the underlying data) • Evaluated with respect to the correctness of answers (on the turn level) Interactive QA • Aim to assist users to solve a specific task (as efficiently as possible) • Dialogues follow a clearly designed structure (flow) that is developed for a particular task in a closed domain • Well-defined measure of performance that is explicitly related to task completion • Aim to carry on an extended conversation (“chit-chat”) with the goal of mimicking human- human interactions • Developed for unstructured, open domain conversations • Objective is to be human-like, i.e., able to talk about different topics (breadth and depth) in an engaging and coherent manner
  • 8. Where does Conversational Info Access fit? Task-oriented Social chat • Aim to provide concise, direct answers to user queries • Dialogues are unstructured, but commonly follow a question-answer pattern; mostly open domain (dictated by the underlying data) • Evaluated with respect to the correctness of answers (on the turn level) Interactive QA • Aim to assist users to solve a specific task (as efficiently as possible) • Dialogues follow a clearly designed structure (flow) that is developed for a particular task in a closed domain • Well-defined measure of performance that is explicitly related to task completion • Aim to carry on an extended conversation (“chit-chat”) with the goal of mimicking human- human interactions • Developed for unstructured, open domain conversations • Objective is to be human-like, i.e., able to talk about different topics (breadth and depth) in an engaging and coherent manner
  • 9. But wait… Hasn’t search always been conversational?!
  • 10. Search as a conversation places to visit in Stavanger AI
  • 11. Search as a conversation places to visit in Stavanger AI By Stavanger you meant the city in Norway, right? Would you be interested in other cities in Norway as well? Do you want to know more about Stavanger?
  • 12. Main differences • Degree of personalization and long-term user state • Support for complex (multi-step) tasks • Answer generation vs. answer retrieval • Dialogue setting where a screen or keyboard may not be present • Mixed initiative
  • 13. J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.
  • 14. J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.
  • 15. Do humans really converse according to clearly separable goals?
  • 16. Example dialogue I’d like to buy new running shoes. Do you have something similar to my Nike? There is a new Pegasus model. Also, you might want to check out Asics’ Gel Nimbus line, which has better ankle support. AI Why does ankle support matter? I don’t enjoy running that much lately. … AI … AI Task-oriented Interactive QA Social chat
  • 17. Hybrid systems • Most commercial systems are hybrid to combine the strengths of task- specific approaches/architectures • Hierarchical dialogue manager: • Broker (top-level) that manages the overall conversation process • Collection of skills (low-level) that handle different types of conversation segments
  • 18. A disconnect vs. Siloed view (Current practice) Holistic view (SWIRL’18 vision) Conversational info access Task- oriented Social chat Interactive QA
  • 19. A disconnect vs. Siloed view (Current practice) Holistic view (SWIRL’18 vision) • Separate architectures for different types of systems • Unified architecture that supports multiple user goals Conversational info access Task- oriented Social chat Interactive QA broker
  • 20. Take-away #1 It’s perhaps time to embrace a more holistic view of conversational info access.
  • 21. Progress so far (selection) • Intent detection • Asking clarification questions • Query resolution • Response retrieval • … • Mostly on the component level and not specific to the conversational paradigm
  • 22. Example: TREC CAST [1] • TREC Conversational Assistance track • Setup: Given a user utterance, retrieve an answer from a corpus of paragraphs • Evaluation is performed with respect to answer relevance • Main challenge is coreference resolution 1. http://www.treccast.ai/ What is throat cancer? … AI Is it treatable? … AI Tell me about lung cancer. … AI What are its symptoms?
  • 23. Issues • Evaluation is limited to single turns and does not consider system responses to previous user utterances • User utterances are given beforehand and follow a fixed sentence • Answers are limited to existing paragraphs in a corpus • Answers are a ranked list • Coreference resolution is not specific to conversational search, it’s been studied in the context of Web/session search • task == passage retrieval?
  • 24. A disconnect • All test collections are abstractions of real tasks, but we need to make sure that we don’t abstract away the conversational aspects vs. U S S … U S S … U S S … … U S S … U S S … … U S U S U
  • 25. Re: SWIRL 2012 [1] 1. J. Allan, B. Croft, A. Moffat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL 2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.
  • 26. Re: SWIRL 2012 [1] 1. J. Allan, B. Croft, A. Moffat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL 2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.
  • 27. Take-away #2 We need to think (more) about what really makes a conversational experience good/bad and develop appropriate evaluation resources.
  • 28. Part II From Answer Retrieval to Answer Generation based on: S. Zhang, Z. Dai, K. Balog, and J. Callan. Summarizing and Exploring Tabular Data in Conversational Search. SIGIR ’20.
  • 29. Conversational AI Task-oriented Social chat • Aim to provide concise, direct answers to user queries • Dialogues are unstructured, but commonly follow a question-answer pattern; mostly open domain (dictated by the underlying data) • Evaluated with respect to the correctness of answers (on the turn level) Interactive QA • Aim to assist users to solve a specific task (as efficiently as possible) • Dialogues follow a clearly designed structure (flow) that is developed for a particular task in a closed domain • Well-defined measure of performance that is explicitly related to task completion • Aim to carry on an extended conversation (“chit-chat”) with the goal of mimicking human- human interactions • Developed for unstructured, open domain conversations • Objective is to be human-like, i.e., able to talk about different topics (breadth and depth) in an engaging and coherent manner
  • 30. Motivation What is the largest field sport stadium in the world?
  • 31. Motivation What is the largest field sport stadium in the world? The AT&T Stadium in Arlington, Texas, USA. AI The AT&T Stadium in Arlington, Texas, USA, which can seat over 80.000 people. AI The AT&T Stadium in Arlington, Texas, USA, home of the Dallas Cowboys, which can seat over 80.000 people. AI Option 1: Option 2: Option 3:
  • 32. What makes it a good response in a conversational setting?
  • 33.
  • 34. Motivation What will the weather be like next week? AI
  • 35. Motivation What will the weather be like next week? Today, it’ll be sunny, with temperatures between -6 and -2 °C. Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C. On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C. On Monday, it’ll be rainy, with temperatures between 2 and 4 °C. On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C. … AI Option 1: Today, it’ll be sunny. Over the weekend, it’ll be overcast with max temperatures around 2 °C. From Monday onwards, it’ll be rainy with temperatures between 2 and 6 °C. And, if you think that’s depressing, you should look at the wind forecast. AI Option 2:
  • 36. Motivation What will the weather be like next week? Today, it’ll be sunny, with temperatures between -6 and -2 °C. Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C. On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C. On Monday, it’ll be rainy, with temperatures between 2 and 4 °C. On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C. … AI Option 1: Today, it’ll be sunny. Over the weekend, it’ll be overcast with max temperatures around 2 °C. From Monday onwards, it’ll be rainy with temperatures between 2 and 6 °C. And, if you think that’s depressing, you should look at the wind forecast. AI Option 2:
  • 37. Motivation • Approximately 10% of QA queries are answered by tabular data • How to present tabular search results effectively in a conversational setting? • Summary is not static, but is conditioned on conversation context • Summary should help drive the conversation (invite for further exploration)
  • 38. Motivation What is the largest field sport stadium in the world? https://en.wikipedia.org/wiki/List_of_covered_stadiums_by_capacity
  • 39. The anatomy of an answer What is the largest field sport stadium in the world? I found a table listing the largest field stadiums in the world. The largest one is the AT&T Stadium, which holds 80,000 people. All the listed stadiums have a capacity of over 5,000 people, and 30 of 67 are in the US. Leading sentence describing the table AI
  • 40. The anatomy of an answer What is the largest field sport stadium in the world? I found a table listing the largest field stadiums in the world. The largest one is the AT&T Stadium, which holds 80,000 people. All the listed stadiums have a capacity of over 5,000 people, and 30 of 67 are in the US. Leading sentence describing the table Answer to the question
  • 41. The anatomy of an answer What is the largest field sport stadium in the world? I found a table listing the largest field stadiums in the world. The largest one is the AT&T Stadium, which holds 80,000 people. All the listed stadiums have a capacity of over 5,000 people, and 30 of 67 are in the US. Leading sentence describing the table Answer to the question Further information to help explore further
  • 42. Would people want the be engaged in a conversation? • Experiment: augmenting answers to NBA-related questions [1] • User engagement increased overall • Explicit negative feedback also increased • Users are less inclined to engage in a conversation when their team lost 1. Szpektor et al. Dynamic Composition for Conversational Domain Exploration. The Web Conf., 2020. Figure taken from Szpektor et al. [1]
  • 43. Objective • Create a test collection for table summarization in a conversational setting • Input: query q and result table T • Output: summary S Illustration of the conversational table summarization task. What is the largest field sport stadium in the world? I found a table listing the largest field stadiums in the world. The largest one is the AT&T Stadium, which holds 80,000 people. All the listed stadiums have a capacity of over 5,000 people, and 30 of 67 are in the US. S q List of covered stadiums by capacity T
  • 44. Queries and tables • 200 tables sampled randomly from the WikiTableQuestions dataset [1] • Tables with at least 6 rows and 4 columns • Corresponding queries either sampled (45) or written by two of the authors (155) 1. Pasupat and Liang. Compositional Semantic Parsing on Semi-Structured Tables. ACL-IJCNLP 2015. q: tell me about the movies he acted T: Zhao_Dan#Selected_filmography q: how many times was ed sheeran listed as the performer T: List of number-one singles and albums in Sweden 2014
  • 45. Creating candidate summaries • Crowdsourcing task, mimicking a conversation with a friend • Summary in 30-50 words (150-250 characters) • ~15-30 seconds long spoken • 5 summaries per query Instructions given to crowd workers. Image talking to a friend on the phone. Your friend asks you a question, and you find the following table on the Web. Remember that your friend cannot see the table. Your goal is to let your friend to capture essential information in the table related to the question. Your summary should be short but comprehensive. Try to describe several rows or columns that you find interesting.
  • 46. Assessments • Aim to assist users to solve a specific task (as efficiently as possible) • Dialogues follow a clearly designed structure (flow) that is developed for a particular task in a closed domain • Well-defined measure of performance that is explicitly related to task completion Summary-level Sentence-level • Assessors (7) are presented with the question and all candidate summaries (5) • They are asked to select the best summary, by considering • language quality • relevance • ability to drive a conversation • Summary with most votes is selected as the ground truth • In 77% of the cases at least 3 assessors agreed on which summary was best • Top-voted summaries tend to be longer and use a richer vocabulary (more unique words) • Summaries are split into sentences (avg. 3.5) • Assessors (3) are presented with a sentence, together with previous and next sentences as context, and are asked to judge its relevance • They are situated in a scenario where they need to write a short summary (to a friend) • Highly relevant: describes the table’s topic or provides a fact regarding the query • Relevant: is not directly relevant to the question but helps to explain the contents of the table • Non-relevant: not about the table or unrelated to the query
  • 47. Experiments • Goals: establish a baseline, identify challenging aspects of the task • Setup: 5-fold cross-validation; all summaries with >0 votes used for training • Methods: pre-trained neural language generation models (CopyNet, GPT-2, T5) • Tables are represented as a text sequence • A "key:value" format is used to preserve structure. Additionally, page title, table caption and total number of rows are also added with special keys.
  • 48. Results Method ROUGE-L ROUGE-1 ROUGE-2 BLEU CopyNet 0.030 0.041 0.012 0.80 GPT-2 0.200 0.272 0.073 5.35 T5 0.276 0.362 0.143 10.43
  • 49. Analysis • Machine-generated summaries are fluent • Typical mistakes • Wrong quantity (84% of summaries) • “Purdue won 4 consecutive in the 1960s, …” => it won only 3 • Wrong reference (38% of summaries) • E.g., elevation vs. prominence when asking for highest mountain Example of ground truth vs. auto-generated summaries. What album had the most sales in 2012 in Finland? I saw a table showing sales figures of around 10 albums of 2012. Of these “vain elamaa” by various artists ranked highest regarding sales. It is performed by various artists. The second rank is held by “koodi” by robin. Manual q T5 I found a table of the top 10 albums of 2012. The most sold album was “vain elämä” by various artists. It sold 164,119 copies. The next highest album was “koodi” by robin. List of number-one albums of 2012 (Finland) T
  • 50. Summary • Response generation is an open research challenge in Interactive QA • Needs to be designed specifically for the conversational paradigm and go beyond mere passage retrieval • Needs appropriate evaluation methodology and resources • Test collection for summarizing and exploring tabular data in conversational search • Includes both summary-level and sentence-level human assessments • Available at https://github.com/iai-group/sigir2020-tablesum
  • 51. Part III Towards Automated End-to-end System Evaluation based on: S. Zhang and K. Balog. Evaluating Conversational Recommender Systems via User Simulation. KDD ’20.
  • 52. Conversational AI Task-oriented Social chat • Aim to provide concise, direct answers to user queries • Dialogues are unstructured, but commonly follow a question-answer pattern; mostly open domain (dictated by the underlying data) • Evaluated with respect to the correctness of answers (on the turn level) Interactive QA • Aim to assist users to solve a specific task (as efficiently as possible) • Dialogues follow a clearly designed structure (flow) that is developed for a particular task in a closed domain • Well-defined measure of performance that is explicitly related to task completion • Aim to carry on an extended conversation (“chit-chat”) with the goal of mimicking human- human interactions • Developed for unstructured, open domain conversations • Objective is to be human-like, i.e., able to talk about different topics (breadth and depth) in an engaging and coherent manner
  • 53. Motivation • Evaluating conversational systems with real users is time-consuming and expensive • Can we simulate users (for the purpose of evaluation)? • The specific problem is conversational item recommendation • Help users find an item they like from a large collection of items • Elicit preferences on specific items or categories of items
  • 54. Why simulation? • Test-collection based ("offline") evaluation • Possible to create a reusable test collection for a specific subtask • Limited to a single turn, does not measure overall user satisfaction • Human evaluation • Possible to annotate entire conversations • Expensive, time-consuming, does not scale
  • 55. Objectives • Develop a user simulator that • produces responses that a real user would give in a certain dialog situation • enables automatic assessment of conversational agents • makes no assumptions about the inner workings of conversational agents • is data-driven (requires only a small annotated dialogue corpus)
  • 56. Problem statement • For a given system S and user population U, the goal of user simulation U* is to predict the performance of S when used by U, denoted as M(S,U) • For two systems S1 and S2, U* should be such that if M(S1,U) < M(S2,U) then M(S1,U*) < M(S2,U*)
  • 57. Simulation framework Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user
  • 58. Simulation framework Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user Translating an agent utterance into a structured format
  • 59. Simulation framework Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user Determining the next user action based on the understanding of the agent’s utterance
  • 60. Simulation framework Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user Turning a structured response representation into natural language
  • 61. Modeling simulated users • Model dialogue as a Markov Decision Process (MDP) • Every MDP is formally described by a finite state space S, a finite action set A, and transition probabilities P • Dialogue state st is the state of the dialogue manager at dialogue turn t • Dialogue actions at at represents the user action that is being communicated in turn t • Transition probabilities: the probability of transitioning from st to st+1 • Markov property: s1 s3 s2 a1 a2 P(st+1|st, at, st 1, at 1, . . . , s0, a0) = P(st+1|st, at)
  • 62. Agenda-based simulation [1] • The action agenda A is a stack-like representation for user actions that is dynamically updated • The next user action is selected from the top of the agenda • Agenda is updated based on whether the agent understands the user action (by giving a corresponding response) • Accomplished goal → pull operation • Not accomplished → push replacement action(s) Great, let’s do this! Start by giving me ONE movie you like and some reasons why. Hello, I am looking for a movie to watch. I like the remains of the day because I like psychological movies.  Got it. About to jump into lightspeed! I'll have your movies ready for you in a flash! Bot C = [ type = film; genre = psychology; name = [“R..”, …] ] R = [ director =; rating = ] disclose (type=film) disclose(name=“R..”) disclose (genre=psy.) navigate (director) navigate (rating) note complete disclose (name=“I..”) disclose (genre=psy.) navigate (director) navigate (rating) note complete na na no co na no co no co Bot I like Requiem for a Dream. I’m pretty solid on a bunch of things so far, but not on this request. Can you give a different movie? reveal (name) disclose (name=“xx”) disclose (genre=psy.) navigate (director) navigate (rating) note complete co 1. Schatzmann et al. Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System. NAACL 2007.
  • 63. Interaction model • The interaction model defines how the agenda should be initialized (A0) and updated (At => At+1) • Action space [2] Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user Disclose I would to arrange a holiday in Italy Reveal Actually, we need to go on the 3rd of May in the evening Inquire What other regions in Europe are like that? Navigate Which one is the cheapest option? Note That hotel could be a possibility Complete Thanks for the help, bye 2. Azzopardi et al. Conceptualizing Agent-human Interactions during the Conversational Search Process. CAIR 2018.
  • 64. Interaction model • QRFA Model [3] • User: Query and Request • Agent: Feedback and Accept • CIR6 Model Disclose Inquire Navigate Note Reveal Complete Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user 3. Vakulenko et al. QRFA: A Data-Driven Model of Information Seeking Dialogues. ECIR 2019. Q F R A
  • 65. Preference model • The preference model is meant to capture individual differences and personal tastes • Preferences are represented as a set of attribute-value pairs • Single Item Preference • Partially rooted in historical user behavior • Offers limited consistency • Personal Knowledge Graph (PKG) • PKG has two types of nodes: items and attributes • Infers the rating for that attribute by considering the ratings of items that have that attribute Natural language understanding (NLU) Natural language generation (NLG) Response generation Conversational agent Interaction model Preference model Simulated user
  • 66. Personal Knowledge Graphs [4] A personal knowledge graph (PKG) is a resource of structured information about entities that are of personal interest to the user Key differences from general KGs: • Entities of personal interest to the user • Distinctive shape (“spiderweb” layout) • Links between a PKG and external sources are inherent to its nature 4. Balog and Kenter. Personal Knowledge Graphs: A Research Agenda. ICTIR 2019.
  • 67. Grounding in actual data • Historical movie viewing data (ratings) • User preference profiles are initialized based on a sample • The rest of the watch history is used as held- out data
  • 68. Evaluation architecture • Three existing conversational movie recommenders (A, B, C) are compared using both real ( ) and simulated ( ) users • Real users: crowdsourcing • Simulated users • Preference model is initialized by sampling historical preferences of a real user from MovieLens data • Interaction model is trained based on behaviors of real human users • Both NLU and NLG use hand-crafted templates User Simulated User Human User Conversation Manager Conversational Agent
  • 69. Method AvgTurns UserActRatio DS-KL A B C A B C A B C Real users 9.20 14.84 20.24 0.374 0.501 0.500 — — — QRFA-Single 10.52 12.28 17.51 0.359 0.500 0.500 0.027 0.056 0.029 CIR6-Single 9.44 12.75 15.92 0.382 0.500 0.500 0.055 0.040 0.025 CIR6-PKG 6.16 9.87 10.56 0.371 0.500 0.500 0.075 0.056 0.095 Characteristics of conversations • (RQ1) How well do our simulation techniques capture the characteristics of conversations? • CIR6-PKG tends to have significantly shorter average conversation length, since it terminates the dialog as soon as the user finds a recommendation they like
  • 70. Performance prediction • (RQ2) How well do the relative ordering of systems according to some measure correlate when using real vs. simulated users? • High correlation between automatic and human evaluations Method Reward Success rate Real users A (8.88) > B (7.56) > C (6.04) B (0.864) > A (0.833) > C (0.727) QRFA-Single A (8.04) > B (7.41) > C (6.30) B (0.836) > A (0.774) > C (0.718) CIR6-Single A (8.64) > B (8.28) > C (6.01) B (0.822) > A (0.807) > C (0.712) CIR6-PKG A (11.12) > B (10.65) > C (9.31) A (0.870) > B (0.847) > C (0.784) Performance of conversational agents using real vs. simulated users, in terms of Reward and Success Rate. We show the relative ordering of agents (A–C), with evaluation scores in parentheses.
  • 71. Realisticity • (RQ3) Do more sophisticated simulation approaches (i.e., more advanced interaction and preference modeling) lead to more realistic simulation? • Human raters were asked to guess which of the two dialogues was performed by a human? • Trying to "fool" the human evaluator ("reverse" Turing test) AI ... ... AI ... ... AI ... ... AI ... ...
  • 72. Realisticity • (RQ3) Do more sophisticated simulation approaches (i.e., more advanced interaction and preference modeling) lead to more realistic simulation? • Our interaction model (CIR6) and personal knowledge graphs for preference modeling both bring improvements Method A B C All Win Loss Tie Win Loss Tie Win Loss Tie Win Loss Tie QRFA-Single 20 39 16 22 33 20 19 43 13 27% 51% 22% CIR6-Single 27 30 18 23 33 19 26 40 9 33% 46% 21% CIR6-PKG 22 39 14 27 29 19 32 25 18 36% 41% 23%
  • 73. Further analysis • We analyze the reasons when the crowd workers chose the real users, and classify them as follows Style Realisticity how realistic or human-sounding a dialog is Engagement involvement of the user in the conversation Emotion expressions of feelings or emotions Content Response user does not seem to understand the agent correctly Grammar language usage, including spelling and punctuation Length the length of reply
  • 74. Summary of contributions • A general framework for evaluating conversational recommender agents via simulation • Interaction and preference models to better control the conversation flow and to ensure the consistency of responses given by the simulated user • Experimental comparison of three conversational movie recommender agents, using both real and simulated users • Analysis of comments collected from human evaluation, and identification of areas for future development
  • 75. It’s time to wrap up
  • 76. Summary • Time to move away from search engines to conversation engines • Consider what really makes a conversational experience unsuccessful/successful • Embrace a more holistic view which integrates and supports multiple user goals • Progress needs to be made on evaluation methodology and resources
  • 77. Acknowledgments • Shuo Zhang (Bloomberg), Zhuyun Dai (CMU), Jamie Callan (CMU)
  • 78. Questions? ECIR 2022 in Stavanger, Norway (ecir2022.org will go live later this week) Simulation for IR Evaluation workshop at SIGIR 2021 Advertisements: