What Does Conversational Information Access Exactly Mean and How to Evaluate It?

Krisztian Balog
University of Stavanger
@krisztianbalog krisztianbalog.com
What Does Conversational Information
Access Exactly Mean and How to Evaluate It?

In this talk
• My perspective on conversational information access
• Point #1: conversational aspect has not received due attention so far
• Point #2: evaluation is a bottleneck of progress
• Recent work on evaluation in the context of two speciﬁc conversational
information access scenarios

Part I
What does conversational information access
exactly mean?

Part I
What does conversational information access
exactly mean?
search AND recommendation
=

Where does conversational information
access fit within conversational AI?

Traditional distinction [1,2]
• Aim to assist users to solve a specific task
(as efficiently as possible)
• Dialogues follow a clearly designed
structure (flow) that is developed for a
particular task in a closed domain
• Well-defined measure of performance that
is explicitly related to task completion
Task-oriented (goal-driven)
1. Chen et al. A Survey on Dialogue Systems: Recent Advances and New Frontiers. SIGKDD Explor. Newsl. 19(2), 2017.
2. Serbian et al. A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version. Dialogue & Discourse 9(1), 2018.
Non-task-oriented (non-goal-driven)
• Aim to carry on an extended conversation
(“chit-chat”) with the goal of mimicking
human-human interactions
• Developed for unstructured, open domain
conversations
• Objective is to be human-like, i.e., able to
talk about different topics (breadth and
depth) in an engaging and coherent manner

Contemporary distinction [1,2]
1. Gao et al. Neural Approaches to Conversational AI. Found. Trends Inf. Retr. 13(2-3), 2019.
2. Deriu et al. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 2020.
Task-oriented Social chat
• Aim to provide concise, direct
answers to user queries
• Dialogues are unstructured,
but commonly follow a
question-answer pattern;
mostly open domain (dictated
by the underlying data)
• Evaluated with respect to the
correctness of answers (on
the turn level)
Interactive QA
• Aim to assist users to solve a
specific task (as efficiently as
possible)
• Dialogues follow a clearly
designed structure (flow) that
is developed for a particular
task in a closed domain
• Well-defined measure of
performance that is explicitly
related to task completion
• Aim to carry on an extended
conversation (“chit-chat”) with
the goal of mimicking human-
human interactions
• Developed for unstructured,
open domain conversations
• Objective is to be human-like,
i.e., able to talk about different
topics (breadth and depth) in an
engaging and coherent manner

Where does Conversational Info Access fit?
the turn level)
Interactive QA
possible)
human interactions

But wait…
Hasn’t search always been conversational?!

Search as a conversation
places to visit in Stavanger
AI

Search as a conversation
places to visit in Stavanger
AI
By Stavanger you meant the city
in Norway, right?
Would you be interested in other
cities in Norway as well?
Do you want to know
more about Stavanger?

Main diﬀerences
• Degree of personalization and long-term user state
• Support for complex (multi-step) tasks
• Answer generation vs. answer retrieval
• Dialogue setting where a screen or keyboard may not be present
• Mixed initiative

J.S. Culpepper, F. Diaz, M.D. Smucker editors. Research Frontiers in Information Retrieval: Report from the
Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52(1), 2018.

Do humans really converse according to
clearly separable goals?

Example dialogue
I’d like to buy new running shoes. Do you have something
similar to my Nike?
There is a new Pegasus model. Also, you might want to check
out Asics’ Gel Nimbus line, which has better ankle support.
AI
Why does ankle support matter?
I don’t enjoy running that much lately.
…
AI
…
AI
Task-oriented
Interactive QA
Social chat

Hybrid systems
• Most commercial systems are hybrid to combine the strengths of task-
speciﬁc approaches/architectures
• Hierarchical dialogue manager:
• Broker (top-level) that manages the overall conversation process
• Collection of skills (low-level) that handle diﬀerent types of conversation segments

A disconnect
vs.
Siloed view
(Current practice)
Holistic view
(SWIRL’18 vision)
Conversational info access
Task-
oriented
Social
chat
Interactive
QA

A disconnect
vs.
Siloed view
(Current practice)
Holistic view
(SWIRL’18 vision)
• Separate architectures for
diﬀerent types of systems
• Uniﬁed architecture that
supports multiple user goals
Conversational info access
Task-
oriented
Social
chat
Interactive
QA
broker

Take-away #1
It’s perhaps time to embrace a more
holistic view of conversational info access.

Progress so far (selection)
• Intent detection
• Asking clariﬁcation questions
• Query resolution
• Response retrieval
• …
• Mostly on the component level and not speciﬁc to the conversational
paradigm

Example: TREC CAST [1]
• TREC Conversational Assistance track
• Setup: Given a user utterance, retrieve an
answer from a corpus of paragraphs
• Evaluation is performed with respect to
answer relevance
• Main challenge is coreference resolution
1. http://www.treccast.ai/
What is throat cancer?
…
AI
Is it treatable?
…
AI
Tell me about lung cancer.
…
AI
What are its symptoms?

Issues
• Evaluation is limited to single turns and does not consider system responses
to previous user utterances
• User utterances are given beforehand and follow a ﬁxed sentence
• Answers are limited to existing paragraphs in a corpus
• Answers are a ranked list
• Coreference resolution is not speciﬁc to conversational search, it’s been
studied in the context of Web/session search
• task == passage retrieval?

A disconnect
• All test collections are abstractions of real tasks, but we need to make
sure that we don’t abstract away the conversational aspects
vs.
U
S S
…
U
S
S
…
U
S
S
…
…
U
S
S
…
U
S
S
…
…
U
S
U
S
U

Re: SWIRL 2012 [1]
1. J. Allan, B. Croft, A. Moﬀat, and M. Sanderson editors. Frontiers, Challenges, and Opportunities for Information Retrieval: Report from SWIRL
2012 the Second Strategic Workshop on Information Retrieval in Lorne. SIGIR Forum 46(1), 2012.

Take-away #2
We need to think (more) about what really makes a
conversational experience good/bad and develop
appropriate evaluation resources.

Part II
From Answer Retrieval to Answer Generation
based on:
S. Zhang, Z. Dai, K. Balog, and J. Callan. Summarizing and Exploring Tabular Data in Conversational Search. SIGIR ’20.

Conversational AI
the turn level)
Interactive QA
possible)
human interactions

Motivation
What is the largest ﬁeld sport stadium in the world?

Motivation
The AT&T Stadium in Arlington, Texas, USA.
AI
The AT&T Stadium in Arlington, Texas, USA, which can
seat over 80.000 people.
AI
The AT&T Stadium in Arlington, Texas, USA, home of the
Dallas Cowboys, which can seat over 80.000 people.
AI
Option 1:
Option 2:
Option 3:

What makes it a good response in a
conversational setting?

Motivation
What will the weather be like next week?
AI

Motivation
What will the weather be like next week?
Today, it’ll be sunny, with temperatures between -6 and -2 °C.
Tomorrow, it’ll be overcast, with temperatures between -1 and 2 °C.
On Sunday, it’ll be overcast, with temperatures between 1 and 2 °C.
On Monday, it’ll be rainy, with temperatures between 2 and 4 °C.
On Tuesday, it’ll be rainy, with temperatures between 3 and 4 °C.
…
AI
Option 1:
Today, it’ll be sunny. Over the weekend, it’ll be overcast with max
temperatures around 2 °C. From Monday onwards, it’ll be rainy with
temperatures between 2 and 6 °C. And, if you think that’s
depressing, you should look at the wind forecast.
AI
Option 2:

Motivation
• Approximately 10% of QA queries are answered by tabular data
• How to present tabular search results eﬀectively in a conversational
setting?
• Summary is not static, but is conditioned on conversation context
• Summary should help drive the conversation (invite for further exploration)

Motivation
https://en.wikipedia.org/wiki/List_of_covered_stadiums_by_capacity

The anatomy of an answer
I found a table listing the largest ﬁeld stadiums in the
world. The largest one is the AT&T Stadium, which holds
80,000 people. All the listed stadiums have a capacity of
over 5,000 people, and 30 of 67 are in the US.
Leading sentence
describing the table
AI

Leading sentence
Answer to the question

Leading sentence
Answer to the question
Further information to
help explore further

Would people want the be engaged in a conversation?
• Experiment: augmenting answers to
NBA-related questions [1]
• User engagement increased overall
• Explicit negative feedback also
increased
• Users are less inclined to engage in a
conversation when their team lost
1. Szpektor et al. Dynamic Composition for Conversational Domain Exploration. The Web Conf., 2020.
Figure taken from Szpektor et al. [1]

Objective
• Create a test collection for table
summarization in a conversational
setting
• Input: query q and result table T
• Output: summary S
Illustration of the conversational table summarization task.
What is the largest ﬁeld sport stadium in the
world?
I found a table listing the largest ﬁeld stadiums in
the world. The largest one is the AT&T Stadium,
which holds 80,000 people. All the listed stadiums
have a capacity of over 5,000 people, and 30 of 67
are in the US.
S
q
List of covered stadiums by capacity
T

Queries and tables
• 200 tables sampled randomly from
the WikiTableQuestions dataset [1]
• Tables with at least 6 rows and 4
columns
• Corresponding queries either
sampled (45) or written by two of the
authors (155)
1. Pasupat and Liang. Compositional Semantic Parsing on Semi-Structured Tables. ACL-IJCNLP 2015.
q: tell me about the movies he acted
T: Zhao_Dan#Selected_filmography
q: how many times was ed sheeran listed
as the performer
T: List of number-one singles and albums
in Sweden 2014

Creating candidate summaries
• Crowdsourcing task, mimicking a
conversation with a friend
• Summary in 30-50 words
(150-250 characters)
• ~15-30 seconds long spoken
• 5 summaries per query
Instructions given to crowd workers.
Image talking to a friend on the
phone. Your friend asks you a
question, and you ﬁnd the following
table on the Web. Remember that
your friend cannot see the table.
Your goal is to let your friend to
capture essential information in the
table related to the question. Your
summary should be short but
comprehensive. Try to describe
several rows or columns that you
ﬁnd interesting.

Assessments
• Aim to assist users to solve a specific task
(as efficiently as possible)
• Dialogues follow a clearly designed
structure (flow) that is developed for a
particular task in a closed domain
• Well-defined measure of performance that
is explicitly related to task completion
Summary-level Sentence-level
• Assessors (7) are presented with the question
and all candidate summaries (5)
• They are asked to select the best summary, by
considering
• language quality
• relevance
• ability to drive a conversation
• Summary with most votes is selected as the
ground truth
• In 77% of the cases at least 3 assessors agreed on
which summary was best
• Top-voted summaries tend to be longer and use a
richer vocabulary (more unique words)
• Summaries are split into sentences (avg. 3.5)
• Assessors (3) are presented with a sentence,
together with previous and next sentences as
context, and are asked to judge its relevance
• They are situated in a scenario where they
need to write a short summary (to a friend)
• Highly relevant: describes the table’s topic or
provides a fact regarding the query
• Relevant: is not directly relevant to the question
but helps to explain the contents of the table
• Non-relevant: not about the table or unrelated
to the query

Experiments
• Goals: establish a baseline, identify challenging aspects of the task
• Setup: 5-fold cross-validation; all summaries with >0 votes used for
training
• Methods: pre-trained neural language generation models (CopyNet,
GPT-2, T5)
• Tables are represented as a text sequence
• A "key:value" format is used to preserve structure. Additionally, page title, table
caption and total number of rows are also added with special keys.

Results
Method ROUGE-L ROUGE-1 ROUGE-2 BLEU
CopyNet 0.030 0.041 0.012 0.80
GPT-2 0.200 0.272 0.073 5.35
T5 0.276 0.362 0.143 10.43

Analysis
• Machine-generated summaries are
ﬂuent
• Typical mistakes
• Wrong quantity (84% of summaries)
• “Purdue won 4 consecutive in the 1960s,
…” => it won only 3
• Wrong reference (38% of summaries)
• E.g., elevation vs. prominence when asking
for highest mountain
Example of ground truth vs. auto-generated summaries.
What album had the most sales in 2012 in Finland?
I saw a table showing sales ﬁgures of around
10 albums of 2012. Of these “vain elamaa”
by various artists ranked highest regarding
sales. It is performed by various artists. The
second rank is held by “koodi” by robin.
Manual
q
T5 I found a table of the top 10 albums of 2012.
The most sold album was “vain elämä” by
various artists. It sold 164,119 copies. The
next highest album was “koodi” by robin.
List of number-one albums of 2012 (Finland)
T

Summary
• Response generation is an open research challenge in Interactive QA
• Needs to be designed speciﬁcally for the conversational paradigm and go beyond
mere passage retrieval
• Needs appropriate evaluation methodology and resources
• Test collection for summarizing and exploring tabular data in conversational
search
• Includes both summary-level and sentence-level human assessments
• Available at https://github.com/iai-group/sigir2020-tablesum

Part III
Towards Automated End-to-end System
Evaluation
based on:
S. Zhang and K. Balog. Evaluating Conversational Recommender Systems via User Simulation. KDD ’20.

Motivation
• Evaluating conversational systems with real users is time-consuming and
expensive
• Can we simulate users (for the purpose of evaluation)?
• The specific problem is conversational item recommendation
• Help users find an item they like from a large collection of items
• Elicit preferences on specific items or categories of items

Why simulation?
• Test-collection based ("oﬄine") evaluation
• Possible to create a reusable test collection for a speciﬁc subtask
• Limited to a single turn, does not measure overall user satisfaction
• Human evaluation
• Possible to annotate entire conversations
• Expensive, time-consuming, does not scale

Objectives
• Develop a user simulator that
• produces responses that a real user would give in a certain dialog situation
• enables automatic assessment of conversational agents
• makes no assumptions about the inner workings of conversational agents
• is data-driven (requires only a small annotated dialogue corpus)

Problem statement
• For a given system S and user population U, the goal of user simulation U*
is to predict the performance of S when used by U, denoted as M(S,U)
• For two systems S1 and S2, U* should be such that
if M(S1,U) < M(S2,U)
then M(S1,U*) < M(S2,U*)

Simulation framework
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user

Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Translating an agent utterance
into a structured format

Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Determining the next user action based on
the understanding of the agent’s utterance

Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Turning a structured response
representation into natural language

Modeling simulated users
• Model dialogue as a Markov Decision Process (MDP)
• Every MDP is formally described by a ﬁnite
state space S, a ﬁnite action set A, and transition probabilities P
• Dialogue state st is the state of the dialogue manager at dialogue turn t
• Dialogue actions at at represents the user action that is being communicated in turn t
• Transition probabilities: the probability of transitioning from st to st+1
• Markov property:
s1 s3
s2
a1
a2
P(st+1|st, at, st 1, at 1, . . . , s0, a0) = P(st+1|st, at)

Agenda-based simulation [1]
• The action agenda A is a stack-like
representation for user actions that is
dynamically updated
• The next user action is selected from the top
of the agenda
• Agenda is updated based on whether the
agent understands the user action (by giving
a corresponding response)
• Accomplished goal → pull operation
• Not accomplished → push replacement action(s)
Great, let’s do this! Start by giving me ONE
movie you like and some reasons why.
Hello, I am looking for a movie to watch.
I like the remains of the day because I
like psychological movies.
Got it. About to jump into lightspeed! I'll
have your movies ready for you in a flash!
Bot
C = [ type = film; genre = psychology; name = [“R..”, …] ]
R = [ director =; rating = ]
disclose (type=film)
disclose(name=“R..”)
disclose (genre=psy.)
navigate (director)
navigate (rating)
note
complete
disclose (name=“I..”)
navigate (director)
navigate (rating)
note
complete
na
na
no
co
na
no
co
no
co
Bot
I like Requiem for a Dream.
I’m pretty solid on a bunch of things so far,
but not on this request. Can you give a
diﬀerent movie?
reveal (name)
disclose (name=“xx”)
navigate (director)
navigate (rating)
note
complete
co
1. Schatzmann et al. Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System. NAACL 2007.

Interaction model
• The interaction model deﬁnes how the agenda
should be initialized (A0) and updated (At => At+1)
• Action space [2]
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
Disclose I would to arrange a holiday in Italy
Reveal Actually, we need to go on the 3rd of May in the evening
Inquire What other regions in Europe are like that?
Navigate Which one is the cheapest option?
Note That hotel could be a possibility
Complete Thanks for the help, bye
2. Azzopardi et al. Conceptualizing Agent-human Interactions during the Conversational Search Process. CAIR 2018.

Interaction model
• QRFA Model [3]
• User: Query and Request
• Agent: Feedback and Accept
• CIR6 Model
Disclose Inquire Navigate Note
Reveal
Complete
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user
3. Vakulenko et al. QRFA: A Data-Driven Model of Information Seeking Dialogues. ECIR 2019.
Q F
R A

Preference model
• The preference model is meant to capture individual
diﬀerences and personal tastes
• Preferences are represented as a set of attribute-value pairs
• Single Item Preference
• Partially rooted in historical user behavior
• Oﬀers limited consistency
• Personal Knowledge Graph (PKG)
• PKG has two types of nodes: items and attributes
• Infers the rating for that attribute by considering the ratings of items that have that attribute
Natural language
understanding (NLU)
Natural language
generation (NLG)
Response generation
Conversational
agent
Interaction model
Preference
model
Simulated user

Personal Knowledge Graphs [4]
A personal knowledge graph (PKG) is a
resource of structured information
about entities that are of personal
interest to the user
Key diﬀerences from general KGs:
• Entities of personal interest to the user
• Distinctive shape (“spiderweb” layout)
• Links between a PKG and external
sources are inherent to its nature
4. Balog and Kenter. Personal Knowledge Graphs: A Research Agenda. ICTIR 2019.

Grounding in actual data
• Historical movie viewing data (ratings)
• User preference proﬁles are initialized based
on a sample
• The rest of the watch history is used as held-
out data

Evaluation architecture
• Three existing conversational movie recommenders (A, B, C)
are compared using both real ( ) and simulated ( ) users
• Real users: crowdsourcing
• Simulated users
• Preference model is initialized by sampling historical
preferences of a real user from MovieLens data
• Interaction model is trained based on behaviors of real
human users
• Both NLU and NLG use hand-crafted templates
User
Simulated User
Human User
Conversation
Manager
Conversational
Agent

Method AvgTurns UserActRatio DS-KL
A B C A B C A B C
Real users 9.20 14.84 20.24 0.374 0.501 0.500 — — —
QRFA-Single 10.52 12.28 17.51 0.359 0.500 0.500 0.027 0.056 0.029
CIR6-Single 9.44 12.75 15.92 0.382 0.500 0.500 0.055 0.040 0.025
CIR6-PKG 6.16 9.87 10.56 0.371 0.500 0.500 0.075 0.056 0.095
Characteristics of conversations
• (RQ1) How well do our simulation techniques capture the
characteristics of conversations?
• CIR6-PKG tends to have signiﬁcantly shorter average conversation
length, since it terminates the dialog as soon as the user ﬁnds a
recommendation they like

Performance prediction
• (RQ2) How well do the relative ordering of systems according to some
measure correlate when using real vs. simulated users?
• High correlation between automatic and human evaluations
Method Reward Success rate
Real users A (8.88) > B (7.56) > C (6.04) B (0.864) > A (0.833) > C (0.727)
QRFA-Single A (8.04) > B (7.41) > C (6.30) B (0.836) > A (0.774) > C (0.718)
CIR6-Single A (8.64) > B (8.28) > C (6.01) B (0.822) > A (0.807) > C (0.712)
CIR6-PKG A (11.12) > B (10.65) > C (9.31) A (0.870) > B (0.847) > C (0.784)
Performance of conversational agents using real vs. simulated users, in terms of Reward and
Success Rate. We show the relative ordering of agents (A–C), with evaluation scores in parentheses.

Realisticity
• (RQ3) Do more sophisticated simulation approaches (i.e., more advanced
interaction and preference modeling) lead to more realistic simulation?
• Human raters were asked to guess which of the two dialogues was performed
by a human?
• Trying to "fool" the human evaluator ("reverse" Turing test)
AI
...
...
AI
...
...
AI
...
...
AI
...
...

Realisticity
• (RQ3) Do more sophisticated simulation approaches (i.e., more
advanced interaction and preference modeling) lead to more realistic
simulation?
• Our interaction model (CIR6) and personal knowledge graphs
for preference modeling both bring improvements
Method A B C All
Win Loss Tie Win Loss Tie Win Loss Tie Win Loss Tie
QRFA-Single 20 39 16 22 33 20 19 43 13 27% 51% 22%
CIR6-Single 27 30 18 23 33 19 26 40 9 33% 46% 21%
CIR6-PKG 22 39 14 27 29 19 32 25 18 36% 41% 23%

Further analysis
• We analyze the reasons when the crowd workers chose the real users,
and classify them as follows
Style
Realisticity how realistic or human-sounding a dialog is
Engagement involvement of the user in the conversation
Emotion expressions of feelings or emotions
Content
Response user does not seem to understand the agent correctly
Grammar language usage, including spelling and punctuation
Length the length of reply

Summary of contributions
• A general framework for evaluating conversational recommender agents
via simulation
• Interaction and preference models to better control the conversation ﬂow
and to ensure the consistency of responses given by the simulated user
• Experimental comparison of three conversational movie recommender
agents, using both real and simulated users
• Analysis of comments collected from human evaluation, and identiﬁcation
of areas for future development

Summary
• Time to move away from search engines to conversation engines
• Consider what really makes a conversational experience unsuccessful/successful
• Embrace a more holistic view which integrates and supports multiple user goals
• Progress needs to be made on evaluation methodology and resources

Acknowledgments
• Shuo Zhang (Bloomberg), Zhuyun Dai (CMU), Jamie Callan (CMU)

Questions?
ECIR 2022 in Stavanger, Norway
(ecir2022.org will go live later this week)
Simulation for IR Evaluation workshop
at SIGIR 2021
Advertisements:

What Does Conversational Information Access Exactly Mean and How to Evaluate It?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What Does Conversational Information Access Exactly Mean and How to Evaluate It?

Similar to What Does Conversational Information Access Exactly Mean and How to Evaluate It? (20)

More from krisztianbalog

More from krisztianbalog (18)

Recently uploaded

Recently uploaded (20)

What Does Conversational Information Access Exactly Mean and How to Evaluate It?