Ai for Human Communication

This content included for educational purposes.
This research deck précis information
from the Forrester Digital
Transformation Conference in May
2017. It compiles selected copy and
visuals from conference presentations
and recent Forrester research reports.
Contents are organized into the
following sections:
• Digital transfor
Machine Learning
Human Communica6on
Ar6ﬁcial Intelligence
Natural Language Processing:
NLP|NLU|NLG
InteracFon:
Dialog, gesture, 
emoFon, hapFc
Audible Language: 
Speech, sound
Visual Language: 
2D/3D/4D
WriQen Language: 
Verbal, text
Formal 
Language 
Processing
Symbolic Reasoning
Data
Deep Learning
AI  
FOR HUMAN
COMMUNICATION
1

This content included for educational purposes. 2
• Lawrence Mills Davis is founder and managing director of
Project10X, a research consultancy known for forward-looking
industry studies; multi-company innovation and market
development programs; and business solution strategy
consulting. Mills brings 30 years experience as an industry
analyst, business consultant, computer scientist, and
entrepreneur. He is the author of more than 50 reports,
whitepapers, articles, and industry studies.
• Mills researches artificial intelligence technologies and their
applications across industries, including cognitive computing,
machine learning (ML), deep learning (DL), predictive analytics,
symbolic AI reasoning, expert systems (ES), natural language
processing (NLP), conversational UI, intelligent assistance (IA),
and robotic process automation (RPA), and autonomous multi-
agent systems.
• For clients seeking to exploit transformative opportunities
presented by the rapidly evolving capabilities of artificial
intelligence, Mills brings a depth and breadth of expertise to help
leaders realize their goals. More than narrow specialization, he
brings perspective that combines understanding of business,
technology, and creativity. Mills fills roles that include industry
research, venture development, and solution envisioning.
Lawrence Mills Davis
Managing Director
Project10X
mdavis@project10x.com
202-667-6400

SECTIONS
1. AI for human communication
2. AI for natural language summarization
3. AI for natural language generation
4. AI technology evolution
3

AI for human communication is about recognition, parsing, understanding, and
generating natural language.
The concept of natural language is evolving.
Human communication encompasses visual language and conversational
interaction as well as text.
5This content included for educational purposes.

This research deck précis information from the Forrester Digital
Transformation Conference in May 2017. It compiles selected copy and
visuals from conference presentations and recent Forrester research
reports. Contents are organized into the following sections:
▪ Digital transfor
6
Overview of AI  
for human communication
•Natural language processing (NLP) is
the confluence of artificial
intelligence (AI) and linguistics.
•A key focus is the analysis,
interpretation, and generation of
verbal and written language.
•Other language focus areas include
audible & visual language, data, and
interaction.
•Formal programming languages
enable computers to process natural
language and other types of data.
•Symbolic reasoning employs rules
and logic to frame arguments, make
inferences, and draw conclusions.
•Machine learning (ML) is a area of AI
and NLP that solves problems using
statistical techniques, large data sets
and probabilistic reasoning.
•Deep learning (DL) is a type of
machine learning that uses layered
artificial neural networks.
Deep Learning
Machine Learning
Human Communica6on
Natural Language Processing:
NLP|NLU|NLG
InteracFon:
Dialog, gesture, 
emoFon, hapFc
Audible Language: 
Speech, sound
Visual Language: 
2D/3D/4D
WriQen Language: 
Verbal, text
Formal 
Language 
Processing
Symbolic Reasoning
Data

NATURAL LANGUAGE PROCESSING

nat·u·ral lan·guage proc·ess·ing
/ˈnaCH(ə)rəl//ˈlaNGɡwij//ˈpräˌsesˌiNG/
Natural language is spoken or wriQen speech. English, Chinese, Spanish, and
Arabic are examples of natural language. A formal language such as
mathemaFcs, symbolic logic, or a computer language isn't.
Natural language processing recognizes the sequence of words spoken by a
person or another computer, understands the syntax or grammar of the words
(i.e., does a syntacFcal analysis), and then extracts the meaning of the words.
Some meaning can be derived from a sequence of words taken out of context
(i.e., by semanFc analysis). Much more of the meaning depends on the context
in which the words are spoken (e.g., who spoke them, under what
circumstances, with what tone, and what else was said, parFcularly before the
words), which requires a pragmaFc analysis to extract meaning in context.
Natural language technology processes queries, answers questions, finds
information, and connects users with various services to accomplish tasks.
What is natural
language processing?
NLP

Aoccdrnig to a rseearch taem at Cmabrigde
Uinervtisy, it deosn't mttaer in waht oredr the
ltteers in a wrod are, the olny iprmoatnt tihng is
taht the frist and lsat ltteer be in the rghit pclae.
The rset can be a taotl mses and you can sitll
raed it wouthit a porbelm. Tihs is bcuseae the
huamn mnid deos not raed ervey lteter by istlef,
but the wrod as a wlohe.

How natural language interpretation & natural language generation happens

Text analytics
11
Text mining is the discovery by computer of new, previously
unknown information, by automatically extracting it from
different written resources. A key element is the linking
together of the extracted information together to form new
facts or new hypotheses to be explored further by more
conventional means of experimentation.
Text analytics is the investigation of concepts, connections,
patterns, correlations, and trends discovered in written
sources. Text analytics examine linguistic structure and apply
statistical, semantic, and machine-learning techniques to
discern entities (names, dates, places, terms) and their
attributes as well as relationships, concepts, and even
sentiments. They extract these 'features' to databases or
semantic stores for further analysis, automate classiﬁcation
and processing of source documents, and exploit visualization
for exploratory analysis.
IM messages, email, call center logs, customer service survey
results, claims forms, corporate documents, blogs, message
boards, and websites are providing companies with enormous
quantities of unstructured data — data that is information-rich
but typically difﬁcult to get at in a usable way.
Text analytics goes beyond search to turn documents and
messages into data. It extends Business Intelligence (BI) and
data mining and brings analytical power to content
management. Together, these complementary technologies
have the potential to turn knowledge management into
knowledge analytics.

NATURAL LANGUAGE UNDERSTANDING

Speech I/O vs NLP vs NLU
NLP
NLU
syntactic
parsing
machine
translation
named entity
recognition (NER)
part-of-speech
tagging (POS)
semantic
parsing
relation
extraction
sentiment
analysis
coreference
resolution
dialogue
agents
paraphrase &
natural language
inference
text-to-
speech (TTS) summarization
automatic
speech
recognition (ASR)
text
categorization
question
answering (QA)
Speech I/O

Natural language understanding (NLU)
Natural language understanding (NLU) involves mapping a given
natural language input into useful representations, and analyzing
different aspects of the language.
NLU is critical to making making AI happen. But language is more
than words, and NLU involves more than lots of math to facilitate
search for matching words. Language understanding requires
dealing with ideas, allusions, inferences, with implicit but critical
connections to the ongoing goals and plans.
To develop models of NLU effectively, we must begin with limited
domains in which the range of knowledge needed is well enough
understood that natural language can be interpreted within the
right context.
One example is in mentoring in massively delivered educational
systems. If we want to have better educated students we need to
offer them hundreds of different experiences to choose from
instead of a mandated curriculum. A main obstacle to doing that
now is the lack of expert teachers.
We can build experiential learning based on simulations and
virtual reality enabling student to pursue their own interests and
eliminate the “one size fits all curriculum.”
To make this happen expertise must be captured and brought in to
guide from people at their time of need. A good teacher (and a
good parent) can do that, but they cannot always be available.
A kid in Kansas who wants to be an aerospace engineer should get
to try out designing airplanes. But a mentor would be needed. We
can build AI mentors in limited domains so it would be possible for
a student anywhere to learn to do anything because the AI mentor
would understand what a user was trying to accomplish within the
domain and perhaps is struggling with.
The student could ask questions and expect good answers tailored
to the student’s needs because the AI/NLU mentor would know
exactly what the students was trying to do because it has a perfect
model of the world in which the student was working, the relevant
expertise needed, and the mistakes students often make. NLU gets
much easier when there is deep domain knowledge available.
Source: Roger C Shank
14

Machine reading
& comprehension
AI machine learning is
being developed to
understand social
media, news trends,
stock prices and
trades, and other data
sources that might
impact enterprise
decisions.
15

Example queries of the future
16
Which of these eye images
shows symptoms of diabetic
retinopathy?
Please fetch me a cup of  
tea from the kitchen
Describe this video 
in Spanish
Find me documents related to
reinforcement learning for robotics  
and summarize them in German
Source: Google

Source: NarraFve Science
Explainable AI (XAI)
New machine-learning
systems will have the ability to
explain their rationale,
characterize their strengths
and weaknesses, and convey
an understanding of how they
will behave in the future.
State-of-the-art human-
computer interface techniques
will translate models into
understandable and useful
explanation dialogues for the
end user.
Source: DARPA
New learning
process
Training data Explainable
model
Explanation
interface
This is a cat:
• It has fur, whiskers,
and claws.
• It has this feature:
• I understand
why/why not
• I know when it
will succeed/fail

VISUAL LANGUAGE

Source: Robert Horn
Source: Robert Horn
Visual Language
The integration of words, images,
and shapes into a single
communication unit.
• Words are essential to visual
language. They give conceptual
shape, and supply the capacity
to name, define, and classify
elements, and to discuss
abstractions.
• Images are what we first think of
when we think of visual
language. But, without words
and/or shapes, images are only
conventional visual art.
• Shapes differ from images. They
are more abstract. We combine
them with words to form
diagramming systems. Shapes
and their integration with words
and/or images is an essential
part of visual language.
19

Source: Robert Horn
Visual language is being created by
the merger of vocabularies from
many, widely different fields

Toward understanding diagrams using recurrent networks and deep learning
21
Source: AI2
Diagrams are rich and diverse. The top row depicts inter class variability of visual
illustrations. The bottom row shows intra-class variation for the water cycle category.
LSTM1 LSTM1 LSTM1 LSTM1
LSTM2 LSTM2 LSTM2 LSTM2
c0 c1 c2 cT
[xycand, scorecand, overlapcand, … scorerel , seenrel … ]
Candidate
Relationships
Diagram Parse
Graph
Stacked
LSTM
Network
Relationship
Feature Vector
FC1
FC2
FC1
FC2
FC1
FC2
FC1
FC2
FC3 FC3 FC3 FC3
Add No change Add Final
Fully
Connected
Fully
Connected
Architecture for inferring DPGs from diagrams. The LSTM based network exploits
global constraints such as overlap, coverage, and layout to select a subset of relations
amongst thousands of candidates to construct a DPG.
The diagram depicts
The life cycle of
a) frog 0.924
b) bird 0.02
c) insecticide 0.054
d) insect 0.002
How many stages of Growth
does the diagram Feature?
a) 4 0.924
b) 2 0.02
c) 3 0.054
d) 1 0.002
What comes before
Second feed?
a) digestion 0.0
b) First feed 0.15
c) indigestion 0.0
d) oviposition 0.85
Sample question answering results. Left column is the diagram.
The second column shows the answer chosen and the third column
shows the nodes and edges in the DPG that Dqa-Net decided to
attend to (indicated by red highlights).
Diagrams represent complex concepts, relationships and events, often
when it would be difficult to portray the same information with natural
images. Diagram Parse Graphs (DPG) model the structure of diagrams.
RNN+LSTM-based syntactic parsing of diagrams learns to infer DPGs.
Adding a DPG-based attention model enables semantic interpretation and
reasoning for diagram question answering.

Computer vision
• The ability of computers to idenFfy objects, scenes,
and acFviFes in unconstrained (that is, naturalisFc)
visual environments.
• Computer vision has been transformed by the rise of
deep learning.
• The confluence of large-scale computing, especially
on GPUs, the availability of large datasets, especially
via the internet, and refinements of neural network
algorithms has led to dramatic improvements.
• Computers are able to perform some (narrowly
defined) visual classification tasks better than
people. A current research focus is automatic image
and video captioning.

Image annotation  
and captioning using 
deep learning
a man riding a motorcycle  
on a city street
a plate of food with 
meat and vegetables
23

Video question-answering
24

AI FOR
NATURAL LANGUAGE
SUMMARIZATION

• The goal of automated summarization is to
produce a shorter version of a source text by
preserving the meaning and the key contents
of the original. A well written summary
reduces the amount of cognitive work needed
to digest large amounts of text.
• Automatic summarization is part of artificial
intelligence, natural language processing,
machine learning, deep learning, data mining
and information retrieval.
• Document summarization tries to create a
representative extract or abstract of the entire
document, by finding or generating the most
informative sentences.
• Image summarization tries to find the most
representative and important (i.e. salient)
images and generates explanatory captions of
still or moving scenes, including objects,
events, emotions, etc.
26
Automatic summarization

Deep
Learning
Machine
Learning
Natural 
Language
Processing
• Natural Language Processing (NLP) is the
confluence of Artificial Intelligence (AI)
and linguistics. A key focus is analysis and
interpretation of written language.
• Machine Learning (or ML) is an area of AI
and NLP that uses large data sets and
statistical techniques for problem solving.
• Deep Learning (DL) is a type of machine
learning that uses neural networks
(including Convolution Neural Networks
(CNN) and Recurrent Neural Networks
(RNN)) to process natural language and
other types of data.

S u m m a r i z a t i o n
Output documentInput document Purpose
Source size
Single-document
Multi-document
Specificity
Domain-specific
General
Form
Audience
Generic
Query-oriented
Usage
Expansiveness
Indicative
Informative
Derivation
Conventionality
Background
Just-the-news
Extract
Abstract
Partiality
Neutral
Evaluative
Fixed
Floating
Scale
Genre
Summarization classification
Automatic summarization is the process of shortening a text
document with software, in order to create a summary with the
major points of the original document. Genres of summary
include:
• Single-document vs. multi-document source — based on one
text vs. fuses together many texts. E.g., for multi-document
summaries we may want one summary with common
information, or similarities and differences among documents,
or support and opposition to specific ideas and concepts.
• Generic vs. query-oriented — provides author’s view vs.
reflects user’s interest.
• Indicative vs. informative — what’s it about (quick
categorization) vs. substitute for reading it (content
processing).
• Background vs. just-the-news — assumes reader’s prior
knowledge is poor vs. up-to-date.
• Extract vs. abstract — lists fragments of text vs. re-phrases
content coherently.

extractive abstractive
select subset of words
output in best order
encode hidden state
decode to text sequence
Extractive vs. Abstractive
summarization

Query
Document
MulFple Documents
Automatic summarization machine
30
10%
50%
100% Long
Very Brief
Headline
Brief
IN OUT
Extract Abstract
IndicaFve InformaFve
Generic Query-oriented
Background Just the news
Extracted summaries
Computable models
Abstracted summaries
• Frames, templates
• ProbabilisFc models
• Knowledge graphs
• Internal states

Summary of Text Document
Text Summarization Approaches
Extraction 
Techniques 
(Statistics)
Abstraction
Techniques
(Linguistic)
General
Techniques
Statistics 
Foundation
Linguistic and
Mathematical
Foundation
Graph-based
Techniques
Combined
Techniques:
Extraction
Abstraction
Keyword
Title
Word Distance
Cue Phrases
Sentence Position
Lexical Chains
Clustering
Non-negative Matrices
Factorization
Clustering
Machine Learning
Neural Networks
Fuzzy Logic
Wikipedia (k-base)
Surface Approach SemanFc Approach

Automated summarization using statistical heuristics
32
Source
Documents
Extracted Sentence
Summary
DTM
DOCUMENTS
T  
E
R
M
S
Determine vocabulary,
term frequency, and most
important words
Vectorize
sentences by
word frequency
Score sentences
by frequency of
most important
words
Select best
scoring sentences

Input
document(s)
Summary
Pre-processing
Normalizer
Segmenter
Stemmer
Stop-word
eliminator
List
of sentences
List of
pre-processed
words for
each sentence
Processing
Clustering
Learning
Scoring
List
of clusters
Summary size
P(f|C)
Extraction
Extraction
Sentences
scores
ReOrdering
List of first
higher scored
sentences
Reordered
sentences
Extrac6ve summariza6on process
• Preprocessing reads and
cleans-up data (including
stop word removal, numbers,
punctuation, short words,
stemming, lemmatization),
and builds the document
term matrix.
• Processing vectorizes and
scores sentences, which may
entail heuristic, statistical,
linguistic, graph-based, and
machine learning methods.
• Extraction selects, orders and
stitches together highest
scoring sentences, and
presents the summary

Automated summarization using topic modeling
34
Select highest scoring
sentence or sentences.
Build document-term
matrix and preprocess.
Train using LDA to
learn topics.
Vectorize using LDA to
determine which topics
occur in each sentence
as well as the weighted
distribuFon of topics
across all documents.
Score sentences by
how much they are
dominated by the
most dominant topic.
Input training data.
DTM
DOCUMENT
ST  
E
R
M
S
Output topic modeled
extracted sentence
summary(s).
LDA
Source Documents

WordsTopics
ObservedObserved Latent
Documents
Topic modeling approaches try to model relaFonships between  
observed words and documents by a set of latent topics.
Topic modeling
• A topic model is a type of
statistical model for discovering
the abstract "topics" that occur
in a collection of documents.
• Topic modeling is used to
discover hidden semantic
structures in a text body.
• Documents are about several
topics at the same time. Topics
are associated with different
words. Topics in the documents
are expressed through the
words that are used.
• Latent topics are the “link”
between the documents and
words. Topics explain why
certain words appear in a given
document.

Document-Term Matrix
• DTM describes the frequency of terms that occur in a collection of documents and is the foundation on which all topic
modeling methods work.
• The document-term matrix (DTM) describes
the frequency of terms that occur in a
collection of documents and is the
foundation on which all topic modeling
methods work.
• Preprocessing steps are pretty much the
same for all of the topic modeling
algorithms:
- Bag-of words (BOW) approaches are used,
since the DTM does not contain ordering
information.
- Punctuation, numbers, short, rare and
uninformative words are typically
removed.
- Stemming and lemmatization also may be
applied.
36
Document-Term Matrix

• A key preprocessing step is to reduce high-
dimensional term vector space to low-
dimensional ‘latent’ topic space.
• Two words co-occurring in a text:
- signal that they are related
- document frequency determines strength of
signal
- co-occurrence index
• TF: Term Frequency — terms occurring
more frequently in document are more
important
• IDF: Inverted Document Frequency — terms
in fewer documents are more specific
• TF * IDF indicates importance of term relative
to the document
37
Semantic Analysis
TF-IDF
Dimension
Reduction
Semantic relatedness and TF-IDF

Probabilistic topic model
What is a topic? A list of
probabilities for each of the
possible words in a
vocabulary.
Example topic:
• dog: 5%
• cat: 5%
• hause: 3%
• hamster: 2%
• turtle: 1%
• calculus: 0.000001%
• analytics: 0.000001%
• .......

Convolu6onal neural network architecture
for sentence classiﬁca6on
39
This diagram illustrates a convolutional neural network (CNN)
architecture for sentence classification.
• It shows three filter region sizes: 2, 3 and 4, each of which
has 2 filters.
• Every filter performs convolution on the sentence matrix and
generates (variable-length) feature maps.
• Next, 1-max pooling is performed over each map, i.e., the
largest number from each feature map is recorded. Thus a
univariate feature vector is generated from all six maps, and
these 6 features are concatenated to form a feature vector
for the penultimate layer.
• The final softmax layer then receives this feature vector as
input and uses it to classify the sentence; here we assume
binary classification and hence depict two possible output
states.
Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’
Guide to) Convolutional Neural Networks for Sentence Classification.

Several different topic modeling algorithms:
• LSA — Latent semantic analysis finds smaller
(lower-rank) matrices that closely approximate
DTM.
• pLSA — Probabilistic LSA finds topic-word and
topic-document associations that best match
dataset and a specified number of topics (K).
• LDA — Latent Dirichlet Allocation finds topic-
word and topic-document associations that
best match dataset and specified number of
topics that come from Dirichlet distribution
with given Dirichlet priors.
• Other advanced topic modeling algorithms —
will briefly mention several including CTM,
DTM, HTM, RTM, STM, and sLDA.
40
Topic modeling algorithms

6x4 DOCUMENTS
T
E
R
M
S
=
6x4 TOPICS
T
E
R
M
S
X X
TOP 0 0 0
0 IC 0 0
0 0 IMPO 0
0 0 0 RTAN
CE
4x4 DOCUMENTS
T
O
P
I
C
S
41
Latent semantic analysis
• LSA is a technique of
distributional semantics for
analyzing relationships
between a set of documents
and the terms they contain by
producing a set of concepts
related to the documents and
terms.
• LSA finds smaller (lower-rank)
matrices that closely
approximate the document-
term matrix by picking the
highest assignments for each
word to topic, and each topic
to document, and dropping
the ones not of interest.
• The contexts in which a certain
word exists or does not exist
determine the similarity of the
documents.

Aspects that describe summaries
42
Latent Dirichlet Allocation
• Latent Dirichlet Allocation (LDA) is an
unsupervised, probabilistic, text clustering
algorithm.
• LDA finds topic-word and topic-document
associations that best match dataset and specified
number of topics that come from Dirichlet
distribution with given Dirichlet priors.
• LDA defines a generative model that can be used
to model how documents are generated given a
set of topics and the words in the topics.
• The LDA model is built as follows:
1. Estimate topics as product of observed words
2. Use to estimate document topic proportions
3. Evaluate corpus based on the distributions
suggested in (1) & (2)
4. Use (3) to improve topic estimations (1)
5. Reiterate until best fit found.

43
Source: Andrius Knispelis, ISSUU
the topic
distribution for
document i
a parameter that sets the
prior on the per-document
topic distributions
a parameter that sets the
prior on the per-topic
word distributions
the topic for
the j’th word in
a document i
observed
words in a
document i
N
M
Θα
β
Z W
N words
M documents
A topic model developed by David Blei, Andrew Ng and Michael Jordan in
2003.
It tells us what topics are present in any given document by observing all the
words in it and producing a topic distribution.
LATENT DIRICHLET ALLOCATION
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
tﬁdf.mm wordids.txt
words
documents
words
topics
model.lda
Document Term Matrix Topic Model

Understanding LDA alpha and beta parameters
44
In practice, a high alpha-value
will lead to documents being
more similar in terms of what
topics they contain.
A high beta-value will similarly
lead to topics being more similar
in terms of what words they
contain.
α β Impact on content
A high beta-value means that
each topic is likely to contain a
mixture of most of the words,
and not any word specifically.
A low value means that a topic
may contain a mixture of just a
few of the words.
A high alpha-value means that
each document is likely to
contain a mixture of most of the
topics, and not any single topic
specifically.
A low alpha value puts less such
constraints on documents and
means that it is more likely that a
document may contain mixture
of just a few, or even only one, of
the topics.

45
Source: Andrius Knispelis, ISSUU
preprocess
the data
Text corpus depends on the
application domain.
It should be contextualised since the
window of context will determine
what words are considered to be
related.
The only observable features for the
model are words. Experiment with
various stoplists to make sure only
the right ones are getting in.
Training corpus can be different from
the documents it will be scored on.
Good all utility corpus is Wikipedia.
train
the model
The key parameter is the number of
topics. Again, depends on the
domain.
Other parameters are alpha and beta.
You can leave them aside to begin
with and only tune later.
Good place to start is gensim - free
python library.
score
it on new document
The goal of the model is not to label
documents, but rather to give them a
unique ﬁngerprint so that they can be
compared to each other in a
humanlike fashion.
evaluate
the performance
Evaluation depends on the
application.
Use Jensen-Shannon Distance as
similarity metric.
Evaluation should show whether the
model captures the right aspects
compared to a human.
Also it will show what distance
threshold is still being perceived as
similar enough.
Use perplexity to see if your model is
representative of the documents
you’re scoring it on.
LDA process

LDA topic modeling process
46
Topics and their Words
Tuning
Parameters
Dictionaries
Bag-of-Words
Bag of-
words Dictionaries
Tokenization
Lemmatization
Stopwords
Removal
LDA
Vector Space ModelPreprocessing
Step 1:  
Select β
• The term distribution β is determined for each
topic by β ∼ Dirichlet (δ).
Step 2:
Select α
• The. proportions θ of the topic distribution for the
document w are determined by: θ ∼ Dirichlet (α)
Step 3:
Iterate
• For each of the N words wi
- (a) Choose a topic zi ∼ Multinomial(θ).
- (b) Choose a word wi from a multinomial
probability distribution conditioned on the topic
- zi : p(wi|zi, β).
* β is the term distribution of topics and
contains the probability of a word occurring in
a given topic.
* The process is purely based on frequency
and co-occurrence of words
• Pass through LDA
algorithm and
evaluate
• Create document-term
matrix, dictionaries,
corpus of Bag-of-Words
• Clean documents of as much noise
as possible, for example:
- Lowercase all the text
- Replace all special characters
and do n-gram tokenizing
- Lemmatize - reduce words to
their root form, e.g., “reviews”
and “reviewing” to “review”
- Remove numbers (e.g., “2017”)
and remove HTML tags and
symbols

• Correlated topic model — CTM allows topics to be correlated, leading to better
prediction, which is more robust to overfitting.
• Dynamic topic model — DTM models how each individual topic changes over
time.
• Supervised LDA — sLDA associates an external variable with each document,
which defines a one-to-one correspondence between latent topics and user tags.
• Relational topic model — RTM predicts which documents a new document is
likely to be linked to. (E.g., tracking activities on Facebook in order to predict a
reaction to an advertisement.)
• Hierarchical topic model — HTM draws the relationship between one topic and
another (which LDA does not) and indicates the level of abstraction of a topic
(which CTM correlation does not).
• Structural topic model — STM provides fast, transparent, replicable analyses that
require few a priori assumptions about the texts under study. STM includes
covariates of interest. Unlike LDA, topics can be correlated and each document
has its own prior distribution over topics, defined by covariate X rather than
sharing a mean, allowing word use within a topic to vary by covariate U.
47
Advanced  
topic modeling
techniques

Topic modeling is a form of lossy compression
because it expresses a document as a vector
where each element can be thought of as the
weight of that topic in that document.
Each element of the vector has interpretable
meaning. This makes topic modeling a powerful
technique to apply in many more contexts than
text summarization. For example:
• A preprocessing step to generate features for
arbitrary text classification tasks
• A way to visualize and explore a corpus by
grouping and linking similar documents
• A solution to the cold-start problem that
plagues collaborative filtering
• Applied to non-text data, including images,
genetic information, and click-through data.
48
Other uses of topic modeling

Query focused0mul$Qdocument0summariza$on0
• a
Document
Document
Document
Document
Document
Input Docs
Sentence
Segmentation
All sentences
from documents
Sentence
Simplification
Content Selection
Sentence
Extraction:
LLR, MMR
Extracted
sentences
Information
Ordering
Sentence
Realization
Summary
All sentences
plus simplified versions
Query
• Multi-document summarization
aims to capture the important
information of a set of
documents related to the same
topic and presenting it in a
brief, representative, and
pertinent summary.
• Query-driven summarization
encodes criteria as search
specs. The user needs only
certain types of information
(e.g., I know what I want! —
don’t confuse me with drivel!)
System processes specs top-
down to filter or analyze text
portions. Templates or frames
order information and shape
presentation of the summary.

This diagram depicts three
approaches to automatic
summarization that start with
sentence level vectorization of
input using an off-the-shelf
language model (Skip-thoughts)
and then process source data
using feedforward and recurrent
neural network configurations to
generate increasingly coherent
extractive and abstractive
summaries of the source
documents.
Automatic summarization using sentence-level vectorization  
and recurrent neural networks
Language
Model
Training
Data
Source
Data
Sentence
Vectors
Feedforward Neural Network
Encode-Decode RNN with LSTM
and AQenFon
Recurrent Neural Network 
with LSTM
1
2
3
Extracted Sentence
Summary
More Coherent
ExtracFve Summary
AbstracFve Summary

Semantic hashing is using a
deep autoencoder as a hash-
function to map a relatively
small number of binary
variables to map documents
to memory addresses in such
a way that semantically
similar documents are
located at nearby addresses.
Semantic
Hashing
Function
Document
Semantic Hashing
Address Space
Semantically
Silmiiar
Documents
European Community
Energy Markets
Accounts/Earnings
Learntomapdocuments into small number of
semantic binary codes.
Retrieve similardocuments storedat the nearbyaddresses
with no searchat all.

Word embeddings
A word’s meaning is embedded by the surrounding words.
Word2vec is a two-layer neural net for pre-processing text.
Its input is a text corpus. Its outputs are word embeddings
— a set of feature vectors for words in that corpus. Word
vectors are positioned in the space so that words that share
common contexts (word(s) preceding and/or following) are
located in close proximity to each other.
One of two model architectures are used to produce the
word embeddings distribution. These include continuous
bag-of-words (CBOW) and continuous skip-gram:
• With CBOW, the model predicts the current word from a
window of surrounding context words without
considering word order.
• With skip-gram, the model uses the current word to
predict the surrounding window of context words, and
weighs nearby words more heavily than more distant
context words.
Word2vec embedding captures subtle syntactical and
semantic structure in the text corpus and can be used to
map similarities, analogies and compositionality.

Skip-thoughts
In contiguous text, nearby
sentences provide rich semantic
and contextual information.
Skip-thought model extends the
skip-gram structure used in
word2vec. It is trained to
reconstruct the surrounding
sentences and to map sentences
that share syntactic and
semantic properties to similar
vectors.
Learned sentence vectors are
highly generic, and can be
reused for many different tasks
by learning an additional
mapping, such as a classification
layer.
The Skip-thought model aQempts to predict the preceding sentence (in red) and the
subsequent sentence (in green), given a source sentence (in grey)

Feedforward neural network
Source: A Beginner’s Guide to Recurrent Networks and LSTMs
Neural networks
• A neural network is a system composed of many simple
processing elements operating in parallel, which can acquire,
store, and utilize experiential knowledge from data.
• Input examples are fed to the network and transformed into an
output. For example, to map raw data to categories, recognizing
patterns that signal, for example, that an input image should be
labeled “cat” or “elephant.”
• Feedforward neural networks move information straight
through (never touching a given node twice). Once trained the
neural network has no notion of order in time. It only considers
the current example it has been exposed, nothing before that.

This content included for educational purposes. 55Source: A Beginner’s Guide to Recurrent Networks and LSTMs
Simple Recurrent Neural Network architecture model
Recurrent neural network (RNN)
• A recurrent neural network (RNN) can give itself feedback from
past experiences. It maintains a hidden state that changes as it
sees different inputs. Like short-term memory, this enables
answers based on both current input and past experience.
• RNNs are distinguished from feedforward networks by having
this feedback loop. Recurrent networks take as their input not
just the current input example they see, but also what they
perceived one step back in time. RNNs have two sources of
input, the present and the recent past, which combine to
determine how they respond to new data.

This content included for educational purposes. 56Source: A Beginner’s Guide to Recurrent Networks and LSTMs
Long short term memory (LSTM)
• Long Short Term Memory (LSTM) empowers a RNN with longer-
term recall. This allows the model to make more context-aware
predictions.
• LSTM has gates that act as differentiable RAM memory. Access
to memory cells is guarded by “read”, “write” and “erase” gates.
• Starting from the bottom of the diagram, the triple arrows show
where information flows into the cell at multiple points. That
combination of present input and past cell state is fed into the
cell itself, and also to each of its three gates, which will decide
how the input will be handled.
• The black dots are the gates themselves, which determine
respectively whether to let new input in, erase the present cell
state, and/or let that state impact the network’s output at the
present time step. S_c is the current state of the memory cell,
and g_y_in is the current input to it. Remember that each gate
can be open or shut, and they will recombine their open and
shut states at each step. The cell can forget its state, or not; be
written to, or not; and be read from, or not, at each time step,
and those flows are represented here.

Abstractive text
summarization
Abstractive text summarization is a
two-step process:
• A sequence of text is encoded
into some kind of internal
representation.
• This internal representation  
is then used to guide the
decoding process back into the
summary sequence, which may
express ideas using words and
phrases not found in the source.
State of the art architectures use
recurrent neural networks for
both the encoding and the
decoding step; often with
attention over the input during
decoding as additional help.
internal representation summarysource document
encoder decoder

• Training data — (Hi) RNN summarizers have the most extensive data requirements that
include language models (such as word2vec and skip-thoughts) for the vectorization/
embedding step, and a large sampling of training documents. Depending on choice of
algorithm(s), training documents may also need corresponding summaries.
• Domain expertise — (Low) RNN summarizers generally demand less domain specific expertise
or hand-crafted linguistic features to develop. Abstractive summarization architectures exist
that combine RNNs and probabilistic models to cast the summarization task as a neural
machine translation problem, where the models, trained on a large amount of data, learn the
alignments between the input text and the target summary through an attention encoder-
decoder paradigm enhanced with prior knowledge, such as linguistic features.
• Computational cost — (Hi-to-very hi) RNNs require large amounts of preprocessing, and a
large (post-training) static shared global state. Computations are best done on a GPU
configuration.
• Interpretability — (Low) RNN summarizers do not provide simple answers to the why of
sentence selection and summary generation. Intermediate embeddings (and internal states)
are not easily understandable in a global sense.
58
Google NMT, arxiv.org/abs/1609.08144
RNN sequence-to-sequence language translation — Chinese to English
Sequence-to-sequence
language translation
All variants of encoder-decoder
architecture share a common
goal: encoding source inputs
into fixed-length vector
representations, and then
feeding such vectors through a
“narrow passage” to decode into
a target output.
The narrow passage forces the
network to pick and abstract a
small number of important
features and builds the
connection between a source
and a target.

Example:
Input: State Sen. Stewart Greenleaf discusses his proposed human trafficking
bill at Calvery BapFst Church in Willow Grove Thursday night.
Output: Stewart Greenleaf discusses his human trafficking bill.
59
Sentence compression with LSTMs
Source: Lukasz Kaiser, Google Brain
Deep learning for abstractive
text summarization
• If we cast the summarization task as a
sequence-to-sequence neural machine
translation problem, the models,
trained on a large amount of data,
learn the alignments between the
input text and the target summary
through an attention encoder-decoder
paradigm.
• The encoder is a recursive neural
network (RNN) with long-short term
memory (LSTM) that reads one token
at time from the input source and
returns a fixed-size vector representing
the input text.
• The decoder is another RNN that
generates words for the summary and
it is conditioned by the vector
representation returned by the first
network.
• Also, we can increase summary quality
by integrating prior relational semantic
knowledge into RNNs in order to learn
jointly word and knowledge
embeddings by exploiting knowledge
bases and lexical thesaurus.

Sentence A: I saw Joe’s dog, which was running in the garden.
Sentence B: The dog was chasing a cat.
Summary: Joe’s dog was chasing a cat in the garden.
Source: Liu F., Flanigan J., Thomson S., Sadeh N., Smith N.A.  
Toward Abstractive Summarization Using Semantic Representations. NAACL 2015
Prior semantic knowledge
• Abstractive summarization
can be enhanced through
integration of a semantic
representation from which a
summary is generated

Symbolic methods
• Declarative languages (Logic)
• Imperative languages  
C, C++, Java, etc.
• Hybrid languages (Prolog)
• Rules — theorem provers,
expert systems
• Frames — case-based
reasoning, model-based
reasoning
• Semantic networks, ontologies
• Facts, propositions
Symbolic methods can find
information by inference, can
explain answer
Non-Symbolic methods
• Neural networks — knowledge
encoded in the weights of the
neural network, for
embeddings, thought vectors
• Genetic algorithms
• graphical models — baysean
reasoning
• Support vectors
Neural KR is mainly about
perception, issue is lack of
common sense (there is a lot of
inference involved in everyday
human reasoning
Knowledge Representation 
and Reasoning
Knowledge representation
and reasoning is:
• What any agent—human,
animal, electronic,
mechanical—needs to
know to behave
intelligently
• What computational
mechanisms allow this
knowledge to be
manipulated?
61

AI FOR
NATURAL LANGUAGE GENERATION

OVERVIEW
63

• Natural language generation — the process by which thought
is rendered into language. Computers are learning to “speak
our language” in multiple ways, for example: data-to-
language, text-to-language, vision-to-language, sound-to-
language, and interaction-to-language. AI for human
communication is about recognizing, parsing, understanding,
and generating natural language. NLG converts some kind of
data into human language. Most often this means generating
text from structured data. However, the current state of play is
broader. To set the stage, we identify four broad classes of AI
for language generation with examples.
• How data-to-text natural language generation works — This
section overviews the process by which data is ingested and
analyzed to determine facts; then facts get reasoned over to
infer a conceptual outline and a communication plan; and an
intelligent narrative is generated from the facts and the plan.
• Symbolic and statistical approaches to NLG — Historically,
there are two broad technical approaches to NLG—symbolic
reasoning and statistical learning:
- Symbolic approaches apply classical AI and involve hand-
crafted lexicons, knowledge, logic, and rules-based
reasoning. We overview the architecture most commonly
used.
- Statistical learning approaches to NLG have emerged in
recent years. They involve machine learning, deep learning,
and probabilistic reasoning, and incorporate techniques
being developed for computer vision, speech recognition
and synthesis, gaming, and robotics.

Natural language generation (NLG) is the process by which  
thought is rendered into language.
64
David McDonald, Brandeis University

Natural language generation (NLG) is the conversion of  
some kind of data into human language.

FOUR CATEGORIES OF NLG

Data-to-text applications analyze and convert incoming
(non-linguistic) data into a generated language. One
way is by filling gaps in a predefined template text.
Examples of this sort of "robo journalism" include:
• Sports reports, such as soccer, baseball, basketball
• Virtual ‘newspapers’ from sensor data
• Textual descriptions of the day-to-day lives of birds
based on satellite data
• Weather reports
• Financial reports such as earnings reports
• Summaries of patient information in clinical contexts
• Interactive information about cultural artifacts, for
example in a museum context
• Text intended to persuade or motivate behavior
modiﬁcation.
67
Data-to-language
generation

Text-to-text applications take existing texts as
their input, then automatically produce a new,
coherent text or summary as output.
Examples include:
• Fusion and summarization of related sentences
or texts to make them more concise
• Simpliﬁcation of complex texts, for example to
make them more accessible for low-literacy
readers
• Automatic spelling, grammar and text
correction
• Automatic generation of peer reviews for
scientiﬁc papers
• Generation of paraphrases of input sentences
• Atomatic generation of questions, for
educational and other purposes.
68
Text-to-language  
generation

Vision-to-text applications convert incoming
visual data from computer vision into a
generated text descriptions or answers to
questions.
Examples include:
• Automatic captions for photographs
• Automatic scene descriptions from video
• Automatic generation of answers to questions
based on understanding and interpretation of a
diagram.
69
Vision-to-language
generation

Sound-to-text applications convert incoming
auditory data from microphones into a
generated text.
Examples include:
• Automatic speech recognition
• Automatic recognition of audible signals and
alerts.
70
Sound-to-language
generation

HOW DATA-TO-TEXT NLG WORKS

First, determine communication purpose and requirements
72
CONTEXT
DOMAIN  
& TOPIC
EXPERTISE
AUDIENCE
LINGUISTIC
KNOWLEDGE
CONTENT 
& DATA
Document
Planning
Micro
Planning
Surface 
RealizaFon Delivery InteracFon
COMMUNICATION
PLANNING
Learning
COMMUNICATION
INTENT
CONSIDERATIONS:
• Communication purpose
• Scope
• Constraints
• Key questions
• Answer form(s)
• Hypotheses
• Strategy
• Data exploration
• Evidence
• Inference
• Simulation & testing
• Conclusions
• Messages
• Styling
• Delivery
• Interaction
• Confidence
NATURAL LANGUAGE GENERATION

Steps to transform data into language
73
DATA
FACTS
INFER
CONCEPTUAL
OUTLINE
INTELLIGENT
NARRATIVE
GENERATE ANALYZE
Analyze data to determine facts. Reason over facts to infer a
conceptual outline; order concepts
into a communication plan.
Generate an intelligent narrative
from the facts according to the plan.

Self-service NLG example — Upload data
74
Source: Automated Insights
Data

Self-service NLG example — Design article
75
Template

Self-service NLG example — Generate narratives
76
Narrative

SYMBOLIC AND STATISTICAL APPROACHES TO NLG

Source: Jonathan Mugan, CEO, DeepGrammar
Two technology paths to  
natural language generation
The symbolic path involves hard-coding our
world into computers. We manually create
representations by building groups and
creating relationships between them. We use
these representations to build a model of
how the world works.
The sub-symbolic, or statistical, path has
computers learn from text using neural
networks. It begins by representing words as
vectors, whole sentences as vectors, then to
using vectors to answer arbitrary questions.
The key is creating algorithms that allow
computers to learn from rich sensory
experience that is similar to our own.

SYMBOLIC NLG

1. Morphological Level: Morphemes are the smallest
units of meaning within words and this level deals
with morphemes in their role as the parts that make
up word.
2. Lexical Level: This level of speech analysis examines
how the parts of words (morphemes) combine to
make words and how slight differences can
dramatically change the meaning of the final word.
3. Syntactic Level: This level focuses on text at the
sentence level. Syntax revolves around the idea that
in most languages the meaning of a sentence is
dependent on word order and dependency.
4. Semantic Level: Semantics focuses on how the
context of words within a sentence helps determine
the meaning of words on an individual level.
5. Discourse Level: How sentences relate to one
another. Sentence order and arrangement can affect
the meaning of the sentences.
6. Pragmatic Level: Bases meaning of words or
sentences on situational awareness and world
knowledge. Basically, what meaning is most likely
and would make the most sense.
80
How symbolic NLP
interprets language  
(six level stack)

1. Content determination: Deciding which
information to include in the text under
construction,
2. Text/document structuring: Determining in
which order information will be presented in the
text,
3. Sentence aggregation: Deciding which
information to present in individual sentences,
4. Lexicalization: Finding the right words and
phrases to express information,
5. Referring expression generation: Selecting the
words and phrases to identify domain objects,
6. Linguistic realization: Combining all words and
phrases into well-formed sentences.
81
Source: Reiter and Dale
Natural language
generation tasks

Natural language generation tasks
82
DOCUMENT PLANNING
Content determination
Decides what information will appear in
the output text. This depends on what
the communication goal is, who the
audience is, what sort of input
information is available in the first place
and other constraints such as allowed
text length.
Text/document structuring
Decides how chunks of content should
be grouped in a document, how to relate
these groups to each other and in what
order they should appear. For instance,
to describe last month's weather, one
might talk first about temperature, then
rainfall. Alternatively, one might start off
generally talking about the weather and
then provide specific weather events
that occurred during the month.
MICRO-PLANNING
Sentence aggregation
Decides how the structures created by document planning should map
onto linguistic structures such as sentences and paragraphs. For
instance, two ideas can be expressed in two sentences or in one:
The month was cooler than average. The month was drier than
average. vs. The month was cooler and drier than average.
Lexicalization
Decides what specific words should be used to express the content. For
example, choosing from a lexiconthe actual nouns, verbs, adjectives
and adverbs to appear in the text. Also, choosing particular syntactic
structures. For example, one could say 'the car owned by Mary' or the
phrase 'Mary's car'.
Refering expression generation
Decides which expressions should be used to refer to entities (both
concrete and abstract); it is possible to refer to the same entity in many
ways. For example, the month Barack Obama was first elected
President of the Unted States can be referred to as:
• November 2008
• November
• The month Obama was elected
• it
SURFACE REALIZATION
Linguistic realization
Uses grammar rules (about morphology
and syntax) to convert abstract
representations of sentences into actual
text. Realization techniques include
template completion, hand-coded
grammar-based realization, and filtering
using probabilistic grammar trained on a
large corpora of candidate text passages.
Structure realization
Converts abstract structures such as
paragraphs and sentences into mark-up
symbols which are used to display the
text.

Rule-based modular pipeline architecture for natural language generation
83
-1-
Document Planning
-2-
Micro-planning
Text
Content Determination,
Text Structuring
Sentence Aggregation,
Lexicalization,
Referring Expression Generation
Linguistic Realization
Communication
Goals
Knowledge Source
-3-
Surface Realization
Text Plan
Sentence Plan
Source: Reiter and Dale

Natural Language Genera6on
Natural language generation (NLG) is the process of producing meaningful
phrases and sentences in the form of natural language from some internal
representation, and involves:
• Text planning − It includes retrieving the relevant content from knowledge
base.
• Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
• Text realization − It is mapping sentence plan into sentence (or visualization)
structure, followed by text-to-speech processing and/or visualization
rendering.
• The output may be provided in any natural language, such as English, French,
Chinese or Tagalog, and may be combined with graphical elements to provide
a mulFmodal presentaFon.
• For example, the log ﬁles of technical monitoring devices can be analyzed for
unexpected events and transformed into alert-driven messages; or numerical
Fme-series data from hospital paFent monitors can be rendered as hand-over
reports describing trends and events for medical staﬀ starFng a new shiˆ.
84

VERTICAL RULESET
CLIENT RULESET
CORE NLG ENGINE
CORE ENGINE RULESET
Source: Arria
NLG rulesets
• Core ruleset — general purpose rules used in
almost every application of the NLG engine. These
capture knowledge about data processing and
linguistic communication in general, independent
of the particular domain of application.
• Vertical ruleset — rules encoding knowledge
about the specific industry vertical or domain in
which the NLG engine is being used. Industry
vertical rulesets are constantly being refined via
ongoing development, embodying knowledge
about data processing and linguistic
communication, which is common to different
clients in the same vertical.
• Client ruleset — rules that are specific to the client
for whom the NLG engine is being configured.
These rules embody the particular expertise in
data processing and linguistic communication that
are unique to a client application.

Example architecture for
realtime data storytelling
The Arria NLG Engine
combines data analytics
and computational
linguistics, enabling it to
convert large and diverse
datasets into meaningful
natural language
narratives.
Source: Arria
DATA
ANALYSIS
Analysis and Interpretation
RAW
DATA
Information Delivery
MESSAGES SENTENCE
PLANS
FACTS DOCUMENT
PLAN
SURFACE
TEXT
DOCUMENT
PLANNING
SURFACE
REALISATION
DATA
INTERPRETATION
MICRO-
PLANNING
DATA ANALYSIS processes
the data to extract the
key facts that it contains
DATA INTERPRETATION makes sense of the
data, particularly from the point of view of
what information can be communicated
DOCUMENT PLANNING takes the
messages derived from the data and
works out how to best structure the
information they contain into a narrative
MICROPLANNING works out how to
package the information into sentences
to maximise fluency and coherence
SURFACE REALISATION ensures that the
meanings expressed in the sentences are
conveyed using correct grammar, word
choice, morphology and punctuation
DATA can be
ingested from a
wide variety of
data sources, both
structured and
unstructured
NARRATIVE can be
output in a variety of
formats (HTML, PDF,
Word, etc.), combined
with graphics as
appropriate, or
delivered as speech

STATISTICAL NLG

Symbolic vs. Statistical NLG
88
Symbolic approaches
apply classical AI and
involve preprocessing,
hand-crafted lexicons,
knowledge, logic, and
rules-based reasoning.
Statistical learning involves
training datasets,
vectorization, embeddings,
machine learning, deep
learning, and probabilistic
reasoning.
CLASSICAL NLP
DEEP LEARNING-BASED NLP

Summarization, and algorithms to make text quantifiable, allow us to
derive insights from Large amounts of unstructured text data.
Unstructured text has been slower to yield to the kinds of analysis that
many businesses are starting to take for granted.
We are beginning to gain the ability to do remarkable things with
unstructured text data.
First, the use of neural networks and deep learning for text offers the ability
to build models that go beyond just counting words to actually representing
the concepts and meaning in text quantitatively.
These examples start simple and eventually demonstrate the breakthrough
capabilities realized by the application of sentence embedding and
recurrent neural networks to capturing the semantic meaning of text.
Machine Learning
Machine Learning is a type of Artificial Intelligence that provides
computers with the ability to learn without being explicitly
programmed.
Machine Learning
Algorithm
Learned Model
Data
Prediction
Labeled Data
Training
Prediction
Provides various techniques that can learn from and make predictions on data
89
Machine learning
Source: Lukas Masuch

Deep Learning
Architecture
A deep neural network consists of a hierarchy of layers, whereby each layer
transforms the input data into more abstract representations (e.g. edge ->
nose -> face). The output layer combines those features to make predictions.
90
Deep learning

91
Why deep learning  
for NLP?

Summarization, and algorithms to make
text quantifiable, allow us to derive insights
from Large amounts of unstructured text
data.
Unstructured text has been slower to yield
to the kinds of analysis that many
businesses are starting to take for granted.
We are beginning to gain the ability to do
remarkable things with unstructured text
data.
First, the use of neural networks and deep
learning for text offers the ability to build
models that go beyond just counting words
to actually representing the concepts and
meaning in text quantitatively.
These examples start simple and eventually
demonstrate the breakthrough capabilities
realized by the application of sentence
embedding and recurrent neural networks
to capturing the semantic meaning of text.
92
Ten applications of deep learning
for natural language processing

Deep Learning in NLP
Syntax Parsing
SyntaxNet (Parsey McParseface) tags each word with a part-of-speech tag, and it
determines the syntactic relationships between words in the sentence with an
94% accuracy compared to a human performance at 96%.
Source
93
Deep learning can be
used to parse syntax
of natural language
sentences.

Generating Text
To train the RNN, insert characters sequentially and
predict the probabilities of the next letter.
Backpropagate error and update RNN’s weights to
increase the confidence of the correct letter (green)
and decrease the confidence of all other letters (red).
Trained on structured Wikipedia markdown. Network learns to spell English words
completely from scratch and copy general syntactic structures.
Source
94
Deep learning networks
can learn to spell
correctly and generate
texts with appropriate
syntactic structures.

Sequence-to-sequence NLG
95
Seq2seq architectures preprocess input data to
encode vectors and embeddings (probability
distributions) at the character, word, sentence,
paragraph, or document level.
Input processes sequentially through LSTM
recurrent neural networks to develop an overall
encoding (thought vector) for the data of interest.
The thought vector then processes
recurrently through a series of decoder
LSTM RNNs to generate the most probable
language output.

Generating Text
To generate text, we feed a character into the trained RNN and get a distribution
over what characters are likely to come next (red = likely). We sample from this
distribution, and feed it right back in to get the next letter.
This highlighted neuron gets very excited (green = excited, blue = not excited) when
the RNN is inside the [[ ]] markdown environment and turns off outside of it.
The RNN is likely using this neuron to remember if it is inside a URL or not.
Source
96
How deep learning
generates text from
character-level
embeddings.

Neural Machine Translation
Googles Neural Machine
Translation system based
on a deep LSTM seq2seq
architecture reduces
translation errors by more
than 55%-85% compared to
traditional approaches.
Source
97
Deep learning neural
machine translations are
increasingly accurate.

Neural Conversational Model
Neural Chatbot: Predicts the next sentence given the previous
sentences in a conversation.
The LSTM remembers facts, understands contexts and performs
common sense reasoning in the trained domain.
Human: what is the purpose of life ?
Machine: to serve the greater good .
Human: what is the purpose of living ?
Machine: to live forever
Human: what is the purpose of existence?
Machine: to find out what happens when
we get to the planet earth
Human: where are you now ?
.
Machine: i ’m in the middle of nowhere .
Human: what is the purpose of dying ?
Machine: to have a life
Human: what is the purpose of being
intelligent ?
Machine: to find out what it is.
Human: what is the purpose of emotions ?
Machine: I don ’t know .
Philosophical Q&A – Network trained with OpenSubtitles (movie subtitles)
Source
98
Deep learning LSTM
neural networks are
being used to generate
human-machine
conversations.

Deep learning for story telling

100
Toward multi-modal deep
learning and language
generation

AI technology directions for  
human-machine communication  
and language generation
•Evolution from hand-crafted knowledge and
rules-based symbolic systems, and statistical
learning and probabilistic inferencing systems,
to contextual adaption systems that surpass
limitations these earlier waves of AI.
•Towards explainable AI, embedded continuous
machine learning, automatic generation of
whole-system causal models, and human-
machine symbiosis.
•Dedicated AI hardware providing 100X to 1000X
increase in computational power.
102

Artificial Intelligence is a programmed ability to process information
103
Source: DARPA
perceive
rich, complex and subtle information
learn
within an environment
abstract
to create new meanings
reason
to plan and to decide
perceiving
learning
abstracting
reasoning
Intelligence scale

Three waves of AI technology
104
Contextual adaptation
Engineers create systems that
construct explanatory models for
classes of real-world phenomena
AI systems learn and reason as
they encounter new tasks and
situations
Natural communication among
machines and people
Engineers create sets of rules
to represent knowledge in well
defined domains
AI systems reason over
narrowly defined problems
No learning capability and
poor handling of uncertainty
Engineers create statistical
models for specific problem
domains and train them on  
big data
AI systems have nuanced
classification and prediction
capabilities
No contextual capability and
minimal reasoning ability
Handcrafted knowledge
Perceiving
Learning
Abstracting
Reasoning
Perceiving
Learning
Abstracting
Reasoning
Perceiving
Learning
Abstracting
Reasoning
Statistical learning
Source: DARPA
New research is shaping this waveStill advancing and solving
hard problems
Amazingly effective, but has
fundamental limitations

Some third wave AI technologies
105
Explainable AI
Embedded machine learning
Continuous learning
Automatic whole-system causal models
Human-machine symbiosis
Source: DARPA

When it comes to different types of natural language goals, like text summarization vs.
question-answering vs. explanation of business intelligence, it seems likely a single platform
will be able to solve them all in coming years. That is, we won’t see dramatically different
technologies for each type of problem. Today, many natural language problems can be
reframed as machine translation problems, and use similar approaches to solve them.
Tomorrow’s NLG will fuse symbolic and statistical AI approaches in a third-wave synthesis.

Ai for Human Communication

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ai for Human Communication

Similar to Ai for Human Communication (20)

More from Mills Davis

More from Mills Davis (6)

Recently uploaded

Recently uploaded (20)

Ai for Human Communication