"Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presented by Chloé Laurent (MLM Conseil)

Graph Convolution for Multimodal
Information Extraction from Visually
Rich Documents
Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao
Alibaba Group
Presented by Chloé Laurent
WiMLDS Paper Study Session - April 16th 2020

Abstract
Howtoextract pre-definedentitiesfrom
VRDs?
• VisuallyRich Documents (VRDs) : purchase
receipts, insurance policy documents,
custom declarationforms...
• Visual and layoutinformation is essential for
document understanding: text serialized
into classic one-dimensionalsequence is not
enough.
• Introductionof a graph convolution based
model to combine textual and visual
informationfor information extraction.
WiMLDS Paper Study Session - April 16th 2020 2

Challenges of IE from VRDs
•How to effectively incorporate visual cues from the
document ?
•What about the scalability of the task ?

Contributions of this paper
• Computes graph embeddings for each
text segment with graph convolutions
• Graph embeddings are combined with
text embeddings to feed into a
standard BiLSTM-CRF for Information
Extraction
This method (March 2019)
outperformsBiLSTM-CRF baselines
on two real-worlddatasets.
Differ from baseline

Information Extraction
• Process of extracting structuredinformations from unstructured documents
• Progress recently made in this area are on plain text document essentially
Shaolei Wang, Yue Zhang, Wanxiang Che, and Ting Liu. 2018. Joint extraction of entities and relations based on a novel graph scheme. In IJCAI, pages 4461–4467.

Document Modeling
• Generated by Optical Character
Recognition system (OCR)
• Each text segment is comprised
of its position and the text
within it

Nodes Embedding
• Nodes represent text
segments
• Embedded using a single layer
of BiLSTM

Feature Extraction
Edges represent
visual dependencies between
two nodes (relative shapes
and distance)
Horizontal and
vertical
distance
between the 2
text boxes
Aspect ratio of
width and
height of the 2
text boxes

Graph Convolution
• Convolution is defined on the node-edge-node triplets (ti, rij, tj) instead of on the node
alone
• For node ti , features hij for each neighbor tj is extracted using a multi-layer perceptron
(MLP) network.

Node-Edge-Node triplet
• Combines visual features directly into
the neighbor representation
• The information of the current node is
copied across the neigbors
where || is the concatenateoperation
The neigbor features can
potentially learn where to
attend given the current
node

Focus on Graph Convolution Networks
• This model follows convolution
directly on the graph to model
the text segment graph of VRDs
• Explicit edge embedding into the
graph convolution network
which models the relationship
between nodes
Source: Zonghan Wu et. al., 2019

Self-Attention Mechanism
• In this model, graph convolution
is defined based on the self-
attention mechanism.
• Compute the output hidden
representation of each node by
attending to its neigbors
• Outputs are fed as inputs to the
next layer of graph convolution.

BiLSTM-CRF with Graph Embeddings
xi : Input token sequence of text segment
e(xi) : Word2Vec vectors as token embeddings
t'i: Graph embedding of the node

Training
• Custom annotation system to facilitate the labelling of ground truth
data
• Labelling of the values for each pre-defined entity and their locations
(bounding boxes)
• IOB tagging format, label O to all tokens in empty text segments
• Graph convolution layers and BiLSTM-CRF extractors are trained
together
• Multi-task learning approach to improve prediction accuracy
(segment classification task)

Results
• Performs in much the same way
on "simple" entities (invoice
number and date)
• Outperforms clearly on entities
which can not be represented by
text alone (price, tax, buyer,
seller)
ValueAdded
Invoices
(chinese)
International
Purchase
Receipts
(english)

Thank you for listening !
All figures are extracted from the main article discussed during this talk,
except when it's mentionned otherwise
WiMLDS Paper Study Session - April 16th 2020

Sources
• Main article : https://arxiv.org/pdf/1903.11279v1.pdf
• Information Extraction: https://arxiv.org/pdf/1708.03743.pdf
• Graph ConvolutionalNetwork: https://tkipf.github.io/graph-convolutional-
networks/
• Information Extraction fromgraphs: https://arxiv.org/pdf/1810.13083.pdf
• Graph Convolution Survey: https://arxiv.org/pdf/1901.00596.pdf
• Node classification by GCN: https://www.experoinc.com/post/node-
classification-by-graph-convolutional-network
• Graph Embedding: https://www-
cs.stanford.edu/people/jure/pubs/graphrepresentation-ieee17.pdf
• Similar approach: https://clgiles.ist.psu.edu/pubs/CVPR2017-connets.pdf
• Neural Architectures for Named Entity
Recognition: https://arxiv.org/pdf/1603.01360.pdf

"Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presented by Chloé Laurent (MLM Conseil)

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to "Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presented by Chloé Laurent (MLM Conseil)

Similar to "Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presented by Chloé Laurent (MLM Conseil) (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

"Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presented by Chloé Laurent (MLM Conseil)