"Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presented by Chloé Laurent (MLM Conseil)
1. Graph Convolution for Multimodal
Information Extraction from Visually
Rich Documents
Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao
Alibaba Group
Presented by Chloé Laurent
WiMLDS Paper Study Session - April 16th 2020
2. Abstract
Howtoextract pre-definedentitiesfrom
VRDs?
• VisuallyRich Documents (VRDs) : purchase
receipts, insurance policy documents,
custom declarationforms...
• Visual and layoutinformation is essential for
document understanding: text serialized
into classic one-dimensionalsequence is not
enough.
• Introductionof a graph convolution based
model to combine textual and visual
informationfor information extraction.
WiMLDS Paper Study Session - April 16th 2020 2
3. Challenges of IE from VRDs
•How to effectively incorporate visual cues from the
document ?
•What about the scalability of the task ?
WiMLDS Paper Study Session - April 16th 2020 3
4. Contributions of this paper
• Computes graph embeddings for each
text segment with graph convolutions
• Graph embeddings are combined with
text embeddings to feed into a
standard BiLSTM-CRF for Information
Extraction
WiMLDS Paper Study Session - April 16th 2020 4
This method (March 2019)
outperformsBiLSTM-CRF baselines
on two real-worlddatasets.
Differ from baseline
5. Information Extraction
• Process of extracting structuredinformations from unstructured documents
• Progress recently made in this area are on plain text document essentially
WiMLDS Paper Study Session - April 16th 2020 5
Shaolei Wang, Yue Zhang, Wanxiang Che, and Ting Liu. 2018. Joint extraction of entities and relations based on a novel graph scheme. In IJCAI, pages 4461–4467.
6. Document Modeling
• Generated by Optical Character
Recognition system (OCR)
• Each text segment is comprised
of its position and the text
within it
WiMLDS Paper Study Session - April 16th 2020 6
7. Nodes Embedding
• Nodes represent text
segments
• Embedded using a single layer
of BiLSTM
WiMLDS Paper Study Session - April 16th 2020 7
8. Feature Extraction
Edges represent
visual dependencies between
two nodes (relative shapes
and distance)
WiMLDS Paper Study Session - April 16th 2020 8
Horizontal and
vertical
distance
between the 2
text boxes
Aspect ratio of
width and
height of the 2
text boxes
9. Graph Convolution
• Convolution is defined on the node-edge-node triplets (ti, rij, tj) instead of on the node
alone
• For node ti , features hij for each neighbor tj is extracted using a multi-layer perceptron
(MLP) network.
WiMLDS Paper Study Session - April 16th 2020 9
10. Node-Edge-Node triplet
• Combines visual features directly into
the neighbor representation
• The information of the current node is
copied across the neigbors
WiMLDS Paper Study Session - April 16th 2020 10
where || is the concatenateoperation
The neigbor features can
potentially learn where to
attend given the current
node
11. Focus on Graph Convolution Networks
• This model follows convolution
directly on the graph to model
the text segment graph of VRDs
• Explicit edge embedding into the
graph convolution network
which models the relationship
between nodes
WiMLDS Paper Study Session - April 16th 2020 11
Source: Zonghan Wu et. al., 2019
12. Self-Attention Mechanism
• In this model, graph convolution
is defined based on the self-
attention mechanism.
• Compute the output hidden
representation of each node by
attending to its neigbors
• Outputs are fed as inputs to the
next layer of graph convolution.
WiMLDS Paper Study Session - April 16th 2020 12
13. BiLSTM-CRF with Graph Embeddings
WiMLDS Paper Study Session - April 16th 2020 13
xi : Input token sequence of text segment
e(xi) : Word2Vec vectors as token embeddings
t'i: Graph embedding of the node
14. Training
• Custom annotation system to facilitate the labelling of ground truth
data
• Labelling of the values for each pre-defined entity and their locations
(bounding boxes)
• IOB tagging format, label O to all tokens in empty text segments
• Graph convolution layers and BiLSTM-CRF extractors are trained
together
• Multi-task learning approach to improve prediction accuracy
(segment classification task)
WiMLDS Paper Study Session - April 16th 2020 14
15. Results
• Performs in much the same way
on "simple" entities (invoice
number and date)
• Outperforms clearly on entities
which can not be represented by
text alone (price, tax, buyer,
seller)
WiMLDS Paper Study Session - April 16th 2020 15
ValueAdded
Invoices
(chinese)
International
Purchase
Receipts
(english)
16. Thank you for listening !
All figures are extracted from the main article discussed during this talk,
except when it's mentionned otherwise
WiMLDS Paper Study Session - April 16th 2020
17. Sources
• Main article : https://arxiv.org/pdf/1903.11279v1.pdf
• Information Extraction: https://arxiv.org/pdf/1708.03743.pdf
• Graph ConvolutionalNetwork: https://tkipf.github.io/graph-convolutional-
networks/
• Information Extraction fromgraphs: https://arxiv.org/pdf/1810.13083.pdf
• Graph Convolution Survey: https://arxiv.org/pdf/1901.00596.pdf
• Node classification by GCN: https://www.experoinc.com/post/node-
classification-by-graph-convolutional-network
• Graph Embedding: https://www-
cs.stanford.edu/people/jure/pubs/graphrepresentation-ieee17.pdf
• Similar approach: https://clgiles.ist.psu.edu/pubs/CVPR2017-connets.pdf
• Neural Architectures for Named Entity
Recognition: https://arxiv.org/pdf/1603.01360.pdf
WiMLDS Paper Study Session - April 16th 2020 17