This document summarizes and compares different concept models for cross-language information retrieval, including explicit and latent models. It describes explicit models like Explicit Semantic Analysis (ESA) that use concepts defined in external resources, and latent models like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) that derive concepts from unlabeled text. It evaluates these models on two datasets and finds that LSA performs best, with CL-ESA achieving comparable results to LSA. It concludes explicit models like CL-ESA can perform as well as or better than latent models for cross-language retrieval.
1. Digital Enterprise Research Institute www.deri.ie
Explicit vs. Latent Concept Models for Cross-
Language Information Retrieval
Nitish Aggarwal
DERI, NUI Galway
firstname.lastname@deri.org
Tuesday,Digitalth June, 2012 All rights reserved.
Copyright 2011
26 Enterprise Research Institute.
DERI, Reading Group
Enabling Networked Knowledge
2. Based On:
Digital Enterprise Research Institute www.deri.ie
Title:
“Explicit vs. Latent Concept Models for Cross-Language
Information Retrieval”
Authors:
Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg,
Steffen Staab
Published:
International Joint Conference on Artificial Intelligence, 2009
Enabling Networked Knowledge
3. Overview
Digital Enterprise Research Institute www.deri.ie
Introduction
Cross lingual information retrieval (CLIR)
Concept Model
Explicit Semantics
Latent Semantics
Evaluation
Conclusion
Enabling Networked Knowledge
4. Introduction: CLIR
Digital Enterprise Research Institute www.deri.ie
Cross Lingual Information Retrieval
Many documents, web sites
are written in different languages
Retrieve all information without
a language barrier
Query and documents are in different
languages
Enabling Networked Knowledge
5. Introduction: CLIR
Digital Enterprise Research Institute www.deri.ie
CLIR based on Machine Translation
Translation of queries or documents
Reduced problem to monolingual retrieval
– Issues:
– MT is not available for all language pairs
– Increase vocabulary mismatch
Enabling Networked Knowledge
6. Introduction: CLIR
Digital Enterprise Research Institute www.deri.ie
Interlingua or Concepts based
Use language independent representation
– Mapping all queries and documents in different language to concepts space
– Define a concept space and relevance function
Language independent
representation
Enabling Networked Knowledge
7. Concept Model
Digital Enterprise Research Institute www.deri.ie
Document in conceptspace
Di = {t1, t2,t3…tn}
ti in space
C1
– Associationwitheveryconcept
Composite semanticsofalltokens
– Σti , Πti
Typesofconceptmodel ti
Explicit
C2
Latent/implicit
C3
Enabling Networked Knowledge
8. ConceptModel: Explicit
Digital Enterprise Research Institute www.deri.ie
Intuition: define concepts from external resources
Definition of concepts
– Wikipedia articles, tagged web pages
Cover a broad range of vocabulary and language
Example
Wikipedia based Explicit semantic analysis (ESA)
Enabling Networked Knowledge
9. Concept Model: ESA
Digital Enterprise Research Institute www.deri.ie
ExplicitConceptSpace
Di = {t1, t2,t3…tn}
ti = {w1a1 + w2a2… + wnan} query University
docs
Composite semanticsofalltoken
– Σti
Student
Education
Enabling Networked Knowledge
10. Cross lingual - ESA
Digital Enterprise Research Institute www.deri.ie
Extension of ESA
Use Wikipedia cross language links
Linked articles define same concepts in different languages
EN Word1 W1*URI1+w2*URI2…. wn*URIn
Wordn W1*URI1+w2*URI2…. wn*URIn
DE Word1 W1*URI1+w2*URI2…. wn*URIn
Wordn W1*URI1+w2*URI2…. wn*URIn
ES Word1 W1*URI1+w2*URI2…. wn*URIn
Wordn W1*URI1+w2*URI2…. wn*URIn
Inverted Index
Term@en W11*URI1+w12*URI2…. w1n*URIn
Vector Semantic
Term@de W11*URI1+w12*URI2…. w1n*URIn
Cosine Relatedness
Enabling Networked Knowledge
11. Concept Model: Latent
Digital Enterprise Research Institute www.deri.ie
Intuition: semantic space of latent concepts
Definition of latent concepts
– Cluster of similar things define a latent concept
Latent Concept1 Latent Concept2
30% broccoli 20% chinchillas
15% bananas 20% kittens
10% breakfast 20% cute
10% munching
(Food) 15% hamster
(animals)
Look at this cute hamster munching on a piece of brocoli
(40% Latent Concept1 and 60%Latent Concept2)
Enabling Networked Knowledge
12. Concept Model: Latent
Digital Enterprise Research Institute www.deri.ie
docs
query
LC1
Training
Corpus
Derived Latent LC2
Concepts
LC1
LC2
LC3
LC3
Enabling Networked Knowledge
13. Latent Semantic Analysis (LSA)
Digital Enterprise Research Institute www.deri.ie
Definition
Dimensionality reductions to find latent concepts
Approach
Build term-documents matrix M
Perform single value decomposition (SVD) on M
Approximate M by taking top N singular values
– N singular values reflect N different latent concepts
– U defines term-concept-correlation
– V defines document-concept-correlation
Cross Lingual-LSA
Use parallel corpus
Enabling Networked Knowledge
14. Latent Dirichlet Allocation (LDA)
Digital Enterprise Research Institute www.deri.ie
Definition
Generative model
– Words generate latent concepts (Topics)
– Topics generate document to learn the parameter
Approach
Topic distribution is assumed to be Dirichlet prior
Fit corpus and document level properties using variational
Expectation Maximization (EM) procedure
Cross-lingual-LDA
Use parallel corpus
Enabling Networked Knowledge
15. Evaluation
Digital Enterprise Research Institute www.deri.ie
Parallel corpora
All documents are translated into many languages
Relevance assessment
Use documents in one language as query to retrieve documents
of other language
Translated document = relevant document
– No manual relevant assessment is needed
Measures used
Mean reciprocal rank (MRR)
Average score over all language pairs
Enabling Networked Knowledge
16. Evaluation: Datasets
Digital Enterprise Research Institute www.deri.ie
Multilingual corpora
MultextCorpus
– 3066 Q/A pairs from the Official Journal of European Community
JRC-AQUIS Corpus
– 21,000 legislative documents of the European Union
– We randomly selected 3,000 documents as queries
Set up
English, German and French documents were used
Split dataset for latent topic extraction
– 60% learning, 40% testing
Enabling Networked Knowledge
17. Evaluation: Datasets
Digital Enterprise Research Institute www.deri.ie
Wikipedia
Snapshot
– 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)
– Collection of 166,484 articles
CL-ESA: Use cross-language links for concepts in different
language
LSA/LDA: Wikipedia as parallel corpus
– Use it as training corpus for latent concepts extraction
Enabling Networked Knowledge
18. Evaluation: Parameter
Digital Enterprise Research Institute www.deri.ie
Cross-lingual ESA
Problem
– Too many concepts
Solution
– Only use highest m values
LSI/LDA
Problem
– Computational costs increase with number of topics
Solution
– Use fixed number of latent topics
Enabling Networked Knowledge
21. Conclusion
Digital Enterprise Research Institute www.deri.ie
Parameter tuning
ESA performs good for m=10,000
Maximum of 500 topics for LSI tested
– Not maximal performance, but seems to converge
Results
LSA performs better than LDA
Comparable results of CL-ESA and LSA
– Explicit Vs Implicit
Explicit model Perform better than latent model
Enabling Networked Knowledge