SlideShare a Scribd company logo
1 of 21
Digital Enterprise Research Institute                                                   www.deri.ie




          Explicit vs. Latent Concept Models for Cross-
                 Language Information Retrieval

                                                  Nitish Aggarwal
                                                 DERI, NUI Galway
                                           firstname.lastname@deri.org




 Tuesday,Digitalth June, 2012 All rights reserved.
 Copyright 2011
                26 Enterprise Research Institute.
 DERI, Reading Group
                                                                  Enabling Networked Knowledge
Based On:
Digital Enterprise Research Institute                                                  www.deri.ie




            Title:
                   “Explicit vs. Latent Concept Models for Cross-Language
                    Information Retrieval”

            Authors:
                   Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg,
                    Steffen Staab

            Published:
                    International Joint Conference on Artificial Intelligence, 2009




                                                             Enabling Networked Knowledge
Overview
Digital Enterprise Research Institute                                            www.deri.ie




            Introduction
                   Cross lingual information retrieval (CLIR)
            Concept Model
                   Explicit Semantics
                   Latent Semantics
            Evaluation
            Conclusion




                                                           Enabling Networked Knowledge
Introduction: CLIR
Digital Enterprise Research Institute                                            www.deri.ie




            Cross Lingual Information Retrieval
                   Many documents, web sites
                    are written in different languages


                   Retrieve all information without
                    a language barrier


                   Query and documents are in different
                     languages




                                                           Enabling Networked Knowledge
Introduction: CLIR
Digital Enterprise Research Institute                                                          www.deri.ie




            CLIR based on Machine Translation
                   Translation of queries or documents
                   Reduced problem to monolingual retrieval
                       – Issues:
                               – MT is not available for all language pairs
                               – Increase vocabulary mismatch




                                                                         Enabling Networked Knowledge
Introduction: CLIR
Digital Enterprise Research Institute                                                         www.deri.ie




      Interlingua or Concepts based
            Use language independent representation
                – Mapping all queries and documents in different language to concepts space
                – Define a concept space and relevance function




                                        Language independent
                                           representation



                                                                Enabling Networked Knowledge
Concept Model
Digital Enterprise Research Institute                                          www.deri.ie

            Document in conceptspace
                   Di = {t1, t2,t3…tn}
                ti in space
                                                               C1
                    – Associationwitheveryconcept
                   Composite semanticsofalltokens
                       – Σti , Πti


            Typesofconceptmodel                          ti

                   Explicit
                                                                              C2
                   Latent/implicit



                                                    C3


                                                         Enabling Networked Knowledge
ConceptModel: Explicit
Digital Enterprise Research Institute                                                 www.deri.ie

            Intuition: define concepts from external resources
                   Definition of concepts
                       – Wikipedia articles, tagged web pages
                   Cover a broad range of vocabulary and language
            Example
                   Wikipedia based Explicit semantic analysis (ESA)




                                                                Enabling Networked Knowledge
Concept Model: ESA
Digital Enterprise Research Institute                                                    www.deri.ie

            ExplicitConceptSpace
                   Di = {t1, t2,t3…tn}
                ti    = {w1a1 + w2a2… + wnan}               query   University
                                                                                  docs
                   Composite semanticsofalltoken
                       – Σti




                                                                                           Student




                                                 Education


                                                              Enabling Networked Knowledge
Cross lingual - ESA
Digital Enterprise Research Institute                                                                            www.deri.ie

            Extension of ESA
                   Use Wikipedia cross language links
                   Linked articles define same concepts in different languages

                                                               EN        Word1 W1*URI1+w2*URI2…. wn*URIn

                                                                         Wordn W1*URI1+w2*URI2…. wn*URIn


                                                               DE        Word1 W1*URI1+w2*URI2…. wn*URIn

                                                                         Wordn W1*URI1+w2*URI2…. wn*URIn

                                                               ES        Word1 W1*URI1+w2*URI2…. wn*URIn

                                                                         Wordn W1*URI1+w2*URI2…. wn*URIn

                                                                                  Inverted Index




                                Term@en   W11*URI1+w12*URI2…. w1n*URIn
                                                                               Vector               Semantic
                               Term@de    W11*URI1+w12*URI2…. w1n*URIn
                                                                               Cosine              Relatedness




                                                                                        Enabling Networked Knowledge
Concept Model: Latent
Digital Enterprise Research Institute                                                                www.deri.ie

            Intuition: semantic space of latent concepts
                   Definition of latent concepts
                       – Cluster of similar things define a latent concept


                               Latent Concept1                        Latent Concept2
                                    30% broccoli                         20% chinchillas
                                   15% bananas                             20% kittens
                                   10% breakfast                            20% cute
                                   10% munching
                                     (Food)                               15% hamster
                                                                          (animals)



                                 Look at this cute hamster munching on a piece of brocoli
                                    (40% Latent Concept1 and 60%Latent Concept2)




                                                                               Enabling Networked Knowledge
Concept Model: Latent
Digital Enterprise Research Institute                                                   www.deri.ie




                                                                                 docs
                                                                query
                                                                        LC1




     Training
     Corpus



                                         Derived Latent                                   LC2
                                           Concepts
                                        LC1

                                        LC2

                                        LC3
                                                          LC3




                                                                 Enabling Networked Knowledge
Latent Semantic Analysis (LSA)
Digital Enterprise Research Institute                                                  www.deri.ie

            Definition
                   Dimensionality reductions to find latent concepts
            Approach
                   Build term-documents matrix M
                   Perform single value decomposition (SVD) on M


                   Approximate M by taking top N singular values
                       – N singular values reflect N different latent concepts
                       – U defines term-concept-correlation
                       – V defines document-concept-correlation
            Cross Lingual-LSA
                   Use parallel corpus


                                                                 Enabling Networked Knowledge
Latent Dirichlet Allocation (LDA)
Digital Enterprise Research Institute                                                   www.deri.ie


            Definition
                   Generative model
                       – Words generate latent concepts (Topics)
                       – Topics generate document to learn the parameter


            Approach
                   Topic distribution is assumed to be Dirichlet prior
                   Fit corpus and document level properties using variational
                    Expectation Maximization (EM) procedure


            Cross-lingual-LDA
                   Use parallel corpus



                                                                  Enabling Networked Knowledge
Evaluation
Digital Enterprise Research Institute                                                    www.deri.ie




            Parallel corpora
                   All documents are translated into many languages


            Relevance assessment
                   Use documents in one language as query to retrieve documents
                    of other language
                   Translated document = relevant document
                       – No manual relevant assessment is needed


            Measures used
                   Mean reciprocal rank (MRR)
                   Average score over all language pairs

                                                                   Enabling Networked Knowledge
Evaluation: Datasets
Digital Enterprise Research Institute                                                     www.deri.ie


            Multilingual corpora
                   MultextCorpus
                       – 3066 Q/A pairs from the Official Journal of European Community
                   JRC-AQUIS Corpus
                       – 21,000 legislative documents of the European Union
                       – We randomly selected 3,000 documents as queries



            Set up
                   English, German and French documents were used
                   Split dataset for latent topic extraction
                       – 60% learning, 40% testing




                                                                   Enabling Networked Knowledge
Evaluation: Datasets
Digital Enterprise Research Institute                                                        www.deri.ie




            Wikipedia
                   Snapshot
                       – 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)
                       – Collection of 166,484 articles



                   CL-ESA: Use cross-language links for concepts in different
                    language


                   LSA/LDA: Wikipedia as parallel corpus
                       – Use it as training corpus for latent concepts extraction




                                                                       Enabling Networked Knowledge
Evaluation: Parameter
Digital Enterprise Research Institute                                                    www.deri.ie




            Cross-lingual ESA
                   Problem
                       – Too many concepts
                   Solution
                       – Only use highest m values


            LSI/LDA
                   Problem
                       – Computational costs increase with number of topics
                   Solution
                       – Use fixed number of latent topics




                                                                   Enabling Networked Knowledge
Evaluation: Results
Digital Enterprise Research Institute                                          www.deri.ie



            Multext Dataset




                                                         Enabling Networked Knowledge
Evaluation: Results
Digital Enterprise Research Institute                                          www.deri.ie



            JRC-Aquis Dataset




                                                         Enabling Networked Knowledge
Conclusion
Digital Enterprise Research Institute                                              www.deri.ie



            Parameter tuning
                   ESA performs good for m=10,000
                   Maximum of 500 topics for LSI tested
                       – Not maximal performance, but seems to converge


            Results
                   LSA performs better than LDA
                   Comparable results of CL-ESA and LSA
                       – Explicit Vs Implicit
                   Explicit model Perform better than latent model




                                                             Enabling Networked Knowledge

More Related Content

What's hot

Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...ijtsrd
 
Rethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed SemanticRethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed SemanticAlexandre Passant
 
A study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptographyA study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptographyAlexander Decker
 
Mist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo andoMist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo andoRuo Ando
 
Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...Benjamin Heitmann
 
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...CSCJournals
 
Knowledge-based generation of educational web pages
Knowledge-based generation of educational web pagesKnowledge-based generation of educational web pages
Knowledge-based generation of educational web pagesStefan Trausan-Matu
 
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic AnalysisEnrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic AnalysisLinkedTV
 
The learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologiesThe learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologiesgrainne
 
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...Benjamin Heitmann
 
Lessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web appsLessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web appsBenjamin Heitmann
 
Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...Alexandre Passant
 
Kbms knowledge
Kbms knowledgeKbms knowledge
Kbms knowledgeokeee
 
TEL Developments & Trends
TEL Developments & TrendsTEL Developments & Trends
TEL Developments & Trendstimku
 
The Future of Technology and Information
The Future of Technology and InformationThe Future of Technology and Information
The Future of Technology and InformationNick Finck
 
Issues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media PublishingIssues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media PublishingBeat Signer
 
Introduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology StackIntroduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology StackFabian Christ
 

What's hot (20)

Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...
 
Rethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed SemanticRethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed Semantic
 
[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh
[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh
[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh
 
A study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptographyA study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptography
 
Mist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo andoMist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo ando
 
Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...
 
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
 
Knowledge-based generation of educational web pages
Knowledge-based generation of educational web pagesKnowledge-based generation of educational web pages
Knowledge-based generation of educational web pages
 
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic AnalysisEnrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
 
The learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologiesThe learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologies
 
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
 
Lessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web appsLessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web apps
 
Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...
 
Kbms knowledge
Kbms knowledgeKbms knowledge
Kbms knowledge
 
TEL Developments & Trends
TEL Developments & TrendsTEL Developments & Trends
TEL Developments & Trends
 
185 189
185 189185 189
185 189
 
The Future of Technology and Information
The Future of Technology and InformationThe Future of Technology and Information
The Future of Technology and Information
 
Issues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media PublishingIssues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media Publishing
 
Introduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology StackIntroduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology Stack
 
1709 1715
1709 17151709 1715
1709 1715
 

Similar to Cross-Language Info Retrieval Models

Linked Open Data
Linked Open DataLinked Open Data
Linked Open DataDerilinx
 
Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...jodischneider
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalismBahareh Heravi
 
ICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work EnvironmentsICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work EnvironmentsLaura Dragan
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceEdward Curry
 
Swap2010 agave
Swap2010 agaveSwap2010 agave
Swap2010 agavejuanaya
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomesjodischneider
 
Annotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting ApplicationsAnnotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting ApplicationsDavid Crowley
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...jodischneider
 
Hello Open World - Semtech 2009
Hello Open World - Semtech 2009Hello Open World - Semtech 2009
Hello Open World - Semtech 2009Alexandre Passant
 
Stefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALSStefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALSStefan Decker
 
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...Alexandre Passant
 
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...Andre Freitas
 
A distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph dataA distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph dataAndre Freitas
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
VoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsVoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsRichard Cyganiak
 
Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataEdward Curry
 
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyondEDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyondEuropean Data Forum
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government DataFadi Maali
 

Similar to Cross-Language Info Retrieval Models (20)

Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
 
ICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work EnvironmentsICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work Environments
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked Dataspace
 
Swap2010 agave
Swap2010 agaveSwap2010 agave
Swap2010 agave
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
 
Annotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting ApplicationsAnnotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting Applications
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...
 
Lgd 2
Lgd 2Lgd 2
Lgd 2
 
Hello Open World - Semtech 2009
Hello Open World - Semtech 2009Hello Open World - Semtech 2009
Hello Open World - Semtech 2009
 
Stefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALSStefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALS
 
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
 
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
 
A distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph dataA distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph data
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
VoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsVoID: Metadata for RDF Datasets
VoID: Metadata for RDF Datasets
 
Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked Data
 
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyondEDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government Data
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Cross-Language Info Retrieval Models

  • 1. Digital Enterprise Research Institute www.deri.ie Explicit vs. Latent Concept Models for Cross- Language Information Retrieval Nitish Aggarwal DERI, NUI Galway firstname.lastname@deri.org Tuesday,Digitalth June, 2012 All rights reserved. Copyright 2011 26 Enterprise Research Institute. DERI, Reading Group Enabling Networked Knowledge
  • 2. Based On: Digital Enterprise Research Institute www.deri.ie  Title:  “Explicit vs. Latent Concept Models for Cross-Language Information Retrieval”  Authors:  Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg, Steffen Staab  Published:  International Joint Conference on Artificial Intelligence, 2009 Enabling Networked Knowledge
  • 3. Overview Digital Enterprise Research Institute www.deri.ie  Introduction  Cross lingual information retrieval (CLIR)  Concept Model  Explicit Semantics  Latent Semantics  Evaluation  Conclusion Enabling Networked Knowledge
  • 4. Introduction: CLIR Digital Enterprise Research Institute www.deri.ie  Cross Lingual Information Retrieval  Many documents, web sites are written in different languages  Retrieve all information without a language barrier  Query and documents are in different languages Enabling Networked Knowledge
  • 5. Introduction: CLIR Digital Enterprise Research Institute www.deri.ie  CLIR based on Machine Translation  Translation of queries or documents  Reduced problem to monolingual retrieval – Issues: – MT is not available for all language pairs – Increase vocabulary mismatch Enabling Networked Knowledge
  • 6. Introduction: CLIR Digital Enterprise Research Institute www.deri.ie  Interlingua or Concepts based  Use language independent representation – Mapping all queries and documents in different language to concepts space – Define a concept space and relevance function Language independent representation Enabling Networked Knowledge
  • 7. Concept Model Digital Enterprise Research Institute www.deri.ie  Document in conceptspace  Di = {t1, t2,t3…tn}  ti in space C1 – Associationwitheveryconcept  Composite semanticsofalltokens – Σti , Πti  Typesofconceptmodel ti  Explicit C2  Latent/implicit C3 Enabling Networked Knowledge
  • 8. ConceptModel: Explicit Digital Enterprise Research Institute www.deri.ie  Intuition: define concepts from external resources  Definition of concepts – Wikipedia articles, tagged web pages  Cover a broad range of vocabulary and language  Example  Wikipedia based Explicit semantic analysis (ESA) Enabling Networked Knowledge
  • 9. Concept Model: ESA Digital Enterprise Research Institute www.deri.ie  ExplicitConceptSpace  Di = {t1, t2,t3…tn}  ti = {w1a1 + w2a2… + wnan} query University docs  Composite semanticsofalltoken – Σti Student Education Enabling Networked Knowledge
  • 10. Cross lingual - ESA Digital Enterprise Research Institute www.deri.ie  Extension of ESA  Use Wikipedia cross language links  Linked articles define same concepts in different languages EN Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn DE Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn ES Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn Inverted Index Term@en W11*URI1+w12*URI2…. w1n*URIn Vector Semantic Term@de W11*URI1+w12*URI2…. w1n*URIn Cosine Relatedness Enabling Networked Knowledge
  • 11. Concept Model: Latent Digital Enterprise Research Institute www.deri.ie  Intuition: semantic space of latent concepts  Definition of latent concepts – Cluster of similar things define a latent concept Latent Concept1 Latent Concept2 30% broccoli 20% chinchillas 15% bananas 20% kittens 10% breakfast 20% cute 10% munching (Food) 15% hamster (animals) Look at this cute hamster munching on a piece of brocoli (40% Latent Concept1 and 60%Latent Concept2) Enabling Networked Knowledge
  • 12. Concept Model: Latent Digital Enterprise Research Institute www.deri.ie docs query LC1 Training Corpus Derived Latent LC2 Concepts LC1 LC2 LC3 LC3 Enabling Networked Knowledge
  • 13. Latent Semantic Analysis (LSA) Digital Enterprise Research Institute www.deri.ie  Definition  Dimensionality reductions to find latent concepts  Approach  Build term-documents matrix M  Perform single value decomposition (SVD) on M  Approximate M by taking top N singular values – N singular values reflect N different latent concepts – U defines term-concept-correlation – V defines document-concept-correlation  Cross Lingual-LSA  Use parallel corpus Enabling Networked Knowledge
  • 14. Latent Dirichlet Allocation (LDA) Digital Enterprise Research Institute www.deri.ie  Definition  Generative model – Words generate latent concepts (Topics) – Topics generate document to learn the parameter  Approach  Topic distribution is assumed to be Dirichlet prior  Fit corpus and document level properties using variational Expectation Maximization (EM) procedure  Cross-lingual-LDA  Use parallel corpus Enabling Networked Knowledge
  • 15. Evaluation Digital Enterprise Research Institute www.deri.ie  Parallel corpora  All documents are translated into many languages  Relevance assessment  Use documents in one language as query to retrieve documents of other language  Translated document = relevant document – No manual relevant assessment is needed  Measures used  Mean reciprocal rank (MRR)  Average score over all language pairs Enabling Networked Knowledge
  • 16. Evaluation: Datasets Digital Enterprise Research Institute www.deri.ie  Multilingual corpora  MultextCorpus – 3066 Q/A pairs from the Official Journal of European Community  JRC-AQUIS Corpus – 21,000 legislative documents of the European Union – We randomly selected 3,000 documents as queries  Set up  English, German and French documents were used  Split dataset for latent topic extraction – 60% learning, 40% testing Enabling Networked Knowledge
  • 17. Evaluation: Datasets Digital Enterprise Research Institute www.deri.ie  Wikipedia  Snapshot – 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German) – Collection of 166,484 articles  CL-ESA: Use cross-language links for concepts in different language  LSA/LDA: Wikipedia as parallel corpus – Use it as training corpus for latent concepts extraction Enabling Networked Knowledge
  • 18. Evaluation: Parameter Digital Enterprise Research Institute www.deri.ie  Cross-lingual ESA  Problem – Too many concepts  Solution – Only use highest m values  LSI/LDA  Problem – Computational costs increase with number of topics  Solution – Use fixed number of latent topics Enabling Networked Knowledge
  • 19. Evaluation: Results Digital Enterprise Research Institute www.deri.ie  Multext Dataset Enabling Networked Knowledge
  • 20. Evaluation: Results Digital Enterprise Research Institute www.deri.ie  JRC-Aquis Dataset Enabling Networked Knowledge
  • 21. Conclusion Digital Enterprise Research Institute www.deri.ie  Parameter tuning  ESA performs good for m=10,000  Maximum of 500 topics for LSI tested – Not maximal performance, but seems to converge  Results  LSA performs better than LDA  Comparable results of CL-ESA and LSA – Explicit Vs Implicit  Explicit model Perform better than latent model Enabling Networked Knowledge