SlideShare a Scribd company logo
1 of 30
Download to read offline
Text categorization with
   Lucene and Solr
     Tommaso Teofili
     tommaso [at] apache [dot] org




                                     1
About me
l      ASF member having fun with:
  l    Lucene / Solr
  l    Hama
  l    UIMA
  l    Stanbol
  l    … some others
l      SW engineer @ Adobe R&D




                                      2
Agenda
l    Classification
l    Lucene classification module
l    Solr text categorization services
l    Conclusions




                                          3
Classification
l    Let the algorithm assign one or more labels
      (classes) to some item given some
      previous knowledge
      l    Spam filter
      l    Tagging system
      l    Digit recognition system
      l    Text categorization
      l    etc.




                                           4
Classification?
why with Lucene?




                   5
The short story
l      Lucene already has a lot of features for
        common information retrieval needs
  l    Postings
  l    Term vectors
  l    Statistics
  l    Positions
  l    TF / IDF
  l    maybe Payloads
  l    etc.
l      We may avoid bringing in new components
        to do classification just leveraging what we
        get for free from Lucene
                                              6
The (slightly) longer story #1
l      While playing with NLP stuff
l      Need to implement a naïve bayes classifier
  l    Not possible to plug in stuff requiring touching
        the architecture
  l    Not really interested in (near) real time
        performance
l      Iteration 1
  l    Plain in memory Java stuff
l      Iteration 2
  l    Same stuff but using Lucene instead of loading
        things into memory
         l    Too much faster J
                                                    7
The (slightly) longer story #2
l      So I realized
  l    Lucene has so many features stored you can
        take advantage of for free
  l    Therefore writing the classification algorithm is
        relatively simple
  l    In many cases you’re just not adding anything to
        the architecture
         l    Your Lucene index was already there for searching
  l    Lucene index is, to some extent, already a model
        which we just need to “query” with the proper
        algorithm
  l    And it is fast enough
                                                           8
Lucene classification module
l      Work in progress on trunk
l      LUCENE-4345
  l    Establishing classification API
  l    With currently two implementations
         l    Naïve bayes
         l    K nearest neighbor




                                             9
Lucene classification module
l      Classifier API
  l    Training
  l    void train(atomicReader, contentField,
        classField, analyzer) throws IOException

         l    atomicReader : the reader on the Lucene index to
               use for classification
               l    still unsure if IR’d be better
         l    textFieldName : the name of the field which contains
               documents’ texts
         l    classFieldName : the name of the field which contains
               the class assigned to existing documents
         l    analyzer : the item used for analyzing the unseen
               texts
                                                            10
Lucene classification module
l      Classifier API
  l    Classifying
  l    ClassificationResult assignClass(String text)
        throws IOException
         l    text: the unseen text of the document to classify
         l    ClassificationResult : the object containing the
               assigned class along with the related score




                                                              11
K Nearest neighbor classifier
l    Fairly simple classification algorithm
l    Given some new unseen item
l    I search in my knowledge base the k items
      which are nearer to the new one
l    I get the k classes assigned to the k
      nearest items
l    I assign to the new item the class that is
      most frequent in the k returned items



                                           12
K Nearest neighbor classifier
l      How can we do this in Lucene?
  l    We have VSM for representing documents as
        vectors and eventually find distances
  l    Lucene MoreLikeThis module can do a lot for it
  l    Given a new document
         l    It’s represented as a MoreLikeThisQuery which filters
               out too frequent words and helps on keeping only the
               relevant tokens for finding the neighbors
         l    The query is executed returning only the first k results
         l    The result is then browsed in order to find the most
               frequent class and that is then assigned with a score
               of classFreq / k


                                                               13
Naïve Bayes classifier
l      Slightly more complicated
l      Based on probabilities
l      C = argmax( P(d|c) * P(c) )
  l    P(d|c) : likelihood
  l    P(c) : prior
  l    With some assumptions:
         l    bag of words assumption: positions don't matter
         l    conditional independence: the feature probabilities
               are independent given a class




                                                             14
Naïve Bayes classifier
l      Prior calculation is easy
  l    It’s the relative frequency of each class
        #docsWithClassC / #docs
l      Likelihood is easy too because of the bag
        of words assumption
  l    P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c)
  l    So we just need probabilities of single terms
         l    P(x|c) := (tf of x in documents with class c + 1)/
               (#terms in docs with class c + #docs)




                                                                15
Naïve Bayes classifier
l      Does the bag of words assumption affect
        the classifier’s precision?
  l    Yes in theory
         l    in text documents (nearby) words are strictly
               correlated
  l    Not always in practice
         l    depending on your index data it may or not have an
               impact




                                                               16
Using different indexes
l      The Classifier API makes usage of an
        AtomicReader to get the data for the
        training
l      It must not be the very same index used for
        every day index / search
  l    For performance reasons
  l    For enhancing classifier effectiveness
         l    Using more specific analyzers
         l    Indexing data in a different way
                l    e.g. one big document for each class and use kNN (with a
                      small k) or TF-IDF similarity



                                                                       17
Things to consider - bootstrapping
l      How are your first documents classified?
  l    Manually
         l    Categories are already there in the documents
         l    Someone is explicitly charged to do that (e.g. article
               authors) at some point in time
  l    (semi) automatically
         l    Using some existing service / library
                l    With or without human supervision

  l    In either case the classifier needs something to
        be fed with to be effective



                                                               18
Things to consider – tokenizing
l      How are your content field tokenized?
  l    Whitespace
         l    It doesn’t work for each language
  l    Standard
  l    Sentence
  l    What about using N-Grams?
  l    What about using Shingles?




                                                   19
Things to consider - filtering
l      Some words may / should be filtered while
  l    Training
  l    Classifying
l      Often
  l    Stopwords
  l    Punctuation
  l    Not relevant PoS tagged tokens



                                            20
Raw benchmarking
l      Tried both algorithms on ~1M docs index
  l    Naïve bayes is affected by the # of classes
  l    kNN is affected by k being large
l      None of them took more than 1-2m to train
        even with great number of classes or large
        k values




                                                  21
From Lucene to Solr
l      The Lucene classifiers can be easily used
        in Solr
  l    As specific search services
         l    A classification based more like this
  l    While indexing
         l    For automatic text categorization




                                                       22
Classification based MLT
l      Use case:
  l    “give me all the documents that belong to the
        same category of a new not indexed document”
  l    Slightly different from basic MLT since it does
        not return the nearest docs
  l    That is useful if the user doesn’t want / need to
        index the document and still want to find all the
        other documents of the same category, whatever
        this category means


                                                   23
Classification based MLT
l      ClassificationMLTHandler
  l    String d = req.getParams().get(DOC);
  l    ClassificationResult r = classifier.assignClass(d);
  l    String c = r.getAssignedClass();
  l    req.getSearcher().search(new TermQuery(new
        Term(classFieldName, c)), rows);




                                                    24
Automatic text categorization
l      Once a doc reaches Solr
l      We can use the Lucene classifiers to
        automate assigning document’s category
l      We can leverage existing Solr facilites for
        enhancing the indexing pipeline
  l    An UpdateChain can be decorated with one or
        more UpdateRequestProcessors




                                               25
Automatic text categorization
l      Configuration
  l    <updateRequestProcessorChain name=“ctgr">
  l    <processor
        class=”solr.CategorizationUpdateRequestProces
        sorFactory">
  l    <processor
        class="solr.RunUpdateProcessorFactory" />
  l    </updateRequestProcessorChain>



                                               26
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      void processAdd(AddUpdateCommand
        cmd) throws IOException
         l    String text = solrInputDocument.getFieldValue(“text”);
         l    String class = classifier.assignClass(text);
         l    solrInputDocument.addField(“cat”, class);
  l    Every now and then need to retrain to get latest
        stuff in the current index, but that can be done in
        the background without affecting performances


                                                             27
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      Finer grained control
  l    Use automatic text categorization only if a value
        does not exist for the “cat” field
  l    Add the classifier output class to the “cat” field
        only if it’s above a certain score




                                                      28
Wrap up
l      Simple classifiers with no or little effort
l      No architecture change
l      Both available to Lucene and Solr
l      Still reasonably fast
l      A lot more can be done
  l    Implement a MaxEnt Lucene based classifier
         l    which takes into account words correlation



                                                            29
Thanks!




          30

More Related Content

What's hot

Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
DataWorks Summit
 

What's hot (20)

Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Sqoop
SqoopSqoop
Sqoop
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 

Similar to Text categorization with Lucene and Solr

Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
Hoffman Lab
 

Similar to Text categorization with Lucene and Solr (20)

Oop basic concepts
Oop basic conceptsOop basic concepts
Oop basic concepts
 
Java Notes
Java NotesJava Notes
Java Notes
 
Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptx
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture One
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in python
 
Proyecto
ProyectoProyecto
Proyecto
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave Implicit
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
Object Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxObject Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptx
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and Reasonable
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptx
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP
 

More from Tommaso Teofili

Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Tommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Tommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

More from Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Text categorization with Lucene and Solr

  • 1. Text categorization with Lucene and Solr Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me l  ASF member having fun with: l  Lucene / Solr l  Hama l  UIMA l  Stanbol l  … some others l  SW engineer @ Adobe R&D 2
  • 3. Agenda l  Classification l  Lucene classification module l  Solr text categorization services l  Conclusions 3
  • 4. Classification l  Let the algorithm assign one or more labels (classes) to some item given some previous knowledge l  Spam filter l  Tagging system l  Digit recognition system l  Text categorization l  etc. 4
  • 6. The short story l  Lucene already has a lot of features for common information retrieval needs l  Postings l  Term vectors l  Statistics l  Positions l  TF / IDF l  maybe Payloads l  etc. l  We may avoid bringing in new components to do classification just leveraging what we get for free from Lucene 6
  • 7. The (slightly) longer story #1 l  While playing with NLP stuff l  Need to implement a naïve bayes classifier l  Not possible to plug in stuff requiring touching the architecture l  Not really interested in (near) real time performance l  Iteration 1 l  Plain in memory Java stuff l  Iteration 2 l  Same stuff but using Lucene instead of loading things into memory l  Too much faster J 7
  • 8. The (slightly) longer story #2 l  So I realized l  Lucene has so many features stored you can take advantage of for free l  Therefore writing the classification algorithm is relatively simple l  In many cases you’re just not adding anything to the architecture l  Your Lucene index was already there for searching l  Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm l  And it is fast enough 8
  • 9. Lucene classification module l  Work in progress on trunk l  LUCENE-4345 l  Establishing classification API l  With currently two implementations l  Naïve bayes l  K nearest neighbor 9
  • 10. Lucene classification module l  Classifier API l  Training l  void train(atomicReader, contentField, classField, analyzer) throws IOException l  atomicReader : the reader on the Lucene index to use for classification l  still unsure if IR’d be better l  textFieldName : the name of the field which contains documents’ texts l  classFieldName : the name of the field which contains the class assigned to existing documents l  analyzer : the item used for analyzing the unseen texts 10
  • 11. Lucene classification module l  Classifier API l  Classifying l  ClassificationResult assignClass(String text) throws IOException l  text: the unseen text of the document to classify l  ClassificationResult : the object containing the assigned class along with the related score 11
  • 12. K Nearest neighbor classifier l  Fairly simple classification algorithm l  Given some new unseen item l  I search in my knowledge base the k items which are nearer to the new one l  I get the k classes assigned to the k nearest items l  I assign to the new item the class that is most frequent in the k returned items 12
  • 13. K Nearest neighbor classifier l  How can we do this in Lucene? l  We have VSM for representing documents as vectors and eventually find distances l  Lucene MoreLikeThis module can do a lot for it l  Given a new document l  It’s represented as a MoreLikeThisQuery which filters out too frequent words and helps on keeping only the relevant tokens for finding the neighbors l  The query is executed returning only the first k results l  The result is then browsed in order to find the most frequent class and that is then assigned with a score of classFreq / k 13
  • 14. Naïve Bayes classifier l  Slightly more complicated l  Based on probabilities l  C = argmax( P(d|c) * P(c) ) l  P(d|c) : likelihood l  P(c) : prior l  With some assumptions: l  bag of words assumption: positions don't matter l  conditional independence: the feature probabilities are independent given a class 14
  • 15. Naïve Bayes classifier l  Prior calculation is easy l  It’s the relative frequency of each class #docsWithClassC / #docs l  Likelihood is easy too because of the bag of words assumption l  P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c) l  So we just need probabilities of single terms l  P(x|c) := (tf of x in documents with class c + 1)/ (#terms in docs with class c + #docs) 15
  • 16. Naïve Bayes classifier l  Does the bag of words assumption affect the classifier’s precision? l  Yes in theory l  in text documents (nearby) words are strictly correlated l  Not always in practice l  depending on your index data it may or not have an impact 16
  • 17. Using different indexes l  The Classifier API makes usage of an AtomicReader to get the data for the training l  It must not be the very same index used for every day index / search l  For performance reasons l  For enhancing classifier effectiveness l  Using more specific analyzers l  Indexing data in a different way l  e.g. one big document for each class and use kNN (with a small k) or TF-IDF similarity 17
  • 18. Things to consider - bootstrapping l  How are your first documents classified? l  Manually l  Categories are already there in the documents l  Someone is explicitly charged to do that (e.g. article authors) at some point in time l  (semi) automatically l  Using some existing service / library l  With or without human supervision l  In either case the classifier needs something to be fed with to be effective 18
  • 19. Things to consider – tokenizing l  How are your content field tokenized? l  Whitespace l  It doesn’t work for each language l  Standard l  Sentence l  What about using N-Grams? l  What about using Shingles? 19
  • 20. Things to consider - filtering l  Some words may / should be filtered while l  Training l  Classifying l  Often l  Stopwords l  Punctuation l  Not relevant PoS tagged tokens 20
  • 21. Raw benchmarking l  Tried both algorithms on ~1M docs index l  Naïve bayes is affected by the # of classes l  kNN is affected by k being large l  None of them took more than 1-2m to train even with great number of classes or large k values 21
  • 22. From Lucene to Solr l  The Lucene classifiers can be easily used in Solr l  As specific search services l  A classification based more like this l  While indexing l  For automatic text categorization 22
  • 23. Classification based MLT l  Use case: l  “give me all the documents that belong to the same category of a new not indexed document” l  Slightly different from basic MLT since it does not return the nearest docs l  That is useful if the user doesn’t want / need to index the document and still want to find all the other documents of the same category, whatever this category means 23
  • 24. Classification based MLT l  ClassificationMLTHandler l  String d = req.getParams().get(DOC); l  ClassificationResult r = classifier.assignClass(d); l  String c = r.getAssignedClass(); l  req.getSearcher().search(new TermQuery(new Term(classFieldName, c)), rows); 24
  • 25. Automatic text categorization l  Once a doc reaches Solr l  We can use the Lucene classifiers to automate assigning document’s category l  We can leverage existing Solr facilites for enhancing the indexing pipeline l  An UpdateChain can be decorated with one or more UpdateRequestProcessors 25
  • 26. Automatic text categorization l  Configuration l  <updateRequestProcessorChain name=“ctgr"> l  <processor class=”solr.CategorizationUpdateRequestProces sorFactory"> l  <processor class="solr.RunUpdateProcessorFactory" /> l  </updateRequestProcessorChain> 26
  • 27. Automatic text categorization l  CategorizationUpdateRequestProcessor l  void processAdd(AddUpdateCommand cmd) throws IOException l  String text = solrInputDocument.getFieldValue(“text”); l  String class = classifier.assignClass(text); l  solrInputDocument.addField(“cat”, class); l  Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances 27
  • 28. Automatic text categorization l  CategorizationUpdateRequestProcessor l  Finer grained control l  Use automatic text categorization only if a value does not exist for the “cat” field l  Add the classifier output class to the “cat” field only if it’s above a certain score 28
  • 29. Wrap up l  Simple classifiers with no or little effort l  No architecture change l  Both available to Lucene and Solr l  Still reasonably fast l  A lot more can be done l  Implement a MaxEnt Lucene based classifier l  which takes into account words correlation 29
  • 30. Thanks! 30