SlideShare a Scribd company logo
1 of 42
Download to read offline
Intro and Tutorial
        W3C Corpus Processing
             Advanced Topics
                      Summary




Unstructured Information Processing with
             Apache UIMA
      NYC Search and Discovery Meetup


             Pablo Ariel Duboue, PhD

              IBM TJ Watson Research Center
                      19 Skyline Dr.
                  Hawthorne, NY 10603


                 February 24th, 2010


                  Pablo Duboue     Apache UIMA
Intro and Tutorial
                    W3C Corpus Processing
                         Advanced Topics
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                    W3C Corpus Processing      UIMA
                         Advanced Topics       Tutorial
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing      UIMA
                     Advanced Topics       Tutorial
                              Summary


What is UIMA



     UIMA is a framework, a means to integrate text or other
     unstructured information analytics.
     Reference implementations available for Java, C++ and
     others.
     An Open Source project under the umbrella of the Apache
     Foundation.




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                   W3C Corpus Processing      UIMA
                        Advanced Topics       Tutorial
                                 Summary


Analytics Frameworks



     Find all telephone numbers in running text
         (((([0-9]{3}))|[0-9]{3})-?
         [0-9]{3}-?[0-9]{4}
     Nice but...
         How are you going to feed this further processing?
         What about finding non-standard proper names in text?
         Acquiring technology from external vendors, free software
         projects, etc?




                             Pablo Duboue     Apache UIMA
Intro and Tutorial
                  W3C Corpus Processing      UIMA
                       Advanced Topics       Tutorial
                                Summary


In-line Annotations


      Modify text to include annotations
          This/DET happy/ADJ puppy/N
      It gets very messy very quickly
          (S (NP (This/DET happy/ADJ puppy/N) (VP eats/V (NP
          (the/DET bone/N)))
      Annotations can easily cross boundaries of other
      annotations
          He said <confidential>the project can’t go own. The
          funding is lacking.</confidential>




                            Pablo Duboue     Apache UIMA
Intro and Tutorial
                    W3C Corpus Processing        UIMA
                         Advanced Topics         Tutorial
                                  Summary


Standoff Annotations


     Standoff annotations
          Do not modify the text
          Keep the annotations as offsets within the original text
     Most analytics frameworks support standoff annotations.
     UIMA is built with standoff annotations at its core.
     Example:
      He said the project can’t go own.        The funding is lacking.

      0123456789012345678901235678901234567890123456789012345678

          Sentence Annotation: 0-33, 36-58.
          Confidential Annotation: 8-58.



                              Pablo Duboue       Apache UIMA
Intro and Tutorial
                W3C Corpus Processing      UIMA
                     Advanced Topics       Tutorial
                              Summary


Type Systems


     Key to integrating analytic packages developed by
     independent vendors.
     Clear metadata about
         Expected Inputs
             Tokens, sentences, proper names, etc
         Produced Outputs
             Parse trees, opinions, etc

     The framework creates an unified typesystem for a given
     set of annotators being run.



                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing      UIMA
                     Advanced Topics       Tutorial
                              Summary


Many frameworks



     Besides UIMA
         http://incubator.apache.org/uima
     LingPipe
         http://alias-i.com/lingpipe/
     Gate
         http://gate.ac.uk/




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
               W3C Corpus Processing      UIMA
                    Advanced Topics       Tutorial
                             Summary


UIMA Advantages



    Apache Licensed
    Enterprise-ready code quality
    Demonstrated scalability
    Developed by experts in building frameworks
    Interoperable (C++, Java, others)




                         Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing      UIMA
                     Advanced Topics       Tutorial
                              Summary


About The Speaker


     PhD in CS, Columbia University (NY)
         Natural Language Generation
     Joined IBM Research in 2005
         Worked in
             Question Answering
             Expert Search
             DeepQA (Jeopardy!)
     Recently joined the UIMA group
         mailto:pablo.duboue@gmail.com




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                    W3C Corpus Processing      UIMA
                         Advanced Topics       Tutorial
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
               W3C Corpus Processing      UIMA
                    Advanced Topics       Tutorial
                             Summary


UIMA Concepts


    Common Annotation Structure or CAS
        Subject of Analysis (SofA or View)
        JCas
    Feature Structures
        Annotations
    Indices and Iterators
    Analysis Engines (AEs)
        AEs descriptors




                         Pablo Duboue     Apache UIMA
Intro and Tutorial
                  W3C Corpus Processing      UIMA
                       Advanced Topics       Tutorial
                                Summary


Room annotator

     From the UIMA tutorial, write an Analysis Engine that
     identifies room numbers in text.
     Yorktown patterns: 20-001, 31-206, 04-123 (Regular
                 Expression Pattern: [0-9][0-9]-[0-2][0-9][0-9])
     Hawthorne patterns: GN-K35, 1S-L07, 4N-B21 (Regular
                 Expression Pattern: [G1-4][NS]-[A-Z][0-9])
     Steps:
       1   Define the CAS types that the annotator will use.
       2   Generate the Java classes for these types.
       3   Write the actual annotator Java code.
       4   Create the Analysis Engine descriptor.
       5   Test the annotator.

                            Pablo Duboue     Apache UIMA
Intro and Tutorial
              W3C Corpus Processing      UIMA
                   Advanced Topics       Tutorial
                            Summary


Editing a Type System




                        Pablo Duboue     Apache UIMA
Intro and Tutorial
                                        W3C Corpus Processing                  UIMA
                                             Advanced Topics                   Tutorial
                                                      Summary


The XML descriptor

  <?xml version= " 1 . 0 " encoding= "UTF−8" ?>
    < t y p e S y s t e m D e s c r i p t i o n xmlns= " h t t p : / / uima . apache . org / r e s o u r c e S p e c i f i e r " >
        <name> Tu to ri al Ty pe Sys te m < / name>
        < d e s c r i p t i o n >Type System D e f i n i t i o n f o r t h e t u t o r i a l examples −
               as o f E x e r c i s e 1< / d e s c r i p t i o n >
        <vendor>Apache Software Foundation < / vendor>
        <version> 1 . 0 < / version>
        <types>
            <typeDescription>
               <name>org . apache . uima . t u t o r i a l . RoomNumber< / name>
               < d e s c r i p t i o n >< / d e s c r i p t i o n >
               <supertypeName>uima . t c a s . A n n o t a t i o n < / supertypeName>
               <features>
                   <featureDescription>
                       <name> b u i l d i n g < / name>
                       < d e s c r i p t i o n > B u i l d i n g c o n t a i n i n g t h i s room< / d e s c r i p t i o n >
                       <rangeTypeName>uima . cas . S t r i n g < / rangeTypeName>
                   </ featureDescription>
               </ features>
            </ typeDescription>
        < / types>
    < / typeSystemDescription>




                                                     Pablo Duboue              Apache UIMA
Intro and Tutorial
                                      W3C Corpus Processing                UIMA
                                           Advanced Topics                 Tutorial
                                                    Summary


The AE code

  package org . apache . uima . t u t o r i a l . ex1 ;

  import j a v a . u t i l . regex . Matcher ;
  import j a v a . u t i l . regex . P a t t e r n ;

  import org . apache . uima . analysis_component . JCasAnnotator_ImplBase ;
  import org . apache . uima . j c a s . JCas ;
  import org . apache . uima . t u t o r i a l . RoomNumber ;

  /∗∗
   ∗ Example a n n o t a t o r t h a t d e t e c t s room numbers u s i n g
   ∗ Java 1 . 4 r e g u l a r e x p r e s s i o n s .
   ∗/
  public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
    p r i v a t e P a t t e r n mYorktownPattern =
                P a t t e r n . compile ( "   b [ 0 − 4 ]   d−[0 −2]d   d   b " ) ;

      p r i v a t e P a t t e r n mHawthornePattern =
                  P a t t e r n . compile ( "   b [ G1−4][NS] −[A ]   d   d   b " ) ;
                                                                   −Z

      public void process ( JCas aJCas ) {
        / / next s l i d e
      }
  }



                                                  Pablo Duboue             Apache UIMA
Intro and Tutorial
                              W3C Corpus Processing          UIMA
                                   Advanced Topics           Tutorial
                                            Summary


The AE code (cont.)


   public void process ( JCas aJCas ) {
     / / g e t document t e x t
     S t r i n g docText = aJCas . getDocumentText ( ) ;
     / / search f o r Yorktown room numbers
     Matcher matcher = mYorktownPattern . matcher ( docText ) ;
     i n t pos = 0 ;
     while ( matcher . f i n d ( pos ) ) {
         / / found one − c r e a t e a n n o t a t i o n
        RoomNumber a n n o t a t i o n = new RoomNumber ( aJCas ) ;
         a n n o t a t i o n . s e t B e g i n ( matcher . s t a r t ( ) ) ;
         a n n o t a t i o n . setEnd ( matcher . end ( ) ) ;
         a n n o t a t i o n . s e t B u i l d i n g ( " Yorktown " ) ;
         a n n o t a t i o n . addToIndexes ( ) ;
        pos = matcher . end ( ) ;
     }
     / / search f o r Hawthorne room numbers
     // ..
   }




                                        Pablo Duboue         Apache UIMA
Intro and Tutorial
            W3C Corpus Processing      UIMA
                 Advanced Topics       Tutorial
                          Summary


UIMA Document Analyzer




                      Pablo Duboue     Apache UIMA
Intro and Tutorial
             W3C Corpus Processing      UIMA
                  Advanced Topics       Tutorial
                           Summary


UIMA Document Analyzer (cont)




                       Pablo Duboue     Apache UIMA
Intro and Tutorial
                    W3C Corpus Processing
                                               TREC Enterprise Track
                         Advanced Topics
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing
                                           TREC Enterprise Track
                     Advanced Topics
                              Summary


An Example



    TREC 2006 enterprise track
    Search for experts in W3C Website
        Given topic, find expert in topic
    The IBM Enterprise Track 2006 Team
        Guillermo Averboch, Jennifer Chu-Carroll, Pablo A Duboue,
        David Gondek, J William Murdock and John Prager .




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                              W3C Corpus Processing
                                                                 TREC Enterprise Track
                                   Advanced Topics
                                            Summary


Corpus Processing: Generalities




     A simplified view of our TREC 2006 Enterprise Track Indexing Pipeline (actual pipeline has 23 components)

    Reader        Low Level Processing        Higher Level Analytics                 Analytics Consumers



     Binary                      HTML         Enterprise    Expertise                       Pseudo
                  LibMagic                                                 Lucene                           EKDB
    Collection                  Detagger       Member        Detector                      Document
                  Annotator                                                Indexer                         Indexer
     Reader                     Annotator     Annotator     Aggregate                      Constructor




                                            Pablo Duboue         Apache UIMA
Intro and Tutorial
                  W3C Corpus Processing
                                             TREC Enterprise Track
                       Advanced Topics
                                Summary


Pipeline

     Reader
           Binary Collection Reader
     Low Level Processing
           LibMagic Annotator
           HTML Detagger Annotator
     Higher Level Analytics
           Enterprise Member Annotator
           Expertise Detector Aggregate
     Analytics Consumers
           Lucene Indexer
           Pseudo Document Constructor
           EKDB Indexer


                            Pablo Duboue     Apache UIMA
Intro and Tutorial
                 W3C Corpus Processing
                                            TREC Enterprise Track
                      Advanced Topics
                               Summary


Binary Collection Reader




     Reads the TREC XML format
     300,000+ documents (a full crawl of the w3.org site)
     Binary format, to allow auto-detection of file type,
     encoding, etc.




                           Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing
                                           TREC Enterprise Track
                     Advanced Topics
                              Summary


LibMagic Annotator




     Uses “magic” numbers to heuristically guess the file type.
     JNI wrapper to libmagic in Linux.
     Non-supported file types are dropped.
     UIMA can run this remotely from a Windows machine.




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing
                                           TREC Enterprise Track
                     Advanced Topics
                              Summary


HTML Detagger Annotator




     For documents identified as HTML, parse them and extract
     the text.
     Perform also encoding detection (utf-8 vs. iso-8859-1).
     Other detaggers (not shown) are applied to other file
     formats.




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                 W3C Corpus Processing
                                            TREC Enterprise Track
                      Advanced Topics
                               Summary


Enterprise Member Annotator




     Detects inside running text the occurrence of any variant of
     the 1,000+ experts for the Enterprise Track.
     Dictionary extended with name variants.
     Simple TRIE-based implementation.




                           Pablo Duboue     Apache UIMA
Intro and Tutorial
                 W3C Corpus Processing
                                            TREC Enterprise Track
                      Advanced Topics
                               Summary


Expertise Detector Aggregate




     Hierarchical aggregate of 16 annotators leveraging existing
     technology into a new “expertise detection” annotator.
     Includes a named-entity detector and a relation detector for
     semantic patterns.




                           Pablo Duboue     Apache UIMA
Intro and Tutorial
                 W3C Corpus Processing
                                            TREC Enterprise Track
                      Advanced Topics
                               Summary


Lucene Indexer




     Integration with Open Source technology.
     Indexes the tokens from the text.
     The UIMA framework also contains JuruXML, an indexer
     for semantic information.




                           Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing
                                           TREC Enterprise Track
                     Advanced Topics
                              Summary


Pseudo Document Constructor




     Uses the name occurrences to create a “pseudo”
     document with all text surrounding each expert name.
     The pseudo documents are indexed off-line.




                          Pablo Duboue     Apache UIMA
Intro and Tutorial
                 W3C Corpus Processing
                                            TREC Enterprise Track
                      Advanced Topics
                               Summary


EKDB Indexer




     Stores extracted entities and relations in a relational
     database.
     Standards-based (JDBC, RDF).
     Employed in a variety of research applications for search
     and inference.




                           Pablo Duboue     Apache UIMA
Intro and Tutorial
                                               Flow Controllers
                    W3C Corpus Processing
                                               UIMA AS
                         Advanced Topics
                                               Improvements
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                                               Flow Controllers
                    W3C Corpus Processing
                                               UIMA AS
                         Advanced Topics
                                               Improvements
                                  Summary


Custom Flow Controllers


     UIMA allows you to specify which AE will receive the CAS
     next, based on all the annotations on the CAS.
      examples/descriptors/flow_controller/WhiteboardFlowController.xml

           FlowController implementing a simple version of the
           “whiteboard” flow model. Each time a CAS is received, it
           looks at the pool of available AEs that have not yet run on
           that CAS, and picks one whose input requirements are
           satisfied. Limitations: only looks at types, not features.
           Does not handle multiple Sofas or CasMultipliers.




                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                                               Flow Controllers
                    W3C Corpus Processing
                                               UIMA AS
                         Advanced Topics
                                               Improvements
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                                       Flow Controllers
            W3C Corpus Processing
                                       UIMA AS
                 Advanced Topics
                                       Improvements
                          Summary


UIMA AS: ActiveMQ



                      ActiveMQ
                                               UIMA AS
                       Broker
                                                 AEs

                                         queue

                  queue


                        Client




                      Pablo Duboue     Apache UIMA
Intro and Tutorial
                                        Flow Controllers
             W3C Corpus Processing
                                        UIMA AS
                  Advanced Topics
                                        Improvements
                           Summary


UIMA AS: Wrapping Primitive AEs




                       Pablo Duboue     Apache UIMA
Intro and Tutorial
                                               Flow Controllers
                    W3C Corpus Processing
                                               UIMA AS
                         Advanced Topics
                                               Improvements
                                  Summary


UIMA AS: More information



     http://incubator.apache.org/uima/doc-uimaas-what.html

     http://svn.apache.org/viewvc/incubator/uima/uimadiscretionary{-}{}{}as/trunk/

     uima-as-distr/src/main/readme/README?view=markup

     http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0

     discretionary{-}{}{}incubating/docs-uima-as/html/uima_async_scaleout/uima_

     async_scaleout.html




                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                                               Flow Controllers
                    W3C Corpus Processing
                                               UIMA AS
                         Advanced Topics
                                               Improvements
                                  Summary


Outline

  1   Intro and Tutorial
         What is UIMA
         Mini-Tutorial

  2   W3C Corpus Processing
       TREC Enterprise Track

  3   Advanced Topics
        Custom Flow Controllers
        UIMA Asynchronous Scale-out
        Things I’m interested in improving



                              Pablo Duboue     Apache UIMA
Intro and Tutorial
                                              Flow Controllers
                   W3C Corpus Processing
                                              UIMA AS
                        Advanced Topics
                                              Improvements
                                 Summary


Descriptorless Primitive Analysis Engines
     Java 5 introduced annotations for metadata associated
     with Java classes.
          The UIMA primitive AEs descriptor files fall squarely within
          their intended use cases.
     Example:
     import org . apache . uima . a n n o t a t i o n . ∗ ;

      @UimaPrimitive (
        d e s c r i p t i o n = " I d e n t i f i e s room numbers on t e x t f
        typesystem= " org / apache / uima / examples / roomtypes
      )
      public class RoomNumberAnnotator extends JCasAn
      ...

                             Pablo Duboue     Apache UIMA
Intro and Tutorial
                                            Flow Controllers
                 W3C Corpus Processing
                                            UIMA AS
                      Advanced Topics
                                            Improvements
                               Summary


CASless JCas Types

     A common use case within UIMA is to do some processing
     and then kept some results outside the CAS.
         These results are used with application level logic, outside
         the UIMA framework.
     Because the JCas types are only available when tied to a
     CAS, they cannot be used within application logic.
         Copying information to POJOs that re-create the JCas
         types is a frequent and tedious task.
     Proposal: have JCasGen produce both CAS-backed and
     CAS-less implementation of the type defined in the type
     system.
         With methods to bridge between the two.


                           Pablo Duboue     Apache UIMA
Intro and Tutorial
                W3C Corpus Processing
                     Advanced Topics
                              Summary


Summary

    UIMA is a production ready framework for unstructured
    information processing.
    UIMA is a framework and it contains little or no annotators.
    It is an efficient framework that requires commitment on
    behalf of its practitioners.

    Outlook
        As an open source project, new contributors are always
        welcomed.
        There are a number of things I am personally interested in
        working with other people interested in UIMA.


                          Pablo Duboue     Apache UIMA

More Related Content

Similar to UIMA

2017 03 25 Microsoft Hacks, How to code efficiently
2017 03 25 Microsoft Hacks, How to code efficiently2017 03 25 Microsoft Hacks, How to code efficiently
2017 03 25 Microsoft Hacks, How to code efficientlyBruno Capuano
 
Pquery_presentation_03 2
Pquery_presentation_03 2Pquery_presentation_03 2
Pquery_presentation_03 2Alexey Bychko
 
OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!Matt Tesauro
 
Building User-Centred Websites with Drupal
Building User-Centred Websites with DrupalBuilding User-Centred Websites with Drupal
Building User-Centred Websites with Drupalamanda etches
 
Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...
Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...
Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...Adam Khan
 
OBJECT ORIENTED ROGRAMMING With Question And Answer Full
OBJECT ORIENTED ROGRAMMING With Question And Answer  FullOBJECT ORIENTED ROGRAMMING With Question And Answer  Full
OBJECT ORIENTED ROGRAMMING With Question And Answer FullManas Rai
 
Pharo Status ESUG 2014
Pharo Status ESUG 2014Pharo Status ESUG 2014
Pharo Status ESUG 2014ESUG
 
Pharo Status ESUG 2014
Pharo Status ESUG 2014Pharo Status ESUG 2014
Pharo Status ESUG 2014Marcus Denker
 
Example Of Import Java
Example Of Import JavaExample Of Import Java
Example Of Import JavaMelody Rios
 
Arakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeArakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeNicolas Trangez
 
Empowering Magnolia for Enterprise Use Cases - Experience Report
Empowering Magnolia for Enterprise Use Cases - Experience ReportEmpowering Magnolia for Enterprise Use Cases - Experience Report
Empowering Magnolia for Enterprise Use Cases - Experience Reportbkraft
 
Succeding with the Apache SOA stack
Succeding with the Apache SOA stackSucceding with the Apache SOA stack
Succeding with the Apache SOA stackJohan Edstrom
 
Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...
Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...
Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...Sakari Hoisko
 

Similar to UIMA (20)

2017 03 25 Microsoft Hacks, How to code efficiently
2017 03 25 Microsoft Hacks, How to code efficiently2017 03 25 Microsoft Hacks, How to code efficiently
2017 03 25 Microsoft Hacks, How to code efficiently
 
Pquery_presentation_03 2
Pquery_presentation_03 2Pquery_presentation_03 2
Pquery_presentation_03 2
 
OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!
 
Building User-Centred Websites with Drupal
Building User-Centred Websites with DrupalBuilding User-Centred Websites with Drupal
Building User-Centred Websites with Drupal
 
Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...
Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...
Between a SPA and a JAMstack: Building Web Sites with Nuxt/Vue, Strapi and wh...
 
OBJECT ORIENTED ROGRAMMING With Question And Answer Full
OBJECT ORIENTED ROGRAMMING With Question And Answer  FullOBJECT ORIENTED ROGRAMMING With Question And Answer  Full
OBJECT ORIENTED ROGRAMMING With Question And Answer Full
 
Java platform
Java platformJava platform
Java platform
 
Pharo Status ESUG 2014
Pharo Status ESUG 2014Pharo Status ESUG 2014
Pharo Status ESUG 2014
 
Pharo Status ESUG 2014
Pharo Status ESUG 2014Pharo Status ESUG 2014
Pharo Status ESUG 2014
 
TAO DAYS - API (IT Session)
TAO DAYS - API (IT Session)TAO DAYS - API (IT Session)
TAO DAYS - API (IT Session)
 
Example Of Import Java
Example Of Import JavaExample Of Import Java
Example Of Import Java
 
Python Workshop
Python WorkshopPython Workshop
Python Workshop
 
T4 presentation
T4 presentationT4 presentation
T4 presentation
 
Maven Introduction
Maven IntroductionMaven Introduction
Maven Introduction
 
Arakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeArakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value store
 
Recipes for Drupal distributions
Recipes for Drupal distributionsRecipes for Drupal distributions
Recipes for Drupal distributions
 
Empowering Magnolia for Enterprise Use Cases - Experience Report
Empowering Magnolia for Enterprise Use Cases - Experience ReportEmpowering Magnolia for Enterprise Use Cases - Experience Report
Empowering Magnolia for Enterprise Use Cases - Experience Report
 
Succeding with the Apache SOA stack
Succeding with the Apache SOA stackSucceding with the Apache SOA stack
Succeding with the Apache SOA stack
 
TAO DAYS - Free advanced items (User Session)
TAO DAYS - Free advanced items (User Session)TAO DAYS - Free advanced items (User Session)
TAO DAYS - Free advanced items (User Session)
 
Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...
Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...
Tampere Technical University - Seminar Presentation in testind day 2016 - Sca...
 

More from otisg

Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)otisg
 
Lucandra
LucandraLucandra
Lucandraotisg
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrievalotisg
 
Faceted Search and Solr
Faceted Search and SolrFaceted Search and Solr
Faceted Search and Solrotisg
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 

More from otisg (6)

Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucandra
LucandraLucandra
Lucandra
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Faceted Search and Solr
Faceted Search and SolrFaceted Search and Solr
Faceted Search and Solr
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 

Recently uploaded

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Recently uploaded (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 

UIMA

  • 1. Intro and Tutorial W3C Corpus Processing Advanced Topics Summary Unstructured Information Processing with Apache UIMA NYC Search and Discovery Meetup Pablo Ariel Duboue, PhD IBM TJ Watson Research Center 19 Skyline Dr. Hawthorne, NY 10603 February 24th, 2010 Pablo Duboue Apache UIMA
  • 2. Intro and Tutorial W3C Corpus Processing Advanced Topics Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 3. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 4. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary What is UIMA UIMA is a framework, a means to integrate text or other unstructured information analytics. Reference implementations available for Java, C++ and others. An Open Source project under the umbrella of the Apache Foundation. Pablo Duboue Apache UIMA
  • 5. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Analytics Frameworks Find all telephone numbers in running text (((([0-9]{3}))|[0-9]{3})-? [0-9]{3}-?[0-9]{4} Nice but... How are you going to feed this further processing? What about finding non-standard proper names in text? Acquiring technology from external vendors, free software projects, etc? Pablo Duboue Apache UIMA
  • 6. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary In-line Annotations Modify text to include annotations This/DET happy/ADJ puppy/N It gets very messy very quickly (S (NP (This/DET happy/ADJ puppy/N) (VP eats/V (NP (the/DET bone/N))) Annotations can easily cross boundaries of other annotations He said <confidential>the project can’t go own. The funding is lacking.</confidential> Pablo Duboue Apache UIMA
  • 7. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Standoff Annotations Standoff annotations Do not modify the text Keep the annotations as offsets within the original text Most analytics frameworks support standoff annotations. UIMA is built with standoff annotations at its core. Example: He said the project can’t go own. The funding is lacking. 0123456789012345678901235678901234567890123456789012345678 Sentence Annotation: 0-33, 36-58. Confidential Annotation: 8-58. Pablo Duboue Apache UIMA
  • 8. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Type Systems Key to integrating analytic packages developed by independent vendors. Clear metadata about Expected Inputs Tokens, sentences, proper names, etc Produced Outputs Parse trees, opinions, etc The framework creates an unified typesystem for a given set of annotators being run. Pablo Duboue Apache UIMA
  • 9. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Many frameworks Besides UIMA http://incubator.apache.org/uima LingPipe http://alias-i.com/lingpipe/ Gate http://gate.ac.uk/ Pablo Duboue Apache UIMA
  • 10. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary UIMA Advantages Apache Licensed Enterprise-ready code quality Demonstrated scalability Developed by experts in building frameworks Interoperable (C++, Java, others) Pablo Duboue Apache UIMA
  • 11. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary About The Speaker PhD in CS, Columbia University (NY) Natural Language Generation Joined IBM Research in 2005 Worked in Question Answering Expert Search DeepQA (Jeopardy!) Recently joined the UIMA group mailto:pablo.duboue@gmail.com Pablo Duboue Apache UIMA
  • 12. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 13. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary UIMA Concepts Common Annotation Structure or CAS Subject of Analysis (SofA or View) JCas Feature Structures Annotations Indices and Iterators Analysis Engines (AEs) AEs descriptors Pablo Duboue Apache UIMA
  • 14. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Room annotator From the UIMA tutorial, write an Analysis Engine that identifies room numbers in text. Yorktown patterns: 20-001, 31-206, 04-123 (Regular Expression Pattern: [0-9][0-9]-[0-2][0-9][0-9]) Hawthorne patterns: GN-K35, 1S-L07, 4N-B21 (Regular Expression Pattern: [G1-4][NS]-[A-Z][0-9]) Steps: 1 Define the CAS types that the annotator will use. 2 Generate the Java classes for these types. 3 Write the actual annotator Java code. 4 Create the Analysis Engine descriptor. 5 Test the annotator. Pablo Duboue Apache UIMA
  • 15. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary Editing a Type System Pablo Duboue Apache UIMA
  • 16. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary The XML descriptor <?xml version= " 1 . 0 " encoding= "UTF−8" ?> < t y p e S y s t e m D e s c r i p t i o n xmlns= " h t t p : / / uima . apache . org / r e s o u r c e S p e c i f i e r " > <name> Tu to ri al Ty pe Sys te m < / name> < d e s c r i p t i o n >Type System D e f i n i t i o n f o r t h e t u t o r i a l examples − as o f E x e r c i s e 1< / d e s c r i p t i o n > <vendor>Apache Software Foundation < / vendor> <version> 1 . 0 < / version> <types> <typeDescription> <name>org . apache . uima . t u t o r i a l . RoomNumber< / name> < d e s c r i p t i o n >< / d e s c r i p t i o n > <supertypeName>uima . t c a s . A n n o t a t i o n < / supertypeName> <features> <featureDescription> <name> b u i l d i n g < / name> < d e s c r i p t i o n > B u i l d i n g c o n t a i n i n g t h i s room< / d e s c r i p t i o n > <rangeTypeName>uima . cas . S t r i n g < / rangeTypeName> </ featureDescription> </ features> </ typeDescription> < / types> < / typeSystemDescription> Pablo Duboue Apache UIMA
  • 17. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary The AE code package org . apache . uima . t u t o r i a l . ex1 ; import j a v a . u t i l . regex . Matcher ; import j a v a . u t i l . regex . P a t t e r n ; import org . apache . uima . analysis_component . JCasAnnotator_ImplBase ; import org . apache . uima . j c a s . JCas ; import org . apache . uima . t u t o r i a l . RoomNumber ; /∗∗ ∗ Example a n n o t a t o r t h a t d e t e c t s room numbers u s i n g ∗ Java 1 . 4 r e g u l a r e x p r e s s i o n s . ∗/ public class RoomNumberAnnotator extends JCasAnnotator_ImplBase { p r i v a t e P a t t e r n mYorktownPattern = P a t t e r n . compile ( " b [ 0 − 4 ] d−[0 −2]d d b " ) ; p r i v a t e P a t t e r n mHawthornePattern = P a t t e r n . compile ( " b [ G1−4][NS] −[A ] d d b " ) ; −Z public void process ( JCas aJCas ) { / / next s l i d e } } Pablo Duboue Apache UIMA
  • 18. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary The AE code (cont.) public void process ( JCas aJCas ) { / / g e t document t e x t S t r i n g docText = aJCas . getDocumentText ( ) ; / / search f o r Yorktown room numbers Matcher matcher = mYorktownPattern . matcher ( docText ) ; i n t pos = 0 ; while ( matcher . f i n d ( pos ) ) { / / found one − c r e a t e a n n o t a t i o n RoomNumber a n n o t a t i o n = new RoomNumber ( aJCas ) ; a n n o t a t i o n . s e t B e g i n ( matcher . s t a r t ( ) ) ; a n n o t a t i o n . setEnd ( matcher . end ( ) ) ; a n n o t a t i o n . s e t B u i l d i n g ( " Yorktown " ) ; a n n o t a t i o n . addToIndexes ( ) ; pos = matcher . end ( ) ; } / / search f o r Hawthorne room numbers // .. } Pablo Duboue Apache UIMA
  • 19. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary UIMA Document Analyzer Pablo Duboue Apache UIMA
  • 20. Intro and Tutorial W3C Corpus Processing UIMA Advanced Topics Tutorial Summary UIMA Document Analyzer (cont) Pablo Duboue Apache UIMA
  • 21. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 22. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary An Example TREC 2006 enterprise track Search for experts in W3C Website Given topic, find expert in topic The IBM Enterprise Track 2006 Team Guillermo Averboch, Jennifer Chu-Carroll, Pablo A Duboue, David Gondek, J William Murdock and John Prager . Pablo Duboue Apache UIMA
  • 23. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Corpus Processing: Generalities A simplified view of our TREC 2006 Enterprise Track Indexing Pipeline (actual pipeline has 23 components) Reader Low Level Processing Higher Level Analytics Analytics Consumers Binary HTML Enterprise Expertise Pseudo LibMagic Lucene EKDB Collection Detagger Member Detector Document Annotator Indexer Indexer Reader Annotator Annotator Aggregate Constructor Pablo Duboue Apache UIMA
  • 24. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Pipeline Reader Binary Collection Reader Low Level Processing LibMagic Annotator HTML Detagger Annotator Higher Level Analytics Enterprise Member Annotator Expertise Detector Aggregate Analytics Consumers Lucene Indexer Pseudo Document Constructor EKDB Indexer Pablo Duboue Apache UIMA
  • 25. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Binary Collection Reader Reads the TREC XML format 300,000+ documents (a full crawl of the w3.org site) Binary format, to allow auto-detection of file type, encoding, etc. Pablo Duboue Apache UIMA
  • 26. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary LibMagic Annotator Uses “magic” numbers to heuristically guess the file type. JNI wrapper to libmagic in Linux. Non-supported file types are dropped. UIMA can run this remotely from a Windows machine. Pablo Duboue Apache UIMA
  • 27. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary HTML Detagger Annotator For documents identified as HTML, parse them and extract the text. Perform also encoding detection (utf-8 vs. iso-8859-1). Other detaggers (not shown) are applied to other file formats. Pablo Duboue Apache UIMA
  • 28. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Enterprise Member Annotator Detects inside running text the occurrence of any variant of the 1,000+ experts for the Enterprise Track. Dictionary extended with name variants. Simple TRIE-based implementation. Pablo Duboue Apache UIMA
  • 29. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Expertise Detector Aggregate Hierarchical aggregate of 16 annotators leveraging existing technology into a new “expertise detection” annotator. Includes a named-entity detector and a relation detector for semantic patterns. Pablo Duboue Apache UIMA
  • 30. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Lucene Indexer Integration with Open Source technology. Indexes the tokens from the text. The UIMA framework also contains JuruXML, an indexer for semantic information. Pablo Duboue Apache UIMA
  • 31. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary Pseudo Document Constructor Uses the name occurrences to create a “pseudo” document with all text surrounding each expert name. The pseudo documents are indexed off-line. Pablo Duboue Apache UIMA
  • 32. Intro and Tutorial W3C Corpus Processing TREC Enterprise Track Advanced Topics Summary EKDB Indexer Stores extracted entities and relations in a relational database. Standards-based (JDBC, RDF). Employed in a variety of research applications for search and inference. Pablo Duboue Apache UIMA
  • 33. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 34. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary Custom Flow Controllers UIMA allows you to specify which AE will receive the CAS next, based on all the annotations on the CAS. examples/descriptors/flow_controller/WhiteboardFlowController.xml FlowController implementing a simple version of the “whiteboard” flow model. Each time a CAS is received, it looks at the pool of available AEs that have not yet run on that CAS, and picks one whose input requirements are satisfied. Limitations: only looks at types, not features. Does not handle multiple Sofas or CasMultipliers. Pablo Duboue Apache UIMA
  • 35. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 36. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary UIMA AS: ActiveMQ ActiveMQ UIMA AS Broker AEs queue queue Client Pablo Duboue Apache UIMA
  • 37. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary UIMA AS: Wrapping Primitive AEs Pablo Duboue Apache UIMA
  • 38. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary UIMA AS: More information http://incubator.apache.org/uima/doc-uimaas-what.html http://svn.apache.org/viewvc/incubator/uima/uimadiscretionary{-}{}{}as/trunk/ uima-as-distr/src/main/readme/README?view=markup http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0 discretionary{-}{}{}incubating/docs-uima-as/html/uima_async_scaleout/uima_ async_scaleout.html Pablo Duboue Apache UIMA
  • 39. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary Outline 1 Intro and Tutorial What is UIMA Mini-Tutorial 2 W3C Corpus Processing TREC Enterprise Track 3 Advanced Topics Custom Flow Controllers UIMA Asynchronous Scale-out Things I’m interested in improving Pablo Duboue Apache UIMA
  • 40. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary Descriptorless Primitive Analysis Engines Java 5 introduced annotations for metadata associated with Java classes. The UIMA primitive AEs descriptor files fall squarely within their intended use cases. Example: import org . apache . uima . a n n o t a t i o n . ∗ ; @UimaPrimitive ( d e s c r i p t i o n = " I d e n t i f i e s room numbers on t e x t f typesystem= " org / apache / uima / examples / roomtypes ) public class RoomNumberAnnotator extends JCasAn ... Pablo Duboue Apache UIMA
  • 41. Intro and Tutorial Flow Controllers W3C Corpus Processing UIMA AS Advanced Topics Improvements Summary CASless JCas Types A common use case within UIMA is to do some processing and then kept some results outside the CAS. These results are used with application level logic, outside the UIMA framework. Because the JCas types are only available when tied to a CAS, they cannot be used within application logic. Copying information to POJOs that re-create the JCas types is a frequent and tedious task. Proposal: have JCasGen produce both CAS-backed and CAS-less implementation of the type defined in the type system. With methods to bridge between the two. Pablo Duboue Apache UIMA
  • 42. Intro and Tutorial W3C Corpus Processing Advanced Topics Summary Summary UIMA is a production ready framework for unstructured information processing. UIMA is a framework and it contains little or no annotators. It is an efficient framework that requires commitment on behalf of its practitioners. Outlook As an open source project, new contributors are always welcomed. There are a number of things I am personally interested in working with other people interested in UIMA. Pablo Duboue Apache UIMA