SlideShare a Scribd company logo
1 of 21
Content Extraction with Apache Tika
     Jukka Zitting | Tika committer, co-author of Tika in Action




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Content Extraction with Apache Tika

    Introduction to Apache Tika
    Full text extraction with Tika
    Tika and Solr - the ExtractingRequestHandler
    Tika and Lucene - direct feeding of the index
         forked parsing
         link extraction




© 2012 Adobe Systems Incorporated. All Rights Reserved.   2
Introduction to Apache Tika
                                                          section 1 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Introduction to Apache Tika




                          The Apache Tika™ toolkit
                          - detects and extracts
                          - metadata and structured text content
                          - from various documents
                          - using existing parser libraries.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   4
Problem domain




© 2012 Adobe Systems Incorporated. All Rights Reserved.   5
The Tika solution



                                                              It is a truth
                                                              universally
                                                              acknowledged, that
                                                              a single man in
                                                              possession of a
                                                              good fortune, must
                                                              be in want of a
                                                              wife...
                                                              Content
                                                              dc:title=
                                                               Pride and Prejudice
                                                              dc:creator=
                                                               Jane Austen
         Document                                             dc:date=1813

                                                              Metadata




© 2012 Adobe Systems Incorporated. All Rights Reserved.   6
Project background

    Brief history
         2007           Tika       started in the Apache Incubator
         2008           Tika       graduates into a Lucene subproject
         2010           Tika       becomes a standalone TLP
         2011           Tika       1.0 released
         2011           Tika       in Action published
    Latest release is Apache Tika 1.2
         thousands of known media types
             most with associated type detection patterns
         dozens of supported document formats
             including all major office formats
         basic language detection
         etc.
    For more information http://tika.apache.org/


© 2012 Adobe Systems Incorporated. All Rights Reserved.   7
Ohloh summary (http://www.ohloh.net/p/tika)




© 2012 Adobe Systems Incorporated. All Rights Reserved.   8
Full text extraction with Tika
                                                          section 2 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Demo: tika-app-1.2.jar

    https://github.com/jukka/tika-demo
    java -jar tika-app-1.2.jar




© 2012 Adobe Systems Incorporated. All Rights Reserved.   10
tika-app as a command line tool

$ java -jar tika-app-1.2.jar --xhtml /path/to/
document.doc

$ java -jar tika-app-1.2.jar --text http://example.com/
 document

$ java -jar tika-app-1.2.jar --metadata < document.doc

$ cat document.doc | java -jar tika-app-1.2.jar --text |
 grep foo

$ java -jar tika-app-1.2.jar --help



© 2012 Adobe Systems Incorporated. All Rights Reserved.   11
Tika’s Java API

    Divided in two layers
         The Tika facade: org.apache.tika.Tika
         Lower-level interfaces like Parser, Detector, etc.

    Use the Tika facade by default
         Provides simple support for most common use cases
         Example: new Tika().parseToString(“/path/to/document.doc”)

    Use the lower-level interfaces for more power or flexibility
         Allows fine-grained control of Tika functionality
         More complicated programming model
             Parsed content handled as XHTML SAX events
         Not all functionality is exposed through the Tika facade




© 2012 Adobe Systems Incorporated. All Rights Reserved.   12
Tika and Solr - the ExtractingRequestHandler
                                                          section 3 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
ExtractingRequestHandler

    aka Solr Cell
    http://wiki.apache.org/solr/ExtractingRequestHandler

     “Solr's ExtractingRequestHandler uses Tika
     to allow users to upload binary files to Solr and
     have Solr extract text from it and then index it.”

    For example:
     $ curl "http://localhost:8983/solr/update/extract?
     literal.id=document&commit=true" -F "file=@document.doc"

    Supports both text and metadata extraction
         with plenty of configurable options




© 2012 Adobe Systems Incorporated. All Rights Reserved.   14
ExtractingRequestHandler parameters

    Helping Tika do it’s job
         resource.name=document.doc - Helps Tika’s automatic type detection
         resource.password=secret - Allows Tika to read encrypted documents
         passwordsFile=/path/to/password-file - Resource name to password
          mappings
             for example: .*.pdf$ = pdf-secret
    Capturing special content
         xpath=//a - Capture only content inside elements that match the
          specified query
         capture=h1 - Capture content inside specific elements to a separate
          field
         captureAttr=true - Capture attributes into separate fields named after
          the element
    Mapping field names
         lowernames=true - Normalize metadata field names to “content_type”,
          etc.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   15
Tika and Lucene - direct feeding of the index
                                                          section 4 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Using the Tika facade to feed Lucene

// Index first part of the document
String text = new Tika().parseToString(“/path/to/document.doc”);
document.add(new TextField(“text”, text, Field.Store.NO));


// Index the full document
Reader reader = new Tika().parse(“/path/to/document.doc”);
document.add(new TextField(“text”, reader));


// Index also some metadata
Metadata metadata = new Metadata();
Reader reader = new Tika().parse(
   new FileInputStream(“/path/to/document.doc”), metadata);
document.add(new TextField(“text”, reader));
document.add(new StringField(“type”,
metadata.get(Metadata.CONTENT_TYPE));

© 2012 Adobe Systems Incorporated. All Rights Reserved.   17
Things to consider

    What if the document is larger than your memory?
         Index only first N bytes/characters?
         WriteOutContentHandler supports an explicit write limit
             Enabled by default in the Tika facade, see get/setMaxStringLength()

    What if the document is malformed or intentionally broken?
         Could cause denial of service problems
             Might even crash the entire JVM due to bugs in native libraries in the JDK!
         SecureContentHandler monitors parsing and terminates it if things look
          bad
             Enabled by default in the Tika facade

    Ultimate solution: forked parsing and the Tika server
         Parse documents in separate, sandboxed JVM processes
         A document could fail to parse, but your application won’t crash
         Code is already there, but still a bit tricky to set up

© 2012 Adobe Systems Incorporated. All Rights Reserved.   18
Link extraction for web crawlers

    The LinkContentHandler class can be used to extract all links from
     a document
         Works also with links in things like PDF, MS Word and email documents
         Use TeeContentHandler to combine with other ways of capturing content

// for example
LinkContentHandler lch = new LinkContentHandler();
BodyContentHandler bch = new BodyContentHandler();
new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...);

System.out.println(“Content: “ + bch);
for (Link link : lch.getLinks()) {
   System.out.println(“Link: “ + link):
}


© 2012 Adobe Systems Incorporated. All Rights Reserved.   19
Questions?
                                                          http://tika.apache.org/




© 2012 Adobe Systems Incorporated. All Rights Reserved.
© 2012 Adobe Systems Incorporated. All Rights Reserved.

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Large Table Partitioning with PostgreSQL and Django
 Large Table Partitioning with PostgreSQL and Django Large Table Partitioning with PostgreSQL and Django
Large Table Partitioning with PostgreSQL and DjangoEDB
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Tuned
TunedTuned
Tuned
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Large Table Partitioning with PostgreSQL and Django
 Large Table Partitioning with PostgreSQL and Django Large Table Partitioning with PostgreSQL and Django
Large Table Partitioning with PostgreSQL and Django
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 

Similar to Content extraction with apache tika

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containersMihir Shah
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasMikael Nilsson
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDSFrauke Ziedorn
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Amazon Web Services
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...buildacloud
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collabKaren Vuong
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookellamuralikrishnanookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAmazon Web Services
 

Similar to Content extraction with apache tika (20)

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containers
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collab
 
Multi Stage Docker Build
Multi Stage Docker Build Multi Stage Docker Build
Multi Stage Docker Build
 
People aggregator
People aggregatorPeople aggregator
People aggregator
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Cloudera and Spark setup
Cloudera and Spark setupCloudera and Spark setup
Cloudera and Spark setup
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
 

More from Jukka Zitting

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6Jukka Zitting
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CIJukka Zitting
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStoreJukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Jukka Zitting
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repositoryJukka Zitting
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuningJukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical modelJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitJukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiJukka Zitting
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On SteroidsJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache JackrabbitJukka Zitting
 

More from Jukka Zitting (19)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Content extraction with apache tika

  • 1. Content Extraction with Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 2. Content Extraction with Apache Tika  Introduction to Apache Tika  Full text extraction with Tika  Tika and Solr - the ExtractingRequestHandler  Tika and Lucene - direct feeding of the index  forked parsing  link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
  • 3. Introduction to Apache Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 4. Introduction to Apache Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
  • 5. Problem domain © 2012 Adobe Systems Incorporated. All Rights Reserved. 5
  • 6. The Tika solution It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
  • 7. Project background  Brief history  2007 Tika started in the Apache Incubator  2008 Tika graduates into a Lucene subproject  2010 Tika becomes a standalone TLP  2011 Tika 1.0 released  2011 Tika in Action published  Latest release is Apache Tika 1.2  thousands of known media types  most with associated type detection patterns  dozens of supported document formats  including all major office formats  basic language detection  etc.  For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
  • 8. Ohloh summary (http://www.ohloh.net/p/tika) © 2012 Adobe Systems Incorporated. All Rights Reserved. 8
  • 9. Full text extraction with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 10. Demo: tika-app-1.2.jar  https://github.com/jukka/tika-demo  java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
  • 11. tika-app as a command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
  • 12. Tika’s Java API  Divided in two layers  The Tika facade: org.apache.tika.Tika  Lower-level interfaces like Parser, Detector, etc.  Use the Tika facade by default  Provides simple support for most common use cases  Example: new Tika().parseToString(“/path/to/document.doc”)  Use the lower-level interfaces for more power or flexibility  Allows fine-grained control of Tika functionality  More complicated programming model  Parsed content handled as XHTML SAX events  Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
  • 13. Tika and Solr - the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 14. ExtractingRequestHandler  aka Solr Cell  http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.”  For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc"  Supports both text and metadata extraction  with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
  • 15. ExtractingRequestHandler parameters  Helping Tika do it’s job  resource.name=document.doc - Helps Tika’s automatic type detection  resource.password=secret - Allows Tika to read encrypted documents  passwordsFile=/path/to/password-file - Resource name to password mappings  for example: .*.pdf$ = pdf-secret  Capturing special content  xpath=//a - Capture only content inside elements that match the specified query  capture=h1 - Capture content inside specific elements to a separate field  captureAttr=true - Capture attributes into separate fields named after the element  Mapping field names  lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
  • 16. Tika and Lucene - direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 17. Using the Tika facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
  • 18. Things to consider  What if the document is larger than your memory?  Index only first N bytes/characters?  WriteOutContentHandler supports an explicit write limit  Enabled by default in the Tika facade, see get/setMaxStringLength()  What if the document is malformed or intentionally broken?  Could cause denial of service problems  Might even crash the entire JVM due to bugs in native libraries in the JDK!  SecureContentHandler monitors parsing and terminates it if things look bad  Enabled by default in the Tika facade  Ultimate solution: forked parsing and the Tika server  Parse documents in separate, sandboxed JVM processes  A document could fail to parse, but your application won’t crash  Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
  • 19. Link extraction for web crawlers  The LinkContentHandler class can be used to extract all links from a document  Works also with links in things like PDF, MS Word and email documents  Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
  • 20. Questions? http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 21. © 2012 Adobe Systems Incorporated. All Rights Reserved.

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n