SlideShare a Scribd company logo
1 of 36
Download to read offline
Column Stride Fields aka. DocValues
Simon Willnauer @ Lucene Revolution 2011

PMC Member & Core Comitter Apache Lucene
simonw@apache.org / simonw@jteam.nl
2
Column Stride Fields aka. DocValues

Agenda
‣ What is this all about? aka. The Problem!

‣ The more native solution

‣ DocValues - current state and future

‣ Questions?




                                              3
What is this all about? - Inverted Index

Lucene is basically an inverted index - used to find terms QUICKLY!

1   The old night keeper keeps the keep in the town    term     freq   Posting list
2   In the big old house in the big old gown.           and       1    6
                                                         big      2    23
3   The house in the town had the big old keep
                                                       dark       1    6
4   Where the old night keeper never did sleep.
                                                         did      1    4
5   The night keeper keeps the keep in the night       gown       1    2
6   And keeps in the dark and sleeps in the light.      had       1    3
                                                      house       2    23
Table with 6 documents                                    in      5    <1> <2> <3> <5> <6>
                                                       keep       3    135
                                                      keeper      3    145
                                                      keeps       3    156
                                        TermsEnum       light     1    6
                                                       never      1    4
                                                       night      3    145
         IndexWriter                                     old      4    1234
                                                       sleep      1    4
                                                      sleeps      1    6
                                                         the      6    <1> <2> <3> <4> <5> <6>
                                                       town       2    13
                                                      where       1    4
Intersecting posting lists

Yet, once we found the right terms the game starts....

        Posting Lists (document IDs)
         5    10   11    55     57   59   77   88
                                                     AND Query
         1    10   13    44     55   79   88   99




                        score


 What goes into the score? PageRank?, ClickFeedback?


                                                                 5
How to store scoring factors?

Lucene provides 2 ways of storing data

  • Stored Fields (document to String or Binary mapping)
  • Inverted Index (term to document mapping)

What if we need here is one or more values per document!

  • document to value mapping
Why not use Stored Fields?




                                                           6
Using Stored Fields


• Stored Fields serve a different purpose
   • loading body or title fields for result rendering / highlighting
   • very suited for loading multiple values
• With Stored Fields you have one indirection per document resulting in
 going to disk twice for each document

  • on-disk random access is too slow
  • remember Lucene could score millions of documents even if you just
    render the top 10 or 20!




                                                                          7
Stored Fields under the hood

                             Document
                                                     title:        title:
                                   id: 108232
                                                  Deutschland    Germany




                                                absolute file pointers
                       [...][...][93438][...]

  Field Index (.fdx)




                         [...][id:108232title:Deutschlandtitle:Germany][...]

  Field Data (.fdt)
                        numFields(vint) [ fieldid(vint) length(vint) payload ]


                                                                                8
Stored Fields - accessing a field



                            1       Lookup filepointer in .fdx


                       [...][...][93438][...]

  Field Index (.fdx)

                        2         Scan on .fdt until you find the field by ID


                         [...][id:108232title:Deutschlandtitle:Germany][...]

  Field Data (.fdt)
                        numFields(vint) [ fieldid(vint) length(vint) payload ]




                                                                                9
Alternatives?

Lucene can un-invert a field into FieldCache




                               un-invert
 weight                                         term     freq    Posting list
  5.8
                                                 1.0         1   16
  1.0
                                                 2.7         1   23
  2.7
                     parse
  2.7                                            3.2         1   7
  4.3         convert to datatype                4.3         1   4
  7.9
                                                 4.7         1   8
  1.0

  3.2                                            5.8         1   0

  4.7
                                                 7.9         1   59
  7.9
          array per field /                       9.0         1   10
  9.0        segment


float 32                                    string / byte[]                      10
FieldCache - is fast once loaded, once!

• Constant time lookup DocID to value
• Efficient representation
   • primitive array
   • low GC overhead
• loading can be slow (realtime can be a problem)
• must parse values
• builds unnecessary term dictionary
• always memory resident


                                                    11
FieldCache - loading

Simple Benchmark

• Indexing 100k, 1M and 10M random floats
• not analyzed no norms
• load field into FieldCache from optimized index


                 100k Docs     1M Docs     10M Docs


                   122 ms      348 ms       3161 ms


Remember, this is only one field! Some apps have many fields to load to
FieldCache

                                                                        12
FieldCache works fine! - if...

• you have enough memory
• you can afford the loading time
• merge is fast enough (for FieldCache you need to index the terms)



What if you canʼt? Like when you are in a very restricted
environment?

• 3 Billion Android installations world wide and growing - 2 MB Heap!
• with 100 Million Documents one field takes 30 seconds to load
• 2 phase Distributed Search

                                                                        13
Summary

• Stored Fields are not fast enough for random access
• FieldCache is fast once loaded
   • abuses a reverse index
   • must convert to String and from String
   • requires fair amount of memory
• Lucene is missing native data-structure for primitive per-document
 values




                                                                       14
Column Stride Fields aka. DocValues

Agenda
‣ What is this all about? aka. The Problem!

‣ The more native solution

‣ DocValues - current state and future

‣ Questions?




                                              15
The more native solution - Column Stride Fields

• A dense column based storage
• 1 value per document
• accepts primitives - no conversion from / to String
   • int & long
   • float & double
   • byte[ ]
• each field has a DocValues Type but can still be indexed or stored
• Entirely optional


                                                                      16
Simple Layout - even on disk

                             1 column per field and segment

                               field: time     field: id   field: page_rank
      1 value per document
                              1288271631431      1             3.2
                              1288271631531      5             4.5
                              1288271631631      3             2.3
                              1288271631732      4            4.44
                              1288271631832      6             6.7
                              1288271631932      9             7.8
                              1288271632032      8             9.9
                              1288271632132      7            10.1
                              1288271632233     12            11.0
                              1288271632333     14            33.1
                              1288271632433     22             0.2
                              1288271632533     32             1.4
                              1288271632637     100           55.6
                              1288271632737     33             2.2
                              1288271632838     34             7.5
                              1288271632938     35             3.2
                              1288271633038     36             3.4
                              1288271633138     37             5.6
                              1288271632333     38            45.0

                                 int64        int32         float 32
                                                                           17
Numeric Types - Int

                    Number of bit depend on the numeric
                    range in the field:                                                                        field: id
                                                                            7 - bit per doc                      1
                                                                                                                 5
                                                                                                                 3
                                     Math.max(1, (int) Math.ceil(
                                               Math.log(1+maxValue)/Math.log(2.0))                               4
                                               );
                                                                                                                 6
                                                                                                                 9




                                                                                              Random Access
                                                                                                                 8
                                                                                                                 7
                                                                                                                12
                                                                                                                14
• Integer are stored dense based on PackedInts                                                                  22
                                                                                                                32
• Space depends on the value-range per segment                                                                  100
                                                                                                                33
  Example: [1, 100] maps to [0, 99] requires 7 bit per doc                                                      34
                                                                                                                35
                                                                                                                36

• Floats are stored without compression                                                                         37
                                                                                                                38

   • either 32 or 64 bit per value
                                                                                                                         18
Arbitrary Values - The byte[] variants

• Length Variants:
   • Fixed / Variable
• Store Variants:
   • Straight or Referenced
                                        fixed / straight                   fixed / deref
                                              data                        offsets          data

                                            10/01/2011                      0            10/01/2011

                                            12/01/2011                      10           12/01/2011




                                                          Random Access
                        Random Access



                                            10/04/2011                      20           10/04/2011
                                            10/06/2011                      30
                                                                                         10/06/2011
                                            10/05/2011                      40
                                                                                         10/05/2011
                                            10/01/2011                      50
                                                                                         10/01/2011
                                            10/07/2011                      60
                                                                                         10/07/2011
                                            10/04/2011                      20

                                            10/04/2011                      20

                                            10/04/2011                      20                        19
DocValues - Memory Requirements

• RAM Resident - random access
   • similar to FieldCache
   • bytes are stored in byte-block pools
   • currently limited to 2GB per segment
• On-Disk - sequential access
   • almost no JVM heap memory
   • files should be in FS cache for fast access
   • possible use MemoryMapped Buffers


                                                  20
Lets look at the API - Indexing


Adding DocValues follows existing patterns, simply use Fieldable

       Document doc = new Document();
       float pageRank = 10.3f;
       DocValuesField valuesField = new DocValuesField("pageRank");
       valuesField.setFloat(pageRank);
       doc.add(valuesField);
       writer.addDocument(doc);




Sometimes the field should also be indexed, stored or needs term-
vectors

       String titleText = "The quick brown fox";
       Field field = new Field("title", titleText , Store.NO, Index.ANALYZED);
       DocValuesField titleDV = new DocValuesField("title");
       titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF);
       field.setDocValues(titleDV);




                                                                                 21
Looking at the API - Search / Retrieve


On disk sequential access is exposed through DocValuesEnum

 IndexReader reader = ...;
 DocValues values = reader.docValues("pageRank");
 DocValuesEnum floatEnum = values.getEnum();
 int doc = 0;
 FloatsRef ref = floatEnum.getFloat(); // values are filled when iterating

 while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
   double value = ref.floats[0];
 }

 // equivalent to ...

 int doc = 0;
 while((doc = floatEnum.advance(doc+1)) != DocValuesEnum.NO_MORE_DOCS) {
   double value = ref.floats[0];
 }




DocValuesEnum is based on DocIdSetIterator just like Scorer or
DocsEnum

                                                                             22
Looking at the API - Search / Retrieve


RAM Resident API is very similar to FieldCache


IndexReader reader = ...;
DocValues values = reader.docValues("pageRank");
Source source = values.getSource();
double value = source.getFloat(x);

// still allows iterating over the RAM resident values

DocValuesEnum floatEnum = source.getEnum();
int doc;
FloatsRef ref = floatEnum.getFloat();
while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
  value = ref.floats[0];
}




DocValuesEnum still available on RAM Resident API




                                                                     23
Can I add my own DocValues Implementation?

• DocValues are integrated into Flexible Indexing
• IndexWriter / IndexReader write and read DocValues via a Codec
• DocValues Types are fixed (int, float32, float64 etc.) but implementations
 are Codec specific

• A Codec provides access to DocValuesComsumer and
 DocValuesProducer

  • allows implementing application specific serialzation
  • customize compression techniques



                                                                       24
Quick detour - Codecs


     IndexWriter                   IndexReader




                        Flex API

                        Codec

                    Directory

                   FileSystem


                                                 25
Quick detour - Codecs


       IndexWriter                         IndexReader


                     write              read



DocValuesConsumer                              DocValuesProducer




                             Flex API

                             Codec
                                                                   26
Remember the loading FieldCache benchmark?

Simple Benchmark

• Indexing 100k, 1M and 10M random floats
• not analyzed no norms
• loading field into FieldCache from optimized index vs. loading
DocValues field

                         100k Docs 1M Docs          10M Docs

           FieldCache      122 ms       348 ms       3161 ms

            DocValues       7 ms         10 ms        90 ms



Loading is 100 x faster - no un-inverting, no string parsing
                                                                  27
QPS - FieldCache vs. DocValues

          Task        QPS DocValues   QPS FieldCache     % change
       AndHighHigh         3.51             3.41            2.9%
        PKLookup          46.06             44.87           2.7%
       AndHighMed         37.09             36.48           1.7%
          Fuzzy2          17.70             17.50           1.1%
          Fuzzy1          27.15             27.21           -0.2%
          Phrase           4.12             4.13            -0.2%
        SpanNear           2.00             2.01            -0.5%
       SloppyPhrase        1.98             2.02            -2.0%
           Term           35.29             36.05           -2.1%
        OrHighMed          4.73             4.93            -4.1%
        OrHighHigh         3.99             4.18            -4.5%
         Wildcard         12.97             13.60           -4.6%
          Prefix3         15.86             16.70           -5.0%
         IntNRQ            2.72             2.91            -6.5%



6 Search Threads 20 JVM instances, 5 instances per task run 50 times on 12 core
Xeon / 24 GB RAM - all queries wrapped with a CustomScoreQuery
                                                                                  28
Column Stride Fields aka. DocValues

Agenda
‣ What is this all about? aka. The Problem!

‣ The more native solution

‣ DocValues - current state and future

‣ Questions?




                                              29
DocValues - current state

• Currently still in a branch
   • Some minor JavaDoc issues
   • needs some cleanups

• Landing on trunk very soon
   • issue is already opened and active




                                          30
DocValues - current features

• Fully customizable via Codecs
• User can control memory usage per field
• Suitable for environments where memory is tight
• Compact and native representation on disk and in RAM
• Fast Loading times
• Comparable to FieldCache (small overhead)
• Direct value access even when on disk (single seek)



                                                         31
DocValues - what is next?

• the ultimate goal for DocValues is to be update-able
   • changing a per-document values without reindexing
   • users can replace existing values directly for each document
   • each field by itself will be update-able
• Will be available in Lucene 4.0 once released ;)




                                                                    32
DocValues - Updates

• Lucene has write-once policy for files
   • Changing in place is not a good idea - Consistency / Corruption!
• Problem is comparable to norms or deleted docs
   • updating norms requires re-writing the entire norms array (1 byte per
    Document with in memory copy-on-write)

  • same is true for deleted docs while cost is low (1 bit per document)
• DocValues will use a stacked-approach instead




                                                                           33
DocValues - Updates



                                IndexWriter                      update


                             merge
                                                                     (id: 3, value: 777)



coalesced store                      DocValues store
  docID   field: permission            docID   field: permission     update stack
    0           777                     0           777

    1           707                     1           707

    2           644                     2           644

    3           777                     3           644              (id: 5, value: 644)

    4           777                     4           777
                                                                     (id: 6, value: 777)
    5           644                     5           664
                                        6           664              (id: 5, value: 777)
    6           777

   ...

    n                                                                                      34
Use-Cases

• Scoring based on frequently changing values
    • click feedback
    • iterative algorithms like page rank
    • user ratings
• Restricted environments like Android
• Realtime Search (fast loading times)
• frequently changing fields
    • if the fields content is not searched!
• fast field fetching / alternative to stored fields (Distributed Search)
                                                                          35
Questions?




Thank you for your attention!




                                36

More Related Content

What's hot

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
Mysteries of the binary log
Mysteries of the binary logMysteries of the binary log
Mysteries of the binary logMats Kindahl
 
Analyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitAnalyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitDataWorks Summit
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrVadim Kirilchuk
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index FileNavis Ryu
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 

What's hot (20)

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Mysteries of the binary log
Mysteries of the binary logMysteries of the binary log
Mysteries of the binary log
 
Analyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitAnalyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and Profit
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and Solr
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
 
Pregel and giraph
Pregel and giraphPregel and giraph
Pregel and giraph
 
ELK introduction
ELK introductionELK introduction
ELK introduction
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 

Viewers also liked

Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)
学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)
学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)Ikki Ohmukai
 
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJPYahoo!デベロッパーネットワーク
 
MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介kwatch
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案Yahoo!デベロッパーネットワーク
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 

Viewers also liked (20)

Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
ドリコムJenkins勉強会資料
ドリコムJenkins勉強会資料ドリコムJenkins勉強会資料
ドリコムJenkins勉強会資料
 
Block join toranomaki
Block join toranomakiBlock join toranomaki
Block join toranomaki
 
Lucene/Solr Revolution2015参加レポート
Lucene/Solr Revolution2015参加レポートLucene/Solr Revolution2015参加レポート
Lucene/Solr Revolution2015参加レポート
 
学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)
学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)
学術コンテンツサービスでの活用事例@Lucene/Solr勉強会(2015.5.13)
 
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
 
Nlp4 l intro-20150513
Nlp4 l intro-20150513Nlp4 l intro-20150513
Nlp4 l intro-20150513
 
MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 

Similar to DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon

HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaTed Dunning
 
Hippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearchHippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearchLuca Cavanna
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestDuyhai Doan
 
How to Fail at Kafka
How to Fail at KafkaHow to Fail at Kafka
How to Fail at Kafkaconfluent
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeDataWorks Summit
 
Bh ad-12-stealing-from-thieves-saher-slides
Bh ad-12-stealing-from-thieves-saher-slidesBh ad-12-stealing-from-thieves-saher-slides
Bh ad-12-stealing-from-thieves-saher-slidesMatt Kocubinski
 
Cassandra for the ops dos and donts
Cassandra for the ops   dos and dontsCassandra for the ops   dos and donts
Cassandra for the ops dos and dontsDuyhai Doan
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Community
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Fiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheFiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheChristopher Brown
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
So you want to liberate your data?
So you want to liberate your data?So you want to liberate your data?
So you want to liberate your data?Mogens Heller Grabe
 

Similar to DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon (20)

HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Hippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearchHippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearch
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapest
 
How to Fail at Kafka
How to Fail at KafkaHow to Fail at Kafka
How to Fail at Kafka
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
Bh ad-12-stealing-from-thieves-saher-slides
Bh ad-12-stealing-from-thieves-saher-slidesBh ad-12-stealing-from-thieves-saher-slides
Bh ad-12-stealing-from-thieves-saher-slides
 
Cassandra for the ops dos and donts
Cassandra for the ops   dos and dontsCassandra for the ops   dos and donts
Cassandra for the ops dos and donts
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Fiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheFiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents Fiche
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
So you want to liberate your data?
So you want to liberate your data?So you want to liberate your data?
So you want to liberate your data?
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Recently uploaded

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon

  • 1. Column Stride Fields aka. DocValues Simon Willnauer @ Lucene Revolution 2011 PMC Member & Core Comitter Apache Lucene simonw@apache.org / simonw@jteam.nl
  • 2. 2
  • 3. Column Stride Fields aka. DocValues Agenda ‣ What is this all about? aka. The Problem! ‣ The more native solution ‣ DocValues - current state and future ‣ Questions? 3
  • 4. What is this all about? - Inverted Index Lucene is basically an inverted index - used to find terms QUICKLY! 1 The old night keeper keeps the keep in the town term freq Posting list 2 In the big old house in the big old gown. and 1 6 big 2 23 3 The house in the town had the big old keep dark 1 6 4 Where the old night keeper never did sleep. did 1 4 5 The night keeper keeps the keep in the night gown 1 2 6 And keeps in the dark and sleeps in the light. had 1 3 house 2 23 Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 135 keeper 3 145 keeps 3 156 TermsEnum light 1 6 never 1 4 night 3 145 IndexWriter old 4 1234 sleep 1 4 sleeps 1 6 the 6 <1> <2> <3> <4> <5> <6> town 2 13 where 1 4
  • 5. Intersecting posting lists Yet, once we found the right terms the game starts.... Posting Lists (document IDs) 5 10 11 55 57 59 77 88 AND Query 1 10 13 44 55 79 88 99 score What goes into the score? PageRank?, ClickFeedback? 5
  • 6. How to store scoring factors? Lucene provides 2 ways of storing data • Stored Fields (document to String or Binary mapping) • Inverted Index (term to document mapping) What if we need here is one or more values per document! • document to value mapping Why not use Stored Fields? 6
  • 7. Using Stored Fields • Stored Fields serve a different purpose • loading body or title fields for result rendering / highlighting • very suited for loading multiple values • With Stored Fields you have one indirection per document resulting in going to disk twice for each document • on-disk random access is too slow • remember Lucene could score millions of documents even if you just render the top 10 or 20! 7
  • 8. Stored Fields under the hood Document title: title: id: 108232 Deutschland Germany absolute file pointers [...][...][93438][...] Field Index (.fdx) [...][id:108232title:Deutschlandtitle:Germany][...] Field Data (.fdt) numFields(vint) [ fieldid(vint) length(vint) payload ] 8
  • 9. Stored Fields - accessing a field 1 Lookup filepointer in .fdx [...][...][93438][...] Field Index (.fdx) 2 Scan on .fdt until you find the field by ID [...][id:108232title:Deutschlandtitle:Germany][...] Field Data (.fdt) numFields(vint) [ fieldid(vint) length(vint) payload ] 9
  • 10. Alternatives? Lucene can un-invert a field into FieldCache un-invert weight term freq Posting list 5.8 1.0 1 16 1.0 2.7 1 23 2.7 parse 2.7 3.2 1 7 4.3 convert to datatype 4.3 1 4 7.9 4.7 1 8 1.0 3.2 5.8 1 0 4.7 7.9 1 59 7.9 array per field / 9.0 1 10 9.0 segment float 32 string / byte[] 10
  • 11. FieldCache - is fast once loaded, once! • Constant time lookup DocID to value • Efficient representation • primitive array • low GC overhead • loading can be slow (realtime can be a problem) • must parse values • builds unnecessary term dictionary • always memory resident 11
  • 12. FieldCache - loading Simple Benchmark • Indexing 100k, 1M and 10M random floats • not analyzed no norms • load field into FieldCache from optimized index 100k Docs 1M Docs 10M Docs 122 ms 348 ms 3161 ms Remember, this is only one field! Some apps have many fields to load to FieldCache 12
  • 13. FieldCache works fine! - if... • you have enough memory • you can afford the loading time • merge is fast enough (for FieldCache you need to index the terms) What if you canʼt? Like when you are in a very restricted environment? • 3 Billion Android installations world wide and growing - 2 MB Heap! • with 100 Million Documents one field takes 30 seconds to load • 2 phase Distributed Search 13
  • 14. Summary • Stored Fields are not fast enough for random access • FieldCache is fast once loaded • abuses a reverse index • must convert to String and from String • requires fair amount of memory • Lucene is missing native data-structure for primitive per-document values 14
  • 15. Column Stride Fields aka. DocValues Agenda ‣ What is this all about? aka. The Problem! ‣ The more native solution ‣ DocValues - current state and future ‣ Questions? 15
  • 16. The more native solution - Column Stride Fields • A dense column based storage • 1 value per document • accepts primitives - no conversion from / to String • int & long • float & double • byte[ ] • each field has a DocValues Type but can still be indexed or stored • Entirely optional 16
  • 17. Simple Layout - even on disk 1 column per field and segment field: time field: id field: page_rank 1 value per document 1288271631431 1 3.2 1288271631531 5 4.5 1288271631631 3 2.3 1288271631732 4 4.44 1288271631832 6 6.7 1288271631932 9 7.8 1288271632032 8 9.9 1288271632132 7 10.1 1288271632233 12 11.0 1288271632333 14 33.1 1288271632433 22 0.2 1288271632533 32 1.4 1288271632637 100 55.6 1288271632737 33 2.2 1288271632838 34 7.5 1288271632938 35 3.2 1288271633038 36 3.4 1288271633138 37 5.6 1288271632333 38 45.0 int64 int32 float 32 17
  • 18. Numeric Types - Int Number of bit depend on the numeric range in the field: field: id 7 - bit per doc 1 5 3 Math.max(1, (int) Math.ceil( Math.log(1+maxValue)/Math.log(2.0)) 4 ); 6 9 Random Access 8 7 12 14 • Integer are stored dense based on PackedInts 22 32 • Space depends on the value-range per segment 100 33 Example: [1, 100] maps to [0, 99] requires 7 bit per doc 34 35 36 • Floats are stored without compression 37 38 • either 32 or 64 bit per value 18
  • 19. Arbitrary Values - The byte[] variants • Length Variants: • Fixed / Variable • Store Variants: • Straight or Referenced fixed / straight fixed / deref data offsets data 10/01/2011 0 10/01/2011 12/01/2011 10 12/01/2011 Random Access Random Access 10/04/2011 20 10/04/2011 10/06/2011 30 10/06/2011 10/05/2011 40 10/05/2011 10/01/2011 50 10/01/2011 10/07/2011 60 10/07/2011 10/04/2011 20 10/04/2011 20 10/04/2011 20 19
  • 20. DocValues - Memory Requirements • RAM Resident - random access • similar to FieldCache • bytes are stored in byte-block pools • currently limited to 2GB per segment • On-Disk - sequential access • almost no JVM heap memory • files should be in FS cache for fast access • possible use MemoryMapped Buffers 20
  • 21. Lets look at the API - Indexing Adding DocValues follows existing patterns, simply use Fieldable Document doc = new Document(); float pageRank = 10.3f; DocValuesField valuesField = new DocValuesField("pageRank"); valuesField.setFloat(pageRank); doc.add(valuesField); writer.addDocument(doc); Sometimes the field should also be indexed, stored or needs term- vectors String titleText = "The quick brown fox"; Field field = new Field("title", titleText , Store.NO, Index.ANALYZED); DocValuesField titleDV = new DocValuesField("title"); titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF); field.setDocValues(titleDV); 21
  • 22. Looking at the API - Search / Retrieve On disk sequential access is exposed through DocValuesEnum IndexReader reader = ...; DocValues values = reader.docValues("pageRank"); DocValuesEnum floatEnum = values.getEnum(); int doc = 0; FloatsRef ref = floatEnum.getFloat(); // values are filled when iterating while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) { double value = ref.floats[0]; } // equivalent to ... int doc = 0; while((doc = floatEnum.advance(doc+1)) != DocValuesEnum.NO_MORE_DOCS) { double value = ref.floats[0]; } DocValuesEnum is based on DocIdSetIterator just like Scorer or DocsEnum 22
  • 23. Looking at the API - Search / Retrieve RAM Resident API is very similar to FieldCache IndexReader reader = ...; DocValues values = reader.docValues("pageRank"); Source source = values.getSource(); double value = source.getFloat(x); // still allows iterating over the RAM resident values DocValuesEnum floatEnum = source.getEnum(); int doc; FloatsRef ref = floatEnum.getFloat(); while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) { value = ref.floats[0]; } DocValuesEnum still available on RAM Resident API 23
  • 24. Can I add my own DocValues Implementation? • DocValues are integrated into Flexible Indexing • IndexWriter / IndexReader write and read DocValues via a Codec • DocValues Types are fixed (int, float32, float64 etc.) but implementations are Codec specific • A Codec provides access to DocValuesComsumer and DocValuesProducer • allows implementing application specific serialzation • customize compression techniques 24
  • 25. Quick detour - Codecs IndexWriter IndexReader Flex API Codec Directory FileSystem 25
  • 26. Quick detour - Codecs IndexWriter IndexReader write read DocValuesConsumer DocValuesProducer Flex API Codec 26
  • 27. Remember the loading FieldCache benchmark? Simple Benchmark • Indexing 100k, 1M and 10M random floats • not analyzed no norms • loading field into FieldCache from optimized index vs. loading DocValues field 100k Docs 1M Docs 10M Docs FieldCache 122 ms 348 ms 3161 ms DocValues 7 ms 10 ms 90 ms Loading is 100 x faster - no un-inverting, no string parsing 27
  • 28. QPS - FieldCache vs. DocValues Task QPS DocValues QPS FieldCache % change AndHighHigh 3.51 3.41 2.9% PKLookup 46.06 44.87 2.7% AndHighMed 37.09 36.48 1.7% Fuzzy2 17.70 17.50 1.1% Fuzzy1 27.15 27.21 -0.2% Phrase 4.12 4.13 -0.2% SpanNear 2.00 2.01 -0.5% SloppyPhrase 1.98 2.02 -2.0% Term 35.29 36.05 -2.1% OrHighMed 4.73 4.93 -4.1% OrHighHigh 3.99 4.18 -4.5% Wildcard 12.97 13.60 -4.6% Prefix3 15.86 16.70 -5.0% IntNRQ 2.72 2.91 -6.5% 6 Search Threads 20 JVM instances, 5 instances per task run 50 times on 12 core Xeon / 24 GB RAM - all queries wrapped with a CustomScoreQuery 28
  • 29. Column Stride Fields aka. DocValues Agenda ‣ What is this all about? aka. The Problem! ‣ The more native solution ‣ DocValues - current state and future ‣ Questions? 29
  • 30. DocValues - current state • Currently still in a branch • Some minor JavaDoc issues • needs some cleanups • Landing on trunk very soon • issue is already opened and active 30
  • 31. DocValues - current features • Fully customizable via Codecs • User can control memory usage per field • Suitable for environments where memory is tight • Compact and native representation on disk and in RAM • Fast Loading times • Comparable to FieldCache (small overhead) • Direct value access even when on disk (single seek) 31
  • 32. DocValues - what is next? • the ultimate goal for DocValues is to be update-able • changing a per-document values without reindexing • users can replace existing values directly for each document • each field by itself will be update-able • Will be available in Lucene 4.0 once released ;) 32
  • 33. DocValues - Updates • Lucene has write-once policy for files • Changing in place is not a good idea - Consistency / Corruption! • Problem is comparable to norms or deleted docs • updating norms requires re-writing the entire norms array (1 byte per Document with in memory copy-on-write) • same is true for deleted docs while cost is low (1 bit per document) • DocValues will use a stacked-approach instead 33
  • 34. DocValues - Updates IndexWriter update merge (id: 3, value: 777) coalesced store DocValues store docID field: permission docID field: permission update stack 0 777 0 777 1 707 1 707 2 644 2 644 3 777 3 644 (id: 5, value: 644) 4 777 4 777 (id: 6, value: 777) 5 644 5 664 6 664 (id: 5, value: 777) 6 777 ... n 34
  • 35. Use-Cases • Scoring based on frequently changing values • click feedback • iterative algorithms like page rank • user ratings • Restricted environments like Android • Realtime Search (fast loading times) • frequently changing fields • if the fields content is not searched! • fast field fetching / alternative to stored fields (Distributed Search) 35
  • 36. Questions? Thank you for your attention! 36