Lucene/Solr Revolution 2016 参加レポート

Lucene/Solr Revolution 2016 参加レポート
Shinpei Nakata, Search Core Team, ECPD, Rakuten Inc.
twitter: @shinpeinkt
Dec/13/2016,第19回 Lucene/Solr 勉強会
グラントウキョウサウスタワー

Lucene/Solr Revolution
• Lucidworks 主催の、Lucene/Solrを主題としたイベント
– Cassandra Summit, Spark Summit等と同類
– 参加費 $1095
– 今年で6回目?(2011年、Bostonが最初とのこと）
• 2010、Lucid imaginationの頃はカウントしていない？
– 今回はBoston, MA, USAで開催
• 規模
– 2日間
– 参加者数: 800+
– 56発表、63発表者
– コミッター多数(17発表)
2(Photo by me)

About me and Rakuten
• 楽天株式会社、サーチコアチーム
– Internal サービス向けの検索エンジンの開発
– バグ取り、実装、新機能の提案
• 最近はSolr6と戯れる
• OSC program
– エンジニアを中心に、2年に１回の国際会議参加
チャンス
– 今年は幸いにも行けることに
• 個人的活動
– 趣味でGo言語
– Blog: http://shinpei.github.io
3
(Photo by me)

Lucene/Solr Revolution 概観
• Data science (11)
– Relevancy tuning, Recommendation, BigData
• Ecosystem (10)
– Combination with other software. (Docker, UI, Spark, CI, Durability)
• Exploring Solr (14)
– Streaming (Solr6), Security, Numeric points (Solr6)
• Keynote (7)
– a.k.a., Big company use case. IBM Watson, Salesforce, Commonvalut...
• Use case (17)
– SIE, Bloomberg, Flipkart, Rakuten, Tech consultants...
4

Data Science
• “Working with Deeply Nested Documents in Apache Solr”, Anshum Gupta,
Alisa Zhila, IBM Watson
– Deeply nested documentをSolrでどうやって扱うか
– 最近のSolrの機能を使えばけっこういろいろできるよ
• Deeply nested document?
– e.g., blogなどで記事へのコメントへの返信
5
title
Comments
title
Replies
title

（おさらい） Nested Document [1/3]
• Lucene はflatなindexしか持てない
– 親も子も独立したドキュメントとして持つ
– 親と子は連続するdocid空間に配置
– 子から親、親から子がシーケンシャルに辿れる構造
• 親、子の区別にもう一つのフィールドを利用
– e.g., <bool fieldName=“isParent”>false</bool>
6
docid1 2 3 4 docid1 2 3 4
Luceneのindex segmentの様子 Nestedは連続したdocidに格納
Child Parent

• 子から親を引っかける
– 子（ブログへの返信）に”Elastic”が含まれる親（記事）
（おさらい） Nested Document [2/3]
7
docid1 2 3 4
Child Parent
5 6 7 8
Elastic

（おさらい）Nested Document [3/3]
8
q=text:”Elastic” & fq=isParent:false
1. 子の検索
q={!parent which=“isParent:true”}text:”Elastic”
2. 子から親の検索 (Block Join Query)
docid1 2 3 4 5 6 7 8
Elastic
docid1 2 3 4 5 6 7 8
Elastic

Deeply Nested Document
• 基本的にはNestedと同じ
– どの階層でも独立した１つのドキュメント
• IDは階層に関係なく、ユニークにする
• ”path”の導入
– 同じ名前だが階層が違う場合でも区別する為
• “1.blog-posts.comments.title”
• “2.blog-posts.comments.replies.title”
• 簡単のため、Preprocessorを用意
– ネストされたJSONを渡せば、PathやIDは自動割り当て
9

例
• ブログ記事
– コメント
• コメントへの返信
10
neutral
negative
positive
sentiment
Solrと、そのほかの検索エンジン
について
Solrへの素晴らしいポ
ストだ
その通り！
私はElasticの
ほうが好きだな
Solrの便利な機能紹介
重要な機能が忘れられ
てる！
それ違うバー
ジョンでは？
Elasticのほうが先に実
装してたけど
※本例は発表の例を訳したものです

Searching
• コメント (子)から親の検索
11
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Searching
12
q={!parent which=“path:1.blog-posts”} (path:2.blog-
posts.comments and sentiment:positive)
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Searching
13
q={!parent which=“path:1.blog-posts”} (path:2.blog-
posts.comments and sentiment:positive)
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Searching
• 指定階層への検索(Replies, level2)への検索
14
q={!child of=“path:2.blog-posts.comments”}
path:2.blog-posts.comments
AND sentiment:negative
&fq=path:3.blog-posts.comments.replies
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Searching
15
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Searching
16
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Searching
17
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Response
• ChildDocTransformerFactoryを利用 (Solr5.3+)
18
q={!parent which=”path:2.blog-posts.comments”}
path:3.blog-posts.comments.replies AND sentiment:positive
&fl=*,[child parentFilter=path:2.blog-posts.comments
childFilter=path:3.blog-posts.comments.replies limit=50]
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Response
19
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

Response
20
について
ストだ
その通り！
私はElasticの
てる！
それ違うバー
ジョンでは？
装してたけど

他にも…
• Wildcards + Level Numbering
• Faceting
– Block Join Faceting (from Solr5.5)
21
q={!parent which=“path:2.*”}
path:3.blog-posts.*.keywords AND text:Solr
&fq=path:2.blog-posts.title OR path2.blog-posts.body

Reference
• “Solr’s Nesting: On Solr’s Capabilities to Handle (Deeply) Nested
Document Structures“, https://medium.com/@alisazhila/solr-s-nesting-on-
solr-s-capabilities-to-handle-deeply-nested-document-structures-
50eeaaa4347a#.w8plg0muk
• Nested Objects in Solr, Solr’nStuff, http://yonik.com/solr-nested-objects/
• “Working with deeply nested documents in apache solr”,
http://www.slideshare.net/anshumg/working-with-deeply-nested-
documents-in-apache-solr
• “Block-Join 虎の巻”, 第16回 Lucene/Solr 勉強会
http://www.slideshare.net/ebisawashinobu/block-join-toranomaki
22

Ecosystem
• “Rebalancing API for Solr Cloud”, Bloomreach, Netflix
– Solr6で入ったRebalancing APIの紹介
23

Background
• Bloomreach, Personalization as a product
– パーソナライゼーションサービスのホスティング会社
– 企業ごとに違うコレクション, ~160M docs
• Solr Cloudの管理は大変
– 複数コア、コレクション、ランク、設定
– QPSが増えてきたからコアを２個から４個に増やそう
• でもどうやって、、、？
24

Rebalancing API
• Rebalance API
– Scaling Strategy
• Auto Shard
• Redistribute
• Replace
• Scale Up
• Scale Down
• Remove Dead Nodes
– Allocation Strategy
• Least Used Node
• Unused Node
– Size Based Sharding
– Discovery Based
Redistribution
25

例１：Re-sharding
• 別のShardを用意していったんマージ
– IndexSplitterで分割
26
Merge Split
http://host:port/solr/admin/collections?
action=REBALANCE
&scaling_strategy=AUTO_SHARD
&shards=4
&collection=collection_name
Node
Core

例２：マイグレーション
• コアのマイグレーション
– s
27
action=REBALANCE
&scaling_strategy=REDISTRIBUTE
Node
Core

例３: Horizontal Scaling
• 冗長化したい
28
action=REBALANCE
&scaling_strategy=SCALE_UP
&num-replicas=2

Performance
#Doc Re-indexing Open source
Solr split shard
BloomReach
Rebalance API
BloomReach
Rebalance API
with parallel split
~10K 2 - 3 min 35 - 40 secs 30 - 35 secs 15 - 20 secs
~100K 6 - 7 min 3 - 3.5 min 2.5 - 3 mins 40 - 55 secs
~1M 35 min 13 - 15 mins 10 - 12 mins 2 - 3 mins
~10M 1h 15 min 28 - 30 mins 21 - 24 mins 3 - 4 mins
~150M 7h~ Timeout ~ 1 hour 18 - 20 mins
29
c.f. http://engineering.bloomreach.com/solrcloud-rebalance-api/
• Reindexingなしなので速い
• インデックスの分割だけでなく、コアの設定も自動以降

Exploring Solr
• “The Evolution of Lucene & Solr Numerics from Strings to Points”,
Steve Rowe, Lucidworks
– Lucene/Solrでの数値の扱いを、内部データ構造の変遷という視点から振り
返り
– 最新のDimensional Pointのベンチ報告
30

数値の文字列表現
1. 初期はStringで保持
2. SolrのInt/Long/Float/Doubleは10 variable-width String
3. 数字でソートしたい場合は、0で埋めろ、といわれてた
e.g., 15  0000015
31

数値の文字列表現
• 2000, Lucene 0.0.1
– Modified UTF-8 terms
• Null is 2 bytes
• 2008, Lucene 2.4
– UTF-8 terms
• 2012, Lucene 4.0
– Binary terms
32

高速な計算のためのスペース
• 2005, Lucene 1.4, FieldCache登場
– メモリ上にデータを保持できるようになる
• 2009, Lucene 2.9/Solr1.4
– Trie numericsが導入
• 2016, Lucene/Solr 6.0
– Trie numericsはDimensional Pointに
33

（おさらい）Trie Numerics
• 数値をトライ木に格納
– 範囲検索が早くなる
• 必要最小限なレンジクエリの生成
– 分割の粒度はPrecision stepsで指定
34
c.f., https://epic.awi.de/17813/1/Sch2007br.pdf
4
42
421 423
44
445 446 448
5
52
521 522
intField: [423 TO 599]
intField:423 OR intField:424 OR
intField:425 OR intField:426 OR..
intField:423 OR intField:44
OR intField:5

インデックスの活用
– DocValuesの導入
• インデックス時に埋め込まれるFieldCach
– Flexible indexing
• Codec導入、インデックスをいじれるようになる
35

より効率的な分割へ向けて
– Auto prefix terms
• 静的に決まってたPrecision stepでは非効率になるケースを避ける為、自
動的に分割範囲を調整する機能
– Lucene/Solr 6.2 でRemoved (LUCENE-7317)
• Dimensional pointが代替できるため
• 2016, Dimensional point導入
– すべての数値型を置き換える
36

Dimensional Points
1. 値は固定幅 (最大128bit)
2. 1D-8D
3. Block k-d tree
37
1-16 bytes
1-8 dimensions

k dimension tree
39
1. X軸の分割

k dimension tree
40
1. X軸の分割
2. Y軸の分割

k dimension tree
41
1. X軸の分割
2. Y軸の分割
3. X軸の分割(2nd)
Block kd treeはノードの数が一定数になったら
分割をやめる

Dimensional Points
1. 値は固定幅 (最大128bit)
2. 1D-8D
3. Block k-d tree
4. Pointsはソートされる
5. 一定数以下への分割がおわると葉ブロック
としてディスクに書き込まれる
6. In-memoryな二分木がブロックへのマッピン
グを持つ
7. Adaptive optimal partitioning
– 密度に応じてバランスされる
42
1-16 bytes
1-8 dimensions

Dimensional Pointsの特徴
• まだLucene Only
– SolrからはSOLR-8396
• Multi-valuedはサポート
• 値の取得は未サポート
– store=trueを入れておく
• ソート、ファセットも未サポート
– DocValuesを使え
43

数値型の置き換え
• 1D Naitive
– LongPoint, IntPoint, DoublePoint, FloatPoint, BinaryPoint
• 1D 128bit
– BigIntegerPoint, InetAddressPoint
• 1D – 4D Range
– LongRangeField, IntRangeField, DoubleRangeField, FloatRangeField
• 2D Geospatial
– LatLonPoint
• 3D Geospatial
– Geo3DPoint
44

Benchmark (1)
45
• McCandless benchmark & Adrien Grand re-run
– 36% faster at query time
– 71% faster at index time
– 66% less disk
– 85% less memory
、、、良すぎない？

Benchmark
• Fixed range query
• 25M NYC taxi data
• 3種類のLong
– Trie numerics, precision step 8
– Point fields
– Trie numerics, precision step 最大
46

Benchmark
Indexing time Index size
Points 31s 1.2GiB
Trie 53s 1.6GiB
Single-precision Trie 19s 0.7GiB
47
http://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-
numerics-from-strings-to-points-presented-by-steve-rowe-lucidworks
• 24 fields, 6 string, 1 text, 2 long fields, 1 int field, 14 double
fields.

Benchmark
48
field cardinarity hits
passenger_cou
nt
10 7.5M IntPoint 86ms
TrieInt/8 114ms
TrieInt/32 116ms
pick_up_date_t
ime
4.1M 10.4M LongPoint 69ms
TrieLong/16 105ms
TrieLong/64 365ms
trip_distance 4,754 9.6M DoublePoint 116ms
TrieDouble/16 92ms
TrieDouble/64 105ms
http://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-
numerics-from-strings-to-points-presented-by-steve-rowe-lucidworks

References
49
• “The Evolution of Lucene & Solr Numerics from Strings to Points”,
Steve Rowe, Lucidworks, http://www.slideshare.net/lucidworks/the-
evolution-of-lucene-solr-numerics-from-strings-to-points-presented-
by-steve-rowe-lucidworks
• Fun with flexible indexing, Michael McCandless,
http://blog.mikemccandless.com/2010/10/fun-with-flexible-
indexing.html

カンファレンスに参加してみて…
• コアな開発してる人にはメリット多そう
– 新機能の多くは、企業からのコミット
– コミュニティも嬉しいし、企業もメンテナンスメリット
– 多くのCommitterに会えるチャンス
• ただし、Elasticsearch寄りのLucene committerには、、、(ry
– Lucene/Solr界隈の熱量に触れられる
– G1GC, OK!
• 将来
– IBM Watsonは割と大きなユースケースになりそう
– Yonik氏からはExpressionへの大きな期待が感じられた
• ビジネス要素も多少あり
– 良くも悪くも技術系だけではない
50

We are hiring!
• Rakuten tech blog
– http://techblog.rakuten.co.jp/
• Rakuten Engineer hiring
– http://corp.rakuten.co.jp/careers/
51

Lucene/Solr Revolution 2016 参加レポート

Recommended

Recommended

More Related Content

Similar to Lucene/Solr Revolution 2016 参加レポート

Similar to Lucene/Solr Revolution 2016 参加レポート (8)

Lucene/Solr Revolution 2016 参加レポート

Editor's Notes