SlideShare a Scribd company logo
1 of 22
Download to read offline
1Berlin | 07/02/2012 | zanox | Presentation title
TABLE OF CONTENTS
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. ANALYTICS
4. QUESTIONS
2Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
THE ZANOX NETWORK
3Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
WHO AM I?
4
â—Ź Senior Architect at Zanox
â—Ź 7+ years experience in building distributed search systems based on Lucene
Lucid Certified Apache Solr/Lucene Developer
â—Ź 5+ years of using Hadoop and 2+ years of using HBase for data mining
Cloudera Certified Developer for Apache Hadoop and HBase
â—Ź I have applied different machine-learning techniques mainly to optimise
resource usage while performing distributed search during my PhD
● See my book: “Beyond Centralised Search Engines
An Agent-Based Filtering Framework”
â—Ź How can you reach me?
â—Ź dragan.milosevic@zanox.com
â—Ź http://www.linkedin.com/in/draganmilosevic
Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. ANALYTICS
4. QUESTIONS
5Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
REPORTING IN GENERAL
6
Tracking
Custom Map Reduce Distributed Reporting
M
S
S
S
JBoss
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S
C
C
C
C
C
C
C
D
D
L
G
Y
D
D
L
F
M
T
T
T
T
T
T
T
T
O
O
O
D
D
D
D
IIIII
IIII
III
IIIIIIII
IIIIIIIIII
IIIIII
DD
D
DD
D
D
D D
D
D D
D D
Q
QQQQQ
R
R
R
R
R
R
D D
D D
D
D
D
D
DD
D
D
D
D
D
D D
D D
D D
D D
D D
D D
D D
OKOK
00:0003:0006:0009:0012:0015:0018:0021:0000:0001:0002:0003:0004:0005:0006:00
D D
D D
D D
D D
D D
D D
D D
D D
D D
00:00:00.00000:00:00.00500:00:00.01000:00:00.01500:00:00.02000:00:00.02500:00:00.03000:00:00.03500:00:00.04000:00:00.04500:00:00.05000:00:00.05500:00:00.06000:00:00.06500:00:00.07000:00:00.07500:00:00.08000:00:00.08500:00:00.09000:00:00.09500:00:00.10000:00:00.20000:00:00.30000:00:00.40000:00:00.50000:00:00.60000:00:00.65000:00:00.70000:00:00.75000:00:00.80000:00:00.85000:00:00.90000:00:00.91000:00:00.92000:00:00.93000:00:00.94000:00:00.95000:00:0000:02:0000:04:0000:06:0000:08:0000:10:0000:10:2000:10:4000:11:0000:11:2000:11:4000:12:0000:14:0000:16:0000:18:0000:20:0000:22:0000:24:0000:26:0000:28:00
Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Name Searcher
Role Knows how to build reports for queries
Able to load assigned indexes from HDFS
Design RPC Server
Query optimiser
Lucene searcher with MMap directories
Report aggregation component
Name Merger
Role Knows how to build sub-queries for searchers
Able to assign indexes to searchers
Aware of available searchers
Design RPC Server
Report aggregation component
Sorting and master-data management
Name Observer
Role Checks the health of mergers and searchers
Automatic deployment of clusters
Repair unhealthy mergers and searchers
Provide information about observed clusters
Design RPC Server
Name Client
Role Keep track of available mergers
Rerouting of failed queries
Take care of load when assigning queries
Design JBoss application
RPC client for communication with mergers
Name Tracking Server
Role Tracks events such as sales, clicks, views …
Writes tracked events to HDFS
…
Design Proprietary application
Name Database server
FTP server
Log server
Role Provides data needed for reporting
…
Design …
Name Google
Yahoo
Miva
Role Provides search engine cost data
…
Design …
Data-Import Data producers are pushing data during a whole day
MR Jobs Reporting jobs are started at midnight
It takes around 6 hours to process data from yesterday
Index loading normally lasts no more that 30 minutes
Reporting There are multiple search clusters for redundancy
Every cluster has one merger that coordinates searchers
The health of every cluster is ensured by observer
Searchers are grouped based on their capabilities
Within 100ms queries are propagated to best searchers
Capable to provide reports in sub-second time
ANALYTICS
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. ANALYTICS
4. QUESTIONS
7Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Searching
2012-09-03
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
2012-09-04
ANALYTICS
8Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Date PublisherAdvertiser Keyword Commission
JimHUK Auto insurance company 23.56 €
TomDEVK Health insurance 9.24 €
BobDEVK Health insurance 43.81 €
BillHUK Cheap auto insurance 7.22 €
AlHUK Auto insurance company 43.07 €
MikeHUK Auto insurance company 29.98 €
NickDEVK Health insurance 12.03 €
PhilDEVK Cheap auto insurance 9.72 €
Amount
345.06 €
81.35 €
512.19 €
83.76 €
892.12 €
98.17 €
108.19 €
83.21 €
OptimizationSharding
Commission | Amount
23.56 | 345.06 €
9.24 | 81.35 €
43.81 | 512.19 €
7.22 | 83.76 €
43.07 | 892.12 €
29.98 | 98.17 €
12.03 | 108.19 €
9.72 | 83.21 €
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
ANALYTICS
9Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Date Advertiser Keyword Publisher – Commission | Amount
HUK Auto insurance company Jim – 23.56 | 345.06 €
DEVK Health insurance
Tom – 9.24 | 81.35 €
Bob – 43.81 | 512.19 €
HUK Cheap auto insurance Bill – 7.22 | 83.76 €
HUK Auto insurance company
Al – 43.07 | 892.12 €
Mike – 29.98 | 98.17 €
DEVK Health insurance Nick – 12.03 | 108.19 €
DEVK Cheap auto insurance Phil – 9.72 | 83.21 €
Searching OptimizationSharding
Searching
ANALYTICS
10Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
Date Advertiser Publisher – Commission | Amount
HUK Jim – 23.56 | 345.06 €
DEVK
Tom – 9.24 | 81.35 €
Bob – 43.81 | 512.19 €
HUK Bill – 7.22 | 83.76 €
HUK
Al – 43.07 | 892.12 €
Mike – 29.98 | 98.17 €
DEVK Nick – 12.03 | 108.19 €
DEVK
Keyword
Auto insurance company
Health insurance
Cheap auto insurance
Auto insurance company
Health insurance
Cheap auto insurance Phil – 9.72 | 83.21 €
Q
FILTER
GROUP BY
Jim
PUBLISHER
HUK
ADVERTISER
HUK
ADVERTISER
R
VALUES
103,83 €
COMMISSION
1419,11 €
AMOUNT
SEARCHING
AGGREGATION
LOADING
What does each
searcher when
query comes?
OptimizationSharding
Can we further
improve loading
of documents?
Yes we can!
Find out which
dimensions are
important filters
and use them
while indexing
documents
Searching
ANALYTICS
11Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
Date Advertiser Publisher – Commission | Amount
HUK Jim – 23.56 | 345.06 €
DEVK
Tom – 9.24 | 81.35 €
Bob – 43.81 | 512.19 €
HUK Bill – 7.22 | 83.76 €
HUK
Al – 43.07 | 892.12 €
Mike – 29.98 | 98.17 €
DEVK Nick – 12.03 | 108.19 €
DEVK
Keyword
Auto insurance company
Health insurance
Cheap auto insurance
Auto insurance company
Health insurance
Cheap auto insurance Phil – 9.72 | 83.21 €
Q
FILTER
GROUP BY
Jim
PUBLISHER
HUK
ADVERTISER
HUK
ADVERTISER
OptimizationSharding
R
VALUES
103,83 €
COMMISSION
1419,11 €
AMOUNT
Searching
ANALYTICS
12Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
â—Ź There are 9 dimensions in total, where several
â—Ź Have huge cardinality (>1 million different values) and
â—Ź At the same time are rarely used as filter, meaning
â—Ź They are excellent candidates for being put together with measures.
â—Ź There are more than 50 measures that
â—Ź Are freely combined in user-defined formulas, but at the same time
â—Ź Form only few groups based on usage patterns, meaning
â—Ź They contribute to better compression because fields become bigger.
â—Ź Ordering in which documents are inserted in index is driven by
â—Ź Dimensions that are known to be the most important filters, meaning
â—Ź Faster loading as needed documents are closer to each other.
OptimizationSharding
OptimizationSearching Sharding
Sharding based on hashing
ANALYTICS
13Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
Distributed Reporting
M
S
S
S
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S O
M
S
S
S
S
S
S
S
S
S
S
S
S O
O
R
VALUES
16,55 €
COMMISSION
523,21 €
AMOUNT
R
VALUES
103,83 €
COMMISSION
1419,11 €
AMOUNT
R
VALUES
78,72 €
COMMISSION
879,34 €
AMOUNT
R
VALUES
134,23 €
COMMISSION
1622,56 €
AMOUNT
R
VALUES
336,16 €
COMMISSION
4464,09 €
AMOUNT
R R R R
Q
FILTER
GROUP BY
Jim
PUBLISHER
HUK
ADVERTISER
HUK
ADVERTISER
QQQQQ
R
VALUES
2,83 €
COMMISSION
19,87 €
AMOUNT
RSharding based on hashing Query is sent to each and every searcher
Searchers work hard to produce reports
Huge amount of data sent over network
Goals for alternative sharding Think very well while distributing data
Network is your friend and precious resource
Produce same report with less searchers
Sharding based on selected filters
OptimizationShardingSearching
ANALYTICS
14Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
M
S S S S S R
VALUES
336,16 €
COMMISSION
4464,09 €
AMOUNT
Q
FILTER
GROUP BY
Jim
PUBLISHER
HUK
ADVERTISER
HUK
ADVERTISER
QQ
VALUES
133,33 €
COMMISSION
2454,22 €
AMOUNT
RRR
VALUES
202,83 €
COMMISSION
2009,87 €
AMOUNT
R
OptimizationShardingSearching
ANALYTICS
15Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
â—Ź Sharding based on hashing randomly distributes data
â—Ź But that is not needed because no relevance will be computed.
â—Ź Queries have to be send to each and every searcher, being suboptimal
because much more data have to be transmitted over the network.
â—Ź The slowest searcher determine final response time due to completeness.
â—Ź How we partition based on filter fields?
â—Ź Order dimensions based on their usage frequency as filter in particular group.
â—Ź Greater usage frequency means more important for sharding.
â—Ź Examples of dimensions that are found to be excellent for sharding are:
time related dimensions and advertiser name dimension.
â—Ź Merger uses knowledge about which data each shard has while routing.
â—Ź Final result is more stable system and much better QPS.
ANALYTICS
16Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
3rdGroup2ndGroup1stGroup
Publisher
Advertiser
Keyword
Publisher
Keyword
Publisher
Advertiser
Distributed Reporting
M
S
S
S
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S O
M
S
S
S
S
S
S
S
S
S
S
S
S O
O
OptimizationShardingSearching
We have created separate
searcher group for every
possible combination of
dimensions. Therefore we
are well prepared for each
query that can hit us.
But we are attacked where
we haven’t thought that it
is possible. One new field
have been requested and
that has spoiled everything
17Berlin | 07/02/2012 | zanox | Presentation title
3rdGroup2ndGroup1stGroup
Publisher
Advertiser
Keyword
Publisher
Keyword
Publisher
Advertiser
M
4thGroup
Publisher
5thGroup
Advertiser
6thGroup
Advertiser
Keyword
7thGroup
Keyword
Q
FILTER
GROUP BY
Jim
PUBLISHER
Keyword
HUK
ADVERTISER
Insurance
KEYWORD
Website
8thGroup
Advertiser
Website
Keyword
9thGroup
Website
Advertiser
Publisher
10thGroup
Website
11thGroup
Website
Keyword
12thGroup
Website
Advertiser
13thGroup
Website
Publisher
Publisher
There are so
many groups
but no one
has fields
Website
Publisher
Keyword
Creating a separate searcher group for each and every possible
combination of dimensions in each query is not feasible because
number of searcher-groups grows extremely fast (exponentially)
Number of dimensions Number of searcher groups
3 23 – 1 = 7
4 24 – 1 = 15
… …
9 29 – 1 = 511
Solution is to analyze already processed queries to discard low
performing groups and to create new ones for emerging queries
ANALYTICS
OptimizationShardingSearching
OptimizationShardingSearching
18Berlin | 07/02/2012 | zanox | Presentation title
3rdGroup2ndGroup1stGroup
Publisher
Advertiser
Keyword
Publisher
Keyword
Publisher
Advertiser
4thGroup
Publisher
5thGroup
Advertiser
6thGroup
Advertiser
Keyword
7thGroup
Keyword
8thGroup
Advertiser
Website
Keyword
9thGroup
Website
Advertiser
Publisher
10thGroup
Website
11thGroup
Website
Keyword
12thGroup
Website
Advertiser
13thGroup
Website
Publisher
queries
451236
response
45.54 ms
queries
9
response
15.54 ms
queries
221128
response
82.87 ms
queries
2876136
response
23.54 ms
queries
236
response
92.33 ms
queries
2
response
147.82 ms
queries
228732
response
61.03 ms
queries
89
response
215.04 ms
queries
15
response
105.02 ms
queries
5644
response
29.76 ms
queries
4
response
563.04 ms
queries
32
response
87.03 ms
queries
543765
response
35.87 ms
14thGroup
Website
Publisher
Keyword
M
Q
FILTER
GROUP BY
Jim
PUBLISHER
Keyword
HUK
ADVERTISER
Insurance
KEYWORD
Website
Publisher
There are so
many groups
but no one
has fields
Website
Publisher
Keyword
ANALYTICS
19Berlin | 07/02/2012 | zanox | Presentation title
3rdGroup9thGroup1stGroup
Publisher
Advertiser
Keyword
Website
Publisher
Publisher
Advertiser
4thGroup
Publisher
11thGroup
Website
Keyword
queries
20236
response
45.54 ms
Advertiser
queries
91234
response
15.54 ms
queries
221128
response
82.87 ms
queries
286136
response
23.54 ms
queries
22336
response
92.33 ms
14thGroup
Website
Publisher
Keyword
queries
32546
response
727.23 ms
M
Q
FILTER
GROUP BY
Jim
PUBLISHER
HUK
ADVERTISER
Insurance
KEYWORD
Website
Publisher
15thGroup
Keyword
Publisher
Fields
Website
Publisher
Keyword
queries
9874
response
129.14 ms
Publisher
Keyword
queries
22678
response
989.28 msFields
=
We optimized system
by adding specialized
group that is excellent
for found slow queries
There is no
group that is
optimized
for queries
having fields
Publisher
Keyword
ANALYTICS
OptimizationShardingSearching
ANALYTICS
20Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
â—Ź Hadoop makes possible to easily build new indexes:
â—Ź Optimize utilization of Hadoop resources because
â—Ź Better indexes can be always build by low priority jobs.
â—Ź Built OLAP continuously adapts itself to queries and
â—Ź There are no more expensive queries in long run.
â—Ź How can we decide which group is best of query at hand?
● That’s query routing, which we solved either by
â—Ź Asking searchers only to return number of hits and
then to generate report if being selected, or simpler
â—Ź By using heuristic based on dimension cardinality.
â—Ź But query routing is not a topic of this talk.
OptimizationShardingSearching
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. QUERY ROUTING
4. QUESTIONS
21Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
22Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

More Related Content

What's hot

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...DataWorks Summit
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at UberDataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 

What's hot (20)

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 

Viewers also liked

Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intropascaldimassimo
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Lucene
LuceneLucene
LuceneMatt Wood
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
HCatalog
HCatalogHCatalog
HCatalogGetInData
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 

Viewers also liked (20)

Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene
LuceneLucene
Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucandra
LucandraLucandra
Lucandra
 
HCatalog
HCatalogHCatalog
HCatalog
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 

Similar to Analytics in olap with lucene & hadoop

How to Make Your Move to the Cloud with Confidence
How to Make Your Move to the Cloud with ConfidenceHow to Make Your Move to the Cloud with Confidence
How to Make Your Move to the Cloud with ConfidenceCloud Spectator
 
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...apidays
 
Cloud Storage Options: The True Costs
Cloud Storage Options:  The True CostsCloud Storage Options:  The True Costs
Cloud Storage Options: The True CostsHitachi Vantara
 
Proudly Carbon Neutral - Our Services
Proudly Carbon Neutral - Our ServicesProudly Carbon Neutral - Our Services
Proudly Carbon Neutral - Our ServicesProudly Carbon Neutral
 
Optimal Search Engine Marketing
Optimal Search Engine MarketingOptimal Search Engine Marketing
Optimal Search Engine MarketingBernd Skiera
 
Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...
Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...
Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...ACA IT-Solutions
 
Changing the Economics of IT
Changing the Economics of ITChanging the Economics of IT
Changing the Economics of ITValencell, Inc.
 
E&S Joint Working Group Meeting - AAMGA Technology Conference 2014
E&S Joint Working Group Meeting - AAMGA Technology Conference 2014E&S Joint Working Group Meeting - AAMGA Technology Conference 2014
E&S Joint Working Group Meeting - AAMGA Technology Conference 2014Angelyn Treutel
 
When smarter procurement means...
When smarter procurement  means...When smarter procurement  means...
When smarter procurement means...OpusCapita
 
Getting Managers to Ride the Cloud
Getting Managers to Ride the CloudGetting Managers to Ride the Cloud
Getting Managers to Ride the CloudDavid Amaya
 
Logistiek manager van het Jaar - sessie Robuust Plannen
Logistiek manager van het Jaar - sessie Robuust PlannenLogistiek manager van het Jaar - sessie Robuust Plannen
Logistiek manager van het Jaar - sessie Robuust PlannenBas Van Bree
 
Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...
Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...
Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...Zycus
 
cloud economics - Toronto FSI Symposium - October 2016
cloud economics - Toronto FSI Symposium - October 2016cloud economics - Toronto FSI Symposium - October 2016
cloud economics - Toronto FSI Symposium - October 2016Amazon Web Services
 
Esouag r12 presentation
Esouag r12 presentationEsouag r12 presentation
Esouag r12 presentationBerry Clemens
 
Migrations leader workshop 11.19
Migrations leader workshop 11.19Migrations leader workshop 11.19
Migrations leader workshop 11.19Cassie Mayes
 
CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...
CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...
CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...InterimONE
 
Paradox Routing Tool Overview and Quick Tour
Paradox Routing Tool Overview and Quick TourParadox Routing Tool Overview and Quick Tour
Paradox Routing Tool Overview and Quick Tourparadoxsci
 

Similar to Analytics in olap with lucene & hadoop (20)

How to Make Your Move to the Cloud with Confidence
How to Make Your Move to the Cloud with ConfidenceHow to Make Your Move to the Cloud with Confidence
How to Make Your Move to the Cloud with Confidence
 
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
 
Cloud Storage Options: The True Costs
Cloud Storage Options:  The True CostsCloud Storage Options:  The True Costs
Cloud Storage Options: The True Costs
 
Proudly Carbon Neutral - Our Services
Proudly Carbon Neutral - Our ServicesProudly Carbon Neutral - Our Services
Proudly Carbon Neutral - Our Services
 
Optimal Search Engine Marketing
Optimal Search Engine MarketingOptimal Search Engine Marketing
Optimal Search Engine Marketing
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
 
Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...
Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...
Appcelerator: Customer testimonial and demo (VAB Fleet Services - Diederik De...
 
SD-WAN Economics 101 - VeloCloud
SD-WAN Economics 101 - VeloCloudSD-WAN Economics 101 - VeloCloud
SD-WAN Economics 101 - VeloCloud
 
Changing the Economics of IT
Changing the Economics of ITChanging the Economics of IT
Changing the Economics of IT
 
E&S Joint Working Group Meeting - AAMGA Technology Conference 2014
E&S Joint Working Group Meeting - AAMGA Technology Conference 2014E&S Joint Working Group Meeting - AAMGA Technology Conference 2014
E&S Joint Working Group Meeting - AAMGA Technology Conference 2014
 
When smarter procurement means...
When smarter procurement  means...When smarter procurement  means...
When smarter procurement means...
 
Getting Managers to Ride the Cloud
Getting Managers to Ride the CloudGetting Managers to Ride the Cloud
Getting Managers to Ride the Cloud
 
Logistiek manager van het Jaar - sessie Robuust Plannen
Logistiek manager van het Jaar - sessie Robuust PlannenLogistiek manager van het Jaar - sessie Robuust Plannen
Logistiek manager van het Jaar - sessie Robuust Plannen
 
Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...
Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...
Horizon 2013 The Journey from Savings Opportunity to Savings Realization - Ca...
 
cloud economics - Toronto FSI Symposium - October 2016
cloud economics - Toronto FSI Symposium - October 2016cloud economics - Toronto FSI Symposium - October 2016
cloud economics - Toronto FSI Symposium - October 2016
 
Esouag r12 presentation
Esouag r12 presentationEsouag r12 presentation
Esouag r12 presentation
 
LogCOST
LogCOSTLogCOST
LogCOST
 
Migrations leader workshop 11.19
Migrations leader workshop 11.19Migrations leader workshop 11.19
Migrations leader workshop 11.19
 
CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...
CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...
CSCMP 2014: Traditional vs Cloud-Based WMS -- Cornerstone V Track 18 Warehous...
 
Paradox Routing Tool Overview and Quick Tour
Paradox Routing Tool Overview and Quick TourParadox Routing Tool Overview and Quick Tour
Paradox Routing Tool Overview and Quick Tour
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]Commonwealth
 
The Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfThe Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfGale Pooley
 
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...Pooja Nehwal
 
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Delhi Call girls
 
00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptxFinTech Belgium
 
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Roomdivyansh0kumar0
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxanshikagoel52
 
Q3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast SlidesQ3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast SlidesMarketing847413
 
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...Suhani Kapoor
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdfFinTech Belgium
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptxFinTech Belgium
 
The Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdfThe Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdfGale Pooley
 
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...ssifa0344
 
Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024Bladex
 
VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...
VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...
VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...Call Girls in Nagpur High Profile
 
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service NashikHigh Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...ssifa0344
 

Recently uploaded (20)

Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]
 
The Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfThe Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdf
 
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
 
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
 
00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx
 
Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024
 
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptx
 
Q3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast SlidesQ3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast Slides
 
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | â‚ą5k To 25k With Room...
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
 
The Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdfThe Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdf
 
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
 
Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024
 
VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...
VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...
VVIP Pune Call Girls Katraj (7001035870) Pune Escorts Nearby with Complete Sa...
 
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service NashikHigh Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
 
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
 

Analytics in olap with lucene & hadoop

  • 1. 1Berlin | 07/02/2012 | zanox | Presentation title
  • 2. TABLE OF CONTENTS 1. ABOUT ZANOX AND ME 2. REPORTING IN GENERAL 3. ANALYTICS 4. QUESTIONS 2Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  • 3. THE ZANOX NETWORK 3Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  • 4. WHO AM I? 4 â—Ź Senior Architect at Zanox â—Ź 7+ years experience in building distributed search systems based on Lucene Lucid Certified Apache Solr/Lucene Developer â—Ź 5+ years of using Hadoop and 2+ years of using HBase for data mining Cloudera Certified Developer for Apache Hadoop and HBase â—Ź I have applied different machine-learning techniques mainly to optimise resource usage while performing distributed search during my PhD â—Ź See my book: “Beyond Centralised Search Engines An Agent-Based Filtering Framework” â—Ź How can you reach me? â—Ź dragan.milosevic@zanox.com â—Ź http://www.linkedin.com/in/draganmilosevic Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  • 5. 1. ABOUT ZANOX AND ME 2. REPORTING IN GENERAL 3. ANALYTICS 4. QUESTIONS 5Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  • 6. REPORTING IN GENERAL 6 Tracking Custom Map Reduce Distributed Reporting M S S S JBoss S S S S S S S S S M S S S S S S S S S S S S M S S S S S S S S S S S S C C C C C C C D D L G Y D D L F M T T T T T T T T O O O D D D D IIIII IIII III IIIIIIII IIIIIIIIII IIIIII DD D DD D D D D D D D D D Q QQQQQ R R R R R R D D D D D D D D DD D D D D D D D D D D D D D D D D D D D OKOK 00:0003:0006:0009:0012:0015:0018:0021:0000:0001:0002:0003:0004:0005:0006:00 D D D D D D D D D D D D D D D D D D 00:00:00.00000:00:00.00500:00:00.01000:00:00.01500:00:00.02000:00:00.02500:00:00.03000:00:00.03500:00:00.04000:00:00.04500:00:00.05000:00:00.05500:00:00.06000:00:00.06500:00:00.07000:00:00.07500:00:00.08000:00:00.08500:00:00.09000:00:00.09500:00:00.10000:00:00.20000:00:00.30000:00:00.40000:00:00.50000:00:00.60000:00:00.65000:00:00.70000:00:00.75000:00:00.80000:00:00.85000:00:00.90000:00:00.91000:00:00.92000:00:00.93000:00:00.94000:00:00.95000:00:0000:02:0000:04:0000:06:0000:08:0000:10:0000:10:2000:10:4000:11:0000:11:2000:11:4000:12:0000:14:0000:16:0000:18:0000:20:0000:22:0000:24:0000:26:0000:28:00 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems Name Searcher Role Knows how to build reports for queries Able to load assigned indexes from HDFS Design RPC Server Query optimiser Lucene searcher with MMap directories Report aggregation component Name Merger Role Knows how to build sub-queries for searchers Able to assign indexes to searchers Aware of available searchers Design RPC Server Report aggregation component Sorting and master-data management Name Observer Role Checks the health of mergers and searchers Automatic deployment of clusters Repair unhealthy mergers and searchers Provide information about observed clusters Design RPC Server Name Client Role Keep track of available mergers Rerouting of failed queries Take care of load when assigning queries Design JBoss application RPC client for communication with mergers Name Tracking Server Role Tracks events such as sales, clicks, views … Writes tracked events to HDFS … Design Proprietary application Name Database server FTP server Log server Role Provides data needed for reporting … Design … Name Google Yahoo Miva Role Provides search engine cost data … Design … Data-Import Data producers are pushing data during a whole day MR Jobs Reporting jobs are started at midnight It takes around 6 hours to process data from yesterday Index loading normally lasts no more that 30 minutes Reporting There are multiple search clusters for redundancy Every cluster has one merger that coordinates searchers The health of every cluster is ensured by observer Searchers are grouped based on their capabilities Within 100ms queries are propagated to best searchers Capable to provide reports in sub-second time ANALYTICS
  • 7. 1. ABOUT ZANOX AND ME 2. REPORTING IN GENERAL 3. ANALYTICS 4. QUESTIONS 7Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  • 8. Searching 2012-09-03 2012-09-03 2012-09-03 2012-09-03 2012-09-04 2012-09-04 2012-09-04 2012-09-04 ANALYTICS 8Berlin | 08/05/2013 | zanox | Zanox Reporting Systems Date PublisherAdvertiser Keyword Commission JimHUK Auto insurance company 23.56 € TomDEVK Health insurance 9.24 € BobDEVK Health insurance 43.81 € BillHUK Cheap auto insurance 7.22 € AlHUK Auto insurance company 43.07 € MikeHUK Auto insurance company 29.98 € NickDEVK Health insurance 12.03 € PhilDEVK Cheap auto insurance 9.72 € Amount 345.06 € 81.35 € 512.19 € 83.76 € 892.12 € 98.17 € 108.19 € 83.21 € OptimizationSharding Commission | Amount 23.56 | 345.06 € 9.24 | 81.35 € 43.81 | 512.19 € 7.22 | 83.76 € 43.07 | 892.12 € 29.98 | 98.17 € 12.03 | 108.19 € 9.72 | 83.21 €
  • 9. 2012-09-03 2012-09-03 2012-09-03 2012-09-04 2012-09-04 2012-09-04 ANALYTICS 9Berlin | 08/05/2013 | zanox | Zanox Reporting Systems Date Advertiser Keyword Publisher – Commission | Amount HUK Auto insurance company Jim – 23.56 | 345.06 € DEVK Health insurance Tom – 9.24 | 81.35 € Bob – 43.81 | 512.19 € HUK Cheap auto insurance Bill – 7.22 | 83.76 € HUK Auto insurance company Al – 43.07 | 892.12 € Mike – 29.98 | 98.17 € DEVK Health insurance Nick – 12.03 | 108.19 € DEVK Cheap auto insurance Phil – 9.72 | 83.21 € Searching OptimizationSharding
  • 10. Searching ANALYTICS 10Berlin | 08/05/2013 | zanox | Zanox Reporting Systems 2012-09-03 2012-09-03 2012-09-03 2012-09-04 2012-09-04 2012-09-04 Date Advertiser Publisher – Commission | Amount HUK Jim – 23.56 | 345.06 € DEVK Tom – 9.24 | 81.35 € Bob – 43.81 | 512.19 € HUK Bill – 7.22 | 83.76 € HUK Al – 43.07 | 892.12 € Mike – 29.98 | 98.17 € DEVK Nick – 12.03 | 108.19 € DEVK Keyword Auto insurance company Health insurance Cheap auto insurance Auto insurance company Health insurance Cheap auto insurance Phil – 9.72 | 83.21 € Q FILTER GROUP BY Jim PUBLISHER HUK ADVERTISER HUK ADVERTISER R VALUES 103,83 € COMMISSION 1419,11 € AMOUNT SEARCHING AGGREGATION LOADING What does each searcher when query comes? OptimizationSharding Can we further improve loading of documents? Yes we can! Find out which dimensions are important filters and use them while indexing documents
  • 11. Searching ANALYTICS 11Berlin | 08/05/2013 | zanox | Zanox Reporting Systems 2012-09-03 2012-09-03 2012-09-03 2012-09-04 2012-09-04 2012-09-04 Date Advertiser Publisher – Commission | Amount HUK Jim – 23.56 | 345.06 € DEVK Tom – 9.24 | 81.35 € Bob – 43.81 | 512.19 € HUK Bill – 7.22 | 83.76 € HUK Al – 43.07 | 892.12 € Mike – 29.98 | 98.17 € DEVK Nick – 12.03 | 108.19 € DEVK Keyword Auto insurance company Health insurance Cheap auto insurance Auto insurance company Health insurance Cheap auto insurance Phil – 9.72 | 83.21 € Q FILTER GROUP BY Jim PUBLISHER HUK ADVERTISER HUK ADVERTISER OptimizationSharding R VALUES 103,83 € COMMISSION 1419,11 € AMOUNT
  • 12. Searching ANALYTICS 12Berlin | 08/05/2013 | zanox | Zanox Reporting Systems â—Ź There are 9 dimensions in total, where several â—Ź Have huge cardinality (>1 million different values) and â—Ź At the same time are rarely used as filter, meaning â—Ź They are excellent candidates for being put together with measures. â—Ź There are more than 50 measures that â—Ź Are freely combined in user-defined formulas, but at the same time â—Ź Form only few groups based on usage patterns, meaning â—Ź They contribute to better compression because fields become bigger. â—Ź Ordering in which documents are inserted in index is driven by â—Ź Dimensions that are known to be the most important filters, meaning â—Ź Faster loading as needed documents are closer to each other. OptimizationSharding
  • 13. OptimizationSearching Sharding Sharding based on hashing ANALYTICS 13Berlin | 08/05/2013 | zanox | Zanox Reporting Systems SEARCHING AGGREGATION LOADING SEARCHING AGGREGATION LOADING SEARCHING AGGREGATION LOADING SEARCHING AGGREGATION LOADING SEARCHING AGGREGATION LOADING Distributed Reporting M S S S S S S S S S S S S M S S S S S S S S S S S S O M S S S S S S S S S S S S O O R VALUES 16,55 € COMMISSION 523,21 € AMOUNT R VALUES 103,83 € COMMISSION 1419,11 € AMOUNT R VALUES 78,72 € COMMISSION 879,34 € AMOUNT R VALUES 134,23 € COMMISSION 1622,56 € AMOUNT R VALUES 336,16 € COMMISSION 4464,09 € AMOUNT R R R R Q FILTER GROUP BY Jim PUBLISHER HUK ADVERTISER HUK ADVERTISER QQQQQ R VALUES 2,83 € COMMISSION 19,87 € AMOUNT RSharding based on hashing Query is sent to each and every searcher Searchers work hard to produce reports Huge amount of data sent over network Goals for alternative sharding Think very well while distributing data Network is your friend and precious resource Produce same report with less searchers
  • 14. Sharding based on selected filters OptimizationShardingSearching ANALYTICS 14Berlin | 08/05/2013 | zanox | Zanox Reporting Systems SEARCHING AGGREGATION LOADING SEARCHING AGGREGATION LOADING M S S S S S R VALUES 336,16 € COMMISSION 4464,09 € AMOUNT Q FILTER GROUP BY Jim PUBLISHER HUK ADVERTISER HUK ADVERTISER QQ VALUES 133,33 € COMMISSION 2454,22 € AMOUNT RRR VALUES 202,83 € COMMISSION 2009,87 € AMOUNT R
  • 15. OptimizationShardingSearching ANALYTICS 15Berlin | 08/05/2013 | zanox | Zanox Reporting Systems â—Ź Sharding based on hashing randomly distributes data â—Ź But that is not needed because no relevance will be computed. â—Ź Queries have to be send to each and every searcher, being suboptimal because much more data have to be transmitted over the network. â—Ź The slowest searcher determine final response time due to completeness. â—Ź How we partition based on filter fields? â—Ź Order dimensions based on their usage frequency as filter in particular group. â—Ź Greater usage frequency means more important for sharding. â—Ź Examples of dimensions that are found to be excellent for sharding are: time related dimensions and advertiser name dimension. â—Ź Merger uses knowledge about which data each shard has while routing. â—Ź Final result is more stable system and much better QPS.
  • 16. ANALYTICS 16Berlin | 08/05/2013 | zanox | Zanox Reporting Systems 3rdGroup2ndGroup1stGroup Publisher Advertiser Keyword Publisher Keyword Publisher Advertiser Distributed Reporting M S S S S S S S S S S S S M S S S S S S S S S S S S O M S S S S S S S S S S S S O O OptimizationShardingSearching
  • 17. We have created separate searcher group for every possible combination of dimensions. Therefore we are well prepared for each query that can hit us. But we are attacked where we haven’t thought that it is possible. One new field have been requested and that has spoiled everything 17Berlin | 07/02/2012 | zanox | Presentation title 3rdGroup2ndGroup1stGroup Publisher Advertiser Keyword Publisher Keyword Publisher Advertiser M 4thGroup Publisher 5thGroup Advertiser 6thGroup Advertiser Keyword 7thGroup Keyword Q FILTER GROUP BY Jim PUBLISHER Keyword HUK ADVERTISER Insurance KEYWORD Website 8thGroup Advertiser Website Keyword 9thGroup Website Advertiser Publisher 10thGroup Website 11thGroup Website Keyword 12thGroup Website Advertiser 13thGroup Website Publisher Publisher There are so many groups but no one has fields Website Publisher Keyword Creating a separate searcher group for each and every possible combination of dimensions in each query is not feasible because number of searcher-groups grows extremely fast (exponentially) Number of dimensions Number of searcher groups 3 23 – 1 = 7 4 24 – 1 = 15 … … 9 29 – 1 = 511 Solution is to analyze already processed queries to discard low performing groups and to create new ones for emerging queries ANALYTICS OptimizationShardingSearching
  • 18. OptimizationShardingSearching 18Berlin | 07/02/2012 | zanox | Presentation title 3rdGroup2ndGroup1stGroup Publisher Advertiser Keyword Publisher Keyword Publisher Advertiser 4thGroup Publisher 5thGroup Advertiser 6thGroup Advertiser Keyword 7thGroup Keyword 8thGroup Advertiser Website Keyword 9thGroup Website Advertiser Publisher 10thGroup Website 11thGroup Website Keyword 12thGroup Website Advertiser 13thGroup Website Publisher queries 451236 response 45.54 ms queries 9 response 15.54 ms queries 221128 response 82.87 ms queries 2876136 response 23.54 ms queries 236 response 92.33 ms queries 2 response 147.82 ms queries 228732 response 61.03 ms queries 89 response 215.04 ms queries 15 response 105.02 ms queries 5644 response 29.76 ms queries 4 response 563.04 ms queries 32 response 87.03 ms queries 543765 response 35.87 ms 14thGroup Website Publisher Keyword M Q FILTER GROUP BY Jim PUBLISHER Keyword HUK ADVERTISER Insurance KEYWORD Website Publisher There are so many groups but no one has fields Website Publisher Keyword ANALYTICS
  • 19. 19Berlin | 07/02/2012 | zanox | Presentation title 3rdGroup9thGroup1stGroup Publisher Advertiser Keyword Website Publisher Publisher Advertiser 4thGroup Publisher 11thGroup Website Keyword queries 20236 response 45.54 ms Advertiser queries 91234 response 15.54 ms queries 221128 response 82.87 ms queries 286136 response 23.54 ms queries 22336 response 92.33 ms 14thGroup Website Publisher Keyword queries 32546 response 727.23 ms M Q FILTER GROUP BY Jim PUBLISHER HUK ADVERTISER Insurance KEYWORD Website Publisher 15thGroup Keyword Publisher Fields Website Publisher Keyword queries 9874 response 129.14 ms Publisher Keyword queries 22678 response 989.28 msFields = We optimized system by adding specialized group that is excellent for found slow queries There is no group that is optimized for queries having fields Publisher Keyword ANALYTICS OptimizationShardingSearching
  • 20. ANALYTICS 20Berlin | 08/05/2013 | zanox | Zanox Reporting Systems â—Ź Hadoop makes possible to easily build new indexes: â—Ź Optimize utilization of Hadoop resources because â—Ź Better indexes can be always build by low priority jobs. â—Ź Built OLAP continuously adapts itself to queries and â—Ź There are no more expensive queries in long run. â—Ź How can we decide which group is best of query at hand? â—Ź That’s query routing, which we solved either by â—Ź Asking searchers only to return number of hits and then to generate report if being selected, or simpler â—Ź By using heuristic based on dimension cardinality. â—Ź But query routing is not a topic of this talk. OptimizationShardingSearching
  • 21. 1. ABOUT ZANOX AND ME 2. REPORTING IN GENERAL 3. QUERY ROUTING 4. QUESTIONS 21Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  • 22. 22Berlin | 08/05/2013 | zanox | Zanox Reporting Systems