SlideShare a Scribd company logo
1 of 49
Download to read offline
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Streaming Aggregation, New Horizons for Search
Erick Erickson
Workplace Partners, LLC.
Who am I?
• Erick Erickson
• Lucene/Solr committer
• PMC member
• Independent Consultant (Workplace
Partners, LLC)
• Not the Red State Guy
• XKCD fan
My favorite XKCD cartoon
http://xkcd.com/722/
Agenda
• High-level introduction to why you should care
about Streaming Aggregation (SA hereafter)
• High-level view of Parallel SQL processing built
on SA
• High-level view of Streaming Expressions
• Samples from a mortgage database
• Joel Bernstein will do a deep-dive right after this
presentation
• Assuming you are familiar with Solr concepts
Why SA?
• Solr has always had “issues” when
dealing with very large result sets
• Data returned had to be read from disk
an decompressed
• “Deep paging” paid this price too
• Entire result set returned at once == lots
of memory
Quick Overview of SA
• Built on the “export” capabilities introduced in
Solr 4.10
• Exports “tuples” which must be populated from
docValues fields
• Only exports primitive types, e.g. numeric,
string etc.
• Work can be distributed in parallel to worker
nodes
• Can scale to limits of hardware, 10s of millions of
rows a second with ParallelStreams (we think)
DocValues
• DocValues are basic to SA, they are the only fields
that can be specified in the “fl” list of an
Streaming Aggregation query
• Only Solr “primitive” types (int/tint, long/tlong,
string) are allowed in DocValues fields
• Defined per-field in schema.xml
• Specifically, cannot be Solr.TextField-derived
• The Solr doc may contain any field types at all, the
DocValues restriction is only on the fields that
may be exported in “tuples” for SA
We can do SQL in Solr!
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount), avg(applicant_income)
from hmda
where phonetic_name='(eric)’
having (avg(applicant_income) > 50)
group by agency_code
order by agency_code asc
And that’s not all!
• We can program arbitrary operations on complete
result sets
• We can parallelize processing across Solr nodes
• We can process very large result sets in limited
memory
• Design processing rate is 400K rows/node/
second
Streaming Aggregation == glue
• Solr is built for returning the top N documents
•  Top N is usually small, e.g. 20 docs
•  Decompress to return fields (fl list)
•  Solr commonly deals with billions of documents
• Analytics:
•  Often memory intensive, especially in distributed
mode. If they can be done at all
•  Are becoming more important to this thing we call
“search”
•  Increasingly important in the era of “big data”
Use the Right Tool
• Three “modes”
• Streaming Aggregation to do arbitrary
operations on large result sets – SolrJ
• Streaming Expressions for non Java way to
access Streaming Aggregations – HTTP and SolrJ
• Parallel SQL to do selected SQL operations on
large result sets - SolrJ
• SA’s sweet spot: batch operations
• Complements Solr’s capabilities, applies to
different problems
Why not use an RDBMS?
•  Well, if it’s the best tool, you should
•  RDBMSs are not good search engines though
•  Find the average mortgage value for all
users with a name that sounds like “erick”
•  erik, erich, eric, aerick, erick, arik
•  Critical point: The “tuples” processed can be
those that satisfy any arbitrary Solr query
Why not use Spark?
•  Well, if it’s the best tool, you should
•  I’m still trying to understand when one is
preferable to the other
•  SA only needs Solr, no other infrastructure
Why not just use Solr?
• Well, if it’s the best tool, you should
• What I’d do: exhaust Solr’s capabilities then apply
SA to those kinds of problems that OOB Solr isn’t
satisfactory for, especially those that require
processing very large result sets
How does SA work?
• Simple example of how to get a bunch of rows
back and “do something” with them from a Solr
collection
• You can process multiple streams from entirely
different collections if you choose!
• It’s usually a good idea to sort return sets
• Process all of one kind of thing then move on
• Could write the results to file, connector, etc.
Sample Data
• Data set of approx 200M mortgages. Selected
fields:
• Year
• Loan amount (thousands)
• Agency (FDIC, FRS, HUD)
• Reason for loan
• Reason for denial
• No personal data, I added randomly generated
names to illustrate search
Use SA through SolrJ
•  The basic pattern is:
•  Create a Solr query
•  Feed it to the appropriate stream
•  Process the “tuples”
•  Right, what’s a “tuple”? A wrapper for a map:
•  keys are the Solr field names
•  values the contents of those fields: must be docValues
•  Why this restriction? Because getting stored fields is
expensive
Code example
• Here’s a bit of code that
• Accesses a 2-shard SolrCloud collection
• Computes the average mortgage by “agency”,
e.g. HUD, OTS, OCC, OFS, FDIC, NCUA
• For a 217M dataset, 335K results (untuned) took
2.1 seconds
Code example
String zkHost = "169.254.80.84:2181";
Map params = new HashMap();
params.put("q", "phonetic_name:eric");
params.put("fl", "loan_amount,agency_code");
params.put("sort", "agency_code asc");
params.put("qt", "/export");
….
CloudSolrStream stream = new
CloudSolrStream(zkHost, "hmda", params);
stream.open();
More code
while (true) {
Tuple tuple = stream.read();
if (tuple.EOF) {
break;
}
// next slide in here
}
Last Code
String newAgency =
tuple.getString("agency_code");
long loant = tuple.getLong("loan_amount");
if (agency.equals(thisAgency) == true) {
add_to_current_counters
} else {
log(average for this agency);
reset_for_next_agency
}
More interestingly
•  Using SA, you can:
•  Join across completely different collections
•  Manipulate data in arbitrary ways to suit your use-case
•  Distribute this load across the solr nodes in a
collection
•  Unlike standard search, SA can use cycles on all the
replicas of a shard
•  Process zillions of buckets without blowing up
memory
Parallel SQL
• Use from SolrJ
• The work can be distributed across multiple
“worker” nodes
• Operations can be combined into complex
statements
• Let’s do our previous example with ParallelSQL
• Currently trunk/6.0 only due to Java 8
requirement for SQL parser. No plan to put in 5x
Parallel SQL
•  SQL “select” is mapped to Solr Search
•  Order by, Group by and Having are all supported
•  Certain aggregations are supported
•  count, sum, avg, min max
•  You can get crazy here:
•  having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10))
•  Following query with numWorkers=2, 612K rows
•  383ms
Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda
where phonetic_name='(erich)’
group by agency_code
order by agency_code asc
Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda <- collection name
where phonetic_name='(eric)’
group by agency_code
order by agency_code asc
Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda
where phonetic_name='(eric)’ <- Solr search
group by agency_code
order by agency_code asc
Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda
where phonetic_name='(eric)’
group by agency_code <- Solr field
order by agency_code asc <- Solr field
Parallel Sql in SolrJ
Map params = new HashMap();
params.put(CommonParams.QT, "/sql");
params.put("numWorkers", "2");
params.put("sql", "select agency_code, count(*),
sum(loan_amount), avg(loan_amount), " +
"min(loan_amount), max(loan_amount),
avg(applicant_income) from hmda where
phonetic_name='eric' " +
"group by agency_code " +
"having (avg(applicant_income) > 50) " +
"order by agency_code asc");
SolrStream stream = new SolrStream("http://ericks-mac-pro:
8981/solr/hmda", params);
Parallel Sql in SolrJ
Map params = new HashMap();
params.put(CommonParams.QT, "/sql");
params.put("numWorkers", "2");
params.put("sql", "select agency_code, count(*),
sum(loan_amount), avg(loan_amount), " +
"min(loan_amount), max(loan_amount),
avg(applicant_income) from hmda where
phonetic_name='eric' " +
"group by agency_code " +
"having (avg(applicant_income) > 50) " +
"order by agency_code asc");
Parallel Sql in SolrJ
SolrStream stream = new SolrStream("http://ericks-mac-pro:
8981/solr/hmda", params);
try {
stream.open();
while (true) {
Tuple tuple = stream.read();
dumpTuple(tuple);
log("");
if (tuple.EOF) {
break;
}
}
} finally {
if (stream != null) stream.close();
}
Parallel Sql in SolrJ
SolrStream stream = new SolrStream("http://ericks-mac-
pro:8981/solr/hmda", params);
try {
stream.open();
while (true) {
if (tuple.EOF) {
break;
}
Tuple tuple = stream.read();
dumpTuple(tuple);
}
} finally {
if (stream != null) stream.close();
}
Sample tuples returned
agency_code=FDIC
max(loan_amount)=972.0
sum(loan_amount)=53307.0
count(*)=224.0
avg(loan_amount)=237.97767857142858
min(loan_amount)=5.0
Sample tuples returned
agency_code=FRS
max(loan_amount)=3000.0
sum(loan_amount)=179702.0
count(*)=834.0
avg(loan_amount)=215.47002398081534
min(loan_amount)=1.0
Current Gotcha’s
• All fields must be lower case (possibly with
underscores)
• Trunk (6.0) only although will be in 5.x (5.4?) Not
planned. (Calcite)
• Requires solrconfig entries
• Only nodes hosting collections can act as worker
nodes (But not necessarily the queried collection)
• Be prepared to dig, documentation is also
evolving
Streaming expressions
• Provide a simple query language for SolrCloud
that merges search with parallel computing
without Java programming
• Operations can be nested
Streaming Expressions
• Can access at least two ways:
• HTTP
• SolrJ
Streaming Expressions
• Operations:
• search
• merge – can be used with separate collections
• group
• unique
• top
• parallel
Example Code
curl --data-urlencode
'stream=group(
search(hmda,q="*:*",
fl="id,agency_code",
sort="agency_code asc"),
by="agency_code asc")'
http://169.254.80.84:8981/solr/hmda/stream
Example Code
curl --data-urlencode
'stream=group(
search(hmda,q="*:*",
fl="id,agency_code",
sort="agency_code asc"),
by="agency_code asc")'
http://169.254.80.84:8981/solr/hmda/stream
Response
{"result-set":{"docs":[
{"agency_code":"FDIC","_MAPS_":[
{"agency_code":"FDIC","id":"2004_CD1.CSV_3955”}
…]
{"agency_code":"NCUA","_MAPS_":[
{"agency_code":"NCUA","id":"2004_CD1.CSV_2816”}
…]
{"EOF":true,"RESPONSE_TIME":4}]}}
Response
{"result-set":{"docs":[
{"agency_code":"FDIC","_MAPS_":[
{"agency_code":"FDIC","id":"2004_CD1.CSV_3955”
}…]
{"agency_code":"NCUA","_MAPS_":[
{"agency_code":"NCUA","id":"2004_CD1.CSV_2816
”}…]
{"EOF":true,"RESPONSE_TIME":4}]}}
Future Enhancements
• This capability is quite new, Solr 5.2 with
significant enhancements every release
• Some is still “baking” in trunk/6.0
• A JDBC Driver so any Java application can treat
Solr like a SQL database, e.g. for visualization
• More user-friendly interfaces (widgets?)
• More docs, how to’s, etc.
• “Select Into”
No time for (some)
•  Oh My. Subclasses of TupleStream:
•  MetricStream
•  RollupStream (for high cardinality faceting)
•  UniqueStream
•  FilterStream (Set operations)
•  MergeStream
•  ReducerStream
•  SolrStream for non-SolrCloud
No time for (cont)
• Parallel execution details
• Distributing SA across “Worker nodes”
• All of the Parallel SQL composition
possibilities
• All of the Streaming Expression
operations
Resources
• Ref guide for streaming expressions: https://
cwiki.apache.org/confluence/display/solr/
Streaming+Expressions
• Solr user’s list: http://lucene.apache.org/solr/
resources.html
• Joel Bernstein’s blogs: http://
joelsolr.blogspot.com/2015/04/in-line-
streaming-aggregation.html
• Parallel SQL Solr JIRA: https://issues.apache.org/
jira/browse/SOLR-7560
Resources (cont)
• Streaming expressions JIRA: https://
issues.apache.org/jira/browse/SOLR-7377
• Background for SA. http://heliosearch.org/
streaming-aggregation-for-solrcloud/
• Background for Parallel SQL. http://
heliosearch.org/heliosearch-sql-sub-project/
• Getting the code, compiling, etc. https://
wiki.apache.org/solr/HowToContribute
Thanks!

More Related Content

What's hot

MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB
 
Building BI Publisher Reports using Templates
Building BI Publisher Reports using TemplatesBuilding BI Publisher Reports using Templates
Building BI Publisher Reports using Templatesp6academy
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance TuningPuneet Behl
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMongoDB
 
Oracle Forms : Timers
Oracle Forms : TimersOracle Forms : Timers
Oracle Forms : TimersSekhar Byna
 
Json in Postgres - the Roadmap
 Json in Postgres - the Roadmap Json in Postgres - the Roadmap
Json in Postgres - the RoadmapEDB
 
JPA and Hibernate Performance Tips
JPA and Hibernate Performance TipsJPA and Hibernate Performance Tips
JPA and Hibernate Performance TipsVlad Mihalcea
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Connection Resiliency and Command Interception in Entity Framework
Connection Resiliency and Command Interception in Entity FrameworkConnection Resiliency and Command Interception in Entity Framework
Connection Resiliency and Command Interception in Entity FrameworkMuhammad Umar
 
An Introduction to Celery
An Introduction to CeleryAn Introduction to Celery
An Introduction to CeleryIdan Gazit
 
Oracle Fusion HCM Payroll Process Flow.pdf
Oracle Fusion HCM Payroll Process Flow.pdfOracle Fusion HCM Payroll Process Flow.pdf
Oracle Fusion HCM Payroll Process Flow.pdfFeras Ahmad
 
Getting out of Callback Hell in PHP
Getting out of Callback Hell in PHPGetting out of Callback Hell in PHP
Getting out of Callback Hell in PHPArul Kumaran
 
MongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB
 
How Shopify Scales Rails
How Shopify Scales RailsHow Shopify Scales Rails
How Shopify Scales Railsjduff
 

What's hot (20)

MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management Demystified
 
Building BI Publisher Reports using Templates
Building BI Publisher Reports using TemplatesBuilding BI Publisher Reports using Templates
Building BI Publisher Reports using Templates
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic training
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDB
 
Oracle Forms : Timers
Oracle Forms : TimersOracle Forms : Timers
Oracle Forms : Timers
 
Json in Postgres - the Roadmap
 Json in Postgres - the Roadmap Json in Postgres - the Roadmap
Json in Postgres - the Roadmap
 
JPA and Hibernate Performance Tips
JPA and Hibernate Performance TipsJPA and Hibernate Performance Tips
JPA and Hibernate Performance Tips
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Connection Resiliency and Command Interception in Entity Framework
Connection Resiliency and Command Interception in Entity FrameworkConnection Resiliency and Command Interception in Entity Framework
Connection Resiliency and Command Interception in Entity Framework
 
MYSQL
MYSQLMYSQL
MYSQL
 
Sql server T-sql basics ppt-3
Sql server T-sql basics  ppt-3Sql server T-sql basics  ppt-3
Sql server T-sql basics ppt-3
 
An Introduction to Celery
An Introduction to CeleryAn Introduction to Celery
An Introduction to Celery
 
SQL
SQLSQL
SQL
 
Oracle Fusion HCM Payroll Process Flow.pdf
Oracle Fusion HCM Payroll Process Flow.pdfOracle Fusion HCM Payroll Process Flow.pdf
Oracle Fusion HCM Payroll Process Flow.pdf
 
Graphdatabases
GraphdatabasesGraphdatabases
Graphdatabases
 
Getting out of Callback Hell in PHP
Getting out of Callback Hell in PHPGetting out of Callback Hell in PHP
Getting out of Callback Hell in PHP
 
MongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema Design
 
Spark etl
Spark etlSpark etl
Spark etl
 
How Shopify Scales Rails
How Shopify Scales RailsHow Shopify Scales Rails
How Shopify Scales Rails
 

Similar to Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson

What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.Jim Czuprynski
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Lucidworks
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System ModernisationMongoDB
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationMongoDB
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceChitturi Kiran
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...confluent
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdutionXuan-Chao Huang
 

Similar to Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson (20)

What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
What's Your Super-Power? Mine is Machine Learning with Oracle Autonomous DB.
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Master tuning
Master   tuningMaster   tuning
Master tuning
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System Modernisation
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick Erickson

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Streaming Aggregation, New Horizons for Search Erick Erickson Workplace Partners, LLC.
  • 3. Who am I? • Erick Erickson • Lucene/Solr committer • PMC member • Independent Consultant (Workplace Partners, LLC) • Not the Red State Guy • XKCD fan
  • 4. My favorite XKCD cartoon http://xkcd.com/722/
  • 5. Agenda • High-level introduction to why you should care about Streaming Aggregation (SA hereafter) • High-level view of Parallel SQL processing built on SA • High-level view of Streaming Expressions • Samples from a mortgage database • Joel Bernstein will do a deep-dive right after this presentation • Assuming you are familiar with Solr concepts
  • 6. Why SA? • Solr has always had “issues” when dealing with very large result sets • Data returned had to be read from disk an decompressed • “Deep paging” paid this price too • Entire result set returned at once == lots of memory
  • 7. Quick Overview of SA • Built on the “export” capabilities introduced in Solr 4.10 • Exports “tuples” which must be populated from docValues fields • Only exports primitive types, e.g. numeric, string etc. • Work can be distributed in parallel to worker nodes • Can scale to limits of hardware, 10s of millions of rows a second with ParallelStreams (we think)
  • 8. DocValues • DocValues are basic to SA, they are the only fields that can be specified in the “fl” list of an Streaming Aggregation query • Only Solr “primitive” types (int/tint, long/tlong, string) are allowed in DocValues fields • Defined per-field in schema.xml • Specifically, cannot be Solr.TextField-derived • The Solr doc may contain any field types at all, the DocValues restriction is only on the fields that may be exported in “tuples” for SA
  • 9. We can do SQL in Solr! select agency_code, count(*), sum(loan_amount), avg(loan_amount), min(loan_amount), max(loan_amount), avg(applicant_income) from hmda where phonetic_name='(eric)’ having (avg(applicant_income) > 50) group by agency_code order by agency_code asc
  • 10. And that’s not all! • We can program arbitrary operations on complete result sets • We can parallelize processing across Solr nodes • We can process very large result sets in limited memory • Design processing rate is 400K rows/node/ second
  • 11. Streaming Aggregation == glue • Solr is built for returning the top N documents •  Top N is usually small, e.g. 20 docs •  Decompress to return fields (fl list) •  Solr commonly deals with billions of documents • Analytics: •  Often memory intensive, especially in distributed mode. If they can be done at all •  Are becoming more important to this thing we call “search” •  Increasingly important in the era of “big data”
  • 12. Use the Right Tool • Three “modes” • Streaming Aggregation to do arbitrary operations on large result sets – SolrJ • Streaming Expressions for non Java way to access Streaming Aggregations – HTTP and SolrJ • Parallel SQL to do selected SQL operations on large result sets - SolrJ • SA’s sweet spot: batch operations • Complements Solr’s capabilities, applies to different problems
  • 13. Why not use an RDBMS? •  Well, if it’s the best tool, you should •  RDBMSs are not good search engines though •  Find the average mortgage value for all users with a name that sounds like “erick” •  erik, erich, eric, aerick, erick, arik •  Critical point: The “tuples” processed can be those that satisfy any arbitrary Solr query
  • 14. Why not use Spark? •  Well, if it’s the best tool, you should •  I’m still trying to understand when one is preferable to the other •  SA only needs Solr, no other infrastructure
  • 15. Why not just use Solr? • Well, if it’s the best tool, you should • What I’d do: exhaust Solr’s capabilities then apply SA to those kinds of problems that OOB Solr isn’t satisfactory for, especially those that require processing very large result sets
  • 16. How does SA work? • Simple example of how to get a bunch of rows back and “do something” with them from a Solr collection • You can process multiple streams from entirely different collections if you choose! • It’s usually a good idea to sort return sets • Process all of one kind of thing then move on • Could write the results to file, connector, etc.
  • 17. Sample Data • Data set of approx 200M mortgages. Selected fields: • Year • Loan amount (thousands) • Agency (FDIC, FRS, HUD) • Reason for loan • Reason for denial • No personal data, I added randomly generated names to illustrate search
  • 18. Use SA through SolrJ •  The basic pattern is: •  Create a Solr query •  Feed it to the appropriate stream •  Process the “tuples” •  Right, what’s a “tuple”? A wrapper for a map: •  keys are the Solr field names •  values the contents of those fields: must be docValues •  Why this restriction? Because getting stored fields is expensive
  • 19. Code example • Here’s a bit of code that • Accesses a 2-shard SolrCloud collection • Computes the average mortgage by “agency”, e.g. HUD, OTS, OCC, OFS, FDIC, NCUA • For a 217M dataset, 335K results (untuned) took 2.1 seconds
  • 20. Code example String zkHost = "169.254.80.84:2181"; Map params = new HashMap(); params.put("q", "phonetic_name:eric"); params.put("fl", "loan_amount,agency_code"); params.put("sort", "agency_code asc"); params.put("qt", "/export"); …. CloudSolrStream stream = new CloudSolrStream(zkHost, "hmda", params); stream.open();
  • 21. More code while (true) { Tuple tuple = stream.read(); if (tuple.EOF) { break; } // next slide in here }
  • 22. Last Code String newAgency = tuple.getString("agency_code"); long loant = tuple.getLong("loan_amount"); if (agency.equals(thisAgency) == true) { add_to_current_counters } else { log(average for this agency); reset_for_next_agency }
  • 23. More interestingly •  Using SA, you can: •  Join across completely different collections •  Manipulate data in arbitrary ways to suit your use-case •  Distribute this load across the solr nodes in a collection •  Unlike standard search, SA can use cycles on all the replicas of a shard •  Process zillions of buckets without blowing up memory
  • 24. Parallel SQL • Use from SolrJ • The work can be distributed across multiple “worker” nodes • Operations can be combined into complex statements • Let’s do our previous example with ParallelSQL • Currently trunk/6.0 only due to Java 8 requirement for SQL parser. No plan to put in 5x
  • 25. Parallel SQL •  SQL “select” is mapped to Solr Search •  Order by, Group by and Having are all supported •  Certain aggregations are supported •  count, sum, avg, min max •  You can get crazy here: •  having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10)) •  Following query with numWorkers=2, 612K rows •  383ms
  • 26. Sample SQL select agency_code, count(*), sum(loan_amount), avg(loan_amount), min(loan_amount), max(loan_amount) from hmda where phonetic_name='(erich)’ group by agency_code order by agency_code asc
  • 27. Sample SQL select agency_code, count(*), sum(loan_amount), avg(loan_amount), min(loan_amount), max(loan_amount) from hmda <- collection name where phonetic_name='(eric)’ group by agency_code order by agency_code asc
  • 28. Sample SQL select agency_code, count(*), sum(loan_amount), avg(loan_amount), min(loan_amount), max(loan_amount) from hmda where phonetic_name='(eric)’ <- Solr search group by agency_code order by agency_code asc
  • 29. Sample SQL select agency_code, count(*), sum(loan_amount), avg(loan_amount), min(loan_amount), max(loan_amount) from hmda where phonetic_name='(eric)’ group by agency_code <- Solr field order by agency_code asc <- Solr field
  • 30. Parallel Sql in SolrJ Map params = new HashMap(); params.put(CommonParams.QT, "/sql"); params.put("numWorkers", "2"); params.put("sql", "select agency_code, count(*), sum(loan_amount), avg(loan_amount), " + "min(loan_amount), max(loan_amount), avg(applicant_income) from hmda where phonetic_name='eric' " + "group by agency_code " + "having (avg(applicant_income) > 50) " + "order by agency_code asc"); SolrStream stream = new SolrStream("http://ericks-mac-pro: 8981/solr/hmda", params);
  • 31. Parallel Sql in SolrJ Map params = new HashMap(); params.put(CommonParams.QT, "/sql"); params.put("numWorkers", "2"); params.put("sql", "select agency_code, count(*), sum(loan_amount), avg(loan_amount), " + "min(loan_amount), max(loan_amount), avg(applicant_income) from hmda where phonetic_name='eric' " + "group by agency_code " + "having (avg(applicant_income) > 50) " + "order by agency_code asc");
  • 32. Parallel Sql in SolrJ SolrStream stream = new SolrStream("http://ericks-mac-pro: 8981/solr/hmda", params); try { stream.open(); while (true) { Tuple tuple = stream.read(); dumpTuple(tuple); log(""); if (tuple.EOF) { break; } } } finally { if (stream != null) stream.close(); }
  • 33. Parallel Sql in SolrJ SolrStream stream = new SolrStream("http://ericks-mac- pro:8981/solr/hmda", params); try { stream.open(); while (true) { if (tuple.EOF) { break; } Tuple tuple = stream.read(); dumpTuple(tuple); } } finally { if (stream != null) stream.close(); }
  • 36. Current Gotcha’s • All fields must be lower case (possibly with underscores) • Trunk (6.0) only although will be in 5.x (5.4?) Not planned. (Calcite) • Requires solrconfig entries • Only nodes hosting collections can act as worker nodes (But not necessarily the queried collection) • Be prepared to dig, documentation is also evolving
  • 37. Streaming expressions • Provide a simple query language for SolrCloud that merges search with parallel computing without Java programming • Operations can be nested
  • 38. Streaming Expressions • Can access at least two ways: • HTTP • SolrJ
  • 39. Streaming Expressions • Operations: • search • merge – can be used with separate collections • group • unique • top • parallel
  • 40. Example Code curl --data-urlencode 'stream=group( search(hmda,q="*:*", fl="id,agency_code", sort="agency_code asc"), by="agency_code asc")' http://169.254.80.84:8981/solr/hmda/stream
  • 41. Example Code curl --data-urlencode 'stream=group( search(hmda,q="*:*", fl="id,agency_code", sort="agency_code asc"), by="agency_code asc")' http://169.254.80.84:8981/solr/hmda/stream
  • 44. Future Enhancements • This capability is quite new, Solr 5.2 with significant enhancements every release • Some is still “baking” in trunk/6.0 • A JDBC Driver so any Java application can treat Solr like a SQL database, e.g. for visualization • More user-friendly interfaces (widgets?) • More docs, how to’s, etc. • “Select Into”
  • 45. No time for (some) •  Oh My. Subclasses of TupleStream: •  MetricStream •  RollupStream (for high cardinality faceting) •  UniqueStream •  FilterStream (Set operations) •  MergeStream •  ReducerStream •  SolrStream for non-SolrCloud
  • 46. No time for (cont) • Parallel execution details • Distributing SA across “Worker nodes” • All of the Parallel SQL composition possibilities • All of the Streaming Expression operations
  • 47. Resources • Ref guide for streaming expressions: https:// cwiki.apache.org/confluence/display/solr/ Streaming+Expressions • Solr user’s list: http://lucene.apache.org/solr/ resources.html • Joel Bernstein’s blogs: http:// joelsolr.blogspot.com/2015/04/in-line- streaming-aggregation.html • Parallel SQL Solr JIRA: https://issues.apache.org/ jira/browse/SOLR-7560
  • 48. Resources (cont) • Streaming expressions JIRA: https:// issues.apache.org/jira/browse/SOLR-7377 • Background for SA. http://heliosearch.org/ streaming-aggregation-for-solrcloud/ • Background for Parallel SQL. http:// heliosearch.org/heliosearch-sql-sub-project/ • Getting the code, compiling, etc. https:// wiki.apache.org/solr/HowToContribute