3. Who am I?
• Erick Erickson
• Lucene/Solr committer
• PMC member
• Independent Consultant (Workplace
Partners, LLC)
• Not the Red State Guy
• XKCD fan
5. Agenda
• High-level introduction to why you should care
about Streaming Aggregation (SA hereafter)
• High-level view of Parallel SQL processing built
on SA
• High-level view of Streaming Expressions
• Samples from a mortgage database
• Joel Bernstein will do a deep-dive right after this
presentation
• Assuming you are familiar with Solr concepts
6. Why SA?
• Solr has always had “issues” when
dealing with very large result sets
• Data returned had to be read from disk
an decompressed
• “Deep paging” paid this price too
• Entire result set returned at once == lots
of memory
7. Quick Overview of SA
• Built on the “export” capabilities introduced in
Solr 4.10
• Exports “tuples” which must be populated from
docValues fields
• Only exports primitive types, e.g. numeric,
string etc.
• Work can be distributed in parallel to worker
nodes
• Can scale to limits of hardware, 10s of millions of
rows a second with ParallelStreams (we think)
8. DocValues
• DocValues are basic to SA, they are the only fields
that can be specified in the “fl” list of an
Streaming Aggregation query
• Only Solr “primitive” types (int/tint, long/tlong,
string) are allowed in DocValues fields
• Defined per-field in schema.xml
• Specifically, cannot be Solr.TextField-derived
• The Solr doc may contain any field types at all, the
DocValues restriction is only on the fields that
may be exported in “tuples” for SA
9. We can do SQL in Solr!
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount), avg(applicant_income)
from hmda
where phonetic_name='(eric)’
having (avg(applicant_income) > 50)
group by agency_code
order by agency_code asc
10. And that’s not all!
• We can program arbitrary operations on complete
result sets
• We can parallelize processing across Solr nodes
• We can process very large result sets in limited
memory
• Design processing rate is 400K rows/node/
second
11. Streaming Aggregation == glue
• Solr is built for returning the top N documents
• Top N is usually small, e.g. 20 docs
• Decompress to return fields (fl list)
• Solr commonly deals with billions of documents
• Analytics:
• Often memory intensive, especially in distributed
mode. If they can be done at all
• Are becoming more important to this thing we call
“search”
• Increasingly important in the era of “big data”
12. Use the Right Tool
• Three “modes”
• Streaming Aggregation to do arbitrary
operations on large result sets – SolrJ
• Streaming Expressions for non Java way to
access Streaming Aggregations – HTTP and SolrJ
• Parallel SQL to do selected SQL operations on
large result sets - SolrJ
• SA’s sweet spot: batch operations
• Complements Solr’s capabilities, applies to
different problems
13. Why not use an RDBMS?
• Well, if it’s the best tool, you should
• RDBMSs are not good search engines though
• Find the average mortgage value for all
users with a name that sounds like “erick”
• erik, erich, eric, aerick, erick, arik
• Critical point: The “tuples” processed can be
those that satisfy any arbitrary Solr query
14. Why not use Spark?
• Well, if it’s the best tool, you should
• I’m still trying to understand when one is
preferable to the other
• SA only needs Solr, no other infrastructure
15. Why not just use Solr?
• Well, if it’s the best tool, you should
• What I’d do: exhaust Solr’s capabilities then apply
SA to those kinds of problems that OOB Solr isn’t
satisfactory for, especially those that require
processing very large result sets
16. How does SA work?
• Simple example of how to get a bunch of rows
back and “do something” with them from a Solr
collection
• You can process multiple streams from entirely
different collections if you choose!
• It’s usually a good idea to sort return sets
• Process all of one kind of thing then move on
• Could write the results to file, connector, etc.
17. Sample Data
• Data set of approx 200M mortgages. Selected
fields:
• Year
• Loan amount (thousands)
• Agency (FDIC, FRS, HUD)
• Reason for loan
• Reason for denial
• No personal data, I added randomly generated
names to illustrate search
18. Use SA through SolrJ
• The basic pattern is:
• Create a Solr query
• Feed it to the appropriate stream
• Process the “tuples”
• Right, what’s a “tuple”? A wrapper for a map:
• keys are the Solr field names
• values the contents of those fields: must be docValues
• Why this restriction? Because getting stored fields is
expensive
19. Code example
• Here’s a bit of code that
• Accesses a 2-shard SolrCloud collection
• Computes the average mortgage by “agency”,
e.g. HUD, OTS, OCC, OFS, FDIC, NCUA
• For a 217M dataset, 335K results (untuned) took
2.1 seconds
20. Code example
String zkHost = "169.254.80.84:2181";
Map params = new HashMap();
params.put("q", "phonetic_name:eric");
params.put("fl", "loan_amount,agency_code");
params.put("sort", "agency_code asc");
params.put("qt", "/export");
….
CloudSolrStream stream = new
CloudSolrStream(zkHost, "hmda", params);
stream.open();
21. More code
while (true) {
Tuple tuple = stream.read();
if (tuple.EOF) {
break;
}
// next slide in here
}
22. Last Code
String newAgency =
tuple.getString("agency_code");
long loant = tuple.getLong("loan_amount");
if (agency.equals(thisAgency) == true) {
add_to_current_counters
} else {
log(average for this agency);
reset_for_next_agency
}
23. More interestingly
• Using SA, you can:
• Join across completely different collections
• Manipulate data in arbitrary ways to suit your use-case
• Distribute this load across the solr nodes in a
collection
• Unlike standard search, SA can use cycles on all the
replicas of a shard
• Process zillions of buckets without blowing up
memory
24. Parallel SQL
• Use from SolrJ
• The work can be distributed across multiple
“worker” nodes
• Operations can be combined into complex
statements
• Let’s do our previous example with ParallelSQL
• Currently trunk/6.0 only due to Java 8
requirement for SQL parser. No plan to put in 5x
25. Parallel SQL
• SQL “select” is mapped to Solr Search
• Order by, Group by and Having are all supported
• Certain aggregations are supported
• count, sum, avg, min max
• You can get crazy here:
• having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10))
• Following query with numWorkers=2, 612K rows
• 383ms
26. Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda
where phonetic_name='(erich)’
group by agency_code
order by agency_code asc
27. Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda <- collection name
where phonetic_name='(eric)’
group by agency_code
order by agency_code asc
28. Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda
where phonetic_name='(eric)’ <- Solr search
group by agency_code
order by agency_code asc
29. Sample SQL
select
agency_code, count(*), sum(loan_amount),
avg(loan_amount), min(loan_amount),
max(loan_amount)
from hmda
where phonetic_name='(eric)’
group by agency_code <- Solr field
order by agency_code asc <- Solr field
30. Parallel Sql in SolrJ
Map params = new HashMap();
params.put(CommonParams.QT, "/sql");
params.put("numWorkers", "2");
params.put("sql", "select agency_code, count(*),
sum(loan_amount), avg(loan_amount), " +
"min(loan_amount), max(loan_amount),
avg(applicant_income) from hmda where
phonetic_name='eric' " +
"group by agency_code " +
"having (avg(applicant_income) > 50) " +
"order by agency_code asc");
SolrStream stream = new SolrStream("http://ericks-mac-pro:
8981/solr/hmda", params);
31. Parallel Sql in SolrJ
Map params = new HashMap();
params.put(CommonParams.QT, "/sql");
params.put("numWorkers", "2");
params.put("sql", "select agency_code, count(*),
sum(loan_amount), avg(loan_amount), " +
"min(loan_amount), max(loan_amount),
avg(applicant_income) from hmda where
phonetic_name='eric' " +
"group by agency_code " +
"having (avg(applicant_income) > 50) " +
"order by agency_code asc");
32. Parallel Sql in SolrJ
SolrStream stream = new SolrStream("http://ericks-mac-pro:
8981/solr/hmda", params);
try {
stream.open();
while (true) {
Tuple tuple = stream.read();
dumpTuple(tuple);
log("");
if (tuple.EOF) {
break;
}
}
} finally {
if (stream != null) stream.close();
}
33. Parallel Sql in SolrJ
SolrStream stream = new SolrStream("http://ericks-mac-
pro:8981/solr/hmda", params);
try {
stream.open();
while (true) {
if (tuple.EOF) {
break;
}
Tuple tuple = stream.read();
dumpTuple(tuple);
}
} finally {
if (stream != null) stream.close();
}
36. Current Gotcha’s
• All fields must be lower case (possibly with
underscores)
• Trunk (6.0) only although will be in 5.x (5.4?) Not
planned. (Calcite)
• Requires solrconfig entries
• Only nodes hosting collections can act as worker
nodes (But not necessarily the queried collection)
• Be prepared to dig, documentation is also
evolving
37. Streaming expressions
• Provide a simple query language for SolrCloud
that merges search with parallel computing
without Java programming
• Operations can be nested
44. Future Enhancements
• This capability is quite new, Solr 5.2 with
significant enhancements every release
• Some is still “baking” in trunk/6.0
• A JDBC Driver so any Java application can treat
Solr like a SQL database, e.g. for visualization
• More user-friendly interfaces (widgets?)
• More docs, how to’s, etc.
• “Select Into”
45. No time for (some)
• Oh My. Subclasses of TupleStream:
• MetricStream
• RollupStream (for high cardinality faceting)
• UniqueStream
• FilterStream (Set operations)
• MergeStream
• ReducerStream
• SolrStream for non-SolrCloud
46. No time for (cont)
• Parallel execution details
• Distributing SA across “Worker nodes”
• All of the Parallel SQL composition
possibilities
• All of the Streaming Expression
operations