SlideShare a Scribd company logo
1 of 34
Download to read offline
Parallel SQL and Streaming Expressions in
Apache Solr 6
Shalin Shekhar Mangar
@shalinmangar
Lucidworks Inc.
Introduction
• Shalin Shekhar Mangar
• Lucene/Solr Committer
• PMC Member
• Senior Solr Consultant with Lucidworks Inc.
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations
and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
Why SQL
• Simple, well-known interface to data inside Solr
• Hides the complexity of Solr and its various features
• Possible to optimise the query plan according to best-practices
automatically
• Distributed Joins done simply and well
Solr 6: Parallel SQL
• Parallel execution of SQL across SolrCloud collections
• Compiled to SolrJ Streaming API (TupleStream) which is a general
purpose parallel computing framework for Solr
• Executed in parallel over SolrCloud worker nodes
• SolrCloud collections are relational ‘tables’
• JDBC thin client as a SolrJ client
Solr’s SQL Interface
SQL Interface at a glance
• SQL over Map/Reduce — for high cardinality aggregations and
distributed joins
• SQL over Facets — high performance, moderate cardinality
aggregations
• SQL with Solr powered search queries
• Fully integrated with SolrCloud
• SQL over JDBC or HTTP — http://host:port/solr/collection1/sql
Limited vs Unlimited SELECT
• select movie, director from IMDB
Returns the entire result set! Return fields must be DocValues
• select movie, directory from IMDB limit 100
Returns specified number of records. It can sort by score and
retrieve any stored field
• select movie, director from IMDB order by rating desc, num_voters
desc
Search predicates
• select movie, director from IMDB where actor = ‘bruce’
• select movie, director from IMDB where actor = ‘(bruce tom)’
• select movie, director from IMDB where rating = ‘[8 TO *]’
• select movie, director from IMDB where (actor = ‘(bruce tom)’ AND
rating = ‘[8 TO *]’)
Search predicates are Solr queries specified inside single-quotes
Can specify arbitrary boolean clauses
Select DISTINCT
• select distinct actor_name from IMDB
• Map/Reduce implementation — Tuples are shuffled to worker
nodes and operation is performed by workers
• JSON Facet implementation — operation is ‘pushed down’ to Solr
Stats aggregations
• select count(*), sum(num_voters) from IMDB
• Computed using Solr’s StatsComponent under the hood
• count, sum, avg, min, max are the supported aggregations
• Always pushed down into the search engine
GROUP BY Aggregations
• select actor_name, director, count(*), sum(num_voters) from IMDB
group by actor_name, director having count(*) > 5 and
sum(num_voters) > 1000 order by sum(num_voters) desc
• Has a map/reduce implementation (shuffle) and a JSON Facet
implementation (push down)
• Multi-dimensional, high cardinality aggregations are possible with
the map/reduce implementation
JDBC
• Part of SolrJ
• SolrCloud Aware Load Balancing
• Connection has ‘aggregationMode’ parameter that can switch
between map_reduce or facet
• jdbc:solr://SOLR_ZK_CONNECTION_STRING?
collection=COLLECTION_NAME&aggregationMode=facet
Inside Parallel SQL
Solr’s Parallel Computing Framework
• Streaming API
• Streaming Expressions
• Shuffling
• Worker collections
• Parallel SQL
Streaming API
• Java API for parallel computation
• Real-time Map/Reduce and Parallel Relational Algebra
• Search results are streams of tuples (TupleStream)
• Transformed in parallel by Decorator streams
• Transformations include group by, rollup, union, intersection,
complement, joins
• org.apache.solr.client.solrj.io.*
Streaming API
• Streaming Transformation
Operations that transform the underlying streams e.g. unique,
group by, rollup, union, intersection, complement, join etc
• Streaming Aggregation
Operations that gather metrics and compute aggregates e.g. sum,
count, average, min, max etc
Streaming Expressions
• String Query Language and Serialisation format for the Streaming
API
• Streaming expressions compile to TupleStream
• TupleStream serialise to Streaming Expressions
• Human friendly syntax for Streaming API accessible to non-Java
folks as well
• Can be used directly via HTTP to SolrJ
Streaming Expressions
Streaming Expressions
• Stream Sources
The origin of a TupleStream
search, jdbc, facet, stats, topic
• Stream Decorators
Wrap other stream functions and perform operations on the stream
complement, hashJoin, innerJoin, merge, intersect, top, unique
• Many streams can be paralleled across worker collections
Shuffling
• Shuffling is pushed down to Solr
• Sorting is done by /export handler which stream-sorts entire result sets
• Partitioning is done by HashQParserPlugin which is a filter that
partitions on arbitrary fields
• Tuples (search results) start streaming instantly to worker nodes never
requiring a spill to the disk.
• All replicas shuffle in parallel for the same query which allows for
massively parallel IO and huge throughputs.
Worker collections
• Regular SolrCloud collections
• Perform streaming aggregations using the Streaming API
• Receive shuffled streams from the replicas
• Over an HTTP endpoint: /stream
• May be empty or created just-in-time for specific analytical queries
or have data as any regular SolrCloud collection
• The goal is to separate processing from data if necessary
Parallel SQL
• The Presto parser compiles SQL to a TupleStream
• TupleStream is serialised to a Streaming Expression and sent over
the wire to worker nodes
• Worker nodes convert the Streaming Expression back into a
TupleStream
• Worker nodes open() and read() the TupleStream in parallel
What’s next
Graph traversals via
streaming expressions
• Shortest path
• Node walking/gathering
• Distributed Gremlin
implementation
Machine learning
models
• LogisticRegressionQuery
• LogitStream
• More to come
Take actions based on
AI driven alerts
• DaemonStreams
• AlertStream
• ModelStream
More, more, more!
• UpdateStream
• Publish-subscribe
• Calcite integration
• Better JDBC support
References
• Joel Bernstein’s Blog — http://joelsolr.blogspot.in/
• https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
• https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface
• Parallel SQL by Joel Bernstein — https://www.youtube.com/watch?
v=baWQfHWozXc
• Streaming Aggregations by Erick Erickson — https://www.youtube.com/
watch?v=n5SYlw0vSFw
Thank you
shalin@apache.org
@shalinmangar

More Related Content

What's hot

Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrVadim Kirilchuk
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Performance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMsPerformance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMsMaarten Smeets
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchNETWAYS
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Umesh Prasad
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in luceneLucidworks (Archived)
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...ForgeRock
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 

What's hot (20)

Search@flipkart
Search@flipkartSearch@flipkart
Search@flipkart
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and Solr
 
Splunk ES Asset & Identity
Splunk ES Asset & IdentitySplunk ES Asset & Identity
Splunk ES Asset & Identity
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Performance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMsPerformance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMs
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearch
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Elk stack
Elk stackElk stack
Elk stack
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 

Similar to Parallel SQL and Streaming Expressions in Apache Solr 6

Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsDataWorks Summit
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphLucidworks
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Parallel SQL for SolrCloud
Parallel SQL for SolrCloudParallel SQL for SolrCloud
Parallel SQL for SolrCloudJoel Bernstein
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoLucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Mary Jo Sminkey
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going stronglucenerevolution
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scaleAnshum Gupta
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0Anshum Gupta
 
ITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cbormITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cbormOrtus Solutions, Corp
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 

Similar to Parallel SQL and Streaming Expressions in Apache Solr 6 (20)

Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Parallel SQL for SolrCloud
Parallel SQL for SolrCloudParallel SQL for SolrCloud
Parallel SQL for SolrCloud
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going strong
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
ITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cbormITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cborm
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 

More from Shalin Shekhar Mangar

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Shalin Shekhar Mangar
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupShalin Shekhar Mangar
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataShalin Shekhar Mangar
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 

More from Shalin Shekhar Mangar (11)

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 

Parallel SQL and Streaming Expressions in Apache Solr 6

  • 1.
  • 2. Parallel SQL and Streaming Expressions in Apache Solr 6 Shalin Shekhar Mangar @shalinmangar Lucidworks Inc.
  • 3. Introduction • Shalin Shekhar Mangar • Lucene/Solr Committer • PMC Member • Senior Solr Consultant with Lucidworks Inc.
  • 4. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 5. • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features
  • 6. Why SQL • Simple, well-known interface to data inside Solr • Hides the complexity of Solr and its various features • Possible to optimise the query plan according to best-practices automatically • Distributed Joins done simply and well
  • 7. Solr 6: Parallel SQL • Parallel execution of SQL across SolrCloud collections • Compiled to SolrJ Streaming API (TupleStream) which is a general purpose parallel computing framework for Solr • Executed in parallel over SolrCloud worker nodes • SolrCloud collections are relational ‘tables’ • JDBC thin client as a SolrJ client
  • 9. SQL Interface at a glance • SQL over Map/Reduce — for high cardinality aggregations and distributed joins • SQL over Facets — high performance, moderate cardinality aggregations • SQL with Solr powered search queries • Fully integrated with SolrCloud • SQL over JDBC or HTTP — http://host:port/solr/collection1/sql
  • 10. Limited vs Unlimited SELECT • select movie, director from IMDB Returns the entire result set! Return fields must be DocValues • select movie, directory from IMDB limit 100 Returns specified number of records. It can sort by score and retrieve any stored field • select movie, director from IMDB order by rating desc, num_voters desc
  • 11. Search predicates • select movie, director from IMDB where actor = ‘bruce’ • select movie, director from IMDB where actor = ‘(bruce tom)’ • select movie, director from IMDB where rating = ‘[8 TO *]’ • select movie, director from IMDB where (actor = ‘(bruce tom)’ AND rating = ‘[8 TO *]’) Search predicates are Solr queries specified inside single-quotes Can specify arbitrary boolean clauses
  • 12. Select DISTINCT • select distinct actor_name from IMDB • Map/Reduce implementation — Tuples are shuffled to worker nodes and operation is performed by workers • JSON Facet implementation — operation is ‘pushed down’ to Solr
  • 13. Stats aggregations • select count(*), sum(num_voters) from IMDB • Computed using Solr’s StatsComponent under the hood • count, sum, avg, min, max are the supported aggregations • Always pushed down into the search engine
  • 14. GROUP BY Aggregations • select actor_name, director, count(*), sum(num_voters) from IMDB group by actor_name, director having count(*) > 5 and sum(num_voters) > 1000 order by sum(num_voters) desc • Has a map/reduce implementation (shuffle) and a JSON Facet implementation (push down) • Multi-dimensional, high cardinality aggregations are possible with the map/reduce implementation
  • 15.
  • 16. JDBC • Part of SolrJ • SolrCloud Aware Load Balancing • Connection has ‘aggregationMode’ parameter that can switch between map_reduce or facet • jdbc:solr://SOLR_ZK_CONNECTION_STRING? collection=COLLECTION_NAME&aggregationMode=facet
  • 18. Solr’s Parallel Computing Framework • Streaming API • Streaming Expressions • Shuffling • Worker collections • Parallel SQL
  • 19. Streaming API • Java API for parallel computation • Real-time Map/Reduce and Parallel Relational Algebra • Search results are streams of tuples (TupleStream) • Transformed in parallel by Decorator streams • Transformations include group by, rollup, union, intersection, complement, joins • org.apache.solr.client.solrj.io.*
  • 20. Streaming API • Streaming Transformation Operations that transform the underlying streams e.g. unique, group by, rollup, union, intersection, complement, join etc • Streaming Aggregation Operations that gather metrics and compute aggregates e.g. sum, count, average, min, max etc
  • 21. Streaming Expressions • String Query Language and Serialisation format for the Streaming API • Streaming expressions compile to TupleStream • TupleStream serialise to Streaming Expressions • Human friendly syntax for Streaming API accessible to non-Java folks as well • Can be used directly via HTTP to SolrJ
  • 23. Streaming Expressions • Stream Sources The origin of a TupleStream search, jdbc, facet, stats, topic • Stream Decorators Wrap other stream functions and perform operations on the stream complement, hashJoin, innerJoin, merge, intersect, top, unique • Many streams can be paralleled across worker collections
  • 24. Shuffling • Shuffling is pushed down to Solr • Sorting is done by /export handler which stream-sorts entire result sets • Partitioning is done by HashQParserPlugin which is a filter that partitions on arbitrary fields • Tuples (search results) start streaming instantly to worker nodes never requiring a spill to the disk. • All replicas shuffle in parallel for the same query which allows for massively parallel IO and huge throughputs.
  • 25. Worker collections • Regular SolrCloud collections • Perform streaming aggregations using the Streaming API • Receive shuffled streams from the replicas • Over an HTTP endpoint: /stream • May be empty or created just-in-time for specific analytical queries or have data as any regular SolrCloud collection • The goal is to separate processing from data if necessary
  • 26. Parallel SQL • The Presto parser compiles SQL to a TupleStream • TupleStream is serialised to a Streaming Expression and sent over the wire to worker nodes • Worker nodes convert the Streaming Expression back into a TupleStream • Worker nodes open() and read() the TupleStream in parallel
  • 27.
  • 29. Graph traversals via streaming expressions • Shortest path • Node walking/gathering • Distributed Gremlin implementation
  • 31. Take actions based on AI driven alerts • DaemonStreams • AlertStream • ModelStream
  • 32. More, more, more! • UpdateStream • Publish-subscribe • Calcite integration • Better JDBC support
  • 33. References • Joel Bernstein’s Blog — http://joelsolr.blogspot.in/ • https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions • https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface • Parallel SQL by Joel Bernstein — https://www.youtube.com/watch? v=baWQfHWozXc • Streaming Aggregations by Erick Erickson — https://www.youtube.com/ watch?v=n5SYlw0vSFw