SlideShare a Scribd company logo
1 of 19
1
What problem are we solving?
• Map/Reduce can be used for aggregation…
  • Currently being used for totaling, averaging, etc
• Map/Reduce is a big hammer
  • Simpler tasks should be easier
    • Shouldn’t need to write JavaScript
    • Avoid the overhead of JavaScript engine
• We’re seeing requests for help in handling
  complex documents
  • Select only matching subdocuments or arrays
How will we solve the problem?
• Our new aggregation framework
  • Declarative framework
    • No JavaScript required
  • Describe a chain of operations to apply
  • Expression evaluation
    • Return computed values
  • Framework: we can add new operations easily
  • C++ implementation
    • Higher performance than JavaScript
Aggregation - Pipelines
• Aggregation requests specify a pipeline
• A pipeline is a series of operations
• Conceptually, the members of a collection
  are passed through a pipeline to produce a
  result
  • Similar to a command-line pipe
Pipeline Operations
• $match
  • Uses a query predicate (like .find({…})) as a filter
• $project
  • Uses a sample document to determine the shape
    of the result (similar to .find()’s optional argument)
    • This can include computed values
• $unwind
  • Hands out array elements one at a time
• $group
  • Aggregates items into buckets defined by a key
Pipeline Operations (continued)
• $sort
  • Sort documents
• $limit
  • Only allow the specified number of documents to
    pass
• $skip
  • Skip over the specified number of documents
Projections
• $project can reshape results
  • Include or exclude fields
  • Computed fields
    • Arithmetic expressions, including built-in functions
    • Pull fields from nested documents to the top
    • Push fields from the top down into new virtual
      documents
Unwinding
• $unwind can “stream” arrays
  • Array values are doled out one at time in the
    context of their surrounding documents
  • Makes it possible to filter out elements before
    returning
Grouping
• $group aggregation expressions
  • Define a grouping key as the _id of the result
  • Total grouped column values: $sum
  • Average grouped column values: $avg
  • Collect grouped column values in an array or set:
    $push, $addToSet
  • Other functions
    • $min, $max, $first, $last
Sorting
• $sort can sort documents
  • Sort specifications are the same as today, e.g.,
    $sort:{ key1: 1, key2: -1, …}
Computed Expressions
• Available in $project operations
• Prefix expression language
  • Add two fields: $add:[“$field1”, “$field2”]
  • Provide a value for a missing field:
    $ifNull:[“$field1”, “$field2”]
  • Nesting: $add:[“$field1”, $ifNull:[“$field2”,
    “$field3”]]
  • Other functions….
    • And we can easily add more as required
Computed Expressions (continued)
• String functions
  • toUpper, toLower, substr
• Date field extraction
  • Get year, month, day, hour, etc, from ISODate
• Date arithmetic
• Null value substitution (like MySQL ifnull(),
  Oracle nvl())
• Ternary conditional
  • Return one of two values based on a predicate
Demo
Demo files are at https://gist.github.com/1401585
Usage Tips
• Use $match in a pipeline as early as possible
  • The query optimizer can then choose to scan an
    index and avoid scanning the entire collection
• Use $sort in a pipeline as early as possible
  • The query optimizer can then be used to choose
    an index to scan instead of sorting the result
Driver Support
• Initial version is a command
  • For any language, build a JSON database object,
    and execute the command
    • In the shell: db.runCommand({ aggregate :
      <collection-name>, pipeline : {…} });
  • Beware of command result size limit
    • Document size limit is 16MB
Sharding support
• Initial release will support sharding
• Mongos analyzes pipeline, and forwards
  operations up to $group or $sort to shards;
  combines shard server results and returns
  them
When is this being released?
• In final development now
  • Adding an explain facility
• Expect to see this in the near future
Future Plans
• More optimizations
• $out pipeline operation
  • Saves the document stream to a collection
  • Similar to M/R $out, but with sharded output
  • Functions like a tee, so that intermediate results
    can be saved
mongodb-aggregation-may-2012

More Related Content

What's hot

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
sqlserver.co.il
 

What's hot (20)

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
SAX
SAXSAX
SAX
 
Query handlingbytheserver
Query handlingbytheserverQuery handlingbytheserver
Query handlingbytheserver
 
Asp #2
Asp #2Asp #2
Asp #2
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
Data centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheData centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad Ulreche
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
You got schema in my json
You got schema in my jsonYou got schema in my json
You got schema in my json
 
Getting started with influx Db and Grafana Installation Guide
Getting started with influx Db and Grafana Installation GuideGetting started with influx Db and Grafana Installation Guide
Getting started with influx Db and Grafana Installation Guide
 
Up and Running with the Typelevel Stack
Up and Running with the Typelevel StackUp and Running with the Typelevel Stack
Up and Running with the Typelevel Stack
 
Utilizing the OpenNTF Domino API
Utilizing the OpenNTF Domino APIUtilizing the OpenNTF Domino API
Utilizing the OpenNTF Domino API
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Head first latex
Head first latexHead first latex
Head first latex
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Entity framework
Entity frameworkEntity framework
Entity framework
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using Optiq
 

Similar to mongodb-aggregation-may-2012

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 

Similar to mongodb-aggregation-may-2012 (20)

MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
 
No sql Database
No sql DatabaseNo sql Database
No sql Database
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012 New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Dapper
DapperDapper
Dapper
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 

More from Chris Westin

Mysql proxy presentation_yahoo
Mysql proxy presentation_yahooMysql proxy presentation_yahoo
Mysql proxy presentation_yahoo
Chris Westin
 

More from Chris Westin (20)

Data torrent meetup-productioneng
Data torrent meetup-productionengData torrent meetup-productioneng
Data torrent meetup-productioneng
 
Gripshort
GripshortGripshort
Gripshort
 
Ambari hadoop-ops-meetup-2013-09-19.final
Ambari hadoop-ops-meetup-2013-09-19.finalAmbari hadoop-ops-meetup-2013-09-19.final
Ambari hadoop-ops-meetup-2013-09-19.final
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera manager
 
Building low latency java applications with ehcache
Building low latency java applications with ehcacheBuilding low latency java applications with ehcache
Building low latency java applications with ehcache
 
SDN/OpenFlow #lspe
SDN/OpenFlow #lspeSDN/OpenFlow #lspe
SDN/OpenFlow #lspe
 
cfengine3 at #lspe
cfengine3 at #lspecfengine3 at #lspe
cfengine3 at #lspe
 
Nimbula lspe-2012-04-19
Nimbula lspe-2012-04-19Nimbula lspe-2012-04-19
Nimbula lspe-2012-04-19
 
mongodb-brief-intro-february-2012
mongodb-brief-intro-february-2012mongodb-brief-intro-february-2012
mongodb-brief-intro-february-2012
 
Stingray - Riverbed Technology
Stingray - Riverbed TechnologyStingray - Riverbed Technology
Stingray - Riverbed Technology
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica sets
 
Architecting a Scale Out Cloud Storage Solution
Architecting a Scale Out Cloud Storage SolutionArchitecting a Scale Out Cloud Storage Solution
Architecting a Scale Out Cloud Storage Solution
 
FlashCache
FlashCacheFlashCache
FlashCache
 
Large Scale Cacti
Large Scale CactiLarge Scale Cacti
Large Scale Cacti
 
MongoDB: An Introduction - July 2011
MongoDB:  An Introduction - July 2011MongoDB:  An Introduction - July 2011
MongoDB: An Introduction - July 2011
 
Practical Replication June-2011
Practical Replication June-2011Practical Replication June-2011
Practical Replication June-2011
 
MongoDB: An Introduction - june-2011
MongoDB:  An Introduction - june-2011MongoDB:  An Introduction - june-2011
MongoDB: An Introduction - june-2011
 
Ganglia Overview-v2
Ganglia Overview-v2Ganglia Overview-v2
Ganglia Overview-v2
 
Mysql Proxy Presentation Yahoo
Mysql Proxy Presentation YahooMysql Proxy Presentation Yahoo
Mysql Proxy Presentation Yahoo
 
Mysql proxy presentation_yahoo
Mysql proxy presentation_yahooMysql proxy presentation_yahoo
Mysql proxy presentation_yahoo
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

mongodb-aggregation-may-2012

  • 1. 1
  • 2. What problem are we solving? • Map/Reduce can be used for aggregation… • Currently being used for totaling, averaging, etc • Map/Reduce is a big hammer • Simpler tasks should be easier • Shouldn’t need to write JavaScript • Avoid the overhead of JavaScript engine • We’re seeing requests for help in handling complex documents • Select only matching subdocuments or arrays
  • 3. How will we solve the problem? • Our new aggregation framework • Declarative framework • No JavaScript required • Describe a chain of operations to apply • Expression evaluation • Return computed values • Framework: we can add new operations easily • C++ implementation • Higher performance than JavaScript
  • 4. Aggregation - Pipelines • Aggregation requests specify a pipeline • A pipeline is a series of operations • Conceptually, the members of a collection are passed through a pipeline to produce a result • Similar to a command-line pipe
  • 5. Pipeline Operations • $match • Uses a query predicate (like .find({…})) as a filter • $project • Uses a sample document to determine the shape of the result (similar to .find()’s optional argument) • This can include computed values • $unwind • Hands out array elements one at a time • $group • Aggregates items into buckets defined by a key
  • 6. Pipeline Operations (continued) • $sort • Sort documents • $limit • Only allow the specified number of documents to pass • $skip • Skip over the specified number of documents
  • 7. Projections • $project can reshape results • Include or exclude fields • Computed fields • Arithmetic expressions, including built-in functions • Pull fields from nested documents to the top • Push fields from the top down into new virtual documents
  • 8. Unwinding • $unwind can “stream” arrays • Array values are doled out one at time in the context of their surrounding documents • Makes it possible to filter out elements before returning
  • 9. Grouping • $group aggregation expressions • Define a grouping key as the _id of the result • Total grouped column values: $sum • Average grouped column values: $avg • Collect grouped column values in an array or set: $push, $addToSet • Other functions • $min, $max, $first, $last
  • 10. Sorting • $sort can sort documents • Sort specifications are the same as today, e.g., $sort:{ key1: 1, key2: -1, …}
  • 11. Computed Expressions • Available in $project operations • Prefix expression language • Add two fields: $add:[“$field1”, “$field2”] • Provide a value for a missing field: $ifNull:[“$field1”, “$field2”] • Nesting: $add:[“$field1”, $ifNull:[“$field2”, “$field3”]] • Other functions…. • And we can easily add more as required
  • 12. Computed Expressions (continued) • String functions • toUpper, toLower, substr • Date field extraction • Get year, month, day, hour, etc, from ISODate • Date arithmetic • Null value substitution (like MySQL ifnull(), Oracle nvl()) • Ternary conditional • Return one of two values based on a predicate
  • 13. Demo Demo files are at https://gist.github.com/1401585
  • 14. Usage Tips • Use $match in a pipeline as early as possible • The query optimizer can then choose to scan an index and avoid scanning the entire collection • Use $sort in a pipeline as early as possible • The query optimizer can then be used to choose an index to scan instead of sorting the result
  • 15. Driver Support • Initial version is a command • For any language, build a JSON database object, and execute the command • In the shell: db.runCommand({ aggregate : <collection-name>, pipeline : {…} }); • Beware of command result size limit • Document size limit is 16MB
  • 16. Sharding support • Initial release will support sharding • Mongos analyzes pipeline, and forwards operations up to $group or $sort to shards; combines shard server results and returns them
  • 17. When is this being released? • In final development now • Adding an explain facility • Expect to see this in the near future
  • 18. Future Plans • More optimizations • $out pipeline operation • Saves the document stream to a collection • Similar to M/R $out, but with sharded output • Functions like a tee, so that intermediate results can be saved