11. Documents are Rich Data Structures
{
first_name: ‘Paul’,
surname: ‘Miller’,
cell: 447557505611,
city: ‘London’,
location: [45.123,47.232],
Profession: [‘banking’, ‘finance’, ‘trader’],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
Fields can contain an array of
sub-documents
Fields
Typed fields
Fields can
contain arrays
12. Do More With Your Data
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
}
}
Rich Queries
Find everybody in London
with a car built between
1970 and 1980
Geospatial
Find all of the car owners
within 5km of Trafalgar Sq.
Text Search
Find all the cars described
as having leather seats
Aggregation
Calculate the average value
of Paul’s car collection
Map Reduce
What is the ownership
pattern of colors by
geography over time?
(is purple trending up in
China?)
13. Automatic Sharding
Three types: hash-based, range-based, location-
aware
Increase or decrease capacity as you go
Automatic balancing
18. Storage Engine Architecture in 3.2
Content
Repo
IoT Sensor
Backend
Ad Service
Customer
Analytics
Archive
MongoDB Query Language (MQL) + Native Drivers
MongoDB Document Data Model
WT MMAP
Supported in MongoDB 3.2
Management
Security
In-memory
(beta)
Encrypted 3rd party
21. What Type of Servers
• RAM
– 64 256 GB+
• Fast IO Systems
– RAID-10/SSDs
• Many cores
– Compress/Uncompress
– Encrypt/Decrypt
– Aggregation queries
22. What about a SAN?
• Mostly Random Disk Access
• IOPS
• Need dedicated IOPS or performance will vary
• Configure your SAN properly
• Suitability of any IO system will depend upon IOPS
25. • Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
26. • Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
Example
Data Size = 9 TB
WiredTiger Compression Ratio: .33
Storage size = 3 TB
Server disk capacity = 2 TB
2 Shards Required
27. • Working set should fit in RAM
– Sum of RAM across shards > Working Set
• WorkSet = Indexes plus the set of documents accessed frequently
• WorkSet in RAM
– Shorter latency
– Higher Throughput
RAM: How Many Shards Do I Need?
28. • Measuring Index Size
– db.coll.stats() – index size of collection
• Estimate frequently accessed documents
– Ex: total size of documents accessed
per day
RAM: How Many Shards Do I Need?
29. • Measuring Index Size
– db.coll.stats() – index size of collection
• Estimate frequently accessed documents
– Ex: total size of documents accessed
per day
RAM: How Many Shards Do I Need?
Example
Working Set = 428 GB
Server RAM = 128 GB
428/128 = 3.34
4 Shards Required
30. • Measure max sustained query rate of a single server (with replication)
– build a prototype and measure
• Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
31. • Measure max sustained query rate of a single server (with replication)
– build a prototype and measure
• Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
Example
Require: 50K ops/sec
Prototype performance: 20 ops/sec (1
replica set)
4 Shards Required: 80 ops/sec * .7 =
56K ops/sec
32.
33. Configure Them Properly
• Default OS Settings Often Don’t Provide Optimal Performance
• See MongoDB Production Notes
– https://docs.mongodb.org/manual/administration/production-notes
• Also Review:
– Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
– Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/
34. Server/OS Configuration
• Server configuration recommendations
– XFS
– Turn off atime and diratime
– NOOP scheduler
– File descriptor limits
– Disable transparent huge pages and NUMA
– Read ahead of 32
– Separate data volumes for data files, the journal, and the log.
– Change the default TCP keepalive time to 300 seconds.
35. These are important
• Ignore them and your performance may suffer
• The first 100 lines of the MongoDB logs identifies
suboptimal OS settings
38. Taylor MongoDB Schema toApplication
Workload
• Design schema to provide good query performance
• Schema design will impact required number of shards!
Application
Query Workload
{
Name: “john”
Height: 12
Address: {…}
}
db.cust.find({…})
db.cust.aggregate({…})
39. Compare Alternative Schemas
• Build a spreadsheet
• Calculate # of shards for each schema
• Estimate query performance
– # of documents
– # of inserts
– # of deletes
– Required indexes
– Number of documents inspected
– Number of documents sent across network
43. Embedding
• Advantages
– Retrieve all relevant information in a single query/document
– Avoid implementing joins in application code
– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
– Large documents mean more overhead if most fields are not relevant
– Might mean replicating data
– 16 MB document size limit
44. Referencing
• Advantages
– Smaller documents
– Less likely to reach 16 MB document limit
– Infrequently accessed information not accessed on every query
– No duplication of data
• Limitations
– Two queries required to retrieve information
– Cannot update related information atomically
49. Data From Vital Signs Monitoring Device
{
deviceId: 123456,
spO2: 88,
pulse: 74,
bp: [128, 80],
ts: ISODate("2013-10-16T22:07:00.000-0500")
}
• One document per minute per device
• Relational approach
50. Document Per Hour (By minute)
{
deviceId: 123456,
spO2: { 0: 88, 1: 90, …, 59: 92},
pulse: { 0: 74, 1: 76, …, 59: 72},
bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]},
ts: ISODate("2013-10-16T22:00:00.000-0500")
}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
51. Characterizing Write Differences
• Example: data generated every minute
• Recording the data for 1 patient for 1 hour:
Document Per Event
60 inserts
Document Per Hour
1 insert, 59 updates
52. Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event
1440 reads
Document Per Hour
24 reads
53. Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
54. Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 *
365 * 24 *
60
100000 *
365 * 24
55. Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 *
365 * 24 *
60 * 130
100000 *
365 * 24 *
130
56. Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 *
365 * 24 *
60 * 92
100000 *
365 * 24 *
758
63. When Sharding
• If you care about initial performance, you must pre-split
• Otherwise, initial performance will be slow
• (hash sharding automatically presplits collection)
70. Best Practices
1. Use servers with specifications that will provide good MongoDB performance
– 64+ GB RAM, many cores, many IOPS (RAID-10/SSDs)
2. Calculate How Many Shards?
1. Calculate required RAM and Disk Space
2. Build a prototype to determine the ops/sec capacity of a server
3. Do the math
3. Configure OS for Optimal MongoDB Performance
– See MongoDB Production Notes
– Review logs for warnings (Don’t ignore)
71. Best Practices (cont.)
4. Create a Document Schema
– Denormalized
5. Tailor schema to application workload
– Use application queries to guide schema design decisions
– Consider alternative schemas
– Compare cluster size (# of shards) and performance
– Build a spreadsheet
72. Best Practices
6. Loading Data
– Loader Hardware ~= MongoDB hardware
– Many threads
– Many mongos
7. Pre-split
– Ensure query workload is evenly distributed across the cluster from the start
Here we have greatly reduced the relational data model for this application to two tables. In reality no database has two tables. It is much more common to have hundreds or thousands of tables. And as a developer where do you begin when you have a complex data model?? If you’re building an app you’re really thinking about just a hand full of common things, like products, and these can be represented in a document much more easily that a complex relational model where the data is broken up in a way that doesn’t really reflect the way you think about the data or write an application.
MongoDB provides horizontal scale-out for databases using a technique called sharding, which is trans- parent to applications. Sharding distributes data across multiple physical partitions called shards. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.
MongoDB supports three types of sharding:
• Range-based Sharding. Documents are partitioned across shards according to the shard key value. Documents with shard key values “close” to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range- based queries.
• Hash-based Sharding. Documents are uniformly distributed according to an MD5 hash of the shard key value. Documents with shard key values “close” to one another are unlikely to be co-located on the same shard. This approach guarantees a uniform distribution of writes across shards, but is less optimal for range-based queries.
• Tag-aware Sharding. Documents are partitioned according to a user-specified configuration that associates shard key ranges with shards. Users can optimize the physical location of documents for application requirements such as locating data in specific data centers.
MongoDB automatically balances the data in the cluster as the data grows or the size of the cluster increases or decreases.
Sharding is transparent to applications; whether there is one or one hundred shards, the application code for querying MongoDB is the same. Applications issue queries to a query router that dispatches the query to the appropriate shards.
For key-value queries that are based on the shard key, the query router will dispatch the query to the shard that manages the document with the requested key. When using range-based sharding, queries that specify ranges on the shard key are only dispatched to shards that contain documents with values within the range. For queries that don’t use the shard key, the query router will dispatch the query to all shards and aggregate and sort the results as appropriate. Multiple query routers can be used with a MongoDB system, and the appropriate number is determined based on performance and availability requirements of the application.
High Availability – Ensure application availability during many types of failures
Disaster Recovery – Address the RTO and RPO goals for business continuity
Maintenance – Perform upgrades and other maintenance operations with no application downtime
Secondaries can be used for a variety of applications – failover, hot backup, rolling upgrades, data locality and privacy and workload isolation