This document summarizes MetLife's experience adopting MongoDB as an enterprise database. It describes MetLife's system architecture choices including geographic placement across data centers, server hardware, storage, and networking configurations. It also discusses how MetLife designed the system for high availability, disaster recovery, security, monitoring, workload management, and automation to meet enterprise standards. Next steps include further automating operations and introducing additional solutions for backup, monitoring integration, workload management, and performance benchmarking.
Operationalizing the Value of MongoDB: The MetLife Experience
1. Page 1
Operationalizing value of
MongoDB
(MetLife experience)
Thrills and challenges of building MongoDB
operations in a large enterprise
2. Page 2
A Journey
When new technology meets enterprise standards :
- advantages and restrictions of large enterprises
- it is always a journey
- decisions we have to live with
3. Page 3
Highly Successful Adoption of New Technology
for a Fortune 50 Enterprise Organization
• Unknown technology
– Proves to be capable
• New platform
– Quickly matures
• Untested for the Enterprise
– Delivers success
• Many new things to learn
– Become experts in time
4. Page 4
Disclaimer
• The content in this presentation represents MetLife's choices and MongoDB
Inc.’s recommendations for MetLife’s specific use case. By no means is this a
“universal blueprint for success” and it doesn’t necessarily represent
MongoDB Inc.'s recommendations for all use cases.
• In particular- because there were some fixed decisions that predated the
MongoDB implementation, MetLife's deployment may require some
“manual intervention” (specifically in case of DR) whereas other, differently-organized
deployments might not.
6. Page 6
Basic System Architecture Decisions
• Company Data Center vs. Public Cloud Placement
Control vs. ease of use
MetLife: Compliance requirements dictate company data center(s) placement.
• Server type and sizes
Enterprise class servers vs. Pizza boxes
MetLife: More cost effective to run on enterprise class servers - 2x8 Core CPU, 512 GB RAM
• Virtualization
VM vs. “Bare Metal”
MetLife: Data nodes – physical servers, Configuration Servers and MongoS – VMs.
• SAN vs. Local storage
Flexibility of SAN vs. performance of local storage
MetLife: Local storage enclosures. 600 GB SAS drives.
• Network
Dedicated LAN for MongoDB replication
MetLife: No dedicated LANs, for MongoDB installation.
7. Page 7
Business Requirements and System Topology
Business requirements:
- mission critical application
- loss of entire data center for indefinite time should not limit the application functionality in
any way
- significant data growth is expected, as well as a significant increase in the number of users
Drive system topology :
a. Geographic placement
MetLife: Geographically dispersed cluster, spanning two data centers
b. Sharded cluster vs. Replica set
MetLife: Sharded cluster for elastic horizontal scalability
c. Number of nodes in the replica set
MetLife: Minimum of 6 to ensure full operability in case of one data center loss.
d. Writes and reads geography
MetLife: Business function driven write-concern implementation, reads are mostly
“secondary preferred”
8. Page 8
System topology
Configuration
C
Server 1
Data Center 1 Data Center 2
Local Prod
Replica 1
Primary
Prod
Local Prod
Hidden Replica
for
backups
Remote Prod
Replica 1
Remote Prod
Replica 2
Remote Prod
Hidden Replica
for
backups
Configuration
Server 2
Configuration
Server 3
Backup
Solution
Backup
Solution
2 SHARDS
comprise
this
2 SHARDS
comprise
this
2 SHARDS
comprise
this
2 SHARDS
comprise
this
2 SHARDS
comprise
this
2 SHARDS
comprise
this
MongoS Prod
Server
MongoS Prod
Server
Mongos
Server
Mongos Server
9. Page 9
System Setup for Availability and DR
System has to comply with MetLife’s enterprise standard for availability and DR
(No single points of failure):
a. Replica sets
MetLife: 6 member replica sets ( 3 in each data center), 2 hidden replicas for backup purposes, 5
voting members ( hidden replicas in DR data center has 0 votes), and 2 replicas in primary datacenter
who have higher priority.
b. Mongo Configuration servers
MetLife: 3 configuration servers (2 in primary data center and 1 in DR data center). Loss of entire data
center halts cluster balancing ability, but not the application functionality.
c. MongoS
MetLife: 4 MongoS servers (2 in each data center). All active.
d. Application servers connectivity
MetLife: MongoDB drivers on application servers are configured to use all MongoS but in a different
order for pseudo load balancing.
e. DR exercise
MetLife: DR exercise is conducted yearly and includes all database and application infrastructure to
ensure complete operability from DR data center.
10. Page 10
System Set up for Recoverability
System has to comply with MetLife’s enterprise standard for recoverability:
Backup and Recovery strategy.
MetLife:
- Daily backups in both data centers (alternating).
- Backups of hidden replicas are performed with mongod brought down. Balancer is
stopped.
- Due to the database size backup is performed at the file system level.
- At the same time backup of Configuration server is performed using mongodump.
Current challenges.
MetLife:
- No point-in-time recovery
- No easy way to restore one specific database
Using MMS Backup solution.
MetLife:
- MMS Backup is capable of solving some of our current challenges.
- Due to compliance reasons, cannot use MMS cloud backup solution in AWS
- Currently looking into an option of running MMS Backup solution on premises
11. Page 11
Security
System has to comply with MetLife’s enterprise standard for data security:
Authentication and authorization.
MetLife:
- Original build in MongoDB 2.2 had very limited options in database authentication
and write or read/write permission at the database level.
- Biggest concerns : authentication – no password policy enforcement, authorization –
excessive application permissions.
- MetLife’s MongoDB 2.6 goals are : authentication – Active Directory, authorization –
custom build roles with least set of permission required by application.
LDAP integration
MetLife:
- Integration with Active Directory (AD) using LINUX PAMs
- Third party product for secure Sever/AD communications
- Currently mixed mode (both AD and in-database) authentication
Data-at-rest encryption
MetLife: Data-at-rest encryption is implemented using third-party product (LINUX file system /
device encryption).
Audit.
MetLife:
- Tactical: MongoDB 2.6 audit capability can do the job.
- Strategic: Database activity audit is performed by third party product.
12. Page 12
Monitoring and Alerting
System has to comply with MetLife’s enterprise standard for monitoring and
alerting:
Hardware monitoring
MetLife: No munin-node monitoring. Using standard enterprise Linux server monitoring toolset
owned by MetLife
MongoDB monitoring with MMS
MetLife: Currently using MMS in cloud for monitoring and alerting. Alerts are sent via SMS and
e-mails to responsible individuals in operations as well as to monitored group mail boxes.
BMC MongoDB Patrol KM as an alternative monitoring solution
MetLife: Third party Knowledge Modules are standard monitoring/alerting tools for MetLife’s
enterprise databases. Currently engaged in MongoDB KM beta-testing.
Integrating monitoring/alerting to the enterprise incident management system
MetLife: Currently no integration. Two approaches in parallel:
- In-house written process to parse JSON attachment from MMS alert e-mail and create
incident ticket
- Third party KM is natively integrated with enterprise incident management system
13. Page 13
Workload Management and Automation
System has to reliably support business SLAs and be efficient to manage:
Workload management and resource sharing.
MetLife: Workload management and resource sharing is one of the bigger challenges. MongoDB 2.6
does not have in-database mechanism for managing different workloads, that makes
resource sharing problematic.
- Potential options: C-groups in RHEL 6
MMS automation (installation, upgrades).
MetLife: Engaged with MongoDB for MMS automation beta-testing.
14. Page 14
Next Steps in our Journey
• Automation (installation, upgrades, maintenance).
• MMS backup solution (on premises).
• Monitoring/alerting integration with an incident management system.
• Workload management / resource sharing solution.
• Introduction of arbiter to existing replica sets (3rd data center).
• Performance benchmarking toolset.