It was a lot of fun bringing exciting emerging technology into the rigid enterprise infrastructure eco-system. And then the real work began. How do you make the new technology operational? Learn from MetLife’s journey of operationalizing MongoDB to the level compliant with large enterprise requirements in High Availability, Recoverability, Security, Monitoring, Alerting, Workload management and Automation.
Operationalizing the Value of MongoDB: The MetLife Experience
Operationalizing value of
Thrills and challenges of building MongoDB
operations in a large enterprise
When new technology meets enterprise standards :
- advantages and restrictions of large enterprises
- it is always a journey
- decisions we have to live with
Highly Successful Adoption of New Technology
for a Fortune 50 Enterprise Organization
• Unknown technology
– Proves to be capable
• New platform
– Quickly matures
• Untested for the Enterprise
– Delivers success
• Many new things to learn
– Become experts in time
• The content in this presentation represents MetLife's choices and MongoDB
Inc.’s recommendations for MetLife’s specific use case. By no means is this a
“universal blueprint for success” and it doesn’t necessarily represent
MongoDB Inc.'s recommendations for all use cases.
• In particular- because there were some fixed decisions that predated the
MongoDB implementation, MetLife's deployment may require some
“manual intervention” (specifically in case of DR) whereas other, differently-organized
deployments might not.
Basic System Architecture Decisions
• Company Data Center vs. Public Cloud Placement
Control vs. ease of use
MetLife: Compliance requirements dictate company data center(s) placement.
• Server type and sizes
Enterprise class servers vs. Pizza boxes
MetLife: More cost effective to run on enterprise class servers - 2x8 Core CPU, 512 GB RAM
VM vs. “Bare Metal”
MetLife: Data nodes – physical servers, Configuration Servers and MongoS – VMs.
• SAN vs. Local storage
Flexibility of SAN vs. performance of local storage
MetLife: Local storage enclosures. 600 GB SAS drives.
Dedicated LAN for MongoDB replication
MetLife: No dedicated LANs, for MongoDB installation.
Business Requirements and System Topology
- mission critical application
- loss of entire data center for indefinite time should not limit the application functionality in
- significant data growth is expected, as well as a significant increase in the number of users
Drive system topology :
a. Geographic placement
MetLife: Geographically dispersed cluster, spanning two data centers
b. Sharded cluster vs. Replica set
MetLife: Sharded cluster for elastic horizontal scalability
c. Number of nodes in the replica set
MetLife: Minimum of 6 to ensure full operability in case of one data center loss.
d. Writes and reads geography
MetLife: Business function driven write-concern implementation, reads are mostly
Data Center 1 Data Center 2
System Setup for Availability and DR
System has to comply with MetLife’s enterprise standard for availability and DR
(No single points of failure):
a. Replica sets
MetLife: 6 member replica sets ( 3 in each data center), 2 hidden replicas for backup purposes, 5
voting members ( hidden replicas in DR data center has 0 votes), and 2 replicas in primary datacenter
who have higher priority.
b. Mongo Configuration servers
MetLife: 3 configuration servers (2 in primary data center and 1 in DR data center). Loss of entire data
center halts cluster balancing ability, but not the application functionality.
MetLife: 4 MongoS servers (2 in each data center). All active.
d. Application servers connectivity
MetLife: MongoDB drivers on application servers are configured to use all MongoS but in a different
order for pseudo load balancing.
e. DR exercise
MetLife: DR exercise is conducted yearly and includes all database and application infrastructure to
ensure complete operability from DR data center.
System Set up for Recoverability
System has to comply with MetLife’s enterprise standard for recoverability:
Backup and Recovery strategy.
- Daily backups in both data centers (alternating).
- Backups of hidden replicas are performed with mongod brought down. Balancer is
- Due to the database size backup is performed at the file system level.
- At the same time backup of Configuration server is performed using mongodump.
- No point-in-time recovery
- No easy way to restore one specific database
Using MMS Backup solution.
- MMS Backup is capable of solving some of our current challenges.
- Due to compliance reasons, cannot use MMS cloud backup solution in AWS
- Currently looking into an option of running MMS Backup solution on premises
System has to comply with MetLife’s enterprise standard for data security:
Authentication and authorization.
- Original build in MongoDB 2.2 had very limited options in database authentication
and write or read/write permission at the database level.
- Biggest concerns : authentication – no password policy enforcement, authorization –
excessive application permissions.
- MetLife’s MongoDB 2.6 goals are : authentication – Active Directory, authorization –
custom build roles with least set of permission required by application.
- Integration with Active Directory (AD) using LINUX PAMs
- Third party product for secure Sever/AD communications
- Currently mixed mode (both AD and in-database) authentication
MetLife: Data-at-rest encryption is implemented using third-party product (LINUX file system /
- Tactical: MongoDB 2.6 audit capability can do the job.
- Strategic: Database activity audit is performed by third party product.
Monitoring and Alerting
System has to comply with MetLife’s enterprise standard for monitoring and
MetLife: No munin-node monitoring. Using standard enterprise Linux server monitoring toolset
owned by MetLife
MongoDB monitoring with MMS
MetLife: Currently using MMS in cloud for monitoring and alerting. Alerts are sent via SMS and
e-mails to responsible individuals in operations as well as to monitored group mail boxes.
BMC MongoDB Patrol KM as an alternative monitoring solution
MetLife: Third party Knowledge Modules are standard monitoring/alerting tools for MetLife’s
enterprise databases. Currently engaged in MongoDB KM beta-testing.
Integrating monitoring/alerting to the enterprise incident management system
MetLife: Currently no integration. Two approaches in parallel:
- In-house written process to parse JSON attachment from MMS alert e-mail and create
- Third party KM is natively integrated with enterprise incident management system
Workload Management and Automation
System has to reliably support business SLAs and be efficient to manage:
Workload management and resource sharing.
MetLife: Workload management and resource sharing is one of the bigger challenges. MongoDB 2.6
does not have in-database mechanism for managing different workloads, that makes
resource sharing problematic.
- Potential options: C-groups in RHEL 6
MMS automation (installation, upgrades).
MetLife: Engaged with MongoDB for MMS automation beta-testing.
Next Steps in our Journey
• Automation (installation, upgrades, maintenance).
• MMS backup solution (on premises).
• Monitoring/alerting integration with an incident management system.
• Workload management / resource sharing solution.
• Introduction of arbiter to existing replica sets (3rd data center).
• Performance benchmarking toolset.