Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operationalizing the Value of MongoDB: The MetLife Experience


Published on

It was a lot of fun bringing exciting emerging technology into the rigid enterprise infrastructure eco-system. And then the real work began. How do you make the new technology operational? Learn from MetLife’s journey of operationalizing MongoDB to the level compliant with large enterprise requirements in High Availability, Recoverability, Security, Monitoring, Alerting, Workload management and Automation.

Published in: Technology
  • Login to see the comments

Operationalizing the Value of MongoDB: The MetLife Experience

  1. 1. Page 1 Operationalizing value of MongoDB (MetLife experience) Thrills and challenges of building MongoDB operations in a large enterprise
  2. 2. Page 2 A Journey When new technology meets enterprise standards : - advantages and restrictions of large enterprises - it is always a journey - decisions we have to live with
  3. 3. Page 3 Highly Successful Adoption of New Technology for a Fortune 50 Enterprise Organization • Unknown technology – Proves to be capable • New platform – Quickly matures • Untested for the Enterprise – Delivers success • Many new things to learn – Become experts in time
  4. 4. Page 4 Disclaimer • The content in this presentation represents MetLife's choices and MongoDB Inc.’s recommendations for MetLife’s specific use case. By no means is this a “universal blueprint for success” and it doesn’t necessarily represent MongoDB Inc.'s recommendations for all use cases. • In particular- because there were some fixed decisions that predated the MongoDB implementation, MetLife's deployment may require some “manual intervention” (specifically in case of DR) whereas other, differently-organized deployments might not.
  5. 5. Page 5 Introducing “The Wall”
  6. 6. Page 6 Basic System Architecture Decisions • Company Data Center vs. Public Cloud Placement Control vs. ease of use MetLife: Compliance requirements dictate company data center(s) placement. • Server type and sizes Enterprise class servers vs. Pizza boxes MetLife: More cost effective to run on enterprise class servers - 2x8 Core CPU, 512 GB RAM • Virtualization VM vs. “Bare Metal” MetLife: Data nodes – physical servers, Configuration Servers and MongoS – VMs. • SAN vs. Local storage Flexibility of SAN vs. performance of local storage MetLife: Local storage enclosures. 600 GB SAS drives. • Network Dedicated LAN for MongoDB replication MetLife: No dedicated LANs, for MongoDB installation.
  7. 7. Page 7 Business Requirements and System Topology Business requirements: - mission critical application - loss of entire data center for indefinite time should not limit the application functionality in any way - significant data growth is expected, as well as a significant increase in the number of users Drive system topology : a. Geographic placement MetLife: Geographically dispersed cluster, spanning two data centers b. Sharded cluster vs. Replica set MetLife: Sharded cluster for elastic horizontal scalability c. Number of nodes in the replica set MetLife: Minimum of 6 to ensure full operability in case of one data center loss. d. Writes and reads geography MetLife: Business function driven write-concern implementation, reads are mostly “secondary preferred”
  8. 8. Page 8 System topology Configuration C Server 1 Data Center 1 Data Center 2 Local Prod Replica 1 Primary Prod Local Prod Hidden Replica for backups Remote Prod Replica 1 Remote Prod Replica 2 Remote Prod Hidden Replica for backups Configuration Server 2 Configuration Server 3 Backup Solution Backup Solution 2 SHARDS comprise this 2 SHARDS comprise this 2 SHARDS comprise this 2 SHARDS comprise this 2 SHARDS comprise this 2 SHARDS comprise this MongoS Prod Server MongoS Prod Server Mongos Server Mongos Server
  9. 9. Page 9 System Setup for Availability and DR System has to comply with MetLife’s enterprise standard for availability and DR (No single points of failure): a. Replica sets MetLife: 6 member replica sets ( 3 in each data center), 2 hidden replicas for backup purposes, 5 voting members ( hidden replicas in DR data center has 0 votes), and 2 replicas in primary datacenter who have higher priority. b. Mongo Configuration servers MetLife: 3 configuration servers (2 in primary data center and 1 in DR data center). Loss of entire data center halts cluster balancing ability, but not the application functionality. c. MongoS MetLife: 4 MongoS servers (2 in each data center). All active. d. Application servers connectivity MetLife: MongoDB drivers on application servers are configured to use all MongoS but in a different order for pseudo load balancing. e. DR exercise MetLife: DR exercise is conducted yearly and includes all database and application infrastructure to ensure complete operability from DR data center.
  10. 10. Page 10 System Set up for Recoverability System has to comply with MetLife’s enterprise standard for recoverability: Backup and Recovery strategy. MetLife: - Daily backups in both data centers (alternating). - Backups of hidden replicas are performed with mongod brought down. Balancer is stopped. - Due to the database size backup is performed at the file system level. - At the same time backup of Configuration server is performed using mongodump. Current challenges. MetLife: - No point-in-time recovery - No easy way to restore one specific database Using MMS Backup solution. MetLife: - MMS Backup is capable of solving some of our current challenges. - Due to compliance reasons, cannot use MMS cloud backup solution in AWS - Currently looking into an option of running MMS Backup solution on premises
  11. 11. Page 11 Security System has to comply with MetLife’s enterprise standard for data security: Authentication and authorization. MetLife: - Original build in MongoDB 2.2 had very limited options in database authentication and write or read/write permission at the database level. - Biggest concerns : authentication – no password policy enforcement, authorization – excessive application permissions. - MetLife’s MongoDB 2.6 goals are : authentication – Active Directory, authorization – custom build roles with least set of permission required by application. LDAP integration MetLife: - Integration with Active Directory (AD) using LINUX PAMs - Third party product for secure Sever/AD communications - Currently mixed mode (both AD and in-database) authentication Data-at-rest encryption MetLife: Data-at-rest encryption is implemented using third-party product (LINUX file system / device encryption). Audit. MetLife: - Tactical: MongoDB 2.6 audit capability can do the job. - Strategic: Database activity audit is performed by third party product.
  12. 12. Page 12 Monitoring and Alerting System has to comply with MetLife’s enterprise standard for monitoring and alerting: Hardware monitoring MetLife: No munin-node monitoring. Using standard enterprise Linux server monitoring toolset owned by MetLife MongoDB monitoring with MMS MetLife: Currently using MMS in cloud for monitoring and alerting. Alerts are sent via SMS and e-mails to responsible individuals in operations as well as to monitored group mail boxes. BMC MongoDB Patrol KM as an alternative monitoring solution MetLife: Third party Knowledge Modules are standard monitoring/alerting tools for MetLife’s enterprise databases. Currently engaged in MongoDB KM beta-testing. Integrating monitoring/alerting to the enterprise incident management system MetLife: Currently no integration. Two approaches in parallel: - In-house written process to parse JSON attachment from MMS alert e-mail and create incident ticket - Third party KM is natively integrated with enterprise incident management system
  13. 13. Page 13 Workload Management and Automation System has to reliably support business SLAs and be efficient to manage: Workload management and resource sharing. MetLife: Workload management and resource sharing is one of the bigger challenges. MongoDB 2.6 does not have in-database mechanism for managing different workloads, that makes resource sharing problematic. - Potential options: C-groups in RHEL 6 MMS automation (installation, upgrades). MetLife: Engaged with MongoDB for MMS automation beta-testing.
  14. 14. Page 14 Next Steps in our Journey • Automation (installation, upgrades, maintenance). • MMS backup solution (on premises). • Monitoring/alerting integration with an incident management system. • Workload management / resource sharing solution. • Introduction of arbiter to existing replica sets (3rd data center). • Performance benchmarking toolset.