Cloud system configurations and their dependencies can quickly grow into the thousands of virtual machine, network and storage components. Once software is included, the number of components can easily rise into six figures.
Frequent releases using continuous integration and deployment tools makes a repository of these components and relationships absolutely critical to cloud system integrity and quality of service no matter what cloud management tools you use.
Systems configurations are more naturally represented using a graph database than the relational representations used by traditional IT management products.
Our talk will explore how we use Neo4J to create a live, active, self-updating repository service, containing nearly all virtual hardware, network and software components and their dependencies, enabling continuous deployment in any cloud environment at scale.
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Neo4j for Cloud Management at Scale
1. David Brian Ward
CEO, Telegraph Hill Software
www.telegraphhillsoftware.com
535 Mission Street
San Francisco CA 94105
Neo4j for Cloud Management at Scale
3. Headline text will go right here
Big ITIL Framework, Asset Heavy IT
A shared CMDB as
the single system
of record is a key
ITIL best practice
for good reason.
4. Headline text will go right here
Pre-DevOps, Pre-Cloud Ops Engineering
Would you build an IT edifice
like a cathedral today?
Origins of DevOps Culture:
• Open Source
• Cloud Computing
• Need for Dev Speed
5. Headline text will go right here
Early Cloud, Early DevOps Days
Shopping at the Open
Source Bazaar – A
new high quality
DevOps tool every
week?
6. Headline text will go right here
On Sale: DevOps for 90% Off!
A Graph Database
Can Make It
Possible!
7. Headline text will go right here
Evolving our DevOps Approach
Cloud DevOps should
combine an integration
hub architecture with
deployment
automation enabling
the evolution of infra,
apps and tools and their
migration.
8. Headline text will go right here
MacGyver and Neo4J for Business Critical DevOps
9. • API Integration with infra, tools and apps
• A batch scheduler for near real time config
• Data aggregation across services
• Monitoring and alerting integration
• Java script execution
• An extensible UI and dashboards
• Infrastructure mapping
• Tools for building engineering, operations,
infosec, QA and finance use cases
• High availability and multi-site
MacGyver Service Basics
10. • Perfect for capturing/modeling interdependencies
• Cypher’s ad hoc query capability can’t be beat
• Easy to extend, build more relationships and layers
incrementally
• Great join/traversal capability
• Flexible and scalable vs rigid; dataset can easily evolve
and grow in terms of complexity and structure
• Easy to consume
• Natural for infrastructure mapping and enterprise
architecture
• JSON native, FTW
Neo4J is Perfect for MacGyver
11. “Do we have a single point of failure
among any of our services?”
12. WHO ARE OUR CUSTOMERS?
Continuous integration
Test automation
Release packaging
High-availability/failover
Server, network and
environment provisioning
Monitoring
App performance monitoring
Cloud cost management
Audit
Change management,
Continuous deployment,
Service discovery
Micro-service cloud migration
Self healing systems
System and security maintenance
Operational cost and capacity
management
Blue/green deployment,
API management and security
And more every day
MacGyver Use Cases at Lending Club
13. WHO ARE OUR CUSTOMERS?
Network:
• A10
• ASA (adaptive security appliance)
• DNS
• F5
Software Code and Deployment:
• Artifactory
• Github
• Jenkins
• Springframework
Authentication and Access:
• AACS (advanced access content
system)
• Ldap
• Microsoft active directory & sso
Cloud Service Providers:
• Aws
• Cloudstack
• Vsphere
User Interface & Dashboards
• Leftronic
• Grafana (graphite)
Notification and Collaboration
• Hipchat
• Pagerduty
• Smtp
Operational Documentation
• Jira/Confluence
Database Platforms
• Mongo
• JDBC (oracle, mysql)
Performance Monitoring
• Catchpoint
• L7
• Newrelic
MacGyver Integrations at Lending Club
System Config and Management
• Nimble
• Purestorage
• Puppet
• Saltstack
Monitoring and logs
• Signalfx
• Splunk
MacGyver tools and services
• ‘health check’ / bootstrap
• Micro-service registry
14. Headline text will go right here
DevOps Evolution with MacGyver
Business Critical DevOps must anticipate the evolution of infra, apps and tools. E.g.,
• Physical -> Vsphere -> AWS
• Monolithic -> Micro Services
• Jenkins -> AWS Code Deploy
• Bonus: Vendor Independence
15. WHO ARE OUR CUSTOMERS?MacGyver Micro Services – Service Discovery
Problem:
Keeping track of many rapidly-changing services
Solution:
All app servers phone home to MacGyver and are stored in Neo4j as ‘App Instance’
nodes. Deployment and release automation assure a real time database of deployed
services. New services get auto-discovered by MacGyver.
• Low maintenance
• Easy scalability
• Low latency ad hoc query capability
16. WHO ARE OUR CUSTOMERS?MacGyver Micro Services – Deployment
Problem:
Highly manual and tedious releases
Difficult to answer questions like:
–What pool should I deploy to?
–Is the most recent revision ‘live’ right now?
–Are live pool revisions in sync in different environments?
Solution:
Utilize app check-ins and Neo4j to expose info about live and dark pools, enabling us to automate
deployments, and build on our existing monitoring automation.
18. Deployment and Release Automation
•Blue-green deployment
Server 1
Server 2
Server 3
Server 5
Server 4
Server 6
Server 7
Server 8
Service Group
“Live” Pool “Dark” Pool
19. Deployment and Release Automation
•Blue-green deployment
Server 1
Server 2
Server 3
Server 5
Server 4
Server 6
Server 7
Server 8
Service Group
20. Deployment and Release Automation
•Blue-green deployment
Server 1
Server 2
Server 3
Server 5
Server 4
Server 6
Server 7
Server 8
Service Group
“Draining” Pool “Live” Pool
21. Deployment and Release Automation
•Blue-green deployment
Server 1
Server 2
Server 3
Server 5
Server 4
Server 6
Server 7
Server 8
Service Group
Pool Cut-over
28. Headline text will go right here
What’s Next? Reactive DevOps
As with the F35 cockpit, the sky is no longer the limit.
29. CLOSING THOUGHTS
• MacGyver provides an Integration Architecture that enables
scalability, enrichment, evolution and migration
• MacGyver and Neo4J enables you to evolve your infrastructure using
best of breed components, all the while running your business critical
systems with integrity.
• MacGyver dramatically reduces the amount of software development
required for even the most sophisticated DevOps use cases.
• Perfect for managing your hybrid infrastructure while staying ahead
of Dev.
30. Headline text will go right here
What’s Next? Reactive DevOps
As in the F35 cockpit, the sky is no longer the limit.
31. CLOSING THOUGHTS
• MacGyver provides an Integration Architecture that enables
scalability, enrichment, evolution and migration
• MacGyver and Neo4J enables you to evolve your infrastructure using
best of breed components, all the while running your business critical
systems with integrity.
• MacGyver dramatically reduces the amount of software development
required for even the most sophisticated DevOps use cases.
• Perfect for managing your hybrid infrastructure while staying ahead
of Dev? We know so.
32. QUESTIONS? FOR MORE INFO…
https://github.com/if6was9/macgyver
https://github.com/if6was9/neorx
Ashley Sun – asun@lendingclub.com, @ashleycsun
Rob Schoening - rschoening@lendingclub.com, @rschoening
David Ward – david.ward@thpii.com
Sarah Lewis – sarah.lewis@thpii.com, info@thpii.com
Ask audience questions:
Technical DevOps staff?
Program staff?
Local/remote?
We happen to believe if you know where you’ve been, you have a better chance of knowing where you are going. So a bit of personal database history.
Non-SQL and application-specific databases have been around from the beginning of the computer age. But they rarely won out against relational databases, whose designs built on SQL and set-theory provided a general repository for multiple applications.
How many here have heard of Essbase?
My experience with NoSQL began twenty years ago, when I ran product engineering for Arbor/Hyperion/Oracle. We created a NoSQL OLAP database called Essbase -- the first commercially successful OLAP database, which is used for near real-time financial data analysis in multi-dimensional data cubes. Essbase is still around after 20yrs and is now a $1B+ line of business for Oracle.
But given the cost of computing, it usually made sense to run everything through a relational database. And other database architectures evolved more slowly as the relational platforms grew their functionality (e.g., object/relational mapping, etc.). Only the power of personal computers to perform real-time drill through made Essbase a viable business choice compared to an RDBMS solution.
By 2005, however, we entered a Renaissance of database innovation, enabled by two huge trends: Cheap virtual computing power/storage (cloud); and the Open Source software movement. The cost of using a NoSQL/application specific database plummeted, and the application solution quality in most cases offsets the higher life-cycle costs of using these application specific databases.
Solution designers can now choose between dozens of repositories based on open-source projects for any mission-critical application need: relational; map/reduce; key/value; in-memory; object; document retrieval; network; etc. And Graph, of course!
In 2000-2009, we were implementing ITIL management use cases for major corporations. These were mostly based on the then popular ITIL best-practices approach to managing IT as if it were a services company within every company. As shown above.
Working with large corporations, and following the era’s best practices, we powered through and built solutions based on using a best-of-breed management suite and a relational CMDB as the “source of truth” about all IT objects deployed.
Although ITIL implementations can vary, ideally, multiple “discovery” systems from multiple legacy IT management suites feed data daily into an enterprise CMDB. IT services performed via the suite all operate on data from the CMDB which provides a common view of all the objects under IT control. The CMDB ideally acts as an integration hub and single source of truth for IT apps. Auditable IT controls and mechanisms for cost management were achieved.
But at great cost:
In 2000-2009, even though open source Linux was spreading rapidly, almost all IT management used what Stallman called “The Cathedral” approach to building IT (one big edifice under the control of a master architect) as opposed to the open source “Bazaar” (lots of independents working on smaller components.) (this is a simplification, read the book, etc.)
ITIL was an asset-heavy solution for asset-heavy infrastructure that didn’t change fast. IT infrastructure was “asset heavy”: Big price tags and too-long depreciation schedules deterred updates and improvements, even as technology evolved more rapidly. IT infrastructure and applications were undiscoverable: Just figuring out what was deployed over the years, and their dependences on other infra and apps often meant pulling plugs to see what broke. IT infrastructure and applications were “snowflakes”: Each beautiful in its own way. One found few common solutions to common IT management requirements, even obvious ones such as monitoring and logging.
CMDB data is at best one day or older out of date. Dynamic infrastructure, whether VMware or cloud, is rarely handled. Data quality is an ongoing headache -- fitting evolving config data into a fixed relational schema is manpower intensive and error prone. Most CMDBs are built using relational schemas. OK for entirely “standard” components, but does not handle new types of components, no common component dependencies well (database queries are complex, support costs high, etc.).
Tribal differences (mainframe vs open system servers, developers vs operations) usually limit the scope of the solution to production environments. IT could rarely move fast enough for Development.
In the end, most IT functions never used CMDB as the ‘source of truth’ -- too many config items still leaked into production via multiple deployment routes. While the CMDB became central to IT service requests and change control, unless it is near real-time, it can never be an integration hub for IT tools.
That being said, ITIL can still make sense for slower moving and highly regulated and risk averse organizations. Fact is, not everybody needs to move fast. And the ITIL suite of recommended apps is truly based on best-practices, albeit from a prior age. And cloud-hosted SaaS alternatives now exist (ServiceNow, e.g.), which addresses some of the limitations.
In 2010, we began experimenting with AWS, recognizing that Amazon was introducing a serious disruption that eliminated much of the cost and friction of deploying and operating systems. The proliferation of high value open source projects was now undeniable. We realized Stallman’s “Bazaar” was going to win over the “Cathedral”.
Impressed with how the AWS service and open-source components could solve so many of the issues we had encountered with asset-heavy IT, we began developing our DevOps framework, realizing the potential for addressing the limitations of asset-heavy IT management.
We set out to build an asset-light DevOps framework that would create a reliable, real-time “source of truth” integration hub for any tool hosted in a cloud.
Since we were emphasizing flexibility, agility and evolvability, we chose a Hawaiian name, “Ho’olilo”, a word meaning “change”. (The usual response to this name has been “Huh?” But we liked the sound of Hawaiian words – even Hawaiian curse words make people feel happy.)
Virtual system configurations and their dependencies can quickly grow into the thousands of virtual machine, network and storage components. Once software and data repositories are included, the number of components can quickly rise into six figures.
In such environments, IT no longer “operates” assets, but manages virtualized infrastructure via software. IT necessarily relies on custom software to do its jobs, integrating a rapidly evolving set of tools.
And with the rapid release cycles of web apps, there is no longer any time for Development to hand off to Operations. Dev must merge with Ops and address operational concerns as another aspect of web app code. More fundamentally, Ops has to stay ahead of Dev. We believe that frequent releases using continuous integration and deployment tools makes a repository of virtual components and dependency relationships absolutely critical to cloud system integrity and quality of service no matter what tools you use.
The rest of our talk will explore how we helped Lending Club uses Neo4J to create a live, active, self-updating repository service, containing nearly all its virtual hardware, network and software components and their dependencies, enabling continuous deployment and operation integrity in any cloud environment, architected for evolution.
(All the ideas and development I’m now going to discuss are from our brilliant colleague Rob Schoening, who heads DevOps at Lending Club.)
An alternative we rejected was the PaaS approach, where we would commit our future to a PaaS provider. This approach seemed risky -- who know if our PaaS provider could be relied on over time in such a dynamic ecosystem? Most companies don’t start greenfield, with the freedom to choose a PaaS provider (e.g., Engine Yard, Heroku, etc.). Most companies start with a motley hybrid of physical and virtual, and need to migrate from there. Making a big technology gamble on “We’re going to run the entire company on xxx” is risky; even with the best vendors, it rarely works out in the long run. The cloud ecosystem is changing too rapidly and complexity vectors in from all directions.
Another alternative we rejected was the “we’ll do everything using chef/puppet or ansible”. Not to denigrate these highly successful products, or their value for particular functions, such as base platform/system config, but their one-tool/boil the ocean approach is ultimately limiting, limitations we saw at some of our clients, and the resulting tool proliferation.
Another alternative as to use a modern cloud-hosted ITIL suite (ServiceNow). Very expensive, too expensive for many firms, don’t need all the services from get-go, and not necessarily extensible.
So the approach that we took was simply to start knitting things together and make it *appear* like a PaaS. It didn’t have to be perfect…just get the job done.
One reaction is “Couldn’t you do the same thing with a bunch of scripts?” The answer is: Yes. But then you would wake up one day and find that your have….a bunch of scripts. We didn’t want to create a magnet for technical debt either. Where it gets challenging is when your scripts need metadata that is spread around. If you collect the metadata you have in one place, instead of every operational initiative starting with an API integration, you can simply create a query returning the data you already have. (CMDB lesson learned.)
Software is Software – Dev and Ops no longer separate domains, but joined by the essential nature of any software development endeavor.
Fitting the business model – what’s right for GE not right for 3-person startup not right for a fast-growth fintech, etc.
Finally, our goal was to design for potential failure of any and all components -- not just site backups/DR or BCP (essential for web/micro services Ops).
Borrowing a name from an old TV series whose hero could create complex solutions from nearly free household items, the framework is named “MacGyver”. (Turns out Rob is also better at marketing than myself too. We’ve dropped our poor attempts at Hawaiian.)
Initial experiments with a “cmdb” used MySql as the repo, but we quickly recalled how difficult schema management would be. Next we experimented with Mongo, easier to query, but still required higher schema administration duties. Then we became aware of Neo4J, a network database with strong Java affinity. The more we tested, the clearer it became that a Graph is the most natural representation of not only of infrastructure components, but all virtual components, networks, applications, data repositories, and their shared dependencies, which is so clumsy to represent relationally.
And the more we tested with Neo4J, the more we realized how natural our virtual component networks fit, how simplified it made enriching the repository over time, and how easy it was for all our users and developers to perform queries using its SQL-like language.
Worth repeating: The trends we intended to exploit via MacGyver were: Use open source components, tools and repositories, integrated using web service technologies.
At Lending Club, Continuous integration and deployment were the first use cases, because Dev was moving fast, and Ops needed to get in front. Running scripts out of Jenkins is far and away the most effective way to get some effective DevOps going. In fact, it’s DevOps job one.
But there are things you can’t easily accomplish with Jenkins:
--Integration with virtual infra to collect metadata
--Polling and event handling for monitoring and alerts
--Orchestration via JSR using aggregated data
MacGyver services are constantly polling or being called by dozens of tools, and the framework was intended to have local enhancements matching any cloud service or suite of tools. A flexible API and Plug-in development kit was essential.
JSR script execution engine. MacGyver accepts any scripting language compatible with JSR.
An easy to use UI based on the Vaadim project.
Initial experiments used MySql as the repo, but we quickly recalled how difficult schema management would be. Next we experimented with Mongo, easier to query, but still required higher schema administration duties.
Then we became aware of Neo4J, a network database with strong Java affinity. The more we tested, the clearer it became that a Graph is the most natural representation of not only of infrastructure components, but all virtual components, networks, applications, data repositories, and their shared dependencies, which is so clumsy to represent relationally.
And the more we tested with Neo4J, the more we realized how natural our virtual component networks fit, how simplified it made enriching the repository over time, and how easy it was for all our users and developers to perform queries using its SQL-like language.
Implementation is fully redundant, fault tolerant and multi-site.
Note the ease of dependency representation, and the ease of query, once you get the hang of it.
1. Get the entities in place with continuous scanning
2. Enrich entities with attributes
3. Use the entities and attributes to derive relationships and formalize them in the graph data model
Instead of ‘all ITIL’, use cases evolve with business necessity
Over two years, DevOps use cases accumulated dramatically, as this slide shows.
Are servers in the correct security zones?”
“What is the correct AWS VPC placement for this application?”
Deliver abstraction across multiple Load Balancer implementations
etc.
Most recent has come microservice enablement, including service discovery.
LC services also now implement a “health check” service transmitting messages from all servers, allowing the implementation of service self-healing use cases.
Order by network, compute, sw, etc.
Over the past two years, MacGyver integrations have also accumulated dramatically, as this slide shows. Adding integrations has proved simple and extensible, allowing the company to upgrade and migrate to more advanced services and tools as needed.
Pulling the data and metadata from these tools and services into Neo4J along with their dependencies is the enrichment necessary for the most advanced DevOps use cases.
Perhaps the most powerful of MacGyver’s proven use cases is the ability to evolve applications and infrastructure over time without jeopardizing operational integrity.
Most companies start DevOps with legacy data centers which then need to evolve into hybrids before reaching cloud-native. MacGyver enables this evolution.
LC can and has upgraded and swapped out tools over time, and never locked into the capabilities of a single PaaS, vendor or all-in tool. LC can not only choose best of breed, but can avoid IT vendor lock-in, with its resulting financial leverage.
LC is now moving to cloud native infrastructure and micro-services while never slowing down its SaaS service application innovation or putting its business at risk. Migrations have never required “burned bridges” or throw-switch migrations with no back-out plan.
Grown from 5 to 139 in the past year alone.
See following slides.
Every service/app has 2 pools, one of which is live or dark at any time
Concept of “Pools”
Concept of “Pools”
-let old connections ‘drain’ out
- When # connections reaches close to 0, we cut the pool over
Concept of “Pools”: live pool, dark pool, drain pool, cut over pool
This concept of pools wasn’t articulated in the load balancer, nor did the app servers have any notion of what pool they belong to
By taking advantage of Neo4j’s ability to map relationships we were able to create ‘Pool’ nodes that ultimately allowed us to automate deployments
GET FROM LC
Auto Scaling Groups are attached to Elastic Load Balancers
ELBs distribute traffic to EC2 Instances
ASGs contain Ec2Instances
ASGs and EC2Instances relate back to a Subnet, which is contained in a VPC, which is owned by an Account
VPC has multiple subnets that contain ASGs, instances, ELBs
CodeDeploy Details page for an app cluster created from Neo4j info
LC’s biggest “new” chunk of functionality is the release automation built on top of AWS Code Deploy (https://aws.amazon.com/codedeploy/).
Code Deploy does all the heavy lifting. MacGyver does the orchestration and presents it as a PaaS to our engineering team.
And event aggregation now joins events from GitHub, Jenkins, EC2, CodeDeploy, HipChat, and NewRelic.
Here's a simple example: We index github commit data to learn who's who. Index class names to learn who the experts are for particular code functions. Now when we see performance problems, we already know the code (often down to the line number) and can send every developer giving them a performance trending report.
There are many Info Sec use cases that will benefit from such enrichment and aggregation now in their DevOps pipeline.
Also it's starting to be used by one engineering team to do modeling. Architects are now using it to do a more enterprise oriented top down model of services through the whole SDLC. The EDH team is starting to use it to model Hadoop job relationships. We use it in new ways all the time.
Historical audit and time domain analyses, metrics over time?
AI and machine learning eventually? The necessary data will exist.
29
LC’s biggest “new” chunk of functionality is the release automation built on top of AWS Code Deploy (https://aws.amazon.com/codedeploy/).
Code Deploy does all the heavy lifting. MacGyver does the orchestration and presents it as a PaaS to our engineering team.
And event aggregation now joins events from GitHub, Jenkins, EC2, CodeDeploy, HipChat, and NewRelic.
Here's a simple example: We index github commit data to learn who's who. Index class names to learn who the experts are for particular code functions. Now when we see performance problems, we already know the code (often down to the line number) and can send every developer giving them a performance trending report.
There are many Info Sec use cases that will benefit from such enrichment and aggregation now in their DevOps pipeline.
Also it's starting to be used by one engineering team to do modeling. Architects are now using it to do a more enterprise oriented top down model of services through the whole SDLC. The EDH team is starting to use it to model Hadoop job relationships. We use it in new ways all the time.
Historical audit and time domain analyses, metrics over time?
AI and machine learning eventually? The necessary data will exist.
31
32
Hi, I’m David Ward, founder and CEO of Telegraph Hill Software, a software development consultancy here in SF providing on-site development teams. SaaS stacks, DevOps, machine learning, analytics and mobile are a few of things we build for our clients.
Today I’ll be sharing an innovative use of Neo4J we developed with our premier client, Lending Club, over the past few years.
Call out to load balancer for info on server state
Combine with app Instance info
Virtual Server node: app ID, revision, # of connections, state (active vs inactive).
By grouping these Virtual Server nodes together into pools based on their ‘state’, etc., we created pools in neo4j. This is where it gets interesting
App Instances report to MacGyver and get saved to Neo4j.
Macgyver queries the load balancer for info and saves that to Neo4j.
By arranging the data in a way that’s useful to us in Neo4j, we formulate Pools and VirtualService nodes.
We gain a lot of visibility into app state that we never had before
Able to constantly monitor app state, developers who ask us “is my app live?” can self-service thru MacGyver.
Again, gaining visibility to data that was not exposed before and arranging it in a way so that it becomes useful to us.
-This all happened very naturally, where we started with app Instances and then extended the relationships and created new nodes and mappings until we got to Nimbles
- Very easy to build new layers and relationships on top of already-existing ones
“If storage volume #3 goes down,
what services will be impacted?”