In my previous years’ talks at DevOps Enterprise Summit, I spoke about starting and scaling of DevOps at Capital One; importance of Open Source, Open Technology and Innovations in DevOps.
This year, I will present Capital One’s journey of maturing in DevOps and Continuous Delivery. My presentation will cover our current areas of focus: Delivery Pipeline, Flow and Measurements. I will also share some of the problems we faced and what we did to solve them.
14. @TopoPal
Deliver High Quality Working Software Faster
• Across LOBs, Shared Services and 3rd
Parties
• Tested end-to-end
• All dependencies are satisfied
• How fast? ASAP?
25. @TopoPal
Pipeline must have 16 gates
Source code version control
Optimum branching strategy
Static analysis
> 80% Code coverage
Vulnerability scan
Open source scan
Artifact version control
Auto provision
Immutable servers
Integration testing
Performance testing
Build, Deploy, Testing automated for every commit
Automated Change Order
Zero downtime release
Feature Toggle
34. @TopoPal
Risks are real
• Intentional damage
• Unintentional damage
• Untested code in production
But….
There is a better way
35. @TopoPal
Hypothesis
• DevOpsSec & CI/CD provide better
controls
• A model with ~30 practices can satisfy
audit and compliance
• If everything is source code, no one
needs access to production
• For emergency, “Break Glass”
36. @TopoPal
Result
Production Release 1+ / dayOnce / sprint
# of Applications with Release Automation: 20+
Max. # of Releases in 1 day for 1 Application: 34
With “Segregation of Duties”
Good Morning everyone.
My name is Tapabrata Pal, I go by Topo.
My twitter handle is @TopoPal.
And I am an engineer at Capital One.
This is my third year at this conference. And let me tell you that it’s an honor to be back on this stage for the third time and speak to you all. It’s like a home coming for me.
Most of you know Capital One is a Credit Card company. We are one of the largest in the US with over 70 million accounts. Many know that we are also one of the nation’s largest banks. Fewer, however, realize that we are a Founder led 20 years old Technology Company. Our “youngest” competitor is 108 years old. In that sense, we are a startup in this industry.
A typical bank organization will largely procure third-party software for its internal and customer-facing operations. Over the past five years we have transformed ourselves to an organization that truly builds its own software and develops its own solutions.
That is a different DNA.
Today, we are hyper-focused on how we can get more productive, move quicker, get things to market faster, and constantly iterate.
• We build on the public cloud, leveraging continuous integration and delivery methods to deploy our products into production.
• We build using micro-services architecture and restful APIs.
using Open Source
and practicing DevOpsSec and Continuous Delivery
I joined Capital One six years ago.
I started as an Enterprise Architect – I have been involved with Capital One’s DevOps journey from the beginning, many a times I led some key efforts around DevOpsSec adoption, Scaling DevOps across the enterprise, Open Source governance that formalizes open source adoption. I led creation of our Enterprise DevOpsSec Strategy – I helped standing up our Shared Delivery Tools platform. And that led to my movement from Enterprise Architecture to Shared Technology organization.
Currently I am the product manager of our Shared Continuous Delivery Tools Platform that offers typical DevOps tools for the enterprise as services.
In the meantime, few of us at Capital One developed Hygieia DevOps Dashboard and Open Sourced it. It’s the first open source product from Capital One. I am the community manager and one of the core contributors. I can’t tell you enough about my excitement around Hygieia. Since its launch in July 2015, it has become very popular. Google “DevOps dashboard” and the first non-ad hit you get is Hygieia github repo! Many large enterprises are either using it or testing it out. It has won “Open Source rookie of 2015”. If any of you are using it or thinking of it and need some help or have new feature ideas, send me a quick note - @TopoPal or open a github issue.
Overall, I am loving it. Let me tell you the best part of my job – I get to learn new things every day. This is so true when you are in the middle of an awesome transformation and been a part of it.
Our Agile and DevOps transformation over the last 5 years have been quite successful. At a high level, we have transformed ourselves from waterfall to agile – across the board. From manual build, code promotion, testing, release to full automation. Exceptions are off-the-self products and products that are pre-historic . Any new product that have been created over the past 4 years are following agile and devopssec principles. From Vertical Silos of Dev, Ops, QA, Support to Autonomous Product based teams. We are not fully there yet, but we are getting there.
Biggest are these three big transformation is the fact we went from mostly outsourced company to mostly in-sourced. We are continuously hiring skilled engineers.
From vertical silos such as Dev Team, Ops Team, QA Team…. we now have Product Teams. Autonomous teams with everyone needed to develop a product.
As I said this is my third year on this stage.
In 2014, I shared with you our DevOpsSec Strategy, Initial successes of our automation efforts and also shared our success story around scaling DevOpsSec in an Enterprise Scale.
In 2015, last year, I shared our success stories around an Engineering Transformation – not just DevOps. An awesome transformation that I will always be proud of being involved in. I shared how Open Source, Open Technology, Innovation and Sharing changed Capital One culture drastically.
This time, I am going to share our learnings around DevOps maturity through measurement and continuous improvements.
Let me start with where we left off last time – our typical success story – before and after. This is before-and-after for one of our biggest product line. It has more than 250 engineers in the product team that includes dev, qa, ops everyone. A single Github repository with application code, test code, infrastructure and configuration code. The application runs on a public cloud infrastructure.
As you can imagine the result of automation and shift-left are quite apparent in these numbers. Builds every 15 minutes, automated testing, automated deployment to all the environments – it’s all good; and it really is. But this is not where we want to stop. In particular, we are not happy with our deployment frequency.
Let me put a disclaimer right here. For us, a deployment means real application code change. It does not include content changes, style changes, network changes, database changes, system resource changes and so on. Whether we should count those too or not is a different topic of discussion. All I am saying is that a deployment here represents a set of new application code, in whatever form, being installed in production. In other words, the deployment number here is a small subset of production changes that are going on.
2016, the year of “What’s in your pipeline”.
We have been asking our teams a very simple question “What’s in your pipeline”. We have been doing DevOpsSec for a while – for about 5 years now. Our engineers know what DevOps means – or not!
In my honest opinion, we need to stop defining DevOps. Instead of asking what DevOps is, we should be asking Why do we need DevOps.
In my point of view, the answer is very straight forward: the goal is to Deliver High Quality Working Software Faster. Now, we all know what each of these words mean. But “Faster”? How fast? This is so much confusing in many ways… What is a good number? Why do we need to go that fast? We used to do 1 release per quarter… now we have release every month – isn’t that fast enough…
In my point of view, the answer is very straight forward: the goal is to Deliver High Quality Working Software Faster. Now, we all know what each of these words mean. But “Faster”? How fast? This is so much confusing in many ways… What is a good number? Why do we need to go that fast? We used to do 1 release per quarter… now we have release every month – isn’t that fast enough…
To be honest, I cannot answer the question as to how fast is optimum or how fast is feasible. But there is scientific proof that faster is better. There is also evidence that frequent deployment is better. And so, faster and frequent is better – an indicator of high performing IT organization.
It was loud in clear in DevOps Survey. If you have not read this yet, do it tonight.
So, faster is better.
I kept thinking… is there a scientific proof? I know Nicole will tell me that the survey is scientific. And it is. I am thinking I have read this proof somewhere… years ago when I was a kid.
Let me digress here a little bit.
It was Bernoulli in the early 18th century.
Based on Bernoulli’s work, you can explain why when the flow of an incompressible fluid is constricted, the fluid velocity increases. And the dynamic fluid pressure decreases and the energy remain constant.
In essence, science had proven long ago that smaller chuck of changes delivered in a continuous flow through a pipeline increases velocity and creates less pressure…
In order to increase speed in delivery, we started looking at pipelines. Continuous Delivery pipelines. With all the automation, shit-left, practices that we developed over the years, we now wanted to build these pipelines that magically takes a commit from commit stage to production with zero touch.
Easier said than done. We looked at some sample pipelines that people started creating – both inside at Capital One and outside…
We found some rather interesting ones…
Like some pipelines that never end.
Some pipelines are so complex that you don’t know where they start and where they end.
And then, this is the most popular type of pipelines – you need an army to support the pipeline.
I have also seen pipelines that just builds and deploys – but as far as I am concerned, a pipeline that does not have security embedded and that does not have test automation – is not a pipeline. Period. I really don’t care about rest of the things that the pipeline does.
So, to summarize, we had these tasks for ourselves
1. Design and implement a pipeline
2. Measure and identify bottlenecks
3. Fix bottlenecks
Let me share what we did on each of these areas and then I will share with you the outcome.
I call it 10 commandments -- in Hexadecimal. We as an enterprise have come together on these criteria to measure our DevOpsSec success. Every product team are being tracked on these – all the way to the CIO – he sees the progress at a lines of business level.
We spent a lot of time on discussing what to measure, how to measure and how to interpret what we measured. We attacked it from different angles.
Last year during about this time Jez Humble showed me the survey that he and Nicole Forsgren were working on. It looked very promising and we agreed to participate.
We did some proof of concepts, went back and forth on many things; I think we had some good influence on two aspects of the survey: Security and Test Data Management.
We went to the executive leaders and got a green signal from our CIO and then ran the survey in many teams across the Enterprise. They produced some interesting and encouraging numbers. First, and most importantly, that we are moving to the right direction and we need to keep doing what we had been doing so far… and in fact some of our new large product initiatives are at per with top industry performers category. The survey also pointed out few areas where we need to double down.
We also used our own Hygieia dashboard and the newest features in Hygieia that we developed to improve “speed”. We spent a lot of time brainstorming this topic. Speed of what? What is flowing through the pipeline? Business Value? Feature? Intent? Code? We could not come up with a full proof method to track the speed of delivery of business values and features.
What we have is a way to track each code commit through delivery lifecycle stages – from Commit stage to Production Deployed stage. The beauty of this is that we can now see the “wait time” between two stages. Why is this important?
1. In our opinion, you can speed up by reducing these wait times. Why are commits waiting for X number of hours before being deployed to Dev environment? May be lack of automation? May be the infrastructure is unstable?
2. The team can decide which “wait time” needs to be reduced to speed up the pipeline. You do not want the team to spend time in reducing a 10 min build cycle to 5 mins where the test cases take few hours to finish.
3. The teams decide what to do and they will do it. Believe me. You just need to make it transparent. In some cases, you need a bigger effort when it comes to processes.
Both the measurements showed us few things that we needed to address at an enterprise level.
First is process and technical bottleneck to go to production. By decisively selecting public cloud, we had our arms around the technical bottleneck. Technically now we can deploy to production by clicking a button. But, who knew clicking a button is so difficult?
The core of this bottleneck is what all the big enterprises face …. CAB! Change Approval Board. Before going to CAB, you need pre-approvals before getting approvals, and then change management, and review of change management. I am sure it is much more complicated that it sounds. This year we worked very closely with our Audit, Compliance and Risk offices to take a deep dive into our processes and how we can do a better job. We have developed a hypothesis and we are testing it out. Let me share, at a high level, what that hypothesis is… before that, let me state that we started from a set of common believes..
We proved via empirical data that “Trunk based” development is better for Continuous Delivery and this is what we want for our teams to follow. But it is hard to enforce this to hundreds of product teams. So, what we came up with is this simple formula
If team goal is to deploy 3 times a day; CI takes 30 minutes; CD takes 3 hours and Prod deployment takes 1 hour. The team must merge code to release branch within about 3 hours of the original commit.
The core of this bottleneck is what all the big enterprises face …. CAB! Change Approval Board. Before going to CAB, you need pre-approvals before getting approvals, and then change management, and review of change management. I am sure it is much more complicated that it sounds. This year we worked very closely with our Audit, Compliance and Risk offices to take a deep dive into our processes and how we can do a better job. We have developed a hypothesis and we are testing it out. Let me share, at a high level, what that hypothesis is… before that, let me state that we started from a set of common believes..
These Risks are real. You can not deny that. But… there is a better way to mitigate these risks.
Remarkable Results.
Production Release - Only Code release - from 1 / sprint -> 10
Preparing for testing these hypothesis was no way easy. We ended up developing our own release automation tool which has an onboarding process that ensures that a team practices required practices, creates a change order automatically and approves it.
We also forked out LGTM Github review tool – enhanced the tool with many configurable rules that helped us in this work. We wanted to give it back to LGTM but they proposed that we host it as a new product.
In the very near future, we are going to open source the modified LGTM as a new tool.
We are also going to open source those 30 practices as a model.
And unless you have not noticed in DevOps news, there is another DevOpsSec tool that we open sourced this year – it’s called Cloud Custodian.
I will end by sharing a picture of my favorite T-Shirt.
“All of Chuck Norris’s Change Controls are FullCycle… and they are always APPROVED”.