You have heard about how great infrastructure as code is. But your organization already has existing infrastructure which were created manually and are now active in production - growing to an unmanageable level. How do you manage them all now in code? This talk will cover how we at Samsung R&D Canada did exactly that with Terraform including the lessons we learned along the way.
Transforming Infrastructure into Code - Importing existing cloud resources using Hashicorp's Terraform
1. Transforming
Production Infrastructure into Code
Importing Existing Cloud Resources with Hashicorp’s Terraform
Shih Oon Liong
Vancouver Hashicorp User Meetup Group
November 22nd 2017, Hootsuite HQ
2. Samsung R&D Canada
Agenda
• Before Infrastructure As Code
• Importing Infrastructure into Code
• Lessons Learned
• Benefits with Infrastructure as Code with Terraform
• Q&A
4. Samsung R&D Canada
Shih Oon Liong
Senior Operations Engineer
Technical Operations and Performance Engineering
Samsung Research & Development Canada
Previously: Hootsuite, Invoke, Syncapse, Nudge
@mechastorm
so@liong.ca
Who am I?
5. Samsung R&D Canada
• Samsung Research & Development Canada (SRCA)
What is Samsung R&D Canada?
• Part of the Global Research & Development
division within Samsung Electronics
• Manages the development and operations for
Samsung Cloud Service
• Works in conjunction with a global operations and
development team
8. Samsung R&D Canada
Infrastructure at “Teddy” (not a real name)
Dedicated Amazon Web Service account per environment
40 – 50 services in a period of 2 years
~10 development teams across 4 different time zones
4 Operations teams globally – Korea, India, Canada & America, Poland
Before Infrastructure as Code
9. Samsung R&D Canada
Teddy’s launch story
Initial launch & set up in one country– all other countries replicate the
same setup
Short deadlines – little time to reiterate
Knowledge migration - on-premise hardware to cloud infrastructure
Before Infrastructure as Code
10. Samsung R&D Canada
Managing infrastructure at Teddy
Manual set up of every AWS Resource
Documentation step-by-step on the settings and names to configure on
each resource
Before Infrastructure as Code
11. Samsung R&D Canada
After 2 – 3 years of running
Before Infrastructure as Code
Takes days to configure a service in a new environment
Inconsistencies across environments. (Worked in Env B! Broke in Env A!)
Long onboarding time for new teams
A new AWS region required
13. Samsung R&D Canada
Why Terraform?
Infrastructure as Code (IaC)
Repeatable / Idempotent
Cloud-agnostic – same workflow regardless of provider
Collaboration
Importing Infrastructure into Code
14. Samsung R&D Canada
Short-term Objectives
Import all AWS resources in production into code
Re-deploy code into new AWS region and account
Transforming Infrastructure with Terraform
Long-term Objectives
Recreate other environments with same code
Infrastructure as Code
16. Samsung R&D Canada
What did we use
Transforming Infrastructure with Terraform
Terraform – terraform import
Get infrastructure into terraform statefile
Terraforming
Get infrastructure into HCL
(Hashicorp Configuration Language)
18. Samsung R&D Canada
terraform import
import existing infrastructure. This allows you take resources you've
created by some other means and bring it under Terraform management
Added in 0.7.x (2016 August)
https://www.terraform.io/docs/import/index.html
Transforming Infrastructure with Terraform
19. Samsung R&D Canada
Transforming Infrastructure with Terraform
Importing a single resource
terraform import aws_instance.web i-123456789
Terraform Resource Type
Resource Custom Name
Resource Cloud ID
20. Samsung R&D Canada
terraform import - Limitations
Only imports into the Terraform Statefile
Does not yet generate HCL code.
A future version of Terraform will support this later.
Before import, requires to manually define a HCL resource configuration
block for the resource (can be overridden)
Transforming Infrastructure with Terraform
23. Samsung R&D Canada
terraforming
import AWS resources into Terraform HCL or Statefile code
Requires Ruby 2.1 and Terraform 0.9.3 or higher
By Daisuke Fujita (dtan4)
https://github.com/dtan4/terraforming
Transforming Infrastructure with Terraform
24. Samsung R&D Canada
Transforming Infrastructure with Terraform
Importing ec2 instances
export AWS_REGION=us-west-2
terraforming ec2 –profile=myaccount | tee "ec2.tf"
As there is no region flag in the CLI,
we need to define the AWS region
The type of AWS resource to generate from
The AWS credential profile to authenticate with
terraforming only outputs to screen.
You need to do your own custom
output redirection
25. Samsung R&D Canada
terraforming - limitations
Generates HCL code for ALL resources of a type in a region
No way to limit to a specific resource
Regions that with 100s or resources – slow to generate. Limited by API
connection limit
Transforming Infrastructure with Terraform
28. Samsung R&D Canada
terraform import
Great for importing specific
resources into statefile
Support for majority of
resources (AWS)
terraform import limitations
Does not generate HCL code
Will only import to statefile
terraforming
Great for generating HCL
code from existing AWS
resources
terraforming limitations
imports everything to HCL or
statefile
Import Workflow – Tools
29. Samsung R&D Canada
Break up each service into its own repository
Shared resources in its own repository
(ie. common subnets, security groups)
Terraform Remote State Storage for each repository, each environment
Retrieving metadata from remote states
data.terraform_remote_state.abc
Only import existing resources – do not change existing resources yet
terraform plan = zero changes
Import Workflow – Conventions
30. Samsung R&D Canada
1. Map it
2. Pick a Service
3. Setup Repository
4. Import Resource into HCL
5. Import Resource into StateFile
6. Verify - Terraform Plan = Zero Changes
Import Workflow – Steps
31. Samsung R&D Canada
1. Map it
Get a physical architecture diagram of the whole system
Import Workflow – Steps
Disclaimer: Not a real actual physical architecture diagram
32. Samsung R&D Canada
1. Map it
2. Pick a Service – Order to start with. Example
Common Networking – VPC
Common Security Groups
Kafka Queueing Service
Service A
Import Workflow – Steps
• Lowest dependency
• Will be reused alot
33. Samsung R&D Canada
2. Pick a Service
3. Setup Repository
Set up Git Repo
Set up remote state configurations
Import Workflow – Steps
vpc
common-sg
kafka-q
service-a
Terraform remote
state dependency
34. Samsung R&D Canada
3. Setup Repository
4. Import Resource into HCL
Have a central repo for all terraforming outputs
Pick a resource and copy/paste
Import Workflow – Steps
36. Samsung R&D Canada
4. Import Resource into HCL
5. Import Resource into Terraform State file
Run terraform import on the desired resource
Import Workflow – Steps
37. Samsung R&D Canada
5. Import Resource into Terraform State file
6. Verify - Run terraform plan
Should report zero changes
Any changes – manual configure the HCL resource until in-sync
Import Workflow – Steps
38. Samsung R&D Canada
Took 3 – 4 weeks to import ~30 core services (1 - 2 persons)
We were lucky
Good Documentation
Similar architecture for a majority of services
Result
40. Samsung R&D Canada
1. Good documentation
2. Communicate / train early
3. Terraform import gotchas
Lessons Learned
41. Samsung R&D Canada
Lessons Learned – Good Documentation
Detailed architecture diagram – always helps!
A map to your services
Helps to plan what service you want to import into code
Maps out dependencies
42. Samsung R&D Canada
Lessons Learned – Communication / Training
Communicate early - make everyone aware that how infrastructure
should be managed from now on
Training early
have everyone ready to use Terraform the moment code is imported
new services can be started on Terraform immediately – stop the flow of
manual configured resources
43. Samsung R&D Canada
Lessons Learned – Communication / Training
Importing existing infrastructure
Great onboarding process for new engineers
44. Samsung R&D Canada
Lessons Learned – Terraform Import Gotcha’s
Slow & Steady
Run your imports resource by resource (especially for more unique
services)
Import one resource
Plan
Repeat
Automation???
45. Samsung R&D Canada
Lessons Learned – Terraform Import Gotcha’s
terraforming ec2
Will import EVERYTHING about an EC2 instance in HCL code including
optional properties (example: ebs_optimized, source_dest_check)
Complicates when running terraform import after
Simplify it (Your mileage may vary)
Remove optional data from HCL code
(example: ebs_optimized, source_dest_check)
Then run terraform import
46. Samsung R&D Canada
Lessons Learned – Terraform Import Gotcha’s
terraform import aws_security_group
Will import AWS security group with two types of rules
Inline security group rules
Individual security group stanzas (aws_security_group_rule)
It is dependent on usage on which to keep
resource "aws_security_group" ”ssh” {
...
ingress {
from_port = 22
to_port = 22
...
}
egress {
...
}
}
resource "aws_security_group_rule" ”ssh-22” {
type = “ingress”
from_port = 22
to_port = 22
...
}
HCL representation (terraform import will actually import the JSON into tfstate)
47. Samsung R&D Canada
Lessons Learned – Terraform Import Gotcha’s
Terraform State manipulation
Remove a resource from tfstate
terraform state rm aws_instance.web
Rename a resource
terraform state mv aws_instance.web aws_instance.server
Move a resource to a module
terraform state mv aws_instance.web
module.mymod.aws_instance.server
49. Samsung R&D Canada
Visibility
Consistency
Collaboration
Reusability
Benefits of IaC with Terraform
50. Samsung R&D Canada
See the how a service infrastructure is defined in code
Benefits of IaC with Terraform
Visibility
51. Samsung R&D Canada
Notice inconsistencies with different
services and environments
Benefits of IaC with Terraform
Consistency
Maintain consistency across services
and environments (to a point)
52. Samsung R&D Canada
Open up infrastructure architecture to developers
Benefits of IaC with Terraform
Collaboration
Working in pairs – reviewer and reviewee
53. Samsung R&D Canada
Similar architectures can use a common module
Benefits of IaC with Terraform
Reusability
No hardcoded values for shared services
(ie. Kafka - Queueing)
54. Samsung R&D Canada
What we still need to do – Future with Terraform
Onboarding more engineers
(different timezones)
Centralized Terraform Workflow
Terraform Enterprise
Atlantis (https://atlantis.run)
Security compliance
Thanks to <host>. I am here to day to present to you on the experiences that the TechOps teams at Samsung Research Canada recently went through in moving the production environment into code using Terraform and some good conventions.
These are topics that I will be covering today. In summary what I hope I will be able present today is that
Even if you already have an infrastructure already running, you can still use Terraform on it.
So a brief introduction of myself. My name is Shih Oon and I am a senior operations engineer at Samsung Research & Development Canada. I am part of the TechOps teams and I am primarily responsible for the shared services within our operations teams. I have been working in software engineering and operations for the last 10 years in numerous organisations small and big like Hootsuite and Invoke. I am a Malaysian who has grown up and worked in London, UK and have been in Canada for the last 5 years or so.
Let us quickly cover what is Samsung Research Canada. This will help you understand the scope of our transformation and the scale that we had to collaborate at.
Here is a lovely picture of our office in Burnaby. We are usually called Samsung Research Canada or SRCA for short.
SRCA is part of Global Search & Development division – which is part of the Samsung Electronics.
Our office manages the software engineering and technical operations for various cloud services of Samsung.
I should emphasize that our office is not the only team that works on those services. We collaborate with numerous colleagues from all around the world.
And finally before we move to main content, just a bit of a disclaimer. For this presentation the story that I will speaking about is true but I can’t be specific about the specifics of the product this story relates to. So for the purpose of this presentations, we will be talking about the infrastructure around a cloud service at Samsung called Teddy.
So what did we have before moving towards infrastructure as code
Infrastructure at Teddy is pretty straightforward for those that are familiar with large cloud services.
We run a dedicated AWS account per environment – and in general we have about 3– 4 different environments.
Over the first two years of the product, the infrastructure now includes 40 – 50 different microservices. Keep in mind these are non-dockerized services – just ec2 instances with elastic load balancers – no auto scaling enabled on a majority of these services.
These services are developed by about ~10-20 different development teams that span across 4 different time zones. These development teams are supported by at least 4 Operations (or TechOps) teams across the globe.
So as you can see, Teddy’s engineering team structure is a complex beast from the start.
When Teddy was initially launched, it was all about getting to market fast. Even though Samsung is a large conglomerate, our cloud services are managed like startups where speed to launch is a key importance. We launched Teddy in one country and it was huge success.
Launching to other countries was ramped up – giving very little time to iterate/improve on the existing setup from the initial country launch.
In addition, our TechOps team’s exposure to the cloud was limited, a majority of our team were previously managing on-premise infrastructure.
All of this factored into an infrastructure at Teddy that was challenging to manage. It resulted in AWS resources that were created manually on the web console – each individual security group, ec2 instance, elb were hand-crafted manually. We had a very dedicated team of operators who were very consistent in making sure that all the names and setting were consistent across the environments.
So from first glance, this all looked great. We got to launch on time and the infrastructure passed all compliance checks.
But after a couple years of running, we notices we were slowing down and problems starting cropping up. A common story that most here may have already experienced.
As more services got launched in the product, Operators would spend days to configure and deploy it in a new environment – and replicate that again in another environment. Remember we had 3 – 4 environments to deploy that same service in.
As more services got launched, we started noticing inconsistencies in the setup across environments. This was especially so when we had situations where it broke in one environment but worked in the other.
In addition, as we grew the engineering team for this product, there was a long ramp up time for the teams (especially the TechOps team) in new offices to learn the infrastructure.
Finally, a new product requirement came – we had to replicate all on this on a new AWS region in a new AWS account.
So what did we do to import our infrastructure into code.
We picked Terraform as our tool to managed our infrastructure. Everyone here should be familiar with Terraform a certain degree. For us it was an ideal fit was us because
- IaC was first class citizen in the tool.
- We can manage infrastructure in a repeatable & safe manner especially with its Plan and apply workflow
- It was not tied to a specific cloud provider. Yes you still need to write code specific to each cloud provider, but we still get to retain the same workflow of plan and apply.
- Terraform makes it easier for us to collaborate simply due to the fact that it is infrastructure as code.
We had to begin properly formulating how we canreplicate the whole production environment on a new region and account. No one was going to manually create the AWS resources for 50 - 60 services from scratch.
We had to begin embracing infrastructure as code or we would never be competitive in our deliverables.
So our first objectives initially was
- to import all the AWS resources from production into code of some sort. Essentially capture what is currently in production in some codified state
- Then redeploy that code into a new environment
By meeting those short term objectives, we could then get our environments consistent by deprecating the manually created ones in redeploying them from code in other environments.
Finally we can then begin embracing Infrastructure as Code.
So how were we going to do this with Terraform
So first path that we looked into first of all – could we just destroy and recreate everything in Terraform?
In order words, burn it all down?
In our case, no we could not do that. We had far too many production systems and with our processes, it would take too much time.
But for anyone else who is trying to solve the same situation as us, I would strongly suggest if you could just destroy and recreate, that would be a much easier path. Unfortunately for us, we had far too many services to do that safely.
So in the end, we used two tools to import existing infrastructure into Terraform
First we used a new feature that was released in 0.7 called terraform import – this tool was used primarily to get the infrastructure defintion into our tfstate file
We used terraform import in conjunction with a 3rd party tool called Terraforming. This was used to get the infrastructure definition into HCL code – think of it as a code generator.
So what is Terraform import
It’s a CLI command within the main Terraform command – used to import existing resources from a provider.
It has been available in 0.7.x (Augut 2016) with some changes in 0.10.x
It is primarily used to import one resource into the Terraform state file.
All you do is pass the resource type, a custom name and the actual ID of the resource in your cloud provider, then Terraform will do the rest.
There are some limitations though to terraform import.
It only works on Terraform Statefiles
It does not generate any HCL code. Future version may support this later.
And recently, terraform import now requires you to define the HCL code for that resource before you even run terraform import (this can be overridden though)
I should also add that eventhough most of the AWS resources can be terraform import, there are still a couple that are not yet supported.
So our wofklow to import our production into Terraform was to
1. Define the HCL code for our resource that we plan to import
2. Run terraform import
3. Then the terraform will generate the jsno code and put it directly in the Terraform Statefile.
But you may ask, how we generate this HCL code.
Ths is where Terraforming helps
So terraforming is a CLI tool built in Ruby. What it does well is to import AWS resources into Terraform HCL code – in order words, it generates HCL code based on the resources that already exist in your cloud account. It can also support generating the StateFile but there are some limitations to that I will explain later.
Terraforming as a CLI tool - is a still a bit clunky to use.
To use it, you need to first define your AWS region.
Then running the terraforming command, you define
- The type of AWS resource
- The credential profile to use
- And then finally redirect the output to a file of your choice. As of now, Terraforming on outputs to the command line.
There are some limitations though.
Terraforming will generate code for EVERY resource that exist in that region for that account.
There is no way to tell it to just import one.
So if your account has hundreds of resources in one region, then this tool will either take a very long time or AWS API will throttle or time you out..
This is what I mean here.
When you run terraforming, it will pull data of ALL resources in your AWS account. In this case, it is pull every ec2 instance including random ones like testing etc.
This all ends up in the final code output.
So what was our final workflow like
As we just saw terraform import and terraforming has its benefits and limitations.
We primarily used terraform import to import into the statefile, and then terraforming to generate the actual HCL code
Our philosophy when we import these resources were these
Each service should be its own repository – even for shared resources like networking/VPC
We always use use terraform remote state storage – and link up the each state using the terraform_remote_state resource.
And finally we only concentrate on importing existing resources. Don’t try to change the existing infrastructure yet. What this means is that when we run plan after importing, it should just say zero changes.
Here is the steps we took .
Let us quickly go through each step.
First we first mapped it.
We were lucky that we kept a very detailed diagram of our architecture. It was managed manually and it was a very laborious process. But it greatly helped us in understanding the scope of the project and what services we can import.
Then we picked a service to import.
We started first with a service that had the least amount of dependency but will be reused a lot as our first priority.
Example here is VPC that has common networking resources such as the VPC and internet gateway. It has zero dependencies when you create it but other services will require it when they are created.
This allows you to work in a logical manner with minimal blockers. As you import more services, it starts getting easier because common/shared resources would have already been imported previously.
After picking a service, we begin setting up the respository for the service.
This diagram represents an example of how the git repos are organized and how they relate to each other. Each service here is in its own git repository and these lines represent who needs to read its remote state. As you can see you are nearly mimicking the actual architecture to a degree.
We now get to the meat and potatoes of our step, first we want to import the resource into HCL code.
If its not done yet, we set up a central repository where all the ouputs from terraforming are stored. Then we pick a resource and copy/paste it to our individual services.
As you can see here we have two separate repos, our service repo and then our terraforming output repo.
Our terraforming repo contains outputs from all resources in AWS.
We copy the HCL code corresponding to our service and paste to our service repo.
You will notice that the name of the resource is changed – we recommended to change this so something for sane for you as be default terraforming names the resource with the name and resource ID.
So now we have the HCL code definition of our resource.
We now import that same resource into our state file using terraform import.
Then verify we have succesfully imported our resource correctly.
We run terraform plan and make sure it reports zero changes.
If there are changes, we go back to HCL code and modify it to be in sync to what it needs to be. In most cases, this is only needed in a minority.
All in all, it took 3 – 4 weeks to import all the core services of our product.
It was quite a smooth running process primarily because
- we had very detailed documentation of the architecture that made it easy for us to pick which services to import and understand their dependencies
- A lot of services had similar architecture, so once we identified a pattern, it was easy to copy/paste or used modules to simplify the HCL code.
What did we learned from all of this?
What did we learned from all of this?
Good documentation always helps!
Eventhough everyone hates it, but you really appreciate the efforts in documentation when you have a project like this.
Get everyone ready for the change to terraform.
We had to communicate to not only our local team but our overseas teams that infrastructure should be be managed differently using Terraform. This also included having to train our colleagues how to use it.
In addition, new services started to be set up using Terraform. We needed to stop the flow of manual configured infrastructure or else it will never end.
The other thing we learnt that importing resources into terraform was a great training for newcomers. I was only 2-3 months in my time with Samsung, but after going through this project, I got fully familiar on the architecture and infrastructure of the product.
Some lessons we learnt when using Terraform Import
- do it slowly bit by bit. Yest it can be a bit daunting but depending on how you work, importing and verifying one resource at a time helps minimize mistakes.
We could have automated this importing processed, but the benefits we gained from it at the time was not worthwhile – hopefully Terraform improves this somehow.
When terraforming ec2, be aware that it generates ALL properties of an aws_instances – including all the little flags like ebs_optimized, source_dest check and so on
This can get complicated when you run terraform import after.
For us, we just simplified the definition (this may differ to your infra).
In addition, when running terrafiorm import for security groups, be aware that Terraform will import security group rules TWICE – one definition in the inline security group and another in the individual stanzas.
The diagram shows the HCL code representation of what I mean.
In terms of deciding which to keep, it really depending on which you prefer and how flexible you want the security groups to be. Ask me for opinions on this later if you wish.
And finally here are some useful commands in terraform that helps manage the resources in the state file such as
Removing a resource
Just run rm
Renaming a resource
And moving a resource to a module – which is just renaming the resource with a module prefix.
So what are the benefits that we saw with using Terraform.
Visibility – we have a full view on how our infrastructure is defined in code. This was especially useful for security reviews such as what IPs we have whitelisted and why.
We were able to properly define and show how the ifnrastructure for a service is defined to newcomers and allow other teams to reuse it if it fits their needs.
Consistentcy, after we imported our resources into Terraform, we could finally notice differences in minor aspects of our resources such as settings and namings.
And then using Terraform, we could then begin enforcing consistency
It also finally allowed us to collaborate on infarstructure with developers. And we could get into working in pairs – reviewing each other codebase and providing more visibility to other team members on how we work.
Finally we got reusability. As mentioned, we managed to get similar services that had a similar archicture to use a common module for example.
We also avoided hardcoded values for reosurces like VPC ID and security group ID by using remote state linking.
Even then we still have a lot to do,
- We still got more engineers to onboard and train – in multiple timezones and different workstation setups.
- We need to roll out a centralized terraform workflow as we scale up. WE are looking into Terraform Enterprise but there are budgetary processes to follow. Meanwhile we are looking to roll out Atlantis to accelerate our IaC efforts
- Finally we need to figure out the security compliance around Terraform – how we will approve and apply infrastructure changes while still being compliant with our security and change policies.
So thank you again to <group> for inviting me to give this talk and to <host> for hosting this. And let me just add that Samsung Canada is always hiring!
So any questions!