SlideShare a Scribd company logo
1 of 66
Download to read offline
Scalable
      System Operations


Joshua Hoffman
                          1
About This Talk

• Set of principles        • Software
• Operations Engineering   • Techniques
• Tumblr project           • Example code
• Server management        • Best practices
• Massively automated      • Open source



                                              2
About Me




           3
About Me

    • 1995: CompUSA
      Intro to The Internet
    • 2000: Guru Labs
      Sun, Cisco, Red Hat
    • 2002: Red Hat
      Sys admin courseware




                              4
About Me
    • 2004: Fortress Systems
      Anti-spam/malware
    • 2005: Red Hat
      Virtualization cert
      Remote learning
      Defined “cloud”
    • 2011: Tumblr
      Lead Systems Eng




                               5
The Problem




  Deploying new servers is very repetitive and slow.
                (and we hate that)
http://www.flickr.com/people/mc4army/
                                                       6
The Job We Don’t Want




http://www.flickr.com/people/redjar/
                                      7
The Solution




               8
Install OS
                    Firmware
Configure OS
                    Configure BIOS
Install software
                    Set up BMC
Configure software
                    Inventory
Add to DNS
                    Stress testing
Add to monitoring
                    Network config
Add to trending




                                     9
The Goal Is Clear




                    10
Time To Strategize




http://www.flickr.com/people/irishwildcat/
                                            11
Time To Strategize
                                            Use open source?
                                            Which?
                                            Buy software?
                                            Which?
                                            Write software?
                                            Mix and match?



http://www.flickr.com/people/irishwildcat/
                                                               12
The Choice Principle




The time to make a decision is a function of the possible choices.

  http://www.flickr.com/people/3059349393/
                                                                     13
Rapid Software Research




                          14
Rapid Software Research


                1. Define
                2. Gather
                3. Disqualify
                4. Rank




                                15
Rank




http://www.flickr.com/people/nirak/
                                            16
Rank

                                            • Modularity
                                            • Compliance
                                            • Novelty
                                            • Disruption



http://www.flickr.com/people/nirak/
                                                           17
My Requirements

 • Asset inventory
 • State management
 • Robust API
 • Event triggers



                      18
My Requirements

    • Modular
    • Flexible
    • Extensible
    • Fast



                   19
My Requirements


Manage physical hardware
as easily as virtual machines.




                                 20
The Usual Suspects
     • Cobbler
     • Foreman
     • Satellite
     • Orchestra
     • Racktables
     • Clusto


                     21
But Wait!




            22
Data Entry




   “Just import the data supplied by the hardware vendor...”

http://www.flickr.com/people/mwichary/
                                                               23
Missing Requirements
      Firmware
      Configure BIOS
      Set up BMC
      Inventory
      Stress testing
      Network config
      Add to monitoring
      Add to trending




                          24
We have to write software!




http://www.flickr.com/people/argen/
                                     25
We have to write software!


                                     • Delivery Schedule
                                     • Scope Creep
                                     • Maintenance
                                     • Documentation


http://www.flickr.com/people/argen/
                                                           26
Tumblr Management Stack

       • iPXE
       • Invisible Touch
       • Collins
       • Phil
       • Kickstart
       • Puppet



                           27
The Glue Principle




                          Unix Rule of Parsimony:
                Write a big program only when it is clear by
                demonstration that nothing else will do.
http://www.flickr.com/people/kodomut/
                                                               28
The Standards Principle




The nice thing about standards is that you have so many to choose from.
                          -Andrew Tanenbaum
     http://www.flickr.com/people/usfwssoutheast/
                                                                          29
The Simplicity Principle




                   Unix Rule of Simplicity:
 Design for simplicity; add complexity only where you must.

http://www.flickr.com/people/haldanemartin/
                                                              30
The 3:00 AM Principle




It must be obvious to someone woken up from a sound sleep at 3:00 am.

     http://www.flickr.com/people/joi/
                                                                        31
The Don’t Break The OS Principle




The software should NOT prevent the OS from working as expected.
   http://www.flickr.com/photos/philmanker/
                                                                   32
The Amnesia Principle




      Given enough time, you WILL forget why you did that.

http://www.flickr.com/people/zach_a/
                                                             33
Tumblr Management Stack
       • iPXE
       • Invisible Touch
       • Collins
       • Phil
       • Kickstart
       • Puppet


                           34
Why not pxelinux?




http://www.flickr.com/people/digiart2001/
                                           35
Why not pxelinux?


                                           • TFTP
                                           • Flat files




http://www.flickr.com/people/digiart2001/
                                                         36
iPXE




http://www.flickr.com/people/chiotsrun/
                                                37
iPXE

                                                • HTTP, FTP, iSCSI
                                                • Scriptable
                                                • Variables
                                                • Dynamic



http://www.flickr.com/people/chiotsrun/
                                                                     38
ISC DHCP For iPXE
  # subnet for the provisioning vlan
  subnet <%= subnet %> netmask <%= netmask %> {
    option domain-name          "<%= option_domain_name %>";
    option routers              <%= option_routers %>;
    option domain-name-servers <%= option_dns_servers.map{|i| "#{i}"}.join(", ") -%>;
    option subnet-mask          <%= option_subnet_mask %>;
    default-lease-time          21600;
    max-lease-time              43200;
    range                       <%= range_start %> <%= range_end %>;
    # If a pxe request comes in from ipxe send the config url
    if exists user-class and option user-class = "iPXE" {
        filename "<%= ipxe_config_url %>"; # http://foo.example.com/ipxe/${net0/mac}
    # For all other pxe requests send ipxe
    } else {
        next-server <%= next_server %>; # tftp server
        filename "<%= filename %>";      # path to ipxe binary on tftp server
    }
  }
}

                                                                                        39
Fedora LiveCD Tools
lang en_US.UTF-8
keyboard us
timezone US/Eastern
auth --useshadow --enablemd5
selinux --enforcing
firewall --disabled
repo --name=centos   --baseurl=http://127.0.0.1/pub/repo/centos/os/6.2
repo --name=infra    --baseurl=http://127.0.0.1/pub/repo/infra/6.2
repo --name=epel     --baseurl=http://127.0.0.1/repo/epel/6/x86_64/

%packages --excludedocs
@core
dracut
dracut-kernel
device-mapper
device-mapper-event
%end


                                                                         40
Invisible Touch Kickstart
# Invisible Touch Live OS image
%include centos-6.2-livecd-minimal.ks
%packages --excludedocs
it
%end
%post
cat > /etc/issue <<EoF
Invisible Touch Live OS v0.0.4
Kernel r
EoF
# set ipmi to start at boot up
/sbin/chkconfig ipmi on
# configure rsyslog
cat >> /etc/rsyslog.conf <<EoF
# invisible touch
local0.*                                /var/log/it.log
local0.*                                /dev/tty7
EoF
%end

                                                          41
Invisible Touch Utilities

        • lshw
        • lldpd
        • Breakin
        • ipmitool
        • Bash scripts



                            42
lshw
<node id="disk:1" claimed="true" class="disk" handle="SCSI:04:00:01:00">
  <description>ATA Disk</description>
  <product>ST91000640NS</product>
  <vendor>Seagate</vendor>
  <physid>0.1.0</physid>
  <businfo>scsi@4:0.1.0</businfo>
  <logicalname>/dev/sdf</logicalname>
  <dev>8:80</dev>
  <version>n/a</version>
  <serial>9XG0ETB8</serial>
  <size units="bytes">1000204886016</size>
  <configuration>
    <setting id="ansiversion" value="5" />
    <setting id="signature" value="000e1763" />
  </configuration>
  <capabilities>
    <capability id="partitioned">Partitioned disk</capability>
    <capability id="partitioned:dos">MS-DOS partition table</capability>
  </capabilities>
</node>



lshw generates hardware info XML

                                                                           43
lldpd
<interface label="Interface" name="eth0" via="LLDP" rid="1" age="0 day, 00:01:03">
 <chassis label="Chassis">
  <id label="ChassisID" type="mac">78:19:f7:88:60:c0</id>
  <name label="SysName">core01.dfw01</name>
  <descr label="SysDescr">Juniper Networks, Inc. ex4500-40f</descr>
  <capability label="Capability" type="Bridge" enabled="on" />
  <capability label="Capability" type="Router" enabled="on" />
 </chassis>
 <port label="Port">
  <id label="PortID" type="local">608</id>
  <descr label="PortDescr">ge-0/0/3.0</descr>
  <mfs label="MFS">1514</mfs>
  <auto-negotiation label="PMD autoneg" supported="no" enabled="yes">
   <advertised label="Adv" type="10Base-T" hd="no" fd="yes" />
   <current label="MAU oper type">unknown</current>
  </auto-negotiation>
 </port>
 <vlan label="VLAN" vlan-id="666" pvid="yes">DFW01-PROVISIONING</vlan>
 <lldp-med label="LLDP-MED">
  <device-type label="Device Type">Network Connectivity Device</device-type>
  <capability label="Capability" type="Capabilities" />
 </lldp-med>
</interface>


    lldpctl outputs network info in XML

                                                                                     44
Breakin




Stress testing framework



                           45
Breakin

     • Standard tools
     • LINPACK
     • Extensible
     • Bash scripts



                        46
Invisible Touch

    Firmware
    Configure BIOS
    Set up BMC
    Inventory
    Stress testing
    Network config




                     47
Collins
• Asset management system in Scala
• REST API
• Client libraries in Ruby, Python and Bash
• Shell tool for scripting and automation
• Callback system for hooking into events
• Granular permissions model
• Flexible web and API based provisioning
• Remote power management
• IP Address allocation and management
• Distributed mode for spanning data centers
                                               48
Collins Docs




               49
Collins Search




                 50
Collins Asset Details




                        51
Collins Provisioning




                       52
Phil

• iPXE dispatcher
• Kickstart generator
• Light Ruby app
• Collins API client



                        53
Server Intake Workflow

   1. Rack and stack
   2. Power on
   3. Enter physical data




                            54
Server Intake Process
1. Server boots iPXE via DHCP/PXE
2. iPXE gets config from Phil
3. Phil sends Invisible Touch
4. IT updates firmware (if needed)
5. IT configures BIOS
6. IT configures BMC
7. IT uploads inventory data to Collins
8. IT starts stress tests
9. IT powers down server



                                          55
Provisioning Workflow

1. Search Collins
2. Choose Profile, Role, Pool
3. Click button




                               56
Provisioning Process
1. Server boots iPXE via DHCP/PXE
2. iPXE gets config from Phil
3. Phil sends install image
4. Install image gets Kickstart from Phil
5. Install runs Puppet in %post
6. End of %post calls back to Collins
7. Collins triggers vlan update
8. Collins triggers monitoring/trending
9. Added to production if “all green”


                                            57
Result




          Fast, scalable, no hassle provisioning!

http://www.flickr.com/people/mc4army/
                                                    58
Hurdles




http://www.flickr.com/people/ligynnek/
                                                  59
Hurdles


                                          PXE kickstart w/ multiple NICs
                                          Network set up in %post
                                          Virident SSD set up in %post




http://www.flickr.com/people/ligynnek/
                                                                           60
PXE Kickstart / Multiple NICs
                     Phil iPXE config
     initrd <%= os_install_url %>/images/initrd.img
     kernel <%= os_install_url %>/images/vmlinuz ip=dhcp ksdevice=${mac}




                Phil kickstart snippet
     # network
     network --bootproto=dhcp




                                                                           61
%post Network Set Up
                Phil kickstart snippet
# Bond Interface: <%= bond.name %>

cat > /etc/sysconfig/network-scripts/ifcfg-<%= bond.name %> <<EoF
DEVICE=<%= bond.name %>
BONDING_OPTS="<%= bond.options %>"
BOOTPROTO=static
IPADDR=<%= bond.address %>
NETMASK=<%= bond.netmask %>
GATEWAY=<%= bond.gateway %>
EoF




                                                                    62
%post Virident SSD Set Up
   # Start the virident daemon
   /etc/init.d/vgcd start
   # create a device node
   mknod /dev/vgca0 b 252 0
   # create a mount point
   mkdir -p /var/lib/mysql
   # create partitions
   parted -s /dev/vgca0 mklabel msdos
   parted -s /dev/vgca0 unit s mkpart primary ext2 2048 100%
   # make another device node
   mknod /dev/vgca0p1 b 252 1
   # make the filesystem
   /sbin/mkfs.xfs -f -d su=64k,sw=3 -l size=32m,su=16k /dev/vgca0p1
   # create fstab entry
   echo "/dev/vgca0p1       /var/lib/mysql   xfs    noauto     0 0" >>/etc/fstab
   # create virident config
   cat > /etc/sysconfig/vgcd.conf << EoF
   RESCAN_MD=1
   RESCAN_LVM=1
   MOUNT_POINTS="/var/lib/mysql"
   RESCAN_MOUNT=1
   EoF
   # mount the virident
   mount /var/lib/mysql




                                                                                   63
Lessons Learned

• Modularity is very important
• Hardware always has issues at scale
• Use modern Bash syntax
• 4 hour burn-in is not enough



                                        64
Yes, we’re hiring!

   Joshua Hoffman
  joshua@tumblr.com
    tumblr.com/jobs




                      65
66

More Related Content

Similar to Scalable system operations presentation

Smart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and ExploitationSmart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and ExploitationSecureState
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
Owning windows 8 with human interface devices
Owning windows 8 with human interface devicesOwning windows 8 with human interface devices
Owning windows 8 with human interface devicesNikhil Mittal
 
The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...Eric Reiche
 
DEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gapDEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gapFelipe Prado
 
Continuous delivery applied
Continuous delivery appliedContinuous delivery applied
Continuous delivery appliedMike McGarr
 
Demystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampDemystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampAndré Baptista
 
Mobile for PHP developers
Mobile for PHP developersMobile for PHP developers
Mobile for PHP developersIvo Jansch
 
5 cro tools that i can't live without
5 cro tools that i can't live without5 cro tools that i can't live without
5 cro tools that i can't live withoutCraig Sullivan
 
Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)Mike McGarr
 
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021Teemu Tiainen
 
Php Development In The Cloud
Php Development In The CloudPhp Development In The Cloud
Php Development In The CloudIvo Jansch
 
Made for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaSMade for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaSVMware Tanzu
 
10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About Plone10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About PloneJazkarta, Inc.
 
Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Stuart Charlton
 
Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)ClubHack
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the TrenchesChris Dagdigian
 
Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Mike McGarr
 

Similar to Scalable system operations presentation (20)

Smart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and ExploitationSmart Bombs: Mobile Vulnerability and Exploitation
Smart Bombs: Mobile Vulnerability and Exploitation
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Owning windows 8 with human interface devices
Owning windows 8 with human interface devicesOwning windows 8 with human interface devices
Owning windows 8 with human interface devices
 
The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...
 
Lrug
LrugLrug
Lrug
 
DEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gapDEF CON 27 - workshop - RICHARD GOLD - mind the gap
DEF CON 27 - workshop - RICHARD GOLD - mind the gap
 
Continuous delivery applied
Continuous delivery appliedContinuous delivery applied
Continuous delivery applied
 
Demystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampDemystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels Camp
 
Mobile for PHP developers
Mobile for PHP developersMobile for PHP developers
Mobile for PHP developers
 
5 cro tools that i can't live without
5 cro tools that i can't live without5 cro tools that i can't live without
5 cro tools that i can't live without
 
Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)Continuous Delivery Applied (AgileDC)
Continuous Delivery Applied (AgileDC)
 
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
Zero Trust And Best Practices for Securing Endpoint Apps on May 24th 2021
 
Php Development In The Cloud
Php Development In The CloudPhp Development In The Cloud
Php Development In The Cloud
 
Made for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaSMade for Each Other: Microservices + PaaS
Made for Each Other: Microservices + PaaS
 
10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About Plone10 Things You Probably Didn't Know About Plone
10 Things You Probably Didn't Know About Plone
 
Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!
 
What is this cloud thing?
What is this cloud thing?What is this cloud thing?
What is this cloud thing?
 
Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)Metasploitation part-1 (murtuja)
Metasploitation part-1 (murtuja)
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)
 

More from james tong

数据库系统设计漫谈
数据库系统设计漫谈数据库系统设计漫谈
数据库系统设计漫谈james tong
 
Migrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQLMigrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQLjames tong
 
Oracle 性能优化
Oracle 性能优化Oracle 性能优化
Oracle 性能优化james tong
 
Cap 理论与实践
Cap 理论与实践Cap 理论与实践
Cap 理论与实践james tong
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentationjames tong
 
The right read optimization is actually write optimization
The right read optimization is actually write optimizationThe right read optimization is actually write optimization
The right read optimization is actually write optimizationjames tong
 
My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012james tong
 
Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2james tong
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorialjames tong
 
Understanding performance through_measurement
Understanding performance through_measurementUnderstanding performance through_measurement
Understanding performance through_measurementjames tong
 
我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)james tong
 
设计可扩展的Oracle应用
设计可扩展的Oracle应用设计可扩展的Oracle应用
设计可扩展的Oracle应用james tong
 
我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptx我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptxjames tong
 
Enqueue Lock介绍.ppt
Enqueue Lock介绍.pptEnqueue Lock介绍.ppt
Enqueue Lock介绍.pptjames tong
 
Oracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.pptOracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.pptjames tong
 
Cassandra简介.ppt
Cassandra简介.pptCassandra简介.ppt
Cassandra简介.pptjames tong
 

More from james tong (17)

数据库系统设计漫谈
数据库系统设计漫谈数据库系统设计漫谈
数据库系统设计漫谈
 
Migrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQLMigrating from MySQL to PostgreSQL
Migrating from MySQL to PostgreSQL
 
Oracle 性能优化
Oracle 性能优化Oracle 性能优化
Oracle 性能优化
 
Cap 理论与实践
Cap 理论与实践Cap 理论与实践
Cap 理论与实践
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
The right read optimization is actually write optimization
The right read optimization is actually write optimizationThe right read optimization is actually write optimization
The right read optimization is actually write optimization
 
My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012My sql ssd-mysqluc-2012
My sql ssd-mysqluc-2012
 
Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2Evaluating ha alternatives my sql tutorial2
Evaluating ha alternatives my sql tutorial2
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial
 
Understanding performance through_measurement
Understanding performance through_measurementUnderstanding performance through_measurement
Understanding performance through_measurement
 
我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)我对后端优化的一点想法 (2012)
我对后端优化的一点想法 (2012)
 
设计可扩展的Oracle应用
设计可扩展的Oracle应用设计可扩展的Oracle应用
设计可扩展的Oracle应用
 
我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptx我对后端优化的一点想法.pptx
我对后端优化的一点想法.pptx
 
Enqueue Lock介绍.ppt
Enqueue Lock介绍.pptEnqueue Lock介绍.ppt
Enqueue Lock介绍.ppt
 
Oracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.pptOracle数据库体系结构简介.ppt
Oracle数据库体系结构简介.ppt
 
Cassandra简介.ppt
Cassandra简介.pptCassandra简介.ppt
Cassandra简介.ppt
 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Scalable system operations presentation

  • 1. Scalable System Operations Joshua Hoffman 1
  • 2. About This Talk • Set of principles • Software • Operations Engineering • Techniques • Tumblr project • Example code • Server management • Best practices • Massively automated • Open source 2
  • 4. About Me • 1995: CompUSA Intro to The Internet • 2000: Guru Labs Sun, Cisco, Red Hat • 2002: Red Hat Sys admin courseware 4
  • 5. About Me • 2004: Fortress Systems Anti-spam/malware • 2005: Red Hat Virtualization cert Remote learning Defined “cloud” • 2011: Tumblr Lead Systems Eng 5
  • 6. The Problem Deploying new servers is very repetitive and slow. (and we hate that) http://www.flickr.com/people/mc4army/ 6
  • 7. The Job We Don’t Want http://www.flickr.com/people/redjar/ 7
  • 9. Install OS Firmware Configure OS Configure BIOS Install software Set up BMC Configure software Inventory Add to DNS Stress testing Add to monitoring Network config Add to trending 9
  • 10. The Goal Is Clear 10
  • 12. Time To Strategize Use open source? Which? Buy software? Which? Write software? Mix and match? http://www.flickr.com/people/irishwildcat/ 12
  • 13. The Choice Principle The time to make a decision is a function of the possible choices. http://www.flickr.com/people/3059349393/ 13
  • 15. Rapid Software Research 1. Define 2. Gather 3. Disqualify 4. Rank 15
  • 17. Rank • Modularity • Compliance • Novelty • Disruption http://www.flickr.com/people/nirak/ 17
  • 18. My Requirements • Asset inventory • State management • Robust API • Event triggers 18
  • 19. My Requirements • Modular • Flexible • Extensible • Fast 19
  • 20. My Requirements Manage physical hardware as easily as virtual machines. 20
  • 21. The Usual Suspects • Cobbler • Foreman • Satellite • Orchestra • Racktables • Clusto 21
  • 22. But Wait! 22
  • 23. Data Entry “Just import the data supplied by the hardware vendor...” http://www.flickr.com/people/mwichary/ 23
  • 24. Missing Requirements Firmware Configure BIOS Set up BMC Inventory Stress testing Network config Add to monitoring Add to trending 24
  • 25. We have to write software! http://www.flickr.com/people/argen/ 25
  • 26. We have to write software! • Delivery Schedule • Scope Creep • Maintenance • Documentation http://www.flickr.com/people/argen/ 26
  • 27. Tumblr Management Stack • iPXE • Invisible Touch • Collins • Phil • Kickstart • Puppet 27
  • 28. The Glue Principle Unix Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do. http://www.flickr.com/people/kodomut/ 28
  • 29. The Standards Principle The nice thing about standards is that you have so many to choose from. -Andrew Tanenbaum http://www.flickr.com/people/usfwssoutheast/ 29
  • 30. The Simplicity Principle Unix Rule of Simplicity: Design for simplicity; add complexity only where you must. http://www.flickr.com/people/haldanemartin/ 30
  • 31. The 3:00 AM Principle It must be obvious to someone woken up from a sound sleep at 3:00 am. http://www.flickr.com/people/joi/ 31
  • 32. The Don’t Break The OS Principle The software should NOT prevent the OS from working as expected. http://www.flickr.com/photos/philmanker/ 32
  • 33. The Amnesia Principle Given enough time, you WILL forget why you did that. http://www.flickr.com/people/zach_a/ 33
  • 34. Tumblr Management Stack • iPXE • Invisible Touch • Collins • Phil • Kickstart • Puppet 34
  • 36. Why not pxelinux? • TFTP • Flat files http://www.flickr.com/people/digiart2001/ 36
  • 38. iPXE • HTTP, FTP, iSCSI • Scriptable • Variables • Dynamic http://www.flickr.com/people/chiotsrun/ 38
  • 39. ISC DHCP For iPXE # subnet for the provisioning vlan subnet <%= subnet %> netmask <%= netmask %> {     option domain-name "<%= option_domain_name %>";     option routers <%= option_routers %>;     option domain-name-servers <%= option_dns_servers.map{|i| "#{i}"}.join(", ") -%>;     option subnet-mask <%= option_subnet_mask %>;     default-lease-time 21600;     max-lease-time 43200;     range <%= range_start %> <%= range_end %>;     # If a pxe request comes in from ipxe send the config url     if exists user-class and option user-class = "iPXE" {         filename "<%= ipxe_config_url %>"; # http://foo.example.com/ipxe/${net0/mac}     # For all other pxe requests send ipxe     } else {         next-server <%= next_server %>; # tftp server         filename "<%= filename %>"; # path to ipxe binary on tftp server     } } } 39
  • 40. Fedora LiveCD Tools lang en_US.UTF-8 keyboard us timezone US/Eastern auth --useshadow --enablemd5 selinux --enforcing firewall --disabled repo --name=centos --baseurl=http://127.0.0.1/pub/repo/centos/os/6.2 repo --name=infra --baseurl=http://127.0.0.1/pub/repo/infra/6.2 repo --name=epel --baseurl=http://127.0.0.1/repo/epel/6/x86_64/ %packages --excludedocs @core dracut dracut-kernel device-mapper device-mapper-event %end 40
  • 41. Invisible Touch Kickstart # Invisible Touch Live OS image %include centos-6.2-livecd-minimal.ks %packages --excludedocs it %end %post cat > /etc/issue <<EoF Invisible Touch Live OS v0.0.4 Kernel r EoF # set ipmi to start at boot up /sbin/chkconfig ipmi on # configure rsyslog cat >> /etc/rsyslog.conf <<EoF # invisible touch local0.* /var/log/it.log local0.* /dev/tty7 EoF %end 41
  • 42. Invisible Touch Utilities • lshw • lldpd • Breakin • ipmitool • Bash scripts 42
  • 43. lshw <node id="disk:1" claimed="true" class="disk" handle="SCSI:04:00:01:00"> <description>ATA Disk</description> <product>ST91000640NS</product> <vendor>Seagate</vendor> <physid>0.1.0</physid> <businfo>scsi@4:0.1.0</businfo> <logicalname>/dev/sdf</logicalname> <dev>8:80</dev> <version>n/a</version> <serial>9XG0ETB8</serial> <size units="bytes">1000204886016</size> <configuration> <setting id="ansiversion" value="5" /> <setting id="signature" value="000e1763" /> </configuration> <capabilities> <capability id="partitioned">Partitioned disk</capability> <capability id="partitioned:dos">MS-DOS partition table</capability> </capabilities> </node> lshw generates hardware info XML 43
  • 44. lldpd <interface label="Interface" name="eth0" via="LLDP" rid="1" age="0 day, 00:01:03"> <chassis label="Chassis"> <id label="ChassisID" type="mac">78:19:f7:88:60:c0</id> <name label="SysName">core01.dfw01</name> <descr label="SysDescr">Juniper Networks, Inc. ex4500-40f</descr> <capability label="Capability" type="Bridge" enabled="on" /> <capability label="Capability" type="Router" enabled="on" /> </chassis> <port label="Port"> <id label="PortID" type="local">608</id> <descr label="PortDescr">ge-0/0/3.0</descr> <mfs label="MFS">1514</mfs> <auto-negotiation label="PMD autoneg" supported="no" enabled="yes"> <advertised label="Adv" type="10Base-T" hd="no" fd="yes" /> <current label="MAU oper type">unknown</current> </auto-negotiation> </port> <vlan label="VLAN" vlan-id="666" pvid="yes">DFW01-PROVISIONING</vlan> <lldp-med label="LLDP-MED"> <device-type label="Device Type">Network Connectivity Device</device-type> <capability label="Capability" type="Capabilities" /> </lldp-med> </interface> lldpctl outputs network info in XML 44
  • 46. Breakin • Standard tools • LINPACK • Extensible • Bash scripts 46
  • 47. Invisible Touch Firmware Configure BIOS Set up BMC Inventory Stress testing Network config 47
  • 48. Collins • Asset management system in Scala • REST API • Client libraries in Ruby, Python and Bash • Shell tool for scripting and automation • Callback system for hooking into events • Granular permissions model • Flexible web and API based provisioning • Remote power management • IP Address allocation and management • Distributed mode for spanning data centers 48
  • 53. Phil • iPXE dispatcher • Kickstart generator • Light Ruby app • Collins API client 53
  • 54. Server Intake Workflow 1. Rack and stack 2. Power on 3. Enter physical data 54
  • 55. Server Intake Process 1. Server boots iPXE via DHCP/PXE 2. iPXE gets config from Phil 3. Phil sends Invisible Touch 4. IT updates firmware (if needed) 5. IT configures BIOS 6. IT configures BMC 7. IT uploads inventory data to Collins 8. IT starts stress tests 9. IT powers down server 55
  • 56. Provisioning Workflow 1. Search Collins 2. Choose Profile, Role, Pool 3. Click button 56
  • 57. Provisioning Process 1. Server boots iPXE via DHCP/PXE 2. iPXE gets config from Phil 3. Phil sends install image 4. Install image gets Kickstart from Phil 5. Install runs Puppet in %post 6. End of %post calls back to Collins 7. Collins triggers vlan update 8. Collins triggers monitoring/trending 9. Added to production if “all green” 57
  • 58. Result Fast, scalable, no hassle provisioning! http://www.flickr.com/people/mc4army/ 58
  • 60. Hurdles PXE kickstart w/ multiple NICs Network set up in %post Virident SSD set up in %post http://www.flickr.com/people/ligynnek/ 60
  • 61. PXE Kickstart / Multiple NICs Phil iPXE config initrd <%= os_install_url %>/images/initrd.img kernel <%= os_install_url %>/images/vmlinuz ip=dhcp ksdevice=${mac} Phil kickstart snippet # network network --bootproto=dhcp 61
  • 62. %post Network Set Up Phil kickstart snippet # Bond Interface: <%= bond.name %> cat > /etc/sysconfig/network-scripts/ifcfg-<%= bond.name %> <<EoF DEVICE=<%= bond.name %> BONDING_OPTS="<%= bond.options %>" BOOTPROTO=static IPADDR=<%= bond.address %> NETMASK=<%= bond.netmask %> GATEWAY=<%= bond.gateway %> EoF 62
  • 63. %post Virident SSD Set Up # Start the virident daemon /etc/init.d/vgcd start # create a device node mknod /dev/vgca0 b 252 0 # create a mount point mkdir -p /var/lib/mysql # create partitions parted -s /dev/vgca0 mklabel msdos parted -s /dev/vgca0 unit s mkpart primary ext2 2048 100% # make another device node mknod /dev/vgca0p1 b 252 1 # make the filesystem /sbin/mkfs.xfs -f -d su=64k,sw=3 -l size=32m,su=16k /dev/vgca0p1 # create fstab entry echo "/dev/vgca0p1 /var/lib/mysql xfs noauto 0 0" >>/etc/fstab # create virident config  cat > /etc/sysconfig/vgcd.conf << EoF RESCAN_MD=1 RESCAN_LVM=1 MOUNT_POINTS="/var/lib/mysql" RESCAN_MOUNT=1 EoF # mount the virident mount /var/lib/mysql 63
  • 64. Lessons Learned • Modularity is very important • Hardware always has issues at scale • Use modern Bash syntax • 4 hour burn-in is not enough 64
  • 65. Yes, we’re hiring! Joshua Hoffman joshua@tumblr.com tumblr.com/jobs 65
  • 66. 66