The document discusses a presentation about Lancaster University's experience with Storm Desmond, a major flood event, and lessons learned for improving campus disaster preparedness. Key points:
- Lancaster University lost power for 4 days during Storm Desmond due to flooding that cut off the main power substation. Emergency response was hampered by a lack of backup communications.
- Shortcomings were identified in emergency planning, backup power testing, and communications during the outage. Temporary fixes like wireless access points helped but were inadequate long-term.
- Upgrades have since included connecting more buildings to generator-backed systems, increasing fiber capacity, and improving radio communications. However, full readiness for a similar large-scale event is still limited
2. Please switch your mobile phones to silent
19:30
No fire alarms scheduled. In the event of an
alarm, please follow directions of NCC staff
Dinner (now full)
Entrance via Goldsmith Street
16:30 -
17:30
Birds of a feather sessions
15:20 -
16:00 Lightning talks
3. Campus disasters –Are you ready
for Storm Desmond 2, too?
Richard du Feu, Lancaster University
4. About Lancaster University
»Campus based
»On top of a hill (immune
from flooding?)
»Fibre rich
»Up to 50% of power
generated on site (CHP,
2.3 MW wind turbine)
»7,000 students living
on campus
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
Campus from the air (with power) – Chad Conway
5. Overview
»Emergency planning
»Storm Desmond
› Situation
› Initial response
› Short term efforts (until power restoration)
› Post-incident
»Longer term developments
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
6. Planning – Project Hydra
»Two Emergency planning
exercises over last decade
»Not always taken seriously by
many on the ground
»All exercises ended after 12 hours
»Weaknesses identified and some
rectification steps taken
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
7. Planning –Testing backup power
»Regular data centre
generator testing
»Annual power failure tests
to data centre backup
power systems
»UPS calibration runs aimed for
every 6 months, in reality
annually around Christmas
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
8. 5th December 2015 - Storm Desmond
» Up to 340mm of rain in 24 hours in
the Lune catchment
» 1600 cubic metres per second
(Olympic swimming pool in 1.5
seconds)
» Substation for Lancaster flooded
» Flood defences designed for 1 in 100
year flood
» 61,000 homes without power for
c48 hours
» Lancaster University without power
for 4 days
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
9. Immediate effects
» Power went off on Saturday 5th
December at 22:45pm to the University
» Students generally left in rooms until
morning as if they’re asleep it’s not a
problem
» Water ran out (it’s pumped, but
students ok with that)
» Sewage stopped flowing (it’s pumped,
but students OK with that!)
» UPS batteries went flat,WiFi stopped…
21st century Maslow
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
10. Initial response
»Emergency management team
(EMT) called
»Space for EMT had networking
generator backed up, no power
to rest of building!
»Decant space for students
generally on multimode fibre
being fed from distribution
switches on UPSs – No
networking in emergency space
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
11. Short term
»EMT not planned for
emergencies going on longer
than 12 hours
»Mobile cell batteries ran out
»Campus radio repeater needed
moving due to battery issues
»Zero communication available
away from a small number of
buildings with power and
POTS phones
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
12. Communications
»Limited WiFi
»Very limited 3G
»No (useful) radio
»How to coordinate people to
tackle problems as they
become apparent?
› IM is the way forward however which
one - Skype, Jabber, facebook
messenger?
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
13. Magic yellow boxes of internet
»For various events we have
Wireless APs mounted in
waterproof cases
»Ideal for getting wireless
outside buildings with power
and networking
»Point-to-point radio link allow
boxes to be generator powered
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
14. Power restoration to data centres
»Data centres are generator
backed up
»No loss of service on loss of
grid power (the plan worked!)
»Transition from generator back
to mains appeared to go OK…
until the UPS went flat
»Sticky switch that failed in
tests but OK second time failed
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
15. Power restoration
»UPSs
»Type B breakers
»Client authentication
»Failures at power on
»BMS/Air conditioning
»Leaks
How well does your network
restore?
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
16. Other things to be aware of
»In a crisis everyone is busy –
help them out, particularly the
porters and security team
»Maglocks!
»ACLs and your NOC
»Single laptop screen makes
much of the response difficult
»Overwhelming of monitoring
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
17. Summary
»Test generators and UPSs
»Have emergency exercises
› Limit them to likely scenarios
»Put in enough SM fibre
› 1 core per 12 data lines?
»Control costs
»NOC generator backed up.
»In a crisis be prepared to
be flexible
»If you need to fix something,
go incognito
»Make sure your decant space is
known and low on SPOFs
»Post incident if you offer a
solution it will be taken up…
»The window for money is very
small (days)
»It’s all about the WiFis.
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
18. Are we ready for Desmond 2?
»Most access switches
connected directly to
generator backed up locations
»Increased fibre count in
specification
»Replaced Campus radio system
to be digital and more resilient
»Are we ready? No.Will it be
less painful? Yes.
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
19. Thanks for listening;
Any questions?
12/04/2017 Campus disasters - Are you ready for Storm Desmond 2, too?
22. WHY IS MANAGING WI-FI IMPORTANT:
• 2015 HE survey mission critical 72.5%
• Weapons Grade Wi-Fi
• Standalone Service not an afterthought to your wired
MISSION
CRITICAL
• Cablecom found 92% of students with 2 devices as high 6
• SAP - At the end of 2013 there were more mobile devices
than people
• IPASS – 888% wifi growth worldwide since 2013 Wifi growth
GROWTH
• As speeds increase so will expectations
• Wi-Fi is the next utility - PERVASIVE
• Five 9’s Availability
• Many devices don’t come with an Ethernet port now!
EXPECTATIONS
• NSS scores – It is important to students
• Young people (that’s your students) prefer the internet to
daylight hot water and sleep!
• Critical to enable mobile working - BYOD
STRAGEGIC
RESOURCE
23. SOUND FAMILIAR:
•No Design methodology
•No documentation
•How do you solve these
issues…..
Inherited a
Wi-Fi Network
•No standards to work from
•No usage/requirements
Implementing
a new Wi-Fi
network
•Why do we need wires it’s the
21st for heavens sake
•50 users high speed transfers
please
The all
wireless
network
•Why can’t I transfer at gigabit
speeds?
Gigabit
Wi-Fi
24. YOU WILL ENCOUNTER SOME OF THESE:
SSID
OVER
HEAD
BANDWIDTH
STEERING
WIRED
CAMER
AS,
BMS,
ACCES
S
CONTR
OL?
CO-CHANNEL
INTERFERENCE
GOOD
COVERA
GE
PERFOR
MANCE IS
AWFUL
WE WANT
TO GO ALL
WIRELESS
I DON’T
UNDERSTAND
WHAT YOUR
TELLING ME
ROI
justification
JUST
ADD
MORE
AP’S FOR
BETTER
WIRELES
S
MULTIPLE
CONTROLLERS
25. KEY THEME 1 – TALK A LANGUAGE PEOPLE CAN
UNDERSTAND
• Lack of understanding
• Common language
• More complex than wired
Why is Wi-Fi so
difficult to
communicate?
• What's on the box is rarely in the tins
• Stretching the truth
• Get your vendor/partner in to present – make the
earn your money
Vendor Truth
• Ubiquitous Wi-Fi is not broadband
• 802.11ac does not mean Gigabit networkingUnderstanding
• You and your engineers need to understand the
technology to explain it.
• When budgets are tight training goes first – resist
• CWNA – CWNP – Vendor Neutral
Training
• Try throwing coffee – landing in the same place
• Now try with people moving in front
• Now make sure everyone gets enough to drink from
one cup
HOW TO THROW
COFFEE
ACCURATELY
26. KEY THEME 2 - UNDERSTAND YOUR
ENVIRONMENT
• Internal environment
• External environment what's around
you?
Visual
• Engage with stakeholders (MAP THEM)
• Challenge stakeholders
• Questionnaires
Stakeholde
rs
• Patterns inform your design
• Monitoring/Management tools
• VoIP, Video, RTLSUsage
• Other institutions
• JISC
• Local Council – Opportunities?
Peer
Support
• Direct engagement with the vendor
• Best Practice documentation
• Architecture & RoadmapsVendor
• Engage with a specialist designer
surveyor
• Full Surveys – NOT JUST COVERAGE
Experts
27. KEY THEME 3 – SURVEYDESIGN
NATURE V’S NURTURE
Stakeholders
Academic
Business
Student
External
Business
Partner
Designer• Informs your design
• Informs your vendor selection
• Comes from understanding your environmentKNOWLEDGE
• It’s not just your front end design
• Get our architecture right – 14 controllers
• Design for VoIP, Video, RTLSArchitecture
• Do you have one?
• Survey/Design guide – Do you have one?
• What's the relationship?Wireless SLA
• Green means nothing
• In the absence of numbers assume high usageDesign for
capacity
• If the stakeholder overrides the design – get it in
writing
• If the project cuts corners – get it in writing
Function over
form
• Can form a KPI
• A benchmark for success
• Demonstrates ROI
Set your
deliverables
28. KEY THEME 4 - VENDOR SELECTION:
•Any vendor will work (almost)
•As long as it’s standards based
•Ours is better - theirs doesn’t work “myth”
Informed Design
•Your vendor informs your design
•Critical to leverage all features
•Road mapping
Requirements/Information
Gathering
•Features– which vendor says they can deliver?
•Compare features – you didn’t know you wanted that
•Don’t forget the “vision” tools, management,
integration!
What do you want from
wireless?
•Look for non vendor use cases
•Peers institutions will give you the true story
•One visit set up by a vendor I was told “switching is
good wi-f don’t touch it!”
The real story
29. KEY THEME 5 - FINANCIALS
• 25 yr. investment
• Full control
• Faster
Wired V’s
Wireless
• Cutting corners? - Put it in writing not a warranted design
• Specify the long term benefitsProject Sign Off
• You have an X Million service you need X to leverage it.
• None of this happens without knowledge
• Making the case for a specialist
Training &
Staffing
• Good Designs save money 62 AP’s v’s 144 AP’s
• Average wasted AP’s = 20% in older installations as high
as 32% !!!!
Savings
33. LSE’s Campus Refresh:
It’s not just about the tin
Networkshop 45, 12/04/2017
Campus Networking
Ed Spick
LSE Network Manger
Matt Bernstein
LSE Senior Network Architect
34. Contents
• About LSE and its Campus Network
(Ed)
• Reshaping team and environment (Ed)
• Refreshing tin and topology (Matt)
• Looking to the future (Matt)
35. About
• est. in 1895
“for the betterment of society”
• 200+ public events per year
• 18 Nobel Laureates
amongst its Alumni
• 2nd in the world (QS 2017)
for the Social Sciences
• 10,800 students
• 5,000 undergraduate
• 5,800 graduate
• 3,300 staff
36. LSE’s Campus
• 40 buildings in Central London
• Students from over 150 countries
• Major Capital Development projects
• 10 Halls of Residence (out of scope)
41. Challenges of Team Growth
• Truckman’s stages cycling
• Growing out of small team mindset
• Having a Mythical Man-Month problem?
• Brooks’ Law of ramp-up time or “operational drag”
Formin
g
Storming
Norming
Performing
?
43. The same
facility after:
• Extensive planning
• 8 weekends’ migration
• Cabinet replacement
• Estates coordination
• Air conditioning
44. Changes in
campus topology
• November 2011
• 4½ core locations
• Some resilience
• Laser / WiFi links
• Lots of SPoFs
• DCs on campus
network
• Little evidence of
design
CS-R
CS-V6
1Gbps
10+1G
FTP
VPN
NH, GH, LK Halls
BW, SW Halls
Telecoms
Peacock
K-2
Lionel Robbins
MCR
Services
SCR
Services
DCR
Services
LSE Research Lab
STICERD/CASE 5th
flr
CEP 4th
flr
Services
Library
Services
LMN-1
ULCC
LMN-2
KCL
1Gbps
10G
Packet
Shaper
HH, BS, RB, CS, PF Halls
Old Building
St Clements
Building
50 LIF
Sheffield St
Kings/Lincoln Chambers
Cowdray House
Lakatos Building
Clement House
Tower 3
Tower 2
Tower 1
Clare
Mkt
East Bldg
Columbia
House
Connaught
House
20 Kingsway
1+1 Gbps
1 spare
unused cable
S-R
S-V
A-V
SPoF
edge
edge
802.11a
1+1 Gbps
1+1 Gbps
Thus switch
5 links
2 links
Tower 2
8 subsidiary
comms rooms
F-LG to F-7
LTC
14 subsidiary
comms rooms
TC2
TC2
Services
NAB
F-B
10G
10G
F-7
edge
New Court
CS-A
CS-TC2
CS-LTC
CS-V2
1+1 Gbps
CS-S
100Mbps
10G
10G
K-B
L-G
10+1G
Sardinia
House
Maths
Finance
FMG
100Mbps
10G
10G
SPoF
SPoF
SPoF
SPoF
32LIF
(in test)
Aldwych
House
45. Changes in
campus topology
• April 2017
• Two core locations
• Air-gapped DCs
• Regular pattern
emerging
• 10Gb/s nearly
everywhere - VSL -
- VSL -
- Nexus FabricPath -
- Chassis Cluster Link -
10Gb/s to Telecity Powergate
NH, GH, LK HallsBW, SW Halls
10Gb/s to Imperial College
HH, BS, RB, CS, PF Halls
MCR
Services
Core-2
(STC)
SCR
Services
95A
1KW
PAR
Security Lodge
KSW
PEA
50L
NAB
32L
LRB
LCH
LAK
KGS
SHF
SAR
SAW
CLM
COW
OLD
ALD
STC
TW2
CON
COL
QUE
Core-2
(TW2)
Core-3
(TW2)
Core-3
(STC)
46. Campus Core
• 6509 VS4O pair retained
• for WiSMs + L2 buildings
• C6807 Sup6T VS4O new core
• pure L3
• largest attached network is a /31
• Lots of interfaces; can connect
30 buildings at 4 x 10GbE
• 2 new core locations
47. Firewall &
Janet links
2011
Core
CS-S
Core
CS-AS-A Trunk
Core
CS-V2
V-A Trunk
S-V Trunk
ISG2000-1 ISG2000-2
VPN/DMZ VPN/DMZ
Halls
HA
Lonman2
Cisco 7206
DMZ, VPN servers
Maths/
FMG
Halls
Halls
Eduroam
LMN (KCL)
DMZ, VPN servers
Halls
Media
convertor
HA
Media
convertor
Maths, Finance and FMG
systems outside firewall
LMN (ULCC)
Packeteer
Lonman1
Cisco 7206
Lonman1/Lonman2 BGP link
Halls
Halls
Halls
Eduroam
Maths/FMG, VPN/DMZ, Eduroam
HA
Lonman1/Lonman2 BGP link
HA
Telecoms
firewall bypass
Telecoms firewall bypass
22
Maths/FMG, VPN/
DMZ, HA, Eduroam,
Halls, BGP
Maths/FMG, VPN/DMZ,
Eduroam
Maths/FMG, VPN/
DMZ, HA, Eduroam,
Halls, BGP
Not currently
working Core
CS-V6
Eduroam
49. Consolidating the Access Layer
• C3850 does not physically fit into some of our current facilities
• L2 but IP Base licensing for Netflow, TrustSec on every access port
• “multi-Gig” interfaces (100Mb/1Gb/2.5Gb/5Gb/10Gb)
• Programmable ASIC, Cisco’s strategic platform
53. Zoning and NAC
• necessary to support multiple
“tenants” on one campus
• legal and regulatory compliance
• TrustSec driven by ISE
• Supports mobility e.g. moving
equipment around Campus
• Fundamental driver for our
Business Case
EPOS
system
climate
change
research
data
CCTV etc etc
55. Lessons Learned
Challenges
• External review to make the case for change
• Managing “organic growth”
of campus network
• Supporting the team as it grows
• Coping with legacy environments
Opportunities
• Agree regular maintenance windows
with your HEI
• Reserve roles for long term planning
and investment
• Engage with Campus development projects
• Align with new product roadmaps
and reference architectures
56. Q & A and Credits
• Thanks to all of the Network Team
• Shameless plug – we’re recruiting – join us!
2x96 core single mode into every building from 2 generator backed up Data centres
Wind turbine feeds grid direct rather than into campus
Many of the 7,000 are overseas
Discovered a couple of buildings that were only connected to one data centre due to misconfigurations
Identified lack of resilient fibre to 4,500 rooms on campus (still to resolve 6 years old while new route is established)
Full exercises with fire service, building occupants involved, EMT called in and discussions as to whether a DC should be shut down.
Many felt they were too busy to be involved and had better things to do (I may have shared some of those views)
Monthly generator tests
Bypass switch and failure of power back to mains tested annually, usually fails, retest succeeds a couple of weeks later.
UPS calibration runs critical as many UPSs only mark batteries as bad when they’re gassing and the gas corrodes the main board!
Battery maintenance critical as without good batteries they’re a waste of time.
Sorry, my degree is in weather and river flow forecasting.
Taken 12 hours after the flood peak
Substation still underwater even though river has dropped by about 2m at this point
This was 1 in 1,500 year flood based on historic data. In this case combination of warm December (2 degrees higher), high pressure stopping the system over the North west caused record rainfall
Largest discharge ever recorded in an English river
61,000 homes, 2 hospitals, 2 Universities, Sections of West Coast main line
Severe flooding further north in Cumbria limited available emergency resource
Please read the news JSD before calling every hour wanting an update!
Middle of the night gave time to establish a plan
Became apparent very quickly that power would be off for a while, students, where possible, sent home. As many over seas this was a challenge
Those awake who went to see porters were fine with now power, water, sewage removal however when the wifi went away all hell broke loose.
Emergency systems batteries lasted long enough for it not to be a safety problem
Short notice decant space is generally to those buildings with normally low networking demand and consequently equipped with older equipment
Often lacking in fibre provision
Why multimode? Historically cheap, genuine optics much cheaper (50%) than single mode
First choice EMT in our building however generator only covered networking equipment and no office space/meeting room space
EMT had no contingency planning for longer than 12 hours. As such the gold commander insisted on staying until crisis over. Make sure if a crisis has a chance of lasting more than 12 hours have a rotation system in place.
Mobile sites are probably worse at battery maintenance than you would hope
Even if they were up the demands of 7,000 additional mobile clients that would normally use wifi seriously harms it’s usefulness
With no wifi, voip phones, 3G communication becomes limited to radio, sneakernet and RFC1149 and RFC2549
Analog radio system at the end of it’s useful life meant 1 channel was available campus wide which clearly was not enough, then the lead acid battery powering the repeater ran low. Power maintained by taking 100kg of batteries up 14 floors of a tower and up a vertical ladder
Yellow jackets – go incognito otherwise you get swamped.
Before carrying a 40kg generator up 14 flights of stairs test it.
After enough service was restored for decant space this became largest problem.
Broadly they suffered from the peer to peer searching problem – the more people that used them the more noise there was and the more time people spent reading stuff that was highly irrelevant
Everyone had their own preferred IM system and insisted theirs was better which just caused frustration and did not help with communication
Vital to limit the number of people in an IM group to those coordinating response. Those fixing stuff need to know what they are going to fix and little else. Danger of too many cooks and not enough bottle washers
Doing a bit of research into RFC6214
Some incidents afterwards have requested magic yellow boxes of internet
While trivial it’s worth having a way to deploy wireless rapidly outdoors such as in entrance ways or natural high volume footfall places
Ours came from sporting event run every other year where WiFi in grandstands has become important
If p2p radios such as Ubiquiti links are on high places a mobile box can be roughly pointed at the base station and just work. If really necessary could be powered from tiny generator or ivnerter and car battery etc or directly off 48v batteries.
UPSs after a complete drain and maybe not perfect battery maintenance cause issues – 6 went bang. Question is is the lack of networking down to lack of power or lack of UPS?
We#ve all got things wrong that we want to fix but most of the time it works. People love installing Type B breakers because for most things it’s the correct one. UPSs, large number of swtiches really benefit from Type C.
Auth servers swamped, edge switches coming up before their uplinks causes fall back clients. Some do not recover, automatic systems were not 100% effective.
Some devices failed at power on, luckily firmware pushed out to some earlier and a reboot occurred. Had firmware been pushed that required a bootrom update then there is a danger of unclean power corrupting the flash
Many Air con units failed to restore for various reasons. Temperature sensing vital in every comms room
One building failed when rain was blown through the wall onto the floor above, leaking all over equipment. UPS let it’s magic smoke out. Only became apperant after power restored and building didn’t come back
Monitoring and try not to focus on small issues like a wap or two down. Worry about the building although can be hard to spot in a sea of red. Worth having different views
Much of the time when we had no power there was relatively little to be done because we had service where it was needed. More work was needed each time a trailer generator arrived to connect some services. Most of the time was spent ensuring a building by building list of things to check was ready so service could be checked quickly. As staff told to stay away until power was restored and most students had gone home this may have been overkill
The people you rely on day to day are your security team for access to buildings or anything else. Helping them out where you can will win lots of brownie points
Access control partially lives with networking at Lancaster. Exactly why is lost in the depths of time however it may be because when we need access to fix something we really need it and have quite a wide variety of skills from various projects. Some others are maintained elsewhere and when an architech gets involved maglocks are the weapon of choice. When these fail buildings fail insecure so much of the time of security is taken in checking buildings are secure and using things like cable ties or padlocks to ‘secure’ the doors.
If you have secured your infrastructure so it’s only accessible from your NOC and your NOC is without power life becomes pretty difficult. Have a plan B. Be sure your bastion hosts have your favourite tools available (clusterSSH, X windows available)
If you’re forced out of your NOC then trying to fix large numbers of things and watch what is going on is a major challenge with a single small laptop screen. Basically make sure your plan B has a big screen and access to everything. I ended up with a 30m power lead from our generator backed up comms to my desk
Monitoring – make sure you have different hierarchical views for infrastructure to prevent an AP becoming a black hole for time when the router for it is really the problem
Test, test and test again. If something fails and works a second time try and understand why.
The exercises are a pain and disruptive to a day’s work but really valuable as long as problems identified are acted upon.
You can never have enough fibre, allowing for 4 cores per switch gives a huge amount of future proofing to something like 2x100GE per switch. 2 cores allows for 2x10GE on BiDi. Is that realistic for 20 years time, the realistic life of cabling? gigE was only ratified in 18 years ago and it’s now very much old technology for uplinks. 18 years time will 100GE be old technology for switch uplinks?
Control costs – bringing all your switch uplinks back to Data centres is not cheap particularly on genuine optics. 3rd party optics swing the economics of collapsed core or distribution switches being in a data centre, with the addition of allowing cheaper edge switches without using stacking
If you can get your NOC generator backed up
When it all goes wrong and you’re stuck with 7,000 people needing catering for and no likely chance of power being restored don’t be surprised if the most useful thing you can do is serve cups of tea at 3am.
Go incognito – if there is no way to get information out and it looks like you know something you will not be able to move for people asking questions which you cannot answer.
Decant space will be in lightly used buidlings you are loathed to spend money on. It’s worth treating them like your flagship building and make sure those spaces have some of the best APs out there
In the days after a crisis many a meeting will be held. If you say yes, we can make that more relisient by doing x, y or z it will be assumed you have committed to doing it
Between those meetings make sure you put the time into what you need to spend money on to help in future. You window is under a week and if you miss it you’ll be expected to deliver without additional money.
Core and aggregation/distribution refresh currently happening. Ideal time to consider the response to Desmond and limit the effects. Moving every switch to dual uplinks with each uplink going on mostly diverse fibre routes to 2 generator backed up data centres.
Economics of it only add up with the addition of 3rd party optics. Note they are cheaper but don’t go too cheap. Reprogrammables give some investment protection. In fact distribution count can drop so far that with 3rd party optics it’s cheaper to have distribution in DCs rather than in buildings. Consider BiDi’s to keep fibre counts down.
Radio system updated to multiple repeater, digital system giving 4 time slots which are dynamically moved between repeaters depending on demand. You get what you pay for, and it wasn’t cheap however in terms of poor communications in Demond it’s well worth it.
Are we ready? No and beliving you have all bases covered is a good route to fail. Are we more prepared before? Definitely, which is lucky as 1 in 100 year events are calculated on historic probabilities when winter atmospheric temperatures were lower and the warmer they are the more water the atmosphere can hold and the more it can hold the more rain can fall.
Explain the coffee cup metaphor – communicating things to non technical people
In this presentation we wanted to talk not just about the equipment that has gone into LSE’s Campus refresh, but also about some of the other changes we had to make in the team and the environment, to support the network and its redevelopment. So I’ll be introducing you to LSE and its network, and how we reshaped the team, and Matt will talk through the changes in topology and equipment that marked our journey to what, at the time, was coined the “network of the future”.
The London School of Economics or LSE for short, was founded by 4 Fabians and it still takes its public mission to widen the social purpose of and access to academia very seriously. LSE runs over 200 public events a year, attracting the “great and the good” of the liberal elite, visiting speakers have included Nelson Mandela, the Dalai Lama and Bill Gates to name a few. It has become world renowned for the research and study of Social Sciences and boasts Nobel Laureates among its Alumni. LSE is a mid-sized institution that prioritises Quality over Quantity.
LSE is located around the Aldwych in Central London, with all the competition for space and costs that that entails. The redevelopment and expansion of the Campus has been a major part of the school’s strategy for a number of years. LSE, like many other HEI’s, market its London location to an international student body. Currently there are two major capital development projects in flight, shown by the darker colours on the map, and it is this relatively dynamic nature of the LSE estate that has provided us with some challenges and opportunities with regard to networking. LSE also has 10 Halls of Residence within walking distance of Campus, but for the purpose of our talk and our refresh project, they are out of scope.
Those 30 campus buildings around the Aldwych are connected by a dedicated fibre optic duct network, shown here by the dark lines. These were blessed to us by a "SPoF" project in 2010-11 which addressed Single Points of Failure in the campus network by essentially “flood wiring” the streets around Aldwych with ducts, so that most if not all LSE’s buildings could be connected by dedicated, resiliently routed fibre to two core locations. However, what we found was that several recent new build projects did not install fibre to core locations using the SPoF ducts but cut corners by connecting buildings to other campus buildings in an “organic” way. One of the other network issues LSE had, was that this relatively complex data network had been supported by a team of 4 staff:
The team structure was relatively flat and focussed on support rather than development
There was a mixture of permanent, fixed term and contractor roles
And there were separate Telecoms and Data Teams, which after the School’s migration to VOIP duplicated areas of service. Putting VOIP on the network, quickly showed that the network had fundamental operational problems and design issues.
Around 2012 we had a new Director who saw the issues on the Network as his first big opportunity so we had an external audit of the network in 2013, this helped secure initial funding to fix immediate issues with the school’s Janet Uplinks, firewalls and distribution network. This was followed a year later by a full external review of the network and the team, which lead to a Business Case that was approved by the school’s IT Committee in 2015. This unlocked the funding to reorganise the team, procure an Access Layer and Core refresh, and a NAC and Zoning solution to address security issues on the network – more of which later.
In 2014 we began to benefit from brokering an agreement with the School for 3 day-long Maintenance Windows a year, represented by the red dots. These windows allowed us to plan major surgery on the network along with regular code and patching updates
So the new team structure now has:
Telephony support merged with Networks
2 Senior roles dedicated to Architecture and Programme Delivery
2 additional Specialist roles to develop technologies and deliver projects
2 additional junior support roles
This has provided depth and structure with different levels of specialism offering team members direction and opportunities to grow, indeed we have a vacant Specialist role that we are recruiting to now.
One of the biggest changes in LSE’s network has been this rapid growth of the team, and whether or not your team has grown in size there are some challenges all teams face when they change which may affect the support and performance of a network.
Conflicts can occur in a team whenever there are personnel changes and we went through a few “storm cycles” before we found our norm
Growing out of a small team mindset can also be difficult when established roles are challenged by new comers, and what may have been single person process need to be documented for others to support, and we need to think about the long term rather than the firefighting we may have been used to.
There is also Fred Brooks’ observation that adding resources to a (delayed software) project can actually cause it to go slower …
And finally there is the “operational drag” that recruiting, familiarising and skilling up new team members can have on your team
Beyond the team environment, we also faced physical challenges with some of our comms rooms. Some of you may recognise what happens to a comms room when a team gets too small to manage it and it grows in such a way you may think it will overwhelm you:
We had a variety of issues with some of our comms rooms which prevented us installing the new campus network, including but not limited to poor cabinet access, little or no comms room cooling (see the fans), comms racks not deep enough for modern switches, poor physical security and a cabinet layout that lead to a congested and overheated environment due to the use of overlength patch leads. Believe it or not there were over 50 switches in this comms room four years ago … So what did we do? We spoke with our colleagues in Estates who helpfully informed us that they had plans to demolish part of the facility (and the building), but after much discussion and planning they gave us 3 new walls, rear access doors and air conditioning(!) And coordinating with our cabling contractor, they replaced all the racks with proper cabinets over 8 weekends, so that we could install and migrate to the new access and distribution layer network.
With a smaller team we simply could not think about, let alone achieve, such a transformation. I hope this gives you an idea of some of the challenges we had at a team and comms room level. So I’ll hand you over now to Matt for a higher level view of the campus topology.
-> MB
These before & after photos are fantastic ways of illustrating improvements to our IT director. Of course, there's more to an enterprise network than tidy comms rooms.
-> MB – Here's the LSE campus network 5½ years ago. It's a bit like the "before" shot.
* A flat in-theory L3 network, but with L2 VLANs (including the data centres) stretching across the core
* Slow PVST and a large untuned OSPF backbone competing to cause the greater outage when a bird flies through a laser link
* 4½ core locations, each driven by single Cisco 6509 chassis
* a multi-vendor mix in distribution, usually a single switch, typically Cisco 3560G, fanning out as a single star
* several hundred ancient HP Procurve switches in the access layer, with single uplinks and the odd bit of daisy-chaining
So far we've disentangled this to ..
.. a simpler network which looks more designed, and behaves more predictably.
* phase 1 gave us 10Gb/s to Janet, phase 2 extended this to distribution with Cisco 4500X pairs
* routed campus network, no VLANs spanning buildings
* DC networks segregated at layer 1
* two core locations, using Cisco's flagship "VS4O" configuration
* we built a new core with two chassis and four sup2ts
* and then demoted it by commissioning our new sup6t-based core last year
* we call the old core "core-2"; it does a little L2, routing smaller buildings
* the new core "core-3" does no L2, and is unlikely to run out of TCAM ;)
* we put in 24 fibre pairs between core locations via diverse routes around the large hole in the middle of our campus
* our relationship with Estates has become more strategic: we've got Cap Dev arguing our case for two new purpose-built campus core locations, one in the existing central building site and one in the purple block, once LSE have demolished and rebuilt the old Cancer Research building. This will take several years to bear fruit.
* This was how LSE used to connect to Janet
* Some relative horrors here, which fortunately I don’t have time to go into
* Luckily no laser links here, but all the other points about how the campus network was in 2011 apply to this critical part of the topology
* not much fibre between core locations
* Today, once again, we have something which looks designed
* don't worry too much about the detail, just enjoy the symmetry! :)
* 10Gb/s wherever it’s needed
* high resilience, with diverse fibre at the router, firewall, core-3, core-2 and DC switching layers
* fast convergence
We are replacing our legacy HPs, our C2960s from Telephony and separate switches for wireless. This month we passed the half-way stage of the deployment, with more access ports served by our new 48-port switches than all previous generations of 24-port switch.
The 3850 has Cisco’s new FPGA silicon—they can implement new features like VXLAN and MPLS in ”hardware” (meaning not on the CPU) by releasing new software. It was initially marketed as able to control WAPs (anyone here do that?) but is now the centrepiece of their programmable “campus fabric”.
It’s a new platform and has more than its fair share of bugs, but while we are concentrating on “speeds and feeds” we are keeping it simple, treating the switches as pure L2. We’re not running IOS 16 in production yet, but we are starting to look at some of the new functionality in our lab.
It's plumbed together with 10Gb/s everywhere. For those who like bipartite graphs, here's the L1.
C6807 / Sup6T VS4O in core
C4500X VSS pairs in Distribution
C3850 stacks in Access
..all with 10Gb/s links. Use of VSS gives us..
* a modern, single-star topology
* lots of LACP, which converges quickly
* no need for protocol timers
* no VRRP; STP only to prevent loops
* dynamic routing very simple -> network predictable
* LSE’s students are typically affluent. The Library looks like the Apple Store, with several thousand devices connected via 70 WAPs.
In 2012 we had about 300 WAPs, nearly every one through a power injector, most of which didn’t support 802.11n but did support 802.11b
We started designing the wireless network, considering the RF
Recruited 3 students to survey the whole campus and produce coverage maps
Now, by far, most of our traffic volume is now wireless
We support 802.11n everywhere, 802.11ac
* Consolidated SSIDs
* Consolidated access networks: resilient DHCP etc
* WISM 2s, now 3 x HA pairs, up for refresh in the next year or so
* Latest Cisco APs anticipating breaking the gigabit barrier
wireless usage always growing: 2016 dipping below 2015 likely related to the building works
Security was the main driver of the NotF programme
* When our new Information Security Manager Dr Jethro Perkins started. I showed him what I'd been uncovering, and he was instrumental in pushing for change, in particular for zoning the campus network so we can segregate sensitive research data, and other things with legal and regulatory requirements. We owe him immeasurably for the improvements we have been able to make.
as time has gone on, the number of applications requiring zoning has only increased.
Plugging things into “the network”
Old technologies (anyone here run VRF-lite?) won't cut it if you want dozens of zones
* Jethro wanted a pomegranate
* which path did my flow take? What security policy might block it? Are there any bottlenecks? “The network is slow.”
* we've spent millions on tin, how can we answer these sorts of fundamental questions?
* we’ve already made good use of NETCONF on our Juniper firewalls to automate the campus/DC zone split via a single PERL script.
Looking to the automated future, it’s great news that Cisco waking up and smelling the NETCONF
YANG and other standard technologies help multivendor environments
* dare to hope NETCONF and YANG could substantially replace SNMP in the next few years
APIC-EM is Cisco’s controller for campus networks. It allows us to write policy which it can then implement across the network, for example. Things like managing QoS can be achievable. With our 3850s we can look into automation of VXLAN overlays, with a control plane, so for those horrible use cases which want a shared broadcast domain across the campus network there’s a safer way to do it. We might even be able to segment the network just by writing a little Python!
If you run a Cisco network and haven’t yet looked at DevNet, I’d recommend you do.
To summarise:
* the School wanted an external review: it took years before we had a formal Business Case to present
* we had a strong case, and the School invested in the team as well as the tin--both are not without their challenges, but both are bearing fruit
* the network has gone from something which worked some of the time, through a dependable utility, to a business enabler
* a symbiotic relationship with Estates is worth working towards
* vendors are stakeholders: you need to decide your balance between their vision and potential lock-in