1. Half a year with
Contrail at Production
krzysztof.kowalik@allegrogroup.com
michal.dopierala@allegrogroup.com
2. ➢ 57% of Polish e-commerce
➢ 14.000.000 users
Who we are?
3. ➢ Provide CLOUD/IAAS services to internal customers since 2011
➢ We support dev/test/prod environments for 50 developers teams in private
and public cloud
➢ As a part of other TEAM provide deployment platform (Mesos/Marathon) for
Allegro platform
TEAM - scope
4. What have we done since last OCUG?
➢ New region PL-KRA with ~100 computes
➢ Test environment with MX10 as gateway
➢ Migration of client VMs from Essex to Icehouse with Contrail
➢ ~100 new computes in PL-POZ
➢ Problem solving
8. Fixed at build 44
SQL:
nova.instance_faults (message):
internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory
/etc/qemu-ifdown: could not launch network script
At Hypervisor
Jun 14 15:42:31 host kernel: [8026284.365930] lowmem_reserve[]: 0 0 0 0
Jun 14 15:42:31 host kernel: [8026284.365988] 0 pages HighMem/MovableOnly
Jun 14 15:42:31 host kernel: [8026284.366178] Out of memory: Kill process 140217 (qemu-system-x86) score 16 or sacrifice child
Issues - memory leak
9. Debug: Exec[munin-node-configure](provider=posix): Executing check 'munin-node-configure --shell | grep ln'
Debug: Executing 'munin-node-configure --shell | grep ln'
Debug: /Stage[main]/Munin::Common/Exec[munin-node-configure]/onlyif: # The following plugins caused errors:
Debug: /Stage[main]/Munin::Common/Exec[munin-node-configure]/onlyif: # ip_:
Debug: /Stage[main]/Munin::Common/Exec[munin-node-configure]/onlyif: # Nothing printed to stdout
Debug: /Stage[main]/Munin::Common/Exec[munin-node-configure]/onlyif: # No valid suggestions
Debug: /Stage[main]/Munin::Common/Exec[munin-node-configure]/onlyif: # mysql_:
Debug: /Stage[main]/Munin::Common/Exec[munin-node-configure]/onlyif: # Non-zero exit during autoconf (2)
Issues - munin can generate latency
11. Problem
➢ When a flow is created, contrail keeps track of routes that can potentially
modify the flow action. Contrail-vrouter-agent re-evaluates flows that are
not affected by the new route during VM spawning
Symptom
➢ large latency to all VMs within affected subnet
Solution
➢ custom build (70 of R2.02 or in commit
6992575a03a08f703edb8f0a7622a457dbdbdeee ) where agent re-evaluates
a flow only if flow is affected by the new route
Issues - flows re-evaluation
12. ➢ Duplicated IP generating issue with multi NH updates at many computes -
latency
Script which helps us
#!/usr/bin/python
import requests
import sys
cc_ip = sys.argv[1]
r = requests.get('http://%s:9081/analytics/virtual-machines' % cc_ip)
for href_vm in r.json():
r_vm = requests.get(href_vm['href'])
vm = r_vm.json()['UveVirtualMachineAgent']
try:
for ifaces in vm['interface_list']:
print vm['uuid'], ifaces['active'], ifaces['ip_address'], ifaces['mac_address'], ifaces['virtual_network'], ifaces['uuid']
except Exception as e:
print e, vm['uuid'], vm['interface_list']
Issues - duplicated IP
14. Experimenting with flow_cache_timeout:
➢ mesh environment with different client needs
➢ experimental hypervisors
➢ parameters set to 30
Since commit “507fda3d5deb22c6549d4fd253624bea44534b73” ;-) no need to
take care about flow_cache_timeout
Issues - flow_cache_timeout
16. Since Kernel 3.13.0-63
Modified script: /opt/contrail/bin/if-vhost0
#!/bin/bash
source /opt/contrail/bin/vrouter-functions.sh
if [ ! -L /sys/class/net/vhost0 ]; then
insert_vrouter &>> $LOG
fi
/sbin/ip l s pkt1 up
/sbin/ip l s pkt2 up
Issues - interfaces pkt1 and pkt2 down after reboot
17. ➢ network performance issues - hping (lookout for 1k bug)
hping3 -c 100000 -p 8080 -i u100000 -S -n 169.254.0.17
➢ introspects
➢ tcpdump
How to debug