2. The Plan
1. What is Ceph?
2. What is Mirantis OpenStack?
3. How does Ceph fit into OpenStack?
4. What has Fuel ever done for Ceph?
5. What does it look like?
6. Things we’ve done
7. Disk partition for Ceph OSD
8. Cephx authentication settings
9. Types of VM migrations
10. Live VM migrations with Ceph
11. Thinks we left undone
12. Diagnostics and troubleshooting
13. Resources
3. What is Ceph?
Ceph is a free clustered storage platform that provides unified
object, block, and file storage.
Object Storage RADOS objects support snapshotting, replication,
and consistency.
Block Storage RBD block devices are thinly provisioned over
RADOS objects and can be accessed by QEMU via
librbd library.
Kernel Module
librbd
RADOS Protocol
OSDs
Monitors
File Storage CephFS metadata servers (MDS) provide a
POSIX-compliant overlay over RADOS.
4. What is Mirantis OpenStack?
OpenStack is an open source cloud computing platform.
Nova
provisions
stores
objects in
VM
provides
volumes for
Cinder
provides
images for
Swift
stores
images in
Glance
Mirantis ships hardened OpenStack packages and provides Fuel
utility to simplify deployment of OpenStack and Ceph.
Fuel uses Cobbler, MCollective, and Puppet to discover
nodes, provision OS, and setup OpenStack services.
Fuel master node
Astute
Nailgun
Target node
serialize
orchestrate
Cobbler
facts
configure
Puppet
start
MCollective
provision
MCollective Agent
5. How does Ceph fit into OpenStack?
RBD drivers for OpenStack make libvirt
configure the QEMU interface to librbd.
OpenStack
libvirt
Ceph benefits:
Multi-node striping and redundancy
for block storage (Cinder volumes
and Nova ephemeral drives)
Copy-on-write cloning of images to
volumes and instances
configures
QEMU
librbd
Unified storage pool for all types of
storage (object, block, POSIX)
librados
Live migration of Ceph-backed
instances
OSDs
Monitors
Problems: sensitivity to clock drift, multi-site (async replication in
Emperor), block storage density (erasure coding in Firefly), Swift
API gap (rbd backend for Swift)
6. What has Fuel ever done for Ceph?
1. Fuel deploys Ceph Monitors and OSDs on dedicated nodes or
in combination with OpenStack components.
controller
controller 1
ceph-mon
storage n
...
storage 1
controller
ceph-mon
ceph-osd
ceph-mon
ceph-osd
storage network
controller 3
controller 2
controller
compute n
...
compute 1
nova
ceph client
management network
2. Creates partitions for OSDs when nodes are provisioned.
3. Creates separate RADOS pools and sets up Cephx
authentication for Cinder, Glance, and Nova.
4. Configures Cinder, Glance, and Nova to use RBD backend
with the right pools and credentials.
5. Deploys RADOS Gateway (S3 and Swift API frontend to
Ceph) behind HAProxy on controller nodes.
7. What does it look like?
Select storage options ⇒ assign roles to nodes ⇒ allocate disks:
8. Things we’ve done
1. Set the right GPT type GUIDs on OSD and journal partitions
for udev automount rules
2. ceph-deploy: set up root SSH between Ceph nodes
3. Basic Ceph settings: cephx, pool size, networks
4. Cephx: ceph auth command line can’t be split
5. Rados Gateway: has to be the Inktank’s fork of FastCGI, set
an infinite revocation interval for UUID auth tokens to work
6. Patch Cinder to convert non-raw images when creating an
RBD backed volume from Glance
7. Patch Nova: clone RBD backed Glance images into RBD
backed ephemeral volumes, pass RBD user to qemu-img
8. Ephemeral RBD: disable SSH key injection, set up Nova,
libvirt, and QEMU for live migrations
9. Disk partitioning for Ceph OSD
Flow of disk partitioning information during discovery,
configuration, provisioning, and deployment:
Fuel master node
Fuel UI
allocation
ceph-osd
role volumes
openstack.json
Target node
Nailgun
ks_spaces
Cobbler
disks
MCAgent
parted
pmanager
sgdisk
scan
scan
create
osd:journal
Base OS
OSD
set
type
Facter
osd_devices_list
Puppet
ceph::osd
OSD
Journal
ceph-deploy
GPT partition type GUIDs according to ceph-disk:
JOURNAL_UUID = ’ 45 b0969e -9 b03 -4 f30 - b4c6 - b4b80ceff106 ’
OSD_UUID
= ’4 fbd7e29 -9 d25 -41 b8 - afd0 -062 c0ceff05d ’
If more than one device is allocated for OSD Journal, journal
devices are evenly distributed between OSDs.
10. Cephx authentication settings
Monitor ACL is the same for all Cephx users:
allow r
OSD ACLs vary per OpenStack component:
Glance: allow class - read object_prefix rbd_children ,
allow rwx pool = images
Cinder: allow class - read object_prefix rbd_children ,
allow rwx pool = volumes
allow rx pool = images
Nova: allow class - read object_prefix rbd_children ,
allow rwx pool = volumes
allow rx pool = images
allow rwx pool = compute
Watch out: Cephx is easily tripped up by unexpected whitespace in
ceph auth command line parameters, so we have to keep them all
on a single line.
11. Types of VM migrations
OpenStack:
Live vs offline: Is VM stopped during migration?
Block vs shared storage vs volume-backed: Is VM data shared
between nodes? Is VM metadata (e.g. libvirt domain
XML) shared?
Libvirt:
Native vs tunneled: Is VM state transferred directly between
hypervisors or tunneled by libvirtd?
Direct vs peer-to-peer: Is migration controlled by libvirt client or by
source libvirtd?
Managed vs unmanaged: Is migration controlled by libvirt or by
hypervisor itself?
Our type:
Live, volume-backed*, native, peer-to-peer, managed.
12. Live VM migrations with Ceph
Enable native peer to peer live migration:
VM-A
Nova
VM-B
VM-C
libvirtd
Source compute node
VM-C
VM-D
libvirtd
VM-E
Nova
Destination compute node
libvirt VIR_MIGRATE_* flags: LIVE, PEER2PEER,
UNDEFINE_SOURCE, PERSIST_DEST
Patch Nova to decouple shared volumes from shared libvirt
metadata logic during live migration
Set VNC listen address to 0.0.0.0 and block VNC from outside
the management network in iptables
Open ports 49152+ between computes for QEMU migrations
13. Things we left undone
1. Non-root user with sudo for ceph-deploy
2. Calculate PG numbers based on the number of OSDs
3. Ceph public network should go to a second storage network
instead of management
4. Dedicated Monitor nodes, list all Monitors in ceph.conf on
each Ceph node
5. Multi-backend configuration for Cinder
6. A better way to configure pools for OpenStack services (than
CEPH_ARGS in the init script)
7. Make Nova update VM’s VNC listen address to
vncserver_listen of the destination compute after migration
8. Replace ’qemu-img convert’ with clone_image() in
LibvirtDriver.snapshot() in Nova
14. Diagnostics and troubleshooting
ceph -s
ceph osd tree
cinder create 1
rados df
qemu - img convert -O raw cirros . qcow2 cirros . raw
glance image - create -- name cirros - raw --is - public yes
-- container - format bare -- disk - format raw < cirros . raw
nova boot -- flavor 1 -- image cirros - raw vm0
nova live - migration vm0 node -3
disk partitioning failed during provisioning – check if traces of
previous partition tables are left on any drives
’ceph-deploy config pull’ failed – check if the node can ssh to the
primary controller over management network
HEALTH_WARN: clock skew detected – check your ntpd settings,
make sure your NTP server is reachable from all nodes
ENOSPC when storing small objects in RGW – try setting a
smaller rgw object stripe size