adventures with cgroups for resource control

What?

Control Groups aka cgroups – see the docs .

Resource control and monitoring.

Some examples follow for throttling i/o speed(s) for a process control group (cgroup).

They’re used heavily in containers (like lxc or docker) to limit things like memory use, i/o requests, network traffic etc. They can also expose some useful stats (e.g. how much disk activity / cpu time / memory a container is using/has used).

Why?

In my case, I’m trying to share resources among multiple containers (websites) … and stop any one from gobbling up all resources and making the response times of the others suck.

How?

Setting up cgroup support

If you’re using systemd, everything’s weird and messy (aka, I know they’re different, but I’ve not looked into what’s going on).

I’m used to everything being lumped together – and as on these servers I’m not using systemd, /sys/fs/cgroup contains all the types/controllers together :

mount -t cgroup cgroup /sys/fs/cgroup

There are also a couple of useful cmdline parameters that are needed to properly turn on e.g. memory tracking – so you may need to fix your kernel command line (on booting, via grub) so /proc/cmdline contains :

cgroup_enable=memory swapaccount=1

Creating a cgroup

While you can create an arbitrary cgroup quite easily –

mkdir /sys/fs/cgroup/david

there’s normally no need – as things like docker or lxc will do this for you. At least for ‘lxc’ it creates them in :

/sys/fs/cgroup/lxc/<container-name>

But for the sake of demonstrating, once you’ve created your cgroup, adding a process to it is straightforward – e.g. pid 12345 –

echo 12345 >> /sys/fs/cgroup/david/tasks

Any child processes of pid 12345 will also be in the ‘david’ cgroup.

So, the /sys/fs/cgroup/david/tasks file lists the PIDs of all processes in that cgroup.

A process can belong to many cgroups. See: /proc/12345/cgroup which might look like :

1:cpuset,cpu,cpuacct,blkio,memory,devices,freezer,net_cls,perf_event,net_prio:/david

disk io / bps –

You can specify a bits-per-second limit for a specific device.

echo blk:id bps > /sys/fs/cgroup/whatever/blkio.throttle.read_bps_device

echo blk:id bps > /sys/fs/cgroup/whatever/blkio.throttle.write_bps_device

Example 1 . Setting a 10 MB/s limit on writing to the device 202:8 (See ‘lsblk’ to see what magic number you need for /path/to/filesystem)

echo "202:8 $(( 10 * 1024 * 1024 )) " > /sys/fs/cgroup/whatever/blkio.throttle.write_bps_device

To disable a limit, set it to zero – i.e.

echo "202:8 0" > /sys/fs/cgroup/lxc/container-name/blkio.throttle.write_bps_device

Notes:

The limit only becomes apparent most of the time when you call sync/fsync – so e.g
```
dd if=/dev/zero of=/path/to/whatever.dd bs=4k count=10240 conv=fdatasync
```
Or when you outgrow the kernel’s write buffer cache – see
```
/proc/sys/vm/dirty_background_bytes
```
Or if you drop the kernel’s page cache –
```
echo 3 >/proc/sys/vm/drop_caches
```

As (obviously) a cached file doesn’t lead to a disk read, so the cgroup restraint isn’t used.

Disk Weighting

Additionally, you can assign an arbitrary weight to a container, and assuming you’re using the CFQ disk schedular, you’ll then be able to prioritise one container’s disk access over anothers – See :

echo 200 > /sys/fs/cgroup/lxc/container1/blkio.weight

From memory, the default is 500.

Disk Usage

See /sys/fs/cgroup/lxc/container1/blkio.io_service_bytes which lists something like :

202:112 Read 106496
202:112 Write 0
202:112 Sync 0
202:112 Async 106496
202:112 Total 106496
202:80 Read 92684288
202:80 Write 59547648
202:80 Sync 59547648
202:80 Async 92684288
202:80 Total 152231936

So you can report based on disk (e.g. 202:112 is the root file system and 202:80 is what’s mounted for the website’s document root). Use ‘lsblk’ to match up to partition/device.

memory limiting …

Useful knobs to meddle with :

memory.limit_in_bytes – Limit the cgroup to however much RAM / memory
1. For a 100Mb limit : echo $(( 100 * 1024 * 1024 )) > /sys/fs/cgroup/lxc/container1/memory.limit_in_bytes
2. If memory.memsw.limit_in_bytes has a larger value than memory.limit_in_bytes then the container will then start swapping.
memory.memsw.limit_in_bytes – Limit the cgroup to however much RAM + SWAP
1. For a 100Mb limit: echo $(( 100 * 1024 * 1024 )) > /sys/fs/cgroup/lxc/container1/memory.memsw.limit_in_bytes
memory.usage_in_bytes – How much RAM/memory has been used
1. So to get the MB usage – echo $(( $(< /sys/fs/cgroup/lxc/container1/memory.usage_in_bytes ) / ( 1024 * 1024 ) ))
memory.memsw.usage_in_bytes – How much RAM+SWAP has been used.

Memory Fail Counters

There’s also memory.failcnt and memory.memsw.failcnt which are incremented each time the container hits a memory limit you’ve assigned.

Note these numbers can jump quickly and may not necessarily indicate a problem – as e.g. reading/writing a large file may push the cgroup’s memory usage up (as it buffers it) causing it to hit the memory limit a few hundred thousand times in a 5 minute period.

You can reset the failcnt counters by writing 0 to them.