... all the running ...

Ubuntu Jammy disables ssh-rsa

Tue, 05 Jul 2022 12:00:00 +0000

Have you upgrade to Ubuntu Jammy lately, and have SSH access or git breaking? If so, you have come to the right place!

Ubuntu Jammy (22.04) launched recently, and one of the biggest changes is that ssh-rsa is disabled by default in the version of OpenSSH it ships with.

There is a lot of confusion on the internet, and docs still seems to be a bit sketchy, so I hope this will help someone out!

Points to note

There is a key type ssh-rsa. This is the ‘default’ key that OpenSSH has been generating. You may probably have a key of this type. This is not disabled, yet.
There is a key algorithm, also named ssh-rsa. This is the one that is disabled. This uses the ssh-rsa key type, along with SHA-1 hash, for authentication in SSH. SHA-1 hash is now considered broken, and should be replaced with SHA-256 or SHA-512 hash.
To allow for continuing use of key type ssh-rsa, RFC8332 defined two new key algorithm, rsa-sha2-256 and rsa-sha2-512. This has been supported by major operating systems for a while.
SSH clients and servers negotiates and uses the stronger algorithms if they are supported. Clients also fall back to use the weaker algorithm if not. Therefor, you may be using your ssh-rsa key type with a bunch of different servers with varying key algorithms without realising it.
Ubuntu Jammy, as an SSH Client, will now refuse to talk to a server if it tries to use the weaker ssh-rsa key algorithm for SSH.

This means that your ssh-rsa key can still be used, however, the server you are talking to MUST support the newer key algorithms.

Unfortunately, the rsa-sha2-256 support is still making its way into major software. See the list below for more information.

Testing

To test if a server supports rsa-sha2-256 or rsa-sha2-512, do the following

ssh -o PubkeyAcceptedKeyTypes=rsa-sha2-256 <user>@<server>

You can also test for any key type NOT ssh-rsa by doing

ssh -o PubkeyAcceptedKeyTypes=ssh-rsa <user>@<server>

If it breaks, this means the software doesn’t support rsa-sha2-256. You can, in order of preference:

allow list the particular server,
upgrade to a newer version (check the software list below), or
change to use ed25519 keys.

Allow listing servers

You can set this in your SSH config (~/.ssh/config) for each server you want to use the weaker key with.

Host <hostname>
    PubkeyAcceptedKeyTypes +ssh-rsa

Software needing update

You might be running a particular application that breaks now that you are connecting to it from Jammy. This is because a lot of SSH servers traditionally only supports the basic SHA-1 ssh-rsa, and have not implemented rsa-sha2-256. This includes many SSH libraries, like paramiko and mina, which other software uses to build the SSH/GIT server functionality.

These libraries have released newer versions which supports rsa-sha2-256, but as we engineers know, you can mark a thing as deprecated for a LONG TIME and people will keep using it, only upgrading once things break. :)

Here is a list of links to different software

Hope this helps to clear the confusion! Feel free to reach out with suggestions / improvements.

Huge packet losses with OVN

Fri, 25 Mar 2022 00:00:00 +0000

A service provided by the Nectar Cloud is ‘Tenant Networks’, where a user can create their own networks in their tenancy to connect VMs together. Tenant Networks have the following features:

They are private to the tenant, in the sense that there can be multiple e.g. 192.168.1.0/24 networks created by different users and they are all isolated from each other
Network can span across availability zones, which means VMs in Queensland and Melbourne can be connected to the same network.
Traffic from these network are NAT’ed for egress to the internet
Similarly, ingress traffic can be NAT’ed via Floating IPs.

The type of software that drives this are called Software Defined Network (SDN). The SDN Nectar is using is MidoNet. Due to MidoNet being unmaintained and the OpenStack community moving to OVN, we are currently migrating our SDN from MidoNet to OVN.

Infrastructure

Nectar has Availability Zones (AZs) all over Australia. Compute Nodes in each AZs are on private RFC1918 networks. For a SDN to work properly, Compute Nodes in each AZs need to be able to any other Compute Nodes. To do that, we created an additional overlay network named ‘WAGNET’.

To create WAGNET, Network Nodes in each AZs forms a mesh-like network with other AZs using tunnels over the public Internet. A simplified diagram of two AZs with Compute and Network nodes look like this

In this diagram, a tenant has VMs in two AZs. They have created a Tenant Network (192.168.1.0/24) which is purely virtual. Traffic destined for VM in another AZ are encapsulated by the Compute Node, then further encapsulated by Network Node and passed over the Internet.

OVN Testing

As part of our testing, we migrate test networks before and after migration using iperf3 to see if there is any performance difference. Unfortunately, this testing revealed a huge traffic drop in some places when the network was changed from MidoNet to OVN.

An example output of the iperf3 test looks like

root@jakeo1:~# iperf3 -c 192.168.2.150
Connecting to host 192.168.2.150, port 5201
[  5] local 192.168.2.211 port 50058 connected to 192.168.2.150 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   102 KBytes   833 Kbits/sec    9   2.83 KBytes       
[  5]   1.00-2.00   sec  0.00 Bytes    0.00 bits/sec   15   5.66 KBytes       
[  5]   2.00-3.00   sec  31.1 KBytes   255 Kbits/sec   12   2.83 KBytes       
[  5]   3.00-4.00   sec  31.1 KBytes   255 Kbits/sec   10   2.83 KBytes       
[  5]   4.00-5.00   sec  31.1 KBytes   255 Kbits/sec    6   2.83 KBytes       
[  5]   5.00-6.00   sec  31.1 KBytes   255 Kbits/sec    9   1.41 KBytes       
[  5]   6.00-7.00   sec  31.1 KBytes   255 Kbits/sec    9   5.66 KBytes       
[  5]   7.00-8.00   sec  0.00 Bytes    0.00 bits/sec    9   1.41 KBytes       
[  5]   8.00-9.00   sec  31.1 KBytes   255 Kbits/sec   11   5.66 KBytes       
[  5]   9.00-10.00  sec  0.00 Bytes    0.00 bits/sec   10   2.83 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   288 KBytes   236 Kbits/sec  100             sender
[  5]   0.00-10.04  sec   238 KBytes   194 Kbits/sec                  receiver

iperf Done.

Comparatively, a good iperf3 looks like this (reverse direction using the -R flag)

root@jakeo1:~# iperf3 -c 192.168.2.150 -R
Connecting to host 192.168.2.150, port 5201
Reverse mode, remote host 192.168.2.150 is sending
[  5] local 192.168.2.211 port 50062 connected to 192.168.2.150 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  51.6 MBytes   433 Mbits/sec                  
[  5]   1.00-2.00   sec  69.7 MBytes   585 Mbits/sec                  
[  5]   2.00-3.00   sec  69.6 MBytes   584 Mbits/sec                  
[  5]   3.00-4.00   sec  69.7 MBytes   585 Mbits/sec                  
[  5]   4.00-5.00   sec  70.5 MBytes   591 Mbits/sec                  
[  5]   5.00-6.00   sec  70.3 MBytes   589 Mbits/sec                  
[  5]   6.00-7.00   sec  70.5 MBytes   591 Mbits/sec                  
[  5]   7.00-8.00   sec  70.3 MBytes   589 Mbits/sec                  
[  5]   8.00-9.00   sec  70.4 MBytes   591 Mbits/sec                  
[  5]   9.00-10.00  sec  70.1 MBytes   588 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.03  sec   686 MBytes   573 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   683 MBytes   573 Mbits/sec                  receiver

iperf Done.

It is obvious that the bitrate is horrible, and the small congestion window (Cwnd) size leads us to believe that packets are being dropped, leading to congestion detection decreasing Cwnd.

On the surface, this was a very interesting problem because It affected VMs at different availability zones (AZs), and in different directions. Nectar has AZs all over Australia. We started doing iperf3 tests between sites in different locations.

For Monash and Auckland. OVN’s throughput is basically nothing (few kbps). For OVN ingress to QRISCloud, it is also basically nothing. Strangely for Swinburne it is the opposite - egress traffic was the one that dropped substantially.

Debugging

Since there isn’t an easily discernible pattern at first sight, we suspected that this might be a combination of problems. After ruling out hardware and software and site config, we started tcpdumping iperf3 tests.

After a long few days we found out the problem. Below is a tcpdump of two hosts. On the left is a hypervisor, and on the right is a network node. They are capturing the same flow of an iperf3 test.

Here we can see that two packets are combined into one. Below are tcpdumps from two places in the network.

We can see that the first 3 packets (#35, #37, #38) on the left is the same as the first 3 packets (#158, #159, #160) on the right. All is good.

Next up is where the problem starts. Packet 46 on the left is 1472 byte packet. On the right, this appears to be a 2820 byte packet.

Looking deeper, there are a few things with this

It seems that 2 of the packets on the left (#46, #47) are combined into 1 (#164)
This can be derived from screenshot because
- Left packet - outer length = 1472 bytes (shown)
- Left packet - inner length of TCP data = 1348 bytes (shown)
- Therefore, there must be 1472 - 1348 = 124 bytes of header
- Right packet - outer length = 2820 bytes (shown)
- Right packet - inner length = 2696 bytes (not displayed, can be derived from subtracting header)
- 2 * 1348 bytes (left packet data) = 2696 bytes (right packet data)
the checksum of the innermost TCP packets on the right (#164, #165, #166, #167) appears to be invalid
So 8 packets where smooshed together
- #46 + #47 = #164
- #48 + #49 = #165
- #50 + #51 = #166
- #52 + #53 = #167
Finally #54 = #168. This looks like a valid packet.

TCP dump further down the line indicates the smooshed packets #164-#167 never made it to the final destination. This resulted in lost packets in iperf3, causing iperf3 to drop the window size.

One of the thing that does this is Generic Receive Offload. (more information at the end). We started trying to toggle off offloads in our environment, and that confirms GRO being the culprit!

Before

ubuntu@jakeo3:~$ iperf3 -c 192.168.2.11
Connecting to host 192.168.2.11, port 5201
[  5] local 192.168.2.13 port 54716 connected to 192.168.2.11 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  71.1 KBytes   582 Kbits/sec   13   2.63 KBytes
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    9   2.63 KBytes
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    8   3.95 KBytes
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    8   2.63 KBytes
[  5]   4.00-5.00   sec  43.4 KBytes   356 Kbits/sec    7   1.32 KBytes
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    9   2.63 KBytes
[  5]   6.00-7.00   sec  42.1 KBytes   345 Kbits/sec    7   2.63 KBytes
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    9   2.63 KBytes
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    8   2.63 KBytes
[  5]   9.00-10.00  sec  42.1 KBytes   345 Kbits/sec    7   2.63 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   199 KBytes   163 Kbits/sec   85             sender
[  5]   0.00-10.05  sec   150 KBytes   122 Kbits/sec                  receiver

After

ubuntu@jakeo3:~$ iperf3 -c 192.168.2.11
Connecting to host 192.168.2.11, port 5201
[  5] local 192.168.2.13 port 54720 connected to 192.168.2.11 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  38.5 MBytes   323 Mbits/sec    0   5.61 MBytes
[  5]   1.00-2.00   sec  58.8 MBytes   493 Mbits/sec    5   3.93 MBytes
[  5]   2.00-3.00   sec  60.0 MBytes   503 Mbits/sec    0   3.93 MBytes
[  5]   3.00-4.00   sec  61.2 MBytes   514 Mbits/sec    0   3.93 MBytes
[  5]   4.00-5.00   sec  55.0 MBytes   461 Mbits/sec   73   2.88 MBytes
[  5]   5.00-6.00   sec  51.2 MBytes   430 Mbits/sec   32   1.48 MBytes
[  5]   6.00-7.00   sec  26.2 MBytes   220 Mbits/sec   17   1.11 MBytes
[  5]   7.00-8.00   sec  23.8 MBytes   199 Mbits/sec    0   1.18 MBytes
[  5]   8.00-9.00   sec  18.8 MBytes   157 Mbits/sec    1    908 KBytes
[  5]   9.00-10.00  sec  16.2 MBytes   136 Mbits/sec    7    685 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   410 MBytes   344 Mbits/sec  135             sender
[  5]   0.00-10.05  sec   407 MBytes   340 Mbits/sec                  receiver

Recap

Remember at beginning we said it seems to be a combination of problems? It turns out that GRO at different areas were messing different flows up. We had to do multiple iperf3 tests and tcpdumps to figure out which links were not optimal, and fix them accordingly. This writeup so far is simplified for a single case, but it is more complicated, if you don’t feel bored read the following:

On a VM, TCP Segment Offload (TSO) or Generic Segment Offload (GSO) is used. An application needing to send a big chunk of data over the network needs to break this into small pieces (segmentation). In software this segmentation is done using CPU, but can be CPU intensive. With TSO/GSO, an application dumps the big chunk to the NIC to perform segmentation. Wikipedia explains it better than me.
GRO is the opposite of GSO. GRO takes segmented packets and combines then back together.
A good way to debug this in a new system will be to turn off all offloads, start graphing iperf results, and then turn them on one by one.

All fixed!

When we have identified all problematic links, our graph looks like this

We believe that we have identified all the links that are not optimal. This gave us the confident to continue our migration to OVN.

Australia Day

Tue, 26 Jan 2021 12:00:00 +0000

On my first year in Australia, I was pretty excited when Australia Day came around and mentioned it to a colleague in passing. Unexpectedly, the colleague scoffed and said “Bogan Day”.

At that time, I did not understand the difference between Australia Day and Singapore’s National Day. Growing up in Singapore, National Day was a big thing. You see, Singapore was probably the only country that gain independence against its will. We basically got kicked out of the house when we were a young nation by our older brother (Malaysia). Forced to survive on our own, we made it by working damn hard and “punching above our weight”.

Every year, come National Day, it was a day for us to be proud of. On that day, our family will sit down in front of the TV in the evening to watch the National Day Parade. It starts off with a grand military parade, where we get to show off our shiny military hardware. It reminds every Singapore male, who is conscripted into 2 years of mandatory military service, that there is something bigger than self. It reminds us that 2 years of blood, sweat and toil that each of us gave means we can stand up on the world stage as an independent country.

I was expecting Australia Day to have the same sort of patriotic, unifying effect on the Australians. Unfortunately, as I learnt more about it, I realised how divisive this day is. Much have been said about the hurt to the Aboriginal community; there’s nothing more I can add to this conversation. Two articles this year especially stood out for me. Firstly, a white Australian perspective on how Australia Day didn’t mean much to him. Secondly, a fresh take from Stan Grant on how changing the date the easy way out.

I shy away from personally commenting on issues like this - politics can be divisive, even when it doesn’t have to be. I know that I will never be accepted as a true Australian (and that is fine), so I don’t draw attention, or pretend my words matter.

But what I’d like to say is how proud I am to be part of this country. Understand that racism and biases exists in almost every place in the world. For Australians to be able to acknowledge its existence, to have a conversation about it, means that we have the ability to change for the better. This, in my humble opinion, puts us way ahead of other countries. I hope more Australians will recognise this current debate is actually a strength, and not a weakness.

How we broke our national object storage and no one noticed

Wed, 01 Jul 2020 00:00:00 +0000

During the last 5 years in Nectar, I admit we’ve broken a number of things. However, one of the most memorable incident in Nectar occurred just last week, where we messed up our national swift cluster. Fortunately, we did not lose any data (fingers crossed), and no one really noticed it.

About the national swift cluster

Swift is the software that powers the object storage for Nectar. The backend storage servers resides in different institutions all over Australia. Configuration on these servers, as with all of our servers, are managed with Puppet.

An object on the Swift cluster is 3x replicated and stored in 3 different geographical locations. This protects against a local disaster at any institution.

How this happen?

We were doing a Swift upgrade when Puppet pushed out a bad version of Swift config. This config, while not being immediately obvious was wrong, caused the majority of the backend servers to think that the objects hosted on them were misplaced.

In more detail, Swift uses a config value swift_hash_path_suffix to determine placement of objects - i.e. when you put an object on the Swift cluster, which (3x) backend servers should this object be written to. This config value has to be the same across all API and storage nodes to ensure a consistent view of the cluster.

Due to our changes, swift_hash_path_suffix was changed from a value enclosed with quotes, to values without quotes on some storage nodes. Approximately 75% of the storage nodes were affected. The API nodes were not affected.

What happened then?

The servers, with the wrong config, now incorrectly decided that objects on them were ‘misplaced’, and started moving them to the ‘right’ place in the cluster. As part of this movement, the original objects were ‘quarantined’ - saved to a location on the same disk so that recovery is possible.

As more and more objects were quarantined, disks started filling up. A few nodes hit 100%, which meant no new data can be written to them. This is dangerous because if too many disks are full, writes to the cluster can stop completely.

Oh crap

Once the issue was identified, we quickly pushed out the correct version of the config file. This halted the runaway process that was filling up our disks. However, now we have a few big issues:

We have thousands, if not millions of objects quarantined and many disks were full.
Due to the objects being moved, many objects in our cluster now have less than 3 copies in the correct locations. If a disaster were to strike now, it could wipe out the remaining copy of an object.
Service levels are being affected. E.g. if a user tries to write an object to the cluster, and all 3 destination disks are full, the write will fail.
We did not know if we lost data yet. This was the biggest issue on our minds.

How we fixed it

We decided that we needed to work the problem from a few angles.

Our immediate priority was to free up some disk space so replication and writes can work. To do this, we need to clean up the quarantine objects.
To make sure data is safe, we need to check through all the objects. We need to figure out which ones now have 3, 2, 1, or 0 copies.

Our strategy was as follow:

Free up space

Look at the quarantine objects. For each object, find out if they are supposed to be on this disk, or on some other disk in the cluster.
If the quarantine object is on this disk, move it out of quarantine, to the rightful location on the same disk. If there already is an object in that location, delete the quarantine object to free up disk space. As the object (a file) is on the same disk, this is a mv operation which is fast. (fast path)
If the quarantine object does not belong to this disk in the cluster, query swift.
If there are already 2 copies in swift, delete this object. We assume swift will replicate the third copy once there is enough disk space. (slow path)

       graph TD;
           A[check object]-->B{belong to disk?};
           B-->|yes| C[mv];
           B-->|no| D[check swift];
           D-->E{has 2 copies?};
           E-->|yes| F[rm];
           E-->|no| G[leave];

Detailed decision on freeing up space

To be perfectly safe, one should delete the quarantine object only if there are 3 copies. We chose to delete on 2 copies for a few reasons:

Many objects have < 3 copies because at least a copy have been quarantined.
The ‘check and mv’ (fast path) can quickly restore a copy on other disk. This script is running in parallel, one for each disk. At a point in time the object might only have 2 copies (disk A and B), but once the script runs on disk C, the quarantine object moves back to the rightful location. We wanted to process all objects through the fast path as quick as possible to get the maximum copies.
Freeing up space is the priority because we needed replication working.
We did not want to block writes.

Check all objects

In parallel, we started checking all objects and notify if a 0 copy object is found.
For this object, look through all possible locations to find out if there is a copy in quarantine. If there is, restore that.

Once a copy is restored, the cluster can serve this object normally again.

Lessons learnt

Quotes matter in ini files, who knew?
Swift is pretty damn resilient
“Most outages are caused by config changes” - This has been repeated by many in the industry, and unfortunately Nectar has contributed to the statistics. One needs to be careful doing any sort of config change, no matter how trivial it seems. Bringing up a new service is much simpler in comparison.

Thanks

Many thanks to fellow operators from the different sites pulling together to fix this issue - Matt and Karl from UTas, Glenn from Intersect, Swe from Monash. Also thanks to my fellow Core Services Operators for hacking up scripts on the go and using Slack as a DVCS.

Cheers to Swift and OpenStack for building a damn fine product. In a commercial closed sourced product we will probably be up shit creek and waiting for a vendor to fly in. The nature of open source let us inspect everything that Swift is doing and hack up a fix in a few hours.

Using CoreOS on OpenStack

Thu, 18 Jun 2020 00:00:00 +0000

Most instances on the Nectar Cloud runs Linux (Ubuntu, CentOS). On Nectar’s Linux images, a provisioning tool call cloud-init runs on first boot, which inserts your SSH key and other user data into the instance. This allows you to log in to your instance securely using SSH keys, and also run any scripts for software installation when your instance first boot up.

CoreOS uses a different provisioning tool called Ignition instead of cloud-init. This means that extra steps are necessary to boot up a CoreOS instance and inject your SSH key.

Short way (just SSH key)

If you do not know where your SSH public key is, you can get it from Nectar Dashboard, on the left menu under Project > Compute > Key Pairs. Or you can use the CLI to get it
```
openstack keypair show --public-key <NAME>
```

Create the following file as user-data.json

{
  "ignition": {
 "version": "3.0.0"
  },
  "passwd": {
 "users": [
   {
     "name": "core",
     "sshAuthorizedKeys": [
       "ssh-rsa AAAAB3NzaC1c2EAA...dzP"
     ]
   }
 ]
  }
}

Boot up using CLI

openstack server create --image fedora-coreos-31 --flavor m3.small \
--user-data user_data.json fedora-coreos-instance

Long way

To build an Ignition configuration file, one has to create a YAML config and use the FCOS Configuration Transpiler (FCCT) to convert it into JSON. See Fedora CoreOS pages for more examples.

FCCT is provided as a container, but to run it we need something like podman. This is not installed on Ubuntu by default, so we need to install it.

Create a user-data.yaml like this. In this example we only set an SSH key, but this method is not limited to SSH keys.

 variant: fcos
 version: 1.0.0
 passwd:
   users:
     - name: core
       ssh_authorized_keys:
         - "ssh-rsa AAAAB3NzaC1c2EAA...dzP"

Install podman by following the Ubuntu instructions on podman’s site.

Run fcct

podman run -i --rm quay.io/coreos/fcct:release --pretty \
--strict < user-data.yaml > user-data.json

You should get the same user-data.json as the previous (short) example.

State of the Cloud 2019

Mon, 06 Apr 2020 00:00:00 +0000

Now that 2020 is upon us, I thought it might be a good idea to generate some statistics about the Nectar Cloud for 2019.

Instances

In 2019, Nectar Cloud ran a total of 70,371 instances.

VCPU time

These instances ran for a total of 9,203,375 days, 19 hours 17 minutes and 59 seconds of VCPU time¹. That is around 25,214 VCPU years!

The mean VCPU time is about 130 days.

The mode VCPU time is 365 days, which means there were lots of single core instances running through the year.

Flavour

The most popular flavour is m2.large (4 VCPU). There were 26,750 of such instances.

End

Statistics were generated from Gnocchi. Nectar logs the start/end times of each instance in Gnocchi, as well as a host of other data. As a Nectar user, you can use the Gnocchi API to access metrics for your resources.

Let me know if this has been interesting, or if there are any other stats you want to see!

Footnote

VCPU time is (Number of VCPU) * (Running time). For example, if an instance has 2 VCPU and has been running for 1 hour, VCPU time is 2 hours. ↩

The year in review

Thu, 19 Dec 2019 00:00:00 +0000

Things are winding down at the end of the year, so I thought it might be helpful to myself to jot down some of what I did this year, and keep gauge if I am growing professionally.

Puppet catalog difference tests

This took up a lot of time, but I felt it was really necessary. We wanted to greatly improve our testing to make sure we don’t merge puppet code that breaks the cloud. Many outages in major cloud providers last year were due to config changes, so preventing this from happening to NeCTAR was a big priority.

Our new tests now generates a list of differences in catalogs for puppet nodes. This greatly helps human reviewing the changes, as they can now see actual resources and nodes being updated by each change.

Key things in this topic are:

Getting all sites’ control repos under control of our CI/CD - so that Jenkins can trigger tests on changes
Wrangling >5 years of legacy puppet code into something that resembles modern day puppet best practices, and into a consistent format across sites - so tests work across all sites.
Figuring out deprecation strategy for code/config - so we don’t break existing code and yet allow us to move forward quicker

This is not totally done yet, but all the technical challenges have already been resolved.

CellsV1 to CellsV2

Huge effort by the whole team. The fact that we managed to pull it off without downtime is impressive. Basically, in CellsV1, Core Services database holds all the information about instances, but in CellsV2 the sites’ databases holds this information.

For my part, the bulk of the work involves making sure the CellsV1 and CellsV2 databases are consistent before the switch, and writing code to fix up any inconsistencies. Boring manual work, which had to be done. This allowed us to finally get rid of all the legacy CellsV1 patches!

Rollout of Yubikeys

With the increase in attacks, I felt that it was time to beef up our security. Fortunately, our manager was supportive and we manage to buy some YubiKeys. I have been experimenting with integrating them into our systems. Right now we have started using Yubikeys for:

Keystone credentials safe using pass
Shared passwords using the same
SSH forwarding
EYAML

Unified user handling in Puppet

Our puppet code has grown over the years, so code to manage users (for different systems) where in multiple places. Because of this, adding a new operator to the cloud means making multiple changes in different repos. This is proving to be a fair bit of technical debt, and is also a security issue when an operator account wasn’t removed cleanly in all places when they leave.

To solve this, I created a way to define users in just one place in Puppet. From this, different systems which need to create users can just read it. Yes this is the universal standard, I promise.

Security, security, security

Added eyaml support in Puppet
Moved rsyslog to using SSL; this will allow for centralised logging. Future work will let sites send their logs to us so we have a single pane of glass for observing events. This can be useful, for example, in tracing a user’s request to boot an instance - we will be able to trace it all the way from API (in Core Services) down to the compute node (at the site).

Passing entrophy to virtual machines

Tue, 01 Oct 2019 12:00:00 +0000

Recently, when we were working on testing new images with Magnum, I found that the newest Fedora Atomic 29 images were taking a long time to boot up. A closer look using nova console-log revealed that they were getting stuck at boot with the following error.

[   12.220574] audit: type=1130 audit(1555723526.895:78): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-machine-id-commit comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   12.248050] audit: type=1130 audit(1555723526.906:79): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journal-catalog-update comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 1061.103725] random: crng init done
[ 1061.108094] random: 7 urandom warning(s) missed due to ratelimiting
[ 1063.306231] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready

The number between the [] shows the number of seconds since boot, as you can see the VM was stuck for >1000 secs waiting for crng init.

It turns out that in some newer images, boot will block waiting for sufficient entropy. Entropy, or randomness, is used in operating systems for important things like random number generation. Traditionally machines generate their entropy by looking at inputs that are random, e.g. disk writes, mouse movement. (Fun experiment: On a Linux machine do a cat /dev/random, wait for the output to stop and move your mouse.)

A fresh VM has very little avenues to collect entropy, so unfortunately if doesn’t have enough entropy it may block. Luckily the smart people at QEMU has a solution called VirtIO RNG, which involves passing entropy from the host hypervisor to the VM. This allows the VM to seed it’s entropy pool and happily continue booting.

Openstack config docs points out that you need to set this in two places, both at flavor and at image. The flavor setting controls whether an image booted with this flavor is allowed to drain the host’s entropy, and what rates they are allowed to drain. Finding the correct rate is a bit of trial and error, as you want a high enough rate so that the VM will not block, but low enough that a malicious VM will not be able to totally drain a hypervisor’s entropy.

With a bit of testing, and advice from our fellow OpenStack Operators at Catalyst NZ, we have found the following values to work for us:

hw_rng:allowed='True'
hw_rng:rate_bytes='24'
hw_rng:rate_period='5000'

NOTE: as our friends from Catalyst points out, rate_period is specified in milliseconds and not seconds like what some documentation states.

Same rate, different periods?

One of the question that we had to answer was, what is the effect of the period setting? For example, if we know that we need 5 bytes/second of information, is it better to set

1 byte in 1 second, or
5 bytes in 5 seconds?

Both of this settings translate to the same effective rate, but their performance can be very different.

In this note, it was suggested that a smaller period is better because it means your process will block for a shorter time. However, a longer period can have some advantages in a cloud environment.

in case of a benign VM, we want it to be able to burst as much as possible. If it needs 5 bytes it can get it in the first sec rather than wait till the 5th sec
in case of a malicious VM, being able to block it for 5 secs means other VMs will be able to have a better chance to consume entropy

Hence, we have set a relatively large period on our environment.

Kubernetes With Loadbalancer

Sun, 28 Apr 2019 00:00:00 +0000

In Kubernetes Part I, we’ve discussd how to spin up a kubernetes cluster easily on Nectar. In this post, we will discuss how to host an application and access it externally.

To being, you should already have a working cluster. If you do not, head back to the previous post and follow the steps.

Check that you cluster is working
```
kubectl get nodes
```
Start a container image. We use nginx as an example
```
kubectl run nginx --image nginx
```
This command will start a pod with a container inside it running the nginx image. On Kubernetes, the smallest runnable unit is a pod, which holds one (or more) containers.
Check that your pod has started up and is running.
```
kubectl get pods
```
Now that you have a pod working, we need a way of getting to it from the Internet. In Nectar Cloud, we can do this by creating a load balancer. A load balancer has a public (floating ip), and redirects traffic to this public IP to one or more private addresses. Use the following yaml to create your load balancer. Save it as nginxservice.yaml.
```
apiVersion: v1
kind: Service
metadata:
  name: nginxservice
  labels:
    app: nginx
  annotations:
    loadbalancer.openstack.org/floating-network-id: 'e48bdd06-cc3e-46e1-b7ea-64af43c74ef8'
spec:
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
  selector:
    run: nginx
  type: LoadBalancer
```
Note that the uuid in the loadbalancer.openstack.org/floating-network-id refers to a network in melbourne. If your cluster is in a different AZ, you might want to choose a floating IP network closer to where your cluster is for routing efficiency. However, without it, things still work though! That’s the beauty of Nectar Advanced Network - no matter which AZ the traffic ingress from, it still is able to make the way to your VM on Nectar Cloud.
Run it as
```
kubectl create -f nginxservice.yaml
```
Get the public IP of the load balancer
```
kubectl get services
```
You should be able to browse to http://<ip> and see the nginx welcome page.

If this doesn’t work, you might not have the correct security groups applied. Find the port the IP is on:

openstack floating ip list --floating-ip-address 103.6.252.52 -c Port -f value

Apply a security group that has the HTTP security group rule to it, or, if do not already have one create it.

openstack security group create http
openstack security group rule create --ingress --dst-port 80 http
openstack port set --security-group http fe008711-7469-4c44-8489-46abbc8b1774

This is an external load balancer (external to kubernetes), and is created in Neutron. You can see the loadbalancer in Neutron by doing
```
neutron lbaas-loadbalancer-list
```

More details on what we have just did.

We started an external LoadBalancer service in Kubernetes.
Kubernetes understands that it has to create this loadbalancer (externally) by calling out to the openstack neutron provider.
The cloud-provider-openstack plugin in kubernetes then create the different pieces that makes it all work, namely floating ip, load balancer, pool, listener and members. These are all openstack resources. It mirrors this to the LoadBalancer service you see in kubernetes when you do a kubectl get services.
The plugin configs all of them and get the floating IP to be displayed in kubectl get services

GitLab and Kubernetes Integration

Thu, 20 Sep 2018 00:00:00 +0000

In the previous blog post we’ve described how to spin up a kubernetes cluster on Nectar. Around the same time, I also got to know that University of Melbourne has a self-hosted gitlab. To my delight, I found out that GitLab has Kubernetes integration.

This means that, if you are in UniMelb (or have a self-hosted gitlab), you can run CI/CD using Nectar cloud, without having to set up any infrastructure!

Spin up cluster

Spin up a kubernetes (k8s) cluster and create the config directory. Instructions.

Run the following

kubectl create clusterrolebinding permissive-binding --clusterrole=cluster-admin
--user=admin --user=kubelet --group=system:serviceaccounts

Get the default secret name (in format default-token-xxxx)
```
kubectl get secrets
```

Get the token for this secret

kubectl describe secrets/default-token-xxxx

Get the API URL.

cat $KUBECONFIG

Look for the line like

clusters:
- cluster:
 server: https://192.168.1.1:6443

Get the CA cert. In the directory where $KUBECONFIG is stored, do
```
cat ca.pem
```

Configure Repo

In the GitLab repo, navigate to Operations - Kubernetes
Fill in the cluster details that you got from the previous steps.
In the list of Applications, install the following in order:
1. Helm Tiller
2. GitLab Runner

Create a .gitlab-ci.yml file. For example, if I want to run yamllint on my code, an example file will look like:

 before_script:
   - apt-get update
   - apt-get install -y python yamllint
   - apt-get install -y python3-pkg-resources python3-setuptools
   - python --version
   - which python
 yaml-lint:
   script:
     - find . -type f -iname "*.yaml" -o -iname "*.eyaml" | xargs yamllint

Commit and push the change. When the change is pushed to GitLab, a runner will start up and run the job specified by .gitlab-ci.yml. Jobs can be viewed from the CI/CD tab.

Limitations

At this point in time, GitLab CE integration can only do one kubernetes cluster per repo. No ability to do dev/test/prod clusters per repo.

Also, it does not support RBAC, so it means that the integration will have full permissions to the cluster. So you really want to dedicate 1 k8s cluster to 1 repo, and not have any other containers running on that cluster.