<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>... all the running ...</title>
    <description>&quot;Now, here, you see, it takes all the running you can do, to keep in the same place&quot; - Red Queen
</description>
    <link>http://blog.alltherunning.com/</link>
    <atom:link href="http://blog.alltherunning.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 18 Jul 2022 01:47:58 +0000</pubDate>
    <lastBuildDate>Mon, 18 Jul 2022 01:47:58 +0000</lastBuildDate>
    <generator>Jekyll v3.9.2</generator>
    
      <item>
        <title>Ubuntu Jammy disables ssh-rsa</title>
        <description>&lt;p&gt;Have you upgrade to Ubuntu Jammy lately, and have SSH access or git breaking? If
so, you have come to the right place!&lt;/p&gt;

&lt;p&gt;Ubuntu Jammy (22.04) launched recently, and one of the biggest changes is that
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt; is &lt;a href=&quot;jammy-release-notes&quot;&gt;disabled by default&lt;/a&gt; in the version of
OpenSSH it ships with.&lt;/p&gt;

&lt;p&gt;There is a lot of confusion on the internet, and docs still seems to be a bit
sketchy, so I hope this will help someone out!&lt;/p&gt;

&lt;h1 id=&quot;points-to-note&quot;&gt;Points to note&lt;/h1&gt;
&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;There is a key &lt;em&gt;type&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt;. This is the ‘default’ key that OpenSSH has been
generating. You may probably have a key of this type. &lt;strong&gt;This is not disabled,
yet.&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There is a key &lt;em&gt;algorithm&lt;/em&gt;, also named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt;. &lt;strong&gt;This is the one that is
disabled.&lt;/strong&gt; This uses the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt; key &lt;em&gt;type&lt;/em&gt;, along with SHA-1 hash, for
authentication in SSH. SHA-1 hash is now considered broken, and should be
replaced with SHA-256 or SHA-512 hash.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To allow for continuing use of key &lt;em&gt;type&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt;, &lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc8332&quot;&gt;RFC8332&lt;/a&gt;
defined two new key &lt;em&gt;algorithm&lt;/em&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-256&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-512&lt;/code&gt;. This has
been supported by major operating systems for a while.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;SSH clients and servers negotiates and uses the stronger algorithms if they
are supported. Clients also fall back to use the weaker algorithm if not.
Therefor, you may be using your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt; key &lt;em&gt;type&lt;/em&gt; with a bunch of different
servers with varying key &lt;em&gt;algorithms&lt;/em&gt; without realising it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Ubuntu Jammy, as an SSH Client, will now refuse to talk to a server if it
tries to use the weaker &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt; key &lt;em&gt;algorithm&lt;/em&gt; for SSH.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means that your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt; key can still be used, however, the server you
are talking to MUST support the newer key algorithms.&lt;/p&gt;

&lt;p&gt;Unfortunately, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-256&lt;/code&gt; support is still making its way into major
software. See the list below for more information.&lt;/p&gt;

&lt;h1 id=&quot;testing&quot;&gt;Testing&lt;/h1&gt;

&lt;p&gt;To test if a server supports &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-256&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-512&lt;/code&gt;, do the following&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh -o PubkeyAcceptedKeyTypes=rsa-sha2-256 &amp;lt;user&amp;gt;@&amp;lt;server&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can also test for any key type NOT &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt; by doing&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh -o PubkeyAcceptedKeyTypes=ssh-rsa &amp;lt;user&amp;gt;@&amp;lt;server&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If it breaks, this means the software doesn’t support &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-256&lt;/code&gt;. You can,
in order of preference:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;allow list the particular server,&lt;/li&gt;
  &lt;li&gt;upgrade to a newer version (check the software list below), or&lt;/li&gt;
  &lt;li&gt;change to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ed25519&lt;/code&gt; keys.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;allow-listing-servers&quot;&gt;Allow listing servers&lt;/h1&gt;

&lt;p&gt;You can set this in your SSH config (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.ssh/config&lt;/code&gt;) for each server you want
to use the weaker key with.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Host &amp;lt;hostname&amp;gt;
    PubkeyAcceptedKeyTypes +ssh-rsa
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;software-needing-update&quot;&gt;Software needing update&lt;/h1&gt;

&lt;p&gt;You might be running a particular application that breaks now that you are
connecting to it from Jammy. This is because a lot of SSH servers traditionally
only supports the basic SHA-1 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh-rsa&lt;/code&gt;, and have not implemented
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-256&lt;/code&gt;. This includes many SSH libraries, like paramiko and mina, which
other software uses to build the SSH/GIT server functionality.&lt;/p&gt;

&lt;p&gt;These libraries have released newer versions which supports &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsa-sha2-256&lt;/code&gt;, but
as we engineers know, you can mark a thing as deprecated for a LONG TIME and
people will keep using it, only upgrading once things break. :)&lt;/p&gt;

&lt;p&gt;Here is a list of links to different software&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.openssh.com/txt/release-8.7&quot;&gt;openssh&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://bugs.launchpad.net/ubuntu/+source/paramiko/+bug/1961979&quot;&gt;paramiko&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://jira.atlassian.com/browse/BSERV-13013&quot;&gt;bitbucket&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/SSHD-895&quot;&gt;mina&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://bugs.chromium.org/p/gerrit/issues/detail?id=13930&quot;&gt;gerrit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hope this helps to clear the confusion! Feel free to reach out with suggestions
/ improvements.&lt;/p&gt;

</description>
        <pubDate>Tue, 05 Jul 2022 12:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/ubuntu/2022/07/05/jammy-ssh.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/ubuntu/2022/07/05/jammy-ssh.html</guid>
        
        
        <category>ubuntu</category>
        
      </item>
    
      <item>
        <title>Huge packet losses with OVN</title>
        <description>&lt;p&gt;A service provided by the Nectar Cloud is ‘Tenant Networks’, where a user can
create their own networks in their tenancy to connect VMs together. Tenant
Networks have the following features:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;They are private to the tenant, in the sense that there can be multiple e.g.
&lt;em&gt;192.168.1.0/24&lt;/em&gt; networks created by different users and they are all isolated
from each other&lt;/li&gt;
  &lt;li&gt;Network can span across availability zones, which means VMs in Queensland and
Melbourne can be connected to the same network.&lt;/li&gt;
  &lt;li&gt;Traffic from these network are NAT’ed for egress to the internet&lt;/li&gt;
  &lt;li&gt;Similarly, ingress traffic can be NAT’ed via Floating IPs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The type of software that drives this are called Software Defined Network (SDN).
The SDN Nectar is using is MidoNet. Due to MidoNet being unmaintained and the
OpenStack community moving to OVN, we are currently migrating our SDN from
MidoNet to OVN.&lt;/p&gt;

&lt;h1 id=&quot;infrastructure&quot;&gt;Infrastructure&lt;/h1&gt;

&lt;p&gt;Nectar has Availability Zones (AZs) all over Australia. Compute Nodes in each
AZs are on private RFC1918 networks. For a SDN to work properly, Compute Nodes
in each AZs need to be able to any other Compute Nodes. To do that, we created
an additional overlay network named ‘WAGNET’.&lt;/p&gt;

&lt;p&gt;To create WAGNET, Network Nodes in each AZs forms a mesh-like network with other
AZs using tunnels over the public Internet. A simplified diagram of two AZs with
Compute and Network nodes look like this&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2022-03-25-slow_iperf3_ovn/wagnet.png&quot; alt=&quot;WAGNET&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this diagram, a tenant has VMs in two AZs. They have created a Tenant
Network (192.168.1.0/24) which is purely virtual. Traffic destined for VM in
another AZ are encapsulated by the Compute Node, then further encapsulated by
Network Node and passed over the Internet.&lt;/p&gt;

&lt;h1 id=&quot;ovn-testing&quot;&gt;OVN Testing&lt;/h1&gt;

&lt;p&gt;As part of our testing, we migrate test networks before and after migration
using iperf3 to see if there is any performance difference. Unfortunately, this
testing revealed a huge traffic drop in some places when the network was changed
from MidoNet to OVN.&lt;/p&gt;

&lt;p&gt;An example output of the iperf3 test looks like&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;root@jakeo1:~# iperf3 -c 192.168.2.150
Connecting to host 192.168.2.150, port 5201
[  5] local 192.168.2.211 port 50058 connected to 192.168.2.150 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   102 KBytes   833 Kbits/sec    9   2.83 KBytes       
[  5]   1.00-2.00   sec  0.00 Bytes    0.00 bits/sec   15   5.66 KBytes       
[  5]   2.00-3.00   sec  31.1 KBytes   255 Kbits/sec   12   2.83 KBytes       
[  5]   3.00-4.00   sec  31.1 KBytes   255 Kbits/sec   10   2.83 KBytes       
[  5]   4.00-5.00   sec  31.1 KBytes   255 Kbits/sec    6   2.83 KBytes       
[  5]   5.00-6.00   sec  31.1 KBytes   255 Kbits/sec    9   1.41 KBytes       
[  5]   6.00-7.00   sec  31.1 KBytes   255 Kbits/sec    9   5.66 KBytes       
[  5]   7.00-8.00   sec  0.00 Bytes    0.00 bits/sec    9   1.41 KBytes       
[  5]   8.00-9.00   sec  31.1 KBytes   255 Kbits/sec   11   5.66 KBytes       
[  5]   9.00-10.00  sec  0.00 Bytes    0.00 bits/sec   10   2.83 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   288 KBytes   236 Kbits/sec  100             sender
[  5]   0.00-10.04  sec   238 KBytes   194 Kbits/sec                  receiver

iperf Done.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Comparatively, a good iperf3 looks like this (reverse direction using the -R flag)&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;root@jakeo1:~# iperf3 -c 192.168.2.150 -R
Connecting to host 192.168.2.150, port 5201
Reverse mode, remote host 192.168.2.150 is sending
[  5] local 192.168.2.211 port 50062 connected to 192.168.2.150 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  51.6 MBytes   433 Mbits/sec                  
[  5]   1.00-2.00   sec  69.7 MBytes   585 Mbits/sec                  
[  5]   2.00-3.00   sec  69.6 MBytes   584 Mbits/sec                  
[  5]   3.00-4.00   sec  69.7 MBytes   585 Mbits/sec                  
[  5]   4.00-5.00   sec  70.5 MBytes   591 Mbits/sec                  
[  5]   5.00-6.00   sec  70.3 MBytes   589 Mbits/sec                  
[  5]   6.00-7.00   sec  70.5 MBytes   591 Mbits/sec                  
[  5]   7.00-8.00   sec  70.3 MBytes   589 Mbits/sec                  
[  5]   8.00-9.00   sec  70.4 MBytes   591 Mbits/sec                  
[  5]   9.00-10.00  sec  70.1 MBytes   588 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.03  sec   686 MBytes   573 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   683 MBytes   573 Mbits/sec                  receiver

iperf Done.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It is obvious that the bitrate is horrible, and the small congestion window
(Cwnd) size leads us to believe that packets are being dropped, leading to
congestion detection decreasing Cwnd.&lt;/p&gt;

&lt;p&gt;On the surface, this was a very interesting problem because It affected VMs at
different availability zones (AZs), and in different directions. Nectar has AZs
all over Australia. We started doing iperf3 tests between sites in different
locations.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2022-03-25-slow_iperf3_ovn/midonet_ovn_old.png&quot; alt=&quot;MidoNet vs OVN&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For Monash and Auckland. OVN’s throughput is basically nothing (few kbps). For
OVN ingress to QRISCloud, it is also basically nothing.  Strangely for Swinburne
it is the opposite - egress traffic was the one that dropped substantially.&lt;/p&gt;

&lt;h1 id=&quot;debugging&quot;&gt;Debugging&lt;/h1&gt;

&lt;p&gt;Since there isn’t an easily discernible pattern at first sight, we suspected
that this might be a combination of problems. After ruling out hardware and
software and site config, we started tcpdumping iperf3 tests.&lt;/p&gt;

&lt;p&gt;After a long few days we found out the problem. Below is a tcpdump of two hosts.
On the left is a hypervisor, and on the right is a network node. They are
capturing the same flow of an iperf3 test.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/assets/posts/2022-03-25-slow_iperf3_ovn/tcpdump.png&quot;&gt;&lt;img src=&quot;/assets/posts/2022-03-25-slow_iperf3_ovn/tcpdump.png&quot; alt=&quot;tcpdump&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we can see that two packets are combined into one. Below are tcpdumps from
two places in the network.&lt;/p&gt;

&lt;p&gt;We can see that the first 3 packets (&lt;em&gt;#35, #37, #38&lt;/em&gt;) on the left is the same as
the first 3 packets (&lt;em&gt;#158, #159, #160&lt;/em&gt;) on the right. All is good.&lt;/p&gt;

&lt;p&gt;Next up is where the problem starts. Packet 46 on the left is &lt;strong&gt;1472&lt;/strong&gt; byte packet.
On the right, this appears to be a &lt;strong&gt;2820&lt;/strong&gt; byte packet.&lt;/p&gt;

&lt;p&gt;Looking deeper, there are a few things with this&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;It seems that 2 of the packets on the left (&lt;em&gt;#46, #47&lt;/em&gt;) are combined into 1 (&lt;em&gt;#164&lt;/em&gt;)&lt;/li&gt;
  &lt;li&gt;This can be derived from screenshot because
    &lt;ul&gt;
      &lt;li&gt;Left packet - outer length = &lt;em&gt;1472 bytes&lt;/em&gt; (shown)&lt;/li&gt;
      &lt;li&gt;Left packet - inner length of TCP data = &lt;em&gt;1348 bytes&lt;/em&gt; (shown)&lt;/li&gt;
      &lt;li&gt;Therefore, there must be &lt;em&gt;1472 - 1348 = 124 bytes&lt;/em&gt; of header&lt;/li&gt;
      &lt;li&gt;Right packet - outer length = &lt;em&gt;2820 bytes&lt;/em&gt; (shown)&lt;/li&gt;
      &lt;li&gt;Right packet - inner length = &lt;em&gt;2696 bytes&lt;/em&gt; (not displayed, can be derived
from subtracting header)&lt;/li&gt;
      &lt;li&gt;&lt;em&gt;2 * 1348 bytes&lt;/em&gt; (left packet data) = &lt;em&gt;2696 bytes&lt;/em&gt; (right packet data)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;the checksum of the innermost TCP packets on the right (&lt;em&gt;#164, #165, #166,
#167&lt;/em&gt;) appears to be invalid&lt;/li&gt;
  &lt;li&gt;So 8 packets where smooshed together
    &lt;ul&gt;
      &lt;li&gt;&lt;em&gt;#46 + #47 = #164&lt;/em&gt;&lt;/li&gt;
      &lt;li&gt;&lt;em&gt;#48 + #49 = #165&lt;/em&gt;&lt;/li&gt;
      &lt;li&gt;&lt;em&gt;#50 + #51 = #166&lt;/em&gt;&lt;/li&gt;
      &lt;li&gt;&lt;em&gt;#52 + #53 = #167&lt;/em&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Finally &lt;em&gt;#54&lt;/em&gt; = &lt;em&gt;#168&lt;/em&gt;. This looks like a valid packet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TCP dump further down the line indicates the smooshed packets &lt;em&gt;#164-#167&lt;/em&gt; never
made it to the final destination. This resulted in lost packets in iperf3,
causing iperf3 to drop the window size.&lt;/p&gt;

&lt;p&gt;One of the thing that does this is &lt;a href=&quot;https://www.kernel.org/doc/html/latest/networking/segmentation-offloads.html#generic-receive-offload&quot;&gt;Generic Receive
Offload&lt;/a&gt;.
(more information at the end). We started trying to toggle off offloads in our
environment, and that confirms GRO being the culprit!&lt;/p&gt;

&lt;p&gt;Before&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ubuntu@jakeo3:~$ iperf3 -c 192.168.2.11
Connecting to host 192.168.2.11, port 5201
[  5] local 192.168.2.13 port 54716 connected to 192.168.2.11 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  71.1 KBytes   582 Kbits/sec   13   2.63 KBytes
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    9   2.63 KBytes
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    8   3.95 KBytes
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    8   2.63 KBytes
[  5]   4.00-5.00   sec  43.4 KBytes   356 Kbits/sec    7   1.32 KBytes
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    9   2.63 KBytes
[  5]   6.00-7.00   sec  42.1 KBytes   345 Kbits/sec    7   2.63 KBytes
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    9   2.63 KBytes
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    8   2.63 KBytes
[  5]   9.00-10.00  sec  42.1 KBytes   345 Kbits/sec    7   2.63 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   199 KBytes   163 Kbits/sec   85             sender
[  5]   0.00-10.05  sec   150 KBytes   122 Kbits/sec                  receiver
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ubuntu@jakeo3:~$ iperf3 -c 192.168.2.11
Connecting to host 192.168.2.11, port 5201
[  5] local 192.168.2.13 port 54720 connected to 192.168.2.11 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  38.5 MBytes   323 Mbits/sec    0   5.61 MBytes
[  5]   1.00-2.00   sec  58.8 MBytes   493 Mbits/sec    5   3.93 MBytes
[  5]   2.00-3.00   sec  60.0 MBytes   503 Mbits/sec    0   3.93 MBytes
[  5]   3.00-4.00   sec  61.2 MBytes   514 Mbits/sec    0   3.93 MBytes
[  5]   4.00-5.00   sec  55.0 MBytes   461 Mbits/sec   73   2.88 MBytes
[  5]   5.00-6.00   sec  51.2 MBytes   430 Mbits/sec   32   1.48 MBytes
[  5]   6.00-7.00   sec  26.2 MBytes   220 Mbits/sec   17   1.11 MBytes
[  5]   7.00-8.00   sec  23.8 MBytes   199 Mbits/sec    0   1.18 MBytes
[  5]   8.00-9.00   sec  18.8 MBytes   157 Mbits/sec    1    908 KBytes
[  5]   9.00-10.00  sec  16.2 MBytes   136 Mbits/sec    7    685 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   410 MBytes   344 Mbits/sec  135             sender
[  5]   0.00-10.05  sec   407 MBytes   340 Mbits/sec                  receiver
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;recap&quot;&gt;Recap&lt;/h1&gt;

&lt;p&gt;Remember at beginning we said it seems to be a combination of problems? It turns
out that GRO at different areas were messing different flows up. We had to do
multiple iperf3 tests and tcpdumps to figure out which links were not optimal,
and fix them accordingly. This writeup so far is simplified for a single case,
but it is more complicated, if you don’t feel bored read the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;On a VM, TCP Segment Offload (TSO) or Generic Segment Offload (GSO) is used.
An application needing to send a big chunk of data over the network needs to
break this into small pieces (segmentation). In software this segmentation is
done using CPU, but can be CPU intensive. With TSO/GSO, an application dumps the
big chunk to the NIC to perform segmentation. Wikipedia explains it better than
me.&lt;/li&gt;
  &lt;li&gt;GRO is the opposite of GSO. GRO takes segmented packets and combines then back
together.&lt;/li&gt;
  &lt;li&gt;A good way to debug this in a &lt;strong&gt;new&lt;/strong&gt; system will be to turn off all offloads,
start graphing iperf results, and then turn them on one by one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;all-fixed&quot;&gt;All fixed!&lt;/h1&gt;

&lt;p&gt;When we have identified all problematic links, our graph looks like this&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2022-03-25-slow_iperf3_ovn/midonet_ovn_new.png&quot; alt=&quot;MidoNet vs OVN New&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We believe that we have identified all the links that are not optimal. This gave
us the confident to continue our migration to OVN.&lt;/p&gt;
</description>
        <pubDate>Fri, 25 Mar 2022 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/openstack/2022/03/25/huge_packet_loss_ovn.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/openstack/2022/03/25/huge_packet_loss_ovn.html</guid>
        
        
        <category>openstack</category>
        
      </item>
    
      <item>
        <title>Australia Day</title>
        <description>&lt;p&gt;On my first year in Australia, I was pretty excited when Australia Day came
around and mentioned it to a colleague in passing. Unexpectedly, the colleague
scoffed and said “Bogan Day”.&lt;/p&gt;

&lt;p&gt;At that time, I did not understand the difference between Australia Day and
Singapore’s National Day. Growing up in Singapore, National Day was a big thing.
You see, Singapore was probably the only country that &lt;a href=&quot;https://en.wikipedia.org/wiki/History_of_Singapore#:~:text=On%209%20August%201965%2C%20the,become%20a%20sovereign%2C%20independent%20nation.&quot;&gt;gain independence against
its will&lt;/a&gt;. We basically got kicked out of the house when
we were a young nation by our older brother (Malaysia). Forced to survive on our
own, we made it by working damn hard and “punching above our weight”.&lt;/p&gt;

&lt;p&gt;Every year, come National Day, it was a day for us to be proud of. On that day,
our family will sit down in front of the TV in the evening to watch the National
Day Parade. It starts off with a grand military parade, where we get to show off
our shiny military hardware. It reminds every Singapore male, who is conscripted
into &lt;a href=&quot;https://en.wikipedia.org/wiki/National_service_in_Singapore&quot;&gt;2 years of mandatory military service&lt;/a&gt;, that there is something bigger than
self. It reminds us that 2 years of blood, sweat and toil that each of us gave
means we can stand up on the world stage as an independent country.&lt;/p&gt;

&lt;p&gt;I was expecting Australia Day to have the same sort of patriotic, unifying
effect on the Australians. Unfortunately, as I learnt more about it, I realised
how divisive this day is. Much have been said about the hurt to the Aboriginal
community; there’s nothing more I can add to this conversation. Two articles
this year especially stood out for me. Firstly, a white Australian perspective
on how &lt;a href=&quot;https://theshot.net.au/general-news/as-a-white-australian-heres-what-australia-day-means-to-me-fuck-all/&quot;&gt;Australia Day didn’t mean much to him&lt;/a&gt;. Secondly, a fresh take from
Stan Grant on how &lt;a href=&quot;https://www.abc.net.au/news/2021-01-26/changing-australia-day-means-nothing-without-change-stan-grant/13088122&quot;&gt;changing the date the easy way out&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I shy away from personally commenting on issues like this - politics can be
divisive, even when it doesn’t have to be. I know that I will never be accepted
as a true Australian (and that is fine), so I don’t draw attention, or pretend
my words matter.&lt;/p&gt;

&lt;p&gt;But what I’d like to say is how proud I am to be part of this country.
Understand that racism and biases exists in almost every place in the world. For
Australians to be able to acknowledge its existence, to have a conversation
about it, means that we have the ability to change for the better. This, in my
humble opinion, puts us way ahead of other countries. I hope more Australians
will recognise this current debate is actually a strength, and not a weakness.&lt;/p&gt;

</description>
        <pubDate>Tue, 26 Jan 2021 12:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/thoughts/2021/01/26/australia-day.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/thoughts/2021/01/26/australia-day.html</guid>
        
        
        <category>thoughts</category>
        
      </item>
    
      <item>
        <title>How we broke our national object storage and no one noticed</title>
        <description>&lt;p&gt;During the last 5 years in Nectar, I admit we’ve broken a number of things.
However, one of the most memorable incident in Nectar occurred just last week,
where we messed up our national swift cluster. Fortunately, we did not lose any
data (fingers crossed), and no one really noticed it.&lt;/p&gt;

&lt;h1 id=&quot;about-the-national-swift-cluster&quot;&gt;About the national swift cluster&lt;/h1&gt;

&lt;p&gt;Swift is the software that powers the object storage for Nectar. The backend
storage servers resides in different institutions all over Australia.
Configuration on these servers, as with all of our servers, are managed with
Puppet.&lt;/p&gt;

&lt;p&gt;An object on the Swift cluster is 3x replicated and stored in 3 different
geographical locations. This protects against a local disaster at any
institution.&lt;/p&gt;

&lt;h1 id=&quot;how-this-happen&quot;&gt;How this happen?&lt;/h1&gt;

&lt;p&gt;We were doing a Swift upgrade when Puppet pushed out a bad version of Swift
config. This config, while not being immediately obvious was wrong, caused
the majority of the backend servers to think that the objects hosted on them
were misplaced.&lt;/p&gt;

&lt;p&gt;In more detail, Swift uses a config value &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;swift_hash_path_suffix&lt;/code&gt; to determine
placement of objects - i.e. when you put an object on the Swift cluster, which
(3x) backend servers should this object be written to. This config value has to
be the same across all API and storage nodes to ensure a consistent view of the
cluster.&lt;/p&gt;

&lt;p&gt;Due to our changes, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;swift_hash_path_suffix&lt;/code&gt; was changed from a value enclosed
with quotes, to values without quotes on some storage nodes. Approximately 75%
of the storage nodes were affected. The API nodes were not affected.&lt;/p&gt;

&lt;h1 id=&quot;what-happened-then&quot;&gt;What happened then?&lt;/h1&gt;

&lt;p&gt;The servers, with the wrong config, now incorrectly decided that objects on them
were ‘misplaced’, and started moving them to the ‘right’ place in the cluster.
As part of this movement, the original objects were ‘quarantined’ - saved to a
location on the same disk so that recovery is possible.&lt;/p&gt;

&lt;p&gt;As more and more objects were quarantined, disks started filling up. A few nodes
hit 100%, which meant no new data can be written to them. This is dangerous
because if too many disks are full, writes to the cluster can stop completely.&lt;/p&gt;

&lt;h1 id=&quot;oh-crap&quot;&gt;Oh crap&lt;/h1&gt;

&lt;p&gt;Once the issue was identified, we quickly pushed out the correct version of the
config file. This halted the runaway process that was filling up our disks.
However, now we have a few big issues:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;We have thousands, if not millions of objects quarantined and many disks
were full.&lt;/li&gt;
  &lt;li&gt;Due to the objects being moved, many objects in our cluster now have less
than 3 copies in the correct locations. If a disaster were to strike now, it
could wipe out the remaining copy of an object.&lt;/li&gt;
  &lt;li&gt;Service levels are being affected. E.g. if a user tries to write an object to
the cluster, and all 3 destination disks are full, the write will fail.&lt;/li&gt;
  &lt;li&gt;We did not know if we lost data yet. This was the biggest issue on our minds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;how-we-fixed-it&quot;&gt;How we fixed it&lt;/h1&gt;

&lt;p&gt;We decided that we needed to work the problem from a few angles.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Our immediate priority was to free up some disk space so replication and
writes can work. To do this, we need to clean up the quarantine objects.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To make sure data is safe, we need to check through all the objects. We need
to figure out which ones now have 3, 2, 1, or 0 copies.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Our strategy was as follow:&lt;/p&gt;

&lt;h2 id=&quot;free-up-space&quot;&gt;Free up space&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Look at the quarantine objects. For each object, find out if they are
supposed to be on this disk, or on some other disk in the cluster.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If the quarantine object is on this disk, move it out of quarantine, to the
rightful location on the same disk. If there already is an object in that
location, delete the quarantine object to free up disk space. As the object (a
file) is on the same disk, this is a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mv&lt;/code&gt; operation which is fast. (fast path)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If the quarantine object does not belong to this disk in the cluster, query
swift.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If there are already 2 copies in swift, delete this object. We assume swift
will replicate the third copy once there is enough disk space. (slow path)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;pre&gt;&lt;code class=&quot;language-mermaid&quot;&gt;       graph TD;
           A[check object]--&amp;gt;B{belong to disk?};
           B--&amp;gt;|yes| C[mv];
           B--&amp;gt;|no| D[check swift];
           D--&amp;gt;E{has 2 copies?};
           E--&amp;gt;|yes| F[rm];
           E--&amp;gt;|no| G[leave];
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&quot;detailed-decision-on-freeing-up-space&quot;&gt;Detailed decision on freeing up space&lt;/h3&gt;

&lt;p&gt;To be perfectly safe, one should delete the quarantine object only if there are
3 copies. We chose to delete on 2 copies for a few reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Many objects have &amp;lt; 3 copies because at least a copy have been quarantined.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The ‘check and mv’ (fast path) can quickly restore a copy on other disk.
This script is running in parallel, one for each disk. At a point in time the
object might only have 2 copies (disk A and B), but once the script runs on disk
C, the quarantine object moves back to the rightful location. We wanted to
process all objects through the fast path as quick as possible to get the
maximum copies.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Freeing up space is the priority because we needed replication working.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;We did not want to block writes.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;check-all-objects&quot;&gt;Check all objects&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;In parallel, we started checking all objects and notify if a 0 copy object is
found.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For this object, look through all possible locations to find out if there is a
copy in quarantine. If there is, restore that.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once a copy is restored, the cluster can serve this object normally again.&lt;/p&gt;

&lt;h1 id=&quot;lessons-learnt&quot;&gt;Lessons learnt&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Quotes matter in ini files, who knew?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Swift is pretty damn resilient&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;“&lt;em&gt;Most outages are caused by config changes&lt;/em&gt;” - This has been repeated by many
in the industry, and unfortunately Nectar has contributed to the statistics.
One needs to be careful doing any sort of config change, no matter how trivial
it seems. Bringing up a new service is much simpler in comparison.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;thanks&quot;&gt;Thanks&lt;/h1&gt;

&lt;p&gt;Many thanks to fellow operators from the different sites pulling together to fix
this issue - Matt and Karl from UTas, Glenn from Intersect, Swe from Monash.
Also thanks to my fellow Core Services Operators for hacking up scripts on the
go and using Slack as a DVCS.&lt;/p&gt;

&lt;p&gt;Cheers to Swift and OpenStack for building a damn fine product. In a commercial
closed sourced product we will probably be up shit creek and waiting for a
vendor to fly in. The nature of open source let us inspect everything that Swift
is doing and hack up a fix in a few hours.&lt;/p&gt;
</description>
        <pubDate>Wed, 01 Jul 2020 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/openstack/2020/07/01/how_we_broke_swift.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/openstack/2020/07/01/how_we_broke_swift.html</guid>
        
        
        <category>openstack</category>
        
      </item>
    
      <item>
        <title>Using CoreOS on OpenStack</title>
        <description>&lt;p&gt;Most instances on the Nectar Cloud runs Linux (Ubuntu, CentOS). On Nectar’s
Linux images, a provisioning tool call &lt;a href=&quot;https://coreos.com/ignition/docs/latest/&quot;&gt;cloud-init&lt;/a&gt; runs on first boot, which
inserts your SSH key and other user data into the instance. This allows you to
log in to your instance securely using SSH keys, and also run any scripts for
software installation when your instance first boot up.&lt;/p&gt;

&lt;p&gt;CoreOS uses a different provisioning tool called
&lt;a href=&quot;https://coreos.com/ignition/docs/latest/&quot;&gt;Ignition&lt;/a&gt; instead of cloud-init. This
means that extra steps are necessary to boot up a CoreOS instance and inject
your SSH key.&lt;/p&gt;

&lt;h1 id=&quot;short-way-just-ssh-key&quot;&gt;Short way (just SSH key)&lt;/h1&gt;

&lt;ol&gt;
  &lt;li&gt;If you do not know where your SSH public key is, you can get it from Nectar
Dashboard, on the left menu under &lt;em&gt;Project&lt;/em&gt; &amp;gt; &lt;em&gt;Compute&lt;/em&gt; &amp;gt; &lt;em&gt;Key Pairs&lt;/em&gt;. Or you
can use the CLI to get it
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;openstack keypair show --public-key &amp;lt;NAME&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Create the following file as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user-data.json&lt;/code&gt;
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;{
  &quot;ignition&quot;: {
 &quot;version&quot;: &quot;3.0.0&quot;
  },
  &quot;passwd&quot;: {
 &quot;users&quot;: [
   {
     &quot;name&quot;: &quot;core&quot;,
     &quot;sshAuthorizedKeys&quot;: [
       &quot;ssh-rsa AAAAB3NzaC1c2EAA...dzP&quot;
     ]
   }
 ]
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Boot up using CLI
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;openstack server create --image fedora-coreos-31 --flavor m3.small \
--user-data user_data.json fedora-coreos-instance
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;long-way&quot;&gt;Long way&lt;/h1&gt;

&lt;p&gt;To build an Ignition configuration file, one has to create a YAML config and
use the FCOS Configuration Transpiler (FCCT) to convert it into JSON. See
&lt;a href=&quot;https://docs.fedoraproject.org/en-US/fedora-coreos/producing-ign/&quot;&gt;Fedora CoreOS pages for more
examples&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;FCCT is provided as a container, but to run it we need something like podman.
This is not installed on Ubuntu by default, so we need to install it.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user-data.yaml&lt;/code&gt; like this. In this example we only set an SSH key,
but this method is not limited to SSH keys.&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; variant: fcos
 version: 1.0.0
 passwd:
   users:
     - name: core
       ssh_authorized_keys:
         - &quot;ssh-rsa AAAAB3NzaC1c2EAA...dzP&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;podman&lt;/code&gt; by following the &lt;a href=&quot;https://podman.io/getting-started/installation.html&quot;&gt;Ubuntu
instructions on podman’s
site&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Run fcct
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;podman run -i --rm quay.io/coreos/fcct:release --pretty \
--strict &amp;lt; user-data.yaml &amp;gt; user-data.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;You should get the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user-data.json&lt;/code&gt; as the previous (short) example.&lt;/li&gt;
&lt;/ol&gt;
</description>
        <pubDate>Thu, 18 Jun 2020 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/openstack/coreos/2020/06/18/coreos-openstack.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/openstack/coreos/2020/06/18/coreos-openstack.html</guid>
        
        
        <category>openstack</category>
        
        <category>coreos</category>
        
      </item>
    
      <item>
        <title>State of the Cloud 2019</title>
        <description>&lt;p&gt;Now that 2020 is upon us, I thought it might be a good idea to generate some
statistics about the Nectar Cloud for 2019.&lt;/p&gt;

&lt;h1 id=&quot;instances&quot;&gt;Instances&lt;/h1&gt;

&lt;p&gt;In 2019, Nectar Cloud ran a total of &lt;strong&gt;70,371&lt;/strong&gt; instances.&lt;/p&gt;

&lt;h2 id=&quot;vcpu-time&quot;&gt;VCPU time&lt;/h2&gt;

&lt;p&gt;These instances ran for a total of &lt;strong&gt;9,203,375 days, 19 hours 17 minutes and 59
seconds&lt;/strong&gt; of &lt;em&gt;VCPU time&lt;/em&gt;&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. That is around &lt;strong&gt;25,214&lt;/strong&gt; VCPU years!&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;mean VCPU time&lt;/em&gt; is about &lt;strong&gt;130 days&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;mode VCPU time&lt;/em&gt; is &lt;strong&gt;365 days&lt;/strong&gt;, which means there were lots of single core
instances running through the year.&lt;/p&gt;

&lt;h2 id=&quot;flavour&quot;&gt;Flavour&lt;/h2&gt;

&lt;p&gt;The most popular flavour is &lt;strong&gt;m2.large&lt;/strong&gt; (4 VCPU). There were &lt;strong&gt;26,750&lt;/strong&gt; of such
instances.&lt;/p&gt;

&lt;h2 id=&quot;end&quot;&gt;End&lt;/h2&gt;

&lt;p&gt;Statistics were generated from &lt;a href=&quot;https://gnocchi.xyz/&quot;&gt;Gnocchi&lt;/a&gt;. Nectar logs the
start/end times of each instance in Gnocchi, as well as a host of other data. As
a Nectar user, you can use the Gnocchi API to access metrics for your resources.&lt;/p&gt;

&lt;p&gt;Let me know if this has been interesting, or if there are any other stats you
want to see!&lt;/p&gt;

&lt;h3 id=&quot;footnote&quot;&gt;Footnote&lt;/h3&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;VCPU time is (Number of VCPU) * (Running time). For example, if an
instance has 2 VCPU and has been running for 1 hour, VCPU time is 2 hours. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 06 Apr 2020 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/openstack,/nectar/2020/04/06/state-of-the-cloud-2019.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/openstack,/nectar/2020/04/06/state-of-the-cloud-2019.html</guid>
        
        
        <category>openstack,</category>
        
        <category>nectar</category>
        
      </item>
    
      <item>
        <title>The year in review</title>
        <description>&lt;p&gt;Things are winding down at the end of the year, so I thought it might be helpful
to myself to jot down some of what I did this year, and keep gauge if I am
growing professionally.&lt;/p&gt;

&lt;h1 id=&quot;puppet-catalog-difference-tests&quot;&gt;Puppet catalog difference tests&lt;/h1&gt;

&lt;p&gt;This took up a lot of time, but I felt it was really necessary. We wanted to
greatly improve our testing to make sure we don’t merge puppet code that breaks
the cloud.
&lt;a href=&quot;https://status.cloud.google.com/incident/cloud-networking/19009&quot;&gt;Many&lt;/a&gt;
&lt;a href=&quot;https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/&quot;&gt;outages&lt;/a&gt;
in major cloud providers last year were due to config changes, so preventing
this from happening to NeCTAR was a big priority.&lt;/p&gt;

&lt;p&gt;Our new tests now generates a list of differences in catalogs for puppet nodes.
This greatly helps human reviewing the changes, as they can now see actual
resources and nodes being updated by each change.&lt;/p&gt;

&lt;p&gt;Key things in this topic are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Getting all sites’ control repos under control of our CI/CD - so that Jenkins
can trigger tests on changes&lt;/li&gt;
  &lt;li&gt;Wrangling &amp;gt;5 years of legacy puppet code into something that resembles modern
day puppet best practices, and into a consistent format across sites - so
tests work across all sites.&lt;/li&gt;
  &lt;li&gt;Figuring out deprecation strategy for code/config - so we don’t break
existing code and yet allow us to move forward quicker&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not totally done yet, but all the technical challenges have already been
resolved.&lt;/p&gt;

&lt;h1 id=&quot;cellsv1-to-cellsv2&quot;&gt;CellsV1 to CellsV2&lt;/h1&gt;

&lt;p&gt;Huge effort by the whole team. The fact that we managed to pull it off without
downtime is impressive. Basically, in CellsV1, Core Services database holds all
the information about instances, but in CellsV2 the sites’ databases holds this
information.&lt;/p&gt;

&lt;p&gt;For my part, the bulk of the work involves making sure the CellsV1 and CellsV2
databases are consistent before the switch, and writing code to fix up any
inconsistencies. Boring manual work, which had to be done. This allowed us to
finally get rid of all the legacy CellsV1 patches!&lt;/p&gt;

&lt;h1 id=&quot;rollout-of-yubikeys&quot;&gt;Rollout of Yubikeys&lt;/h1&gt;

&lt;p&gt;With the increase in attacks, I felt that it was time to beef up our security.
Fortunately, our manager was supportive and we manage to buy some YubiKeys. I
have been experimenting with integrating them into our systems. Right now we
have started using Yubikeys for:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Keystone credentials safe using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pass&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Shared passwords using the same&lt;/li&gt;
  &lt;li&gt;SSH forwarding&lt;/li&gt;
  &lt;li&gt;EYAML&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;unified-user-handling-in-puppet&quot;&gt;Unified user handling in Puppet&lt;/h1&gt;

&lt;p&gt;Our puppet code has grown over the years, so code to manage users (for different
systems) where in multiple places. Because of this, adding a new operator to the
cloud means making multiple changes in different repos. This is proving to be a
fair bit of technical debt, and is also a security issue when an operator
account wasn’t removed cleanly in all places when they leave.&lt;/p&gt;

&lt;p&gt;To solve this, I created a way to define users in just one place in Puppet. From
this, different systems which need to create users can just read it. Yes this is
the &lt;a href=&quot;https://xkcd.com/927/&quot;&gt;universal standard&lt;/a&gt;, I promise.&lt;/p&gt;

&lt;h1 id=&quot;security-security-security&quot;&gt;Security, security, security&lt;/h1&gt;

&lt;ol&gt;
  &lt;li&gt;Added eyaml support in Puppet&lt;/li&gt;
  &lt;li&gt;Moved rsyslog to using SSL; this will allow for centralised logging. Future
work will let sites send their logs to us so we have a single pane of glass
for observing events. This can be useful, for example, in tracing a user’s
request to boot an instance - we will be able to trace it all the way from API
(in Core Services) down to the compute node (at the site).&lt;/li&gt;
&lt;/ol&gt;
</description>
        <pubDate>Thu, 19 Dec 2019 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/others/2019/12/19/year-in-review.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/others/2019/12/19/year-in-review.html</guid>
        
        
        <category>others</category>
        
      </item>
    
      <item>
        <title>Passing entrophy to virtual machines</title>
        <description>&lt;p&gt;Recently, when we were working on testing new images with Magnum, I found that
the newest Fedora Atomic 29 images were taking a long time to boot up. A closer
look using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nova console-log&lt;/code&gt; revealed that they were getting stuck at boot
with the following error.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[   12.220574] audit: type=1130 audit(1555723526.895:78): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-machine-id-commit comm=&quot;systemd&quot; exe=&quot;/usr/lib/systemd/systemd&quot; hostname=? addr=? terminal=? res=success'
[   12.248050] audit: type=1130 audit(1555723526.906:79): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journal-catalog-update comm=&quot;systemd&quot; exe=&quot;/usr/lib/systemd/systemd&quot; hostname=? addr=? terminal=? res=success'
[ 1061.103725] random: crng init done
[ 1061.108094] random: 7 urandom warning(s) missed due to ratelimiting
[ 1063.306231] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The number between the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[]&lt;/code&gt; shows the number of seconds since boot, as you can
see the VM was stuck for &amp;gt;1000 secs waiting for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;crng init&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It turns out that in some newer images, boot will block waiting for sufficient
entropy. Entropy, or randomness, is used in operating systems for important
things like random number generation. Traditionally machines generate their
entropy by looking at inputs that are random, e.g. disk writes, mouse movement.
(Fun experiment: On a Linux machine do a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cat /dev/random&lt;/code&gt;, wait for the output
to stop and move your mouse.)&lt;/p&gt;

&lt;p&gt;A fresh VM has very little avenues to collect entropy, so unfortunately if
doesn’t have enough entropy it may block. Luckily the smart people at QEMU has
a solution called &lt;a href=&quot;https://wiki.qemu.org/Features/VirtIORNG&quot;&gt;VirtIO RNG&lt;/a&gt;, which
involves passing entropy from the host hypervisor to the VM. This allows the
VM to seed it’s entropy pool and happily continue booting.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://wiki.openstack.org/wiki/LibvirtVirtioRng&quot;&gt;Openstack config docs&lt;/a&gt;
points out that you need to set this in two places, both at flavor and at
image. The flavor setting controls whether an image booted with this flavor is
allowed to drain the host’s entropy, and what rates they are allowed to drain.
Finding the correct rate is a bit of trial and error, as you want a high enough
rate so that the VM will not block, but low enough that a malicious VM will not
be able to totally drain a hypervisor’s entropy.&lt;/p&gt;

&lt;p&gt;With a bit of testing, and advice from our fellow OpenStack Operators at
Catalyst NZ, we have found the following values to work for us:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hw_rng:allowed='True'&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hw_rng:rate_bytes='24'&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hw_rng:rate_period='5000'&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NOTE: as our friends from Catalyst points out, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rate_period&lt;/code&gt; is specified in
milliseconds and not seconds like what some documentation states.&lt;/p&gt;

&lt;h3 id=&quot;same-rate-different-periods&quot;&gt;Same rate, different periods?&lt;/h3&gt;
&lt;p&gt;One of the question that we had to answer was, what is the effect of the period
setting? For example, if we know that we need 5 bytes/second of information, is
it better to set&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;1 byte in 1 second, or&lt;/li&gt;
  &lt;li&gt;5 bytes in 5 seconds?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both of this settings translate to the same effective &lt;em&gt;rate&lt;/em&gt;, but their
performance can be very different.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://wiki.qemu.org/Features/VirtIORNG#Effect_of_the_period_parameter&quot;&gt;this
note&lt;/a&gt;,
it was suggested that a smaller period is better because it means your process
will block for a shorter time. However, a longer period can have some advantages
in a cloud environment.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;in case of a benign VM, we want it to be able to burst as much as possible.
If it needs 5 bytes it can get it in the first sec rather than wait till the
5th sec&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;in case of a malicious VM, being able to block it for 5 secs means other
VMs will be able to have a better chance to consume entropy&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hence, we have set a relatively large period on our environment.&lt;/p&gt;
</description>
        <pubDate>Tue, 01 Oct 2019 12:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/openstack/2019/10/01/entrophy-in-virtual-machines.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/openstack/2019/10/01/entrophy-in-virtual-machines.html</guid>
        
        
        <category>openstack</category>
        
      </item>
    
      <item>
        <title>Kubernetes With Loadbalancer</title>
        <description>&lt;p&gt;In &lt;a href=&quot;/nectar/openstack/2018/09/03/openstack-magnum.html&quot;&gt;Kubernetes Part I&lt;/a&gt;, we’ve discussd how to spin up a kubernetes cluster easily
on Nectar. In this post, we will discuss how to host an application and access
it externally.&lt;/p&gt;

&lt;p&gt;To being, you should already have a working cluster. If you do not, head back to
the previous post and follow the steps.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Check that you cluster is working
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Start a container image. We use nginx as an example
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl run nginx --image nginx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;This command will start a &lt;em&gt;pod&lt;/em&gt; with a &lt;em&gt;container&lt;/em&gt; inside it running the
nginx image.  On Kubernetes, the smallest runnable unit is a &lt;em&gt;pod&lt;/em&gt;, which holds
one (or more) &lt;em&gt;containers&lt;/em&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Check that your pod has started up and is running.
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl get pods
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Now that you have a pod working, we need a way of getting to it from the
Internet. In Nectar Cloud, we can do this by creating a load balancer. A
load balancer has a public (floating ip), and redirects traffic to this public
IP to one or more private addresses. Use the following yaml to create your load
balancer. Save it as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nginxservice.yaml&lt;/code&gt;.&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;apiVersion: v1
kind: Service
metadata:
  name: nginxservice
  labels:
    app: nginx
  annotations:
    loadbalancer.openstack.org/floating-network-id: 'e48bdd06-cc3e-46e1-b7ea-64af43c74ef8'
spec:
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
  selector:
    run: nginx
  type: LoadBalancer
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;Note that the uuid in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loadbalancer.openstack.org/floating-network-id&lt;/code&gt;
refers to a network in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;melbourne&lt;/code&gt;. If your cluster is in a different AZ, you
might want to choose a floating IP network closer to where your cluster is for
routing efficiency. However, without it, things still work though! That’s the
beauty of Nectar Advanced Network - no matter which AZ the traffic ingress
from, it still is able to make the way to your VM on Nectar Cloud.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Run it as
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl create -f nginxservice.yaml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Get the public IP of the load balancer
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl get services
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You should be able to browse to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http://&amp;lt;ip&amp;gt;&lt;/code&gt; and see the nginx welcome page.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;If this doesn’t work, you might not have the correct security groups applied.
Find the port the IP is on:
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;openstack floating ip list --floating-ip-address 103.6.252.52 -c Port -f value
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;Apply a security group that has the HTTP security group rule to it, or, if do
not already have one create it.&lt;/p&gt;
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;openstack security group create http
openstack security group rule create --ingress --dst-port 80 http
openstack port set --security-group http fe008711-7469-4c44-8489-46abbc8b1774
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;This is an external load balancer (external to kubernetes), and is created in
Neutron. You can see the loadbalancer in Neutron by doing
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;neutron lbaas-loadbalancer-list
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;More details on what we have just did.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;We started an external &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LoadBalancer&lt;/code&gt; service in Kubernetes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Kubernetes understands that it has to create this loadbalancer (externally)
by calling out to the openstack neutron provider.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloud-provider-openstack&lt;/code&gt; plugin in kubernetes then create the different
pieces that makes it all work, namely floating ip, load balancer, pool,
listener and members. These are all openstack resources. It mirrors this to the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LoadBalancer&lt;/code&gt; service you see in kubernetes when you do a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl get
services&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The plugin configs all of them and get the floating IP to be displayed in
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl get services&lt;/code&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;
</description>
        <pubDate>Sun, 28 Apr 2019 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/openstack/kubernetes/2019/04/28/kubernetes-with-loadbalancer.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/openstack/kubernetes/2019/04/28/kubernetes-with-loadbalancer.html</guid>
        
        
        <category>openstack</category>
        
        <category>kubernetes</category>
        
      </item>
    
      <item>
        <title>GitLab and Kubernetes Integration</title>
        <description>&lt;p&gt;In the previous blog post we’ve described &lt;a href=&quot;/nectar/openstack/2018/09/03/openstack-magnum.html&quot;&gt;how to spin up a kubernetes cluster on
Nectar&lt;/a&gt;. Around the same time, I
also got to know that University of Melbourne has a &lt;a href=&quot;https://gitlab.unimelb.edu.au/&quot;&gt;self-hosted
gitlab&lt;/a&gt;. To my delight, I found out that GitLab
has &lt;a href=&quot;https://about.gitlab.com/kubernetes/&quot;&gt;Kubernetes integration&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This means that, if you are in UniMelb (or have a self-hosted gitlab), you can
run CI/CD using Nectar cloud, without having to set up any infrastructure!&lt;/p&gt;

&lt;h1 id=&quot;spin-up-cluster&quot;&gt;Spin up cluster&lt;/h1&gt;
&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Spin up a kubernetes (k8s) cluster and create the config directory.
&lt;a href=&quot;/nectar/openstack/2018/09/03/openstack-magnum.html&quot;&gt;Instructions&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Run the following
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl create clusterrolebinding permissive-binding --clusterrole=cluster-admin
--user=admin --user=kubelet --group=system:serviceaccounts
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Get the default secret name (in format &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;default-token-xxxx&lt;/code&gt;)
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl get secrets
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Get the token for this secret
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl describe secrets/default-token-xxxx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Get the API URL.
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cat $KUBECONFIG
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;Look for the line like&lt;/p&gt;
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;clusters:
- cluster:
 server: https://192.168.1.1:6443
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Get the CA cert. In the directory where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$KUBECONFIG&lt;/code&gt; is stored, do
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cat ca.pem
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;configure-repo&quot;&gt;Configure Repo&lt;/h1&gt;
&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;In the GitLab repo, navigate to &lt;strong&gt;Operations&lt;/strong&gt; - &lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Fill in the cluster details that you got from the previous steps.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;In the list of Applications, install the following in order:
    &lt;ol&gt;
      &lt;li&gt;Helm Tiller&lt;/li&gt;
      &lt;li&gt;GitLab Runner&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.gitlab-ci.yml&lt;/code&gt; file. For example, if I want to run yamllint on my
code, an example file will look like:&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; before_script:
   - apt-get update
   - apt-get install -y python yamllint
   - apt-get install -y python3-pkg-resources python3-setuptools
   - python --version
   - which python
 yaml-lint:
   script:
     - find . -type f -iname &quot;*.yaml&quot; -o -iname &quot;*.eyaml&quot; | xargs yamllint
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Commit and push the change. When the change is pushed to GitLab, a runner
will start up and run the job specified by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.gitlab-ci.yml&lt;/code&gt;. Jobs can be
viewed from the &lt;strong&gt;CI/CD&lt;/strong&gt; tab.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;limitations&quot;&gt;Limitations&lt;/h1&gt;
&lt;p&gt;At this point in time, GitLab CE integration can only do one kubernetes cluster
per repo. No ability to do dev/test/prod clusters per repo.&lt;/p&gt;

&lt;p&gt;Also, it &lt;a href=&quot;https://gitlab.com/gitlab-org/gitlab-ce/issues/29398&quot;&gt;does not support
RBAC&lt;/a&gt;, so it means that
the integration will have full permissions to the cluster. So you really want to
dedicate 1 k8s cluster to 1 repo, and not have any other containers running on
that cluster.&lt;/p&gt;
</description>
        <pubDate>Thu, 20 Sep 2018 00:00:00 +0000</pubDate>
        <link>http://blog.alltherunning.com/nectar/openstack/2018/09/20/gitlab-kubernetes-integration.html</link>
        <guid isPermaLink="true">http://blog.alltherunning.com/nectar/openstack/2018/09/20/gitlab-kubernetes-integration.html</guid>
        
        
        <category>nectar</category>
        
        <category>openstack</category>
        
      </item>
    
  </channel>
</rss>
