You don't need to use the complex stateful overlay networks. At Stripe we have a network overlay using IPv6 and the Linux kernel's built-in stateless tunnel device, so there's effectively unlimited addresses with no coordination between worker machines and no iptables port remapping.
The basic idea is you create a SIT tunnel device and assign it an IPv6 /64 composed of two parts:
1. A network prefix between 32 and 56 bits long. This prefix is the same for all machines in the network.
2. A subnet derived from the machine's IPv4 address, minus the netmask.
For example, if your IPv4 addresses are allocated from 192.168.1.0/24 and the machine has 192.168.1.155, then the network prefix should be 56 bits long (64 - (32 - 24)) and the machine's prefix is `xxxx:xxxx:xxxx:xx9B::/64`.
The Linux kernel knows how to wrap the IPv6 with IPv4 so it can route within your local network to any other machine with a similarly configured tunnel device. If you want to send packets to 192.168.1.200 then they get addressed to `xxxx:xxxx:xxxx:xxC8::1` or whatever, they'll transit the IPv4 network like normal, and on arrival the receiving machine's kernel will strip off the IPv4 wrapper and route the IPv6 locally.
How's this useful? Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.
> Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.
I’ve been meaning for a while now to experiment with this same idea in Erlang. I.e., hack up the Erlang runtime to use an IPv6 address as its PID type, such that each Erlang node running on a machine gets its own /64 subnet to hand out; and each Erlang actor-process on that node gets an IP allocated from its node’s /64 range.
This could just be a way of letting Erlang nodes talk to each-other through tunnels. Or it could be a way of having Erlang “VMs” exposed directly to the Internet as their own little machines.
This is super useful for home networks, I do this for my k8s cluster hosted by a bunch of pi's.
But in production, I'd rather my ability to launch a new pod not be dependent on a DHCP server being reachable and functional. In that case, this particular trick is rather neat, since assignment of IP addresses is fully static/local (without having to agree upfront what range of IPs each node can use for bringing pods online), while retaining the benefit of everything being directly routable. You can now also run a ridiculous amount of pods on a single node.
Though to counter your point, you don't actually need to use an external DHCP server in my example either, you can just define the block you're giving the server via the macvlan/ipvlan plugin, and I presume, again, it works with both IPV4 or IPV6.
So I guess my wider point is, k8s probably doesn't need to replaced to have the networking work how you like.
In some deployments we remove the VPC CNI and use an overlay anyway because of the integrations we need. You do lose Security Groups but that's not a big deal if you aren't using them in your deployment anyway.
Oh sure there's lots of good reasons for not using their CNI not least of which is needing to worry IP space utilization. I was just trying to point out overlays aren't a requirement.
Well an overlay network by definition is always going to be transparent to you. But no the AWS VPC CNI does not create an overlay. It's just layer 3. It works by adding ENIs to your worker nodes with secondary IP addresses on them. And those secondary IPs are from your VPC's address space. See:
Yep, another reason it does that is to spread the risk and limit the need for tons of unchecked (no src/dst check) addresses in your VPC.
In the end there is no ideal scenario; boils down to what works best for the use case (or what is the 'least worst' solution). Sometimes it gets you down, but those imperfections can turn a churn job into an interesting one.
On my mobile so hard to go.into detail, but take a look at 6rd and the Linux 6to4 tunnel driver. You can assign a prefix to the overlay, then each machine's IPv4 becomes a subnet, and packets can be routed by the kernel knowing only the destination pod's IPv6 address.
Product teams deploying code to our Kubernetes clusters are strongly recommended to use resource limits, and we're going to make that a hard requirement at some point.
We haven't noticed unusual CPU throttling, though we do have some workloads that turned out to be burstier than expected and had to adjust their CPU limits to match.
Note that when it comes to subtle Linux thread scheduling behavior, your experience will depend on which runtime you use, and if using runc then which version of the Linux kernel your workers run. We weren't affected by the CFS bug introduced in Linux v4.18 because we never ran Kubernetes workloads on a machine with the affected kernel, and if a similar bug occurs in the future it might not affect workloads running within gVisor or Firecracker.
Additionally, Stripe has historically cared more about security than efficiency. This lead to an architecture where services run on dedicated VMs, which naturally strands capacity and reduces the impact of bugs that appear at high utilization and/or high core count.
I believe you did run into the CFS bug in non-Kubernetes workloads, though, specifically with Hadoop tasks. An engineer told me about a workaround he devised using cpuset.
That's likely a different bug. From what I understand the CFS bug being discussed was introduced in Linux v4.18 and fixed in v5.3, and we have not used a kernel within in that range in our Hadoop clusters.
Not from stripe, but I've seen pretty bad CPU throttling.
Often see quota's get exhausted through short bursts that don't show up in metrics that then causes CFS throttling to occur even though it looks like the pod is no where near its limit. Also struggled with application startup requiring far more CPU than at runtime leading to ridiculously slow startup times if you had a low limit.
So far our solution has been to just remove CPU limits, but hoping things will get better.
Removing the limits really improved our latency tail, and so far hasn't resulted in CPU saturation at the node level but your mileage may vary