You don't need to use the complex stateful overlay networks. At Stripe we have a...

ignoramous · on Nov 29, 2020

> At Stripe we have a network overlay using IPv6 and the Linux kernel's built-in stateless tunnel device...

Can you please expand on this and if possible point to some references that'd help in case I want to go down this route myself? Thx.

jmillikin · on Nov 29, 2020

I posted briefly elsewhere in the thread, but am now back at a compute with a keyboard so can give a more detailed answer.

First, background reading:

https://en.wikipedia.org/wiki/6to4

https://en.wikipedia.org/wiki/IPv6_rapid_deployment

The basic idea is you create a SIT tunnel device and assign it an IPv6 /64 composed of two parts:

1. A network prefix between 32 and 56 bits long. This prefix is the same for all machines in the network.

2. A subnet derived from the machine's IPv4 address, minus the netmask.

For example, if your IPv4 addresses are allocated from 192.168.1.0/24 and the machine has 192.168.1.155, then the network prefix should be 56 bits long (64 - (32 - 24)) and the machine's prefix is `xxxx:xxxx:xxxx:xx9B::/64`.

The Linux kernel knows how to wrap the IPv6 with IPv4 so it can route within your local network to any other machine with a similarly configured tunnel device. If you want to send packets to 192.168.1.200 then they get addressed to `xxxx:xxxx:xxxx:xxC8::1` or whatever, they'll transit the IPv4 network like normal, and on arrival the receiving machine's kernel will strip off the IPv4 wrapper and route the IPv6 locally.

How's this useful? Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.

derefr · on Nov 29, 2020

> Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.

I’ve been meaning for a while now to experiment with this same idea in Erlang. I.e., hack up the Erlang runtime to use an IPv6 address as its PID type, such that each Erlang node running on a machine gets its own /64 subnet to hand out; and each Erlang actor-process on that node gets an IP allocated from its node’s /64 range.

This could just be a way of letting Erlang nodes talk to each-other through tunnels. Or it could be a way of having Erlang “VMs” exposed directly to the Internet as their own little machines.

simonebrunozzi · on Nov 29, 2020

This deserves a blog post in itself. Please be kind and share it with the world! I bet you will hire a few engineers as a result of this blog post :)

bogomipz · on Nov 29, 2020

I think you might mean a SIT(simple internet transition) interface and not SIP? In case anyone is interested. This is a quick read on setting this up:

https://kogitae.fr/debianipv6-debian-wiki.htm

jmillikin · on Nov 29, 2020

Yes, sorry, SIT -- it's been a while since I set it up and I forgot some of the details.

dang · on Nov 29, 2020

We've fixed that typo in the GP comment now.

rualca · on Nov 29, 2020

Outstanding post. Thank you for taking the time to share this gem.

ownagefool · on Nov 29, 2020

I haven't used it, but doesn't this suit the average usecase?

https://www.cni.dev/plugins/main/macvlan/

Basically just do normal ipv4 via your dhcp server rather than an overlay.

-- Edit

For arguments sake, I just set this up:

root@nas:/opt/cni/bin# ./dhcp daemon

cat /etc/cni/net.d/01-macvlan.conf { "name": "mynet", "type": "macvlan", "master": "eno1", "ipam": { "type": "dhcp", "routes": [{ "dst": "192.168.1.0/24"}] } }

PODIP = 192.168.1.181:8096

Works in my browser; so routes correctly.

Got its dhcp from my pihole.

daenney · on Nov 29, 2020

This is super useful for home networks, I do this for my k8s cluster hosted by a bunch of pi's.

But in production, I'd rather my ability to launch a new pod not be dependent on a DHCP server being reachable and functional. In that case, this particular trick is rather neat, since assignment of IP addresses is fully static/local (without having to agree upfront what range of IPs each node can use for bringing pods online), while retaining the benefit of everything being directly routable. You can now also run a ridiculous amount of pods on a single node.

ownagefool · on Nov 30, 2020

Yeah, I don't do this in production.

Though to counter your point, you don't actually need to use an external DHCP server in my example either, you can just define the block you're giving the server via the macvlan/ipvlan plugin, and I presume, again, it works with both IPV4 or IPV6.

So I guess my wider point is, k8s probably doesn't need to replaced to have the networking work how you like.

bogomipz · on Nov 29, 2020

Indeed and if you bring up a Kubernetes cluster today on AWS using their EKS there is no overlay network. There's a CNI but no overlay.

oneplane · on Nov 29, 2020

In some deployments we remove the VPC CNI and use an overlay anyway because of the integrations we need. You do lose Security Groups but that's not a big deal if you aren't using them in your deployment anyway.

bogomipz · on Nov 29, 2020

Oh sure there's lots of good reasons for not using their CNI not least of which is needing to worry IP space utilization. I was just trying to point out overlays aren't a requirement.

AlphaSite · on Nov 29, 2020

I would need to double check, but isn’t it still running an overlay? Just one which is transparent to you?

bogomipz · on Nov 29, 2020

Well an overlay network by definition is always going to be transparent to you. But no the AWS VPC CNI does not create an overlay. It's just layer 3. It works by adding ENIs to your worker nodes with secondary IP addresses on them. And those secondary IPs are from your VPC's address space. See:

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/c...

oneplane · on Nov 30, 2020

Yep, another reason it does that is to spread the risk and limit the need for tons of unchecked (no src/dst check) addresses in your VPC.

In the end there is no ideal scenario; boils down to what works best for the use case (or what is the 'least worst' solution). Sometimes it gets you down, but those imperfections can turn a churn job into an interesting one.

fesc · on Nov 29, 2020

Could you elaborate on that a little bit? This sounds interesting and I'm not sure where to start researching about this.

jmillikin · on Nov 29, 2020

On my mobile so hard to go.into detail, but take a look at 6rd and the Linux 6to4 tunnel driver. You can assign a prefix to the overlay, then each machine's IPv4 becomes a subnet, and packets can be routed by the kernel knowing only the destination pod's IPv6 address.

eric_khun · on Nov 29, 2020

I'm curious about your practises regarding CPU Limits at Stripe. Do you noticed severe CPU throttling? What's your guidelines on this?

jmillikin · on Nov 29, 2020

Product teams deploying code to our Kubernetes clusters are strongly recommended to use resource limits, and we're going to make that a hard requirement at some point.

We haven't noticed unusual CPU throttling, though we do have some workloads that turned out to be burstier than expected and had to adjust their CPU limits to match.

Note that when it comes to subtle Linux thread scheduling behavior, your experience will depend on which runtime you use, and if using runc then which version of the Linux kernel your workers run. We weren't affected by the CFS bug introduced in Linux v4.18 because we never ran Kubernetes workloads on a machine with the affected kernel, and if a similar bug occurs in the future it might not affect workloads running within gVisor or Firecracker.

Additionally, Stripe has historically cared more about security than efficiency. This lead to an architecture where services run on dedicated VMs, which naturally strands capacity and reduces the impact of bugs that appear at high utilization and/or high core count.

xyzzy_plugh · on Nov 29, 2020

I believe you did run into the CFS bug in non-Kubernetes workloads, though, specifically with Hadoop tasks. An engineer told me about a workaround he devised using cpuset.

jmillikin · on Nov 29, 2020

That's likely a different bug. From what I understand the CFS bug being discussed was introduced in Linux v4.18 and fixed in v5.3, and we have not used a kernel within in that range in our Hadoop clusters.

mnahkies · on Nov 29, 2020

Not from stripe, but I've seen pretty bad CPU throttling.

Often see quota's get exhausted through short bursts that don't show up in metrics that then causes CFS throttling to occur even though it looks like the pod is no where near its limit. Also struggled with application startup requiring far more CPU than at runtime leading to ridiculously slow startup times if you had a low limit.

So far our solution has been to just remove CPU limits, but hoping things will get better.

Removing the limits really improved our latency tail, and so far hasn't resulted in CPU saturation at the node level but your mileage may vary

powerbook5300CS · on Nov 29, 2020

Did you know that the Linux kernel has a bug that makes CPU limits for containers extra costly?

https://github.com/kubernetes/kubernetes/issues/67577

https://github.com/torvalds/linux/commit/512ac999d2755d2b710...

If I recall correctly you need 4.18+ to get the fix.

dilyevsky · on Nov 29, 2020

We’ve seen just regular throttling too especially with erlang vm and go workloads