I have to disagree with the primary point being offered up in this post. In my career I’ve met quite a few network engineers who don’t think sufficiently outside the box. They typically say “this is how it’s done, this is how it works.” The idea that we can’t come up with a better protocol than BGP to manage the interconnections of the Internet is absurd. We don’t say “well security isn’t a programming problem, it’s a human problem, so there’s no point in making safer programming languages.” BGP is a ridiculous system of complete trust just like many older protocols.
If we sat down and started designing a routing protocol to handle the advertisement of routing tables on the internet, would we come up with BGP again? Would the new protocol potentially offer up fixes for some of the most common and obvious problems with BGP? Then there’s potential for improvement.
The network industry resists and slows the horribly needed disruption in several key areas through the dogmatic defense of the status quo that we typically hear from its constituents.
This is why so many of us pray for SDN. It’s the only way to wrest control of networking away from those who typically have dogmatic vendor based training rather than a “network science and theory” education that leads to innovation.
Although this may very well be the mentality of many rank and file network engineers that you may come across, that wouldn't at all be an accurate representation of those in charge of the architecture and design of larger networks. However, even the more open minded engineers tend to be pragmatic, and there have already been numerous proposals or initiatives which are reached near consensus level of acceptance as good ideas such as BCP38, IRR generated prefix lists, etc., but still have abysmal adoption rates. Solutions already exist for most problems, with the ability to implement them on a case by case basis, which would be a much lesser barrier than upending BGP altogether. Throwing BGP out in lieu of a new protocol altogether has an order of magnitude even lower probability of seeing widespread adoption.
What's lacking in many existing cases is simply (business) will. There are many businesses who will not allocate resources to make the Internet a better place if it doesn't improve their bottom line, and there are many more businesses who lack the resources to even be able to consider adopting anything beyond the bare minimum.
At the heart of it, most carrier networks lose money most quarters, and smaller ISPs barely scrape by. It's only a handful of very large, oligopoly eyeball networks like Comcast that are quite profitable.
>This is why so many of us pray for SDN. It’s the only way to wrest control of networking
Why not take the standards approach and draft a protocol spec? At some point, you'll need to interoperate with those horribly dogmatic vendor hugging networks that serve the majority of traffic on the internet.
>who typically have dogmatic vendor based training rather than a “network science and theory” education that leads to innovation.
I've seen this as well, and it can be frustrating. Part of me believes this is a defense mechanism used to prevent developers from pushing problems into the network. The other part believes that network engineering has drawn individuals to the field based on earnings potential rather than interest in the work, and the talent just isn't there. Both are anecdotal, so grain of salt and all.
> At some point, you'll need to interoperate with those horribly dogmatic vendor hugging networks that serve the majority of traffic on the internet.
And this has been their defense all along. It serves the status quo to imply that transitions must be difficult. IPv6 being the most obvious example today. Of course as DJB pointed out, backwards compatible IPv6 would have already rolled out by now, at the cost of a single /64. Dual stack was literally the most difficult way to accomplish IPv6.
Now let’s look at the BGP situation. As long as the outcome is the same (same routing tables result in nominal situations), any two ISPs could choose to use a different protocol for their interconnection. Same thing that has happened with a lot of supplemental networking protocols.
There are BGP alternatives on deck. What we don’t need is high profile network engineers speaking out against them and in favor of the status quo. The truth is, networking is the one part of the industry that isn’t improving at the same pace as the rest. And all the improvements that are happening seem to be related to single links and switching, whereas the defense of networking dogma is related to routing. According to the holy bible of networking, switching is easy, routing is hard. Routers must be memory restricted with small memory footprints to artificially create SKUs. Routers must operate on multi-minute timers rather than seconds. Building routing tables is an insanely intense and time consuming process. Blah blah blah.
The reality is, almost all of these problems are contrived to maintain the status quo and so network vendors can ignore the pace of hardware and instead set SKUs at resource levels that force you to use the “appropriate” piece of hardware for the appropriate task.
Another smell that should make this obvious to everyone is that almost every major company has had high level engineers make some statement of how network companies are ripping you off, and explaining how they bought white box network solutions that route layer 3 at wire speed, converge rapidly, and have tons of memory. Not to mention the vast improvement something like well designed SDN can provide.
The network companies rely on the fact that most companies don’t have the labor or expertise to use white label solutions. But that doesn’t disprove the point. That proves the point that the state of hardware is “super cheap components provide all the switching and routing performance and memory needed for almost 99% of networking purposes.” Companies like Cisco are playing the Intel +25Mhz game. Supposedly disrupting companies do so at an awfully measured pace. And the network staff generally will open up to off brand switches but start to shy away at routing, especially edge routing.
As an industry, networking needs to be disrupted and it needs a severe model change. People involved in networking need to get on board, and those who don’t and insist on fighting for the status quo need to be left behind.
I don't think it's any defense - it's plain fact. You're going to have to interopate with other networks. BGP is the protocol franca for doing so (until another rolls along). You can draft a spec, or add capabilities to BGP, but at some point your network needs to speak the same language as others or it's just an island.
>There are BGP alternatives on deck. What we don’t need is high profile network engineers speaking out against them and in favor of the status quo. The truth is, networking is the one part of the industry that isn’t improving at the same pace as the rest. And all the improvements that are happening seem to be related to single links and switching, whereas the defense of networking dogma is related to routing. According to the holy bible of networking, switching is easy, routing is hard. Routers must be memory restricted with small memory footprints to artificially create SKUs. Routers must operate on multi-minute timers rather than seconds. Building routing tables is an insanely intense and time consuming process. Blah blah blah.
I don't know where you're seeing this. The rage around Clos networks has sent everyone after small table SoC's. Any one operating at scale is running some form of commodity (aka, BRCM) chip as a router.
And when you get to routers, the Jericho (purchased by Broadcom, found in the Cisco NCS) or the Juniper Paradise (found in the PTX) are both capable of taking in internet routing tables using either HMC or off-chip memory in a pizza box form factor and in a fixed pipeline chip. Your statements aren't aligning with the reality of the platforms available today.
>Another smell that should make this obvious to everyone is that almost every major company has had high level engineers make some statement of how network companies are ripping you off, and explaining how they bought white box network solutions that route layer 3 at wire speed, converge rapidly, and have tons of memory.
This last part reads like the grass is always greener, and doesn't consider a lot of the hard work and time that goes into developing network hardware.
First, conflation: White box can entail NPUs or ASICs, which have different implications on FIB memory - the majority of these major company (read: FB, MSFT, GOOG, AMZN) use ASICs with some on-chip memory (in the dozens of MB at best). Convergence speed is primarily driven by the network operating system and it's protocol implementation (aka, something you need to write, or contract out to a vendor)
What those high level engineers don't tell you is how hard it is to create white box solutions. And they aren't vendor free paradises either - you have to sign a NDA with the chip manufacturer and use their SDK. All of those things like implementing counters, make before break, prioritization, control-plane and data-plane interfaces are up to you. And when that's all done, you need to perform countless hours of testing.
> In my career I’ve met quite a few network engineers who don’t think sufficiently outside the box. They typically say “this is how it’s done, this is how it works.”
Completely agree, it is my experience as well. I believe this is because, at last historically, very few network engineers are interested in software.
This is slowly changing, but it will take "a generation".
I'm a network engineer myself, and I can tell you your evaluation of the industry is pretty spot on. I believe a big part of the solution will be so-called white-box hardware. This will break the Cisco strangle-hold and force network operators to have deeper technical knowledge.
As an NSP architect I can tell you the hardest resource to procure is experienced talent that can operate a production network without trying to move fast and break things.
Buying hardware with usable software is pretty critical when dealing with stateful services.
i can see where SDN might be applied to build your own machinery for inter domain routing, but how does it help in the broader internet where bgp is applied?
sure, I guess if you developed a new mechanism for resolving transitive policy with baked in authentication, and got everyone to agree to use it, you wouldn't need to convince the router vendor. but it seems like the first two things are harder than the third.
The biggest problem with replacing BGP is the adoption. BGP is "good enough." What's the incentive? If you inject BGP routes into the new protocols, you'll have some of the same issues, right?
Slightly aside of BGP on an internet scale - We’ve recently switched to using BGP internally within our network for kubernetes + load balancer (nginx) routing, (announcement / discovery) amongst other things and I can tell you - it’s one of the best moves we’ve ever made. The simplicity, performance and reliability is brilliant.
We did the same thing but opted for the much simpler kube-router, which is more k8s specific and doesn't have nearly as many moving parts (easier to troubleshoot as a result). Bonus point is uses LVS for service proxy load balancing:
OT:
is this really "simple" to setup or do you mean that once it's running that it's simple?
or what do you mean with simplicity.
I actually looking forward to use k8s in our network, too. But I think the simplest solution for HA k8s is using keepalived paired with nginx/haproxy on the master level and add a keepalived-vip for every service.
(at least for our 3 node master, 3 node worker setup)
(I evaluated the calico + bgp option, too but it looked way to complex, as a k8s starter)
And there’s plenty of other resources out there. IMO if you’re designing a highly available platform, BGP is one of the simplest parts of that complex system.
In my experience you’re likely to run into much more finicky problems with higher layer systems including failover delays, reliance on casting systems and state drift.
Kubernetes is very complicated, but BGP is not. If the curve for BGP is too steep, you can do OSPF inside your racks with BGP announcements to spine. This is pretty common as OSPF can use area 0 with little to zero configuration. BGP leaf to spine is standard is layer 3 networks.
Indeed, completely for fast, simple routing (rather than higher level, more complex failover).
Slightly off topic, but in my experience most (general) internal networking loops stem from mis configured devices bridging networks (desktop level, network device level or VM / SDN)
I may be old, but if you have a network loop inside your own infrastructure wouldn't it be way more sane to handle that on layer 2? for instance with properly configured spanning tree?
Spanning tree just prunes links from the network to eliminate loops. Sometimes the link that got pruned happened to be the fastest path from point A to point B. So, sometimes you can get a more efficient network if you leave those links in, but that requires more sophisticated routing.
Routing is a Layer 3 issue. Loops are a Layer 2 issue.
Multiple paths on a Layer 3 network provide redundancy when used with a routing protocol such as BGP.
Multiple paths on a Layer 2 network provide redundancy when used with a protocol such as LACP. Without proper design, 2 network segments connected with multiple links will trigger Spanning Tree and shut down one of the links. And depending on the configuration, it may or may not unshut if the active link pair goes down.
Also, it's allows loops to be used as multi paths from different points in the net. Say you have a rack at south and north and there is a loop to a rack in the middle. For failover, you'll take any path that is up. But for normal operation, you'd prefer to take the shortest path.
Technology doesn't magically fix human problems, but good uses of technology can play to human strengths so that humans make fewer mistakes, or the mistakes have fewer bad consequences. BGP can and should be implemented so that the human operators are making the fewest possible mistakes and those mistakes have the least bad consequences. I doubt that protocol changes are what is needed, but that doesn't mean sitting on our hands is the right choice.
As I understand BGP, any updates published by any AS may arbitrarily re-route/disrupt traffic, ie it is not sufficient for me to update my systems, everyone must update all systems for the overall routing infrastructure to be protected. That sounds a lot like a protocol change to me.
> any updates published by any AS may arbitrarily re-route/disrupt traffic
Fortunately, most ISPs are doing strict filtering on announcements on customer ASes. It's when mid-to-large ISPs fuck up (or neglect the filtering part) that it becomes a problem.
Yes we do, there are problems which beg to be fixed. Most of current complaints are about security, in one way or another require plenty of CPU horsepower (which most routers don't have - CPU peaks while suddenly switching upling and retriving fullview is still a typical picture in a modern ISP). What it takes to prevent something like NLRI spoofing? Plenty of layers of data security.
And it still won't fix the human problem underneath it.
Security-aware BGP versions like S-BGP, soBGP and psBGP address most of security concerns, add plenty of router load and still... don't solve some of the human problems.
But current publicly made suggestions on replacing it with something completely different even lack proper understanding of design goals of original BGP standard.
Many years ago I ran a small hosting ISP, ~10 cabinets. One of the choices I made and still stand by was I used Linux-based routers.
We could throw tons of CPU horsepower at it and the fast path and the slow path were the same thing. We had deep Linux expertise, but only passing Cisco expertise, so staying with the Linux networking stack required less specialized knowledge.
But, while most people were worrying about the size of the BGP tables, or trimming the announcements they were receiving, we could get full feeds and never had memory problems. We could run strong filtering, to make sure we weren't sending or receiving any junk like bogons or our users weren't sending with spoofed addresses, without worrying about adding one rule running past the ASIC memory and causing everything to go slow path. We had high availability. All for around $2500 in hardware.
It always seemed crazy to me that they would put such small CPUs in the high end routers. I know they tried to have that CPU do almost nothing, and the ASICs do all the heavy lifting. But for our needs, having an insanely fast CPU and dumb interfaces was perfectly adequate, and let us do full BGP feeds on a $1200 router.
The only real trick we did was we used one network interface and VLANs, and had the multiple network connections terminated at the switch. We were able to get feeds via fiber that we could just terminate in an SPDIF module in the switch. We had 4 core CPUs, so this allowed us plenty of horsepower to manage packet storms or DDoS instead of livelocking the kernel with too many interrupts (it took a while for the kernel to switch from interrupts to polling, and if packets ramped up too quickly the kernel couldn't switch if you didn't have more CPUs than interfaces).
I run a not too small hosting ISP today, and ten years ago I used routers based on commodity hardware running OpenBSD. Back when we were at Gbps scale and most customers were on 100Mb links, this worked fine. However, it did not scale to 10Gbps+, let alone the 100Gbps+ of capacity we're at today.
Beside the obvious case of performance not keeping up in aggregate, performance for just customers being on 1Gbps links was not as good. We found individual TCP session throughput routed through commodity hardware was noticeably lower than using a layer 3 ASICS based switch despite various attempts at tweaking various kernel sysctl variables.
Also, because nobody operates commodity based hardware routers at carrier scale, you cannot trust any open source BGP daemon's implementation of BGP confederations if they even have it at all. At that, I wouldn't really trust any vendor other than Cisco or Juniper for carrier scale routing.
So even if someone were to create a popular, open source routing project making use of DPDK to increase performance scalability on commodity hardware, I'd still stick to Cisco or Juniper solutions like CSR 1000V or vMX for the foreseeable future if I really wanted to use commodity hardware, as the full feature set is going to be proven and mature. I'd want to see any new product stand up well for 5+ years at carrier scale production use before I'd give it any serious consideration.
Even then, I'm not sure how well DPDK will stand up to UDP reflection volumetric type attacks which now dominate DDOS. The top packets/s levels achieved are barely sufficient for simple routing, and would likely drop precipitously with a modest ACL in place, in comparison to ASICS based routers being able to still do full line rate with ACL's that have a large number (hundreds) of terms.
So what is our option for securing BGP-like functionality? The thing that triggered me to tell this story in reply was the part about how most routers don't have the CPU to do the security necessary for more enhanced security.
At the time I was deploying Pentium D in the multi-GHz range routers, the hot-shot Cisco routers were running 100MHz MIPS CPUs, IIRC.
I know they put a ton of emphasis on the ASICs, but seems like they maybe need to splurge a bit on the CPUs. :-) That being said, I haven't really been doing much networking these days, I now just work for a single org, we let the facility do the heavy routing.
Not sure how far back you're going, or which model of Cisco router you're referring to. I first cut my teeth on networking working at a Tier 2 carrier back in 2005 on Cisco GSR's (12000 series) which use a 200MHz R5000 MIPS CPU, but they were already quite long in the tooth at the time and were one of the few remaining networks still running them. And I do recall implementing ACL's to be an issue. Not sure if you were referring to ACL's or BGP policies themselves when referring to securing BGP.
These days, a full BGP capable router starts with the ASR1001 for Cisco which starts with a 32bit 1.5GHz CPU + 4GB RAM on the RP1 and goes up to a 64bit quad-core 2.2GHz CPU + 8-64GB on the RP3. Juniper land, the lowest end non-EOS router would be the MX80 which comes with a 1.33GHz PPC CPU and 2GB RAM, but unofficially Juniper will always steer you towards the MX104 with a 1.8GHz CPU and 4GB of RAM. I can't comment on the ASR's, but we have MX80's and we can do line rate ACL's with hundreds of terms in ASICs without issue. BGP policies themselves are still handled by the CPU though, and are indeed quite slow on the MX80's with full convergence taking up to 20 minutes or so. We've mostly relegated them to core switching duty at this point though, and are waiting for the new MX204's (800Gbps in 1U) to become mature enough before replacing them.
Moving up a bit, we use MX480's as well which we currently have routing engines with 2GHz Intel CPU's and 4GB of RAM, which I believe these are already EOS, but not enough of an issue for us to buy upgraded routing engines for this platform. These do full table re-convergence in about a minute or two. I believe quad-core 1.8GHz is standard now though, with up to hex-core 2GHz available. I don't think CPU's are really much of an issue anymore, although obviously still not as fast as what you can find easily enough in commodity hardware. The ASICs handle most things flawlessly though.
Well... As someone outside of that arm of the industry, I have to wonder about what exposure there is to spectre, or what kind of patches are coming out for affected machines.
JunOS is based on FreeBSD, which doesn't have a fix yet. Not sure about the different variants of Cisco IOS. You wouldn't run untrusted code on a router though, so spectre would not be a concern. In fact, it's probably better not to patch for it given the performance degradation.
That worked fine back then, but in-kernel routing/switching does not scale beyond 1-2 Mpps. Much less if you have complicated iptables rules.
We're getting there with DPDK and similar technologies, but right now, if you're running a high-bandwidth network, there's really no alternative to buying ASIC-based routing equipment from the big vendors. Even more if you're targeted by DDoS attacks, which represent the worst case: tens of gigabits of <100B packets.
I've done mostly the same thing you describe and it absolutely works and works well. You throw much less money into hardware, you don't need to rely on vendor-specific experts and expensive certification training and you don't have to pay out the nose for vendor maintenance agreements and licensing. It also requires that your people have real understanding of how routing and switching protocols actually work rather than being glorified [insert vendor] sales consultants. Add some programming and you have SDN and an incredible amount of power and flexibility with very little siloed, non-adaptable expertise.
If we don't replace BGP, people with BGP access are going to keep "accidentally" redirecting compute resources from AWS and other data centers to themselves so they can man in the middle crypto mining, and steal millions.
I don't think we need a new bgp, we just need to fix the old bgp. I have yet to see a practical and realistic alternative. Also, don't forget MPLS is very interdependent in it's current state on BGP, and MPLS is really awesome.
Yes, BGP has some weaknesses, but perhaps calling for a new bgp is throwing the baby out with the bathwater.
You know what I think is more of a problem is vendor market dominance that uses closed source proprietary boxen on proprietary platforms designed for vendor lock in.
Also, some have referenced the power of asics in these black boxen over cots hardware, but these days thats not true because there are open source systems with asics in them.
Not exactly related to BGP, but I did an entire infrastructure upgrade for a company. I talked to every vendor in existence, extreme, juniper, cisco, brocade, etc etc. I ended up going with Ubiquiti and saving over $50k... These vendors are gouging customers who already have problems with IT infrastructure budgets, and I'm tired of dealing with them. (and their often non-spec implimentations of protocls)
On top of that, you would be amazed at how many "network engineers" don't know anything but cisco, and couldn't route their way out of a paper bag in anything else, this result of the vendor lock-in is much more of a problem.
Lets also not forget to mention the NSA-cisco, etc backdooring systems.
A bit of topic obviously here. But that's how overall I feel about Apple's UI experience or product design in general. The UI interaction if you think about is also a protocol of sorts - between user and the device.
Say user scans the the preferences window, finds the checkboxes they were looking for , picks one, clicks on it, it reacts back and is marked as such and so on. Now try to reduce the number of needed actions and decisions needed by the user there. Maybe they don't have to click "Apply" button when done. Or maybe there are only 2 checkboxes. Or they are not even in the preference menus at all.
The relevance of "you cannot solve people problems with technology" point is debatable. Sure, some problems you can't solve, but a properly designed system can definitely reduce the risk of human error - we've seen this in fields such as aviation and medicine (to some extent).
But what isn't debatable is the fact that BGP is a security nightmare, and there's really no reason for internet routing to be so vulnerable. It's amazingly simple to divert traffic using BGP, and it has been so for years - https://www.youtube.com/watch?v=S0BM6aB90n8.
>"From time to time, I run across (yet another) article about why Border Gateway Protocol (BGP) is so bad, and how it needs to be replaced. This one, for instance, is a recent example."
Yet if you read that linked article this author references in the last sentence, that article does not suggest replacing BGP at all. That referenced post even states:
>"How do we fix this? Well, aside from making sure that anyone touching BGP knows exactly what they’re doing? Not much."
And then the referenced article goes on to mention RPKI and the MANRS frameworks which are not routing protocols nor are they meant to replace BGP.
Operator and designer mistakes happen, the goals should be to make simple systems easy to build. Complex networks will need thoughtful solutions, and better designs will have fewer operator interventions.
I could be petty and pick apart the article for mistakes, but I agree with the gist- If you compose your needs and environment and use the right design you might not hate the protocol(s) you end up with?
We wouldn't need BGP at all if they hadn't ditched flow routing from IPv6 before it even became a proper draft, and if IP in general wouldn't assume we still use thicknet everywhere.
costs, costs, costs. As your IP service becomes commoditized fewer, not more, engineers make bigger changes.
Interestingly, expert systems reduce, not improve, the quality of the operating engineers: http://journals.sagepub.com/doi/full/10.1177/000183921775169...
Mostly this makes no sense because the things BGP cares about are orthogonal to mere IPv4 or IPv6 addressing.
But if there is a difference, IPv6 means less prefixes are needed for any given provider, since the IPv6 prefixes are huge and unfragmented, we can usually give a provider enough address space in their initial allocation to last them forever, and if they exceptionally out-grow it, the larger space is available, whereas with IPv4 there's no chance anybody can give you a /16 even if you warrant one.
Oh yeah, nothing makes a mess of your network like someone coming in and saying "we need to remove these IPs from the middle of our advertised blocks for security reasons".
I think we're going to see less of this due to IPv6's design, renumbering is less of a pain than it was with IPv4 so instead of splitting off a route in the middle of your address space and bloating the BGP table you can just move to a different subnet entirely.
Network Operators dislike buying new hardware. As the policies matriculate through tiers of providers, raw numbers of announced prefixes becomes a sticking point.
If we sat down and started designing a routing protocol to handle the advertisement of routing tables on the internet, would we come up with BGP again? Would the new protocol potentially offer up fixes for some of the most common and obvious problems with BGP? Then there’s potential for improvement.
The network industry resists and slows the horribly needed disruption in several key areas through the dogmatic defense of the status quo that we typically hear from its constituents.
This is why so many of us pray for SDN. It’s the only way to wrest control of networking away from those who typically have dogmatic vendor based training rather than a “network science and theory” education that leads to innovation.