Earlier this week, Ethan Banks wrote a very nice article about Mellanox’s dual spine and leaf network in support of a large amount 10GbE access ports. After describing the scaled up network design, he reviews 8 observations about the design, not to point out good or bad, but merely to point out specific points to consider. Fully coincidental (Ethan lives close to us, but I am pretty sure he cannot peek through our windows) we had gone through a similar exercise this week, documenting the choices and limitations of spine and leaf networks. And as always, the conclusions are not ones of right or wrong, more of awareness of choices and consequences.
The Mellanox design Ethan describes employes an extra spine layer, we have seen and heard the same from Arista and others, some calling it a spine-spine or similar. Nitpicking perhaps, but adding a spine layer to a spine and leaf network and still calling it a spine and leaf network is like adding a docking station, screen and keyboard to a phone and still calling it a phone. It’s a computer that can make calls. And a spine and leaf network with an extra spine is a fat tree.
Ethan points out in his first few points that the sheer amount of cables and optics is astonishing. Let me try and put some numbers around that statement. If you build an approximate 3:1 oversubscribed spine and leaf network out of generic switches, I would probably use a Trident2 based ToR switch with 48 SFP+/10GbE ports and 6 QSFP/40GbE ports. The 48 ports should serve most rack deployments with single homed servers, only really dense or heavily multi homed servers would need more ports and I will use 4 of the QSFP ports to connect to my spine. That leaves 2 QSFPs on each leaf to be used for extra access ports, at the cost of some oversubscription. Or if I wanted I could use these as extra spine connections, lowering my oversubscription to 2:1. As a spine, and staying away from chassis based systems, I would pick a 32xQSFP/40GbE switch.
The largest spine and leaf I could build out of this combination is one that contains 16 spine switches and 96 leaf switches. In a spine and leaf I need to connect each leaf to each spine and I have the equivalent of 16x10GbE to use out of each leaf, which I can connect to at best 16 spines. With each spine receiving 1x10GbE from each leaf, I can build out to 96 leafs to fill out my 32xQSFP or 96x10GbE equivalent spine switch capacity. 96 Leaf switches give me 4608 10GbE access ports at 3:1 oversubscription, 5376 at a slightly worse oversubscription if I use the extra QSFP ports on my leaf switches. To support these 5376 access ports, I need 3584 10GbE fabric ports: 16 each from 96 leaf switches plus 32×4 from each spine. That means 1792 switch interconnect cables. And 3584 10GbE short reach optics, because the vast majority of connections between spines and leafs is likely to be at a distance that DAC cables cannot cover (not considering the fact that most high density 10GbE switches are designed without PHYs, limiting the use of passive cables to usually 5m or less).
I let you do the math on the cost of that infrastructure. And the installation. And the maintenance. And the sheer complexity of running a 10GbE from each leaf to each spine. If you want to reduce some of this complexity, you can switch to using 40GbE/QSFPs between the leafs and spines, but by doing so you have reduced the maximum size of the network. Each leaf will now contribute 1 QSFP worth of interconnect to each spine, to a 32xQSFP spine can only support 32 leaf switches, or 1536 10GbE access ports (1792 if you use the 2 extra leaf QSFP ports). And you probably noticed that I left no room on the spine switches to actually leave this spine and leaf network to the rest of the network infrastructure. Taking a few QSFPs for that will reduce the size of the network more.
Even in the spine, spine and leaf network (or spine, leaf and ToR in Mellanox terminology) the amount of cabling between the ToR (someone explain why that is not the leaf of the network?) and the aggregation leaf may be reduced, the cabling between their aggregation leaf and spine still follows the same model as above.
We often focus on the cost of switches and the derived cost per port. Of course the cost per port is important, but don’t fool yourself by taking the cost of a switch and dividing it by the number of ports it supports. It makes for great quotes in press releases, but the actual cost for that port is way higher the moment you add in the overhead required to connect that port to the rest of the network. And for spine and leaf networks, the cost of the rest of the network has a very large portion of cables and optics. Even for reasonably priced optics, that piece of the infrastructure is likely going to cost more than the cost of the spine switches together. Even when you create one of these multispined animals…[Today’s fun fact: The human spine contains 120 muscles and approximately 220 individual ligaments.]