Getting true 4-way sli to work

The area of pci express has always been incredibly confusing to me. And there isn’t much written about it for laymen. But an old anandtech piece finally made is clear what is happening.

Modern intel processors are directly connected to memory and to peripherals. The smaller processors have 28 serial lanes and Xeon has 40. These are very fast 1GBps in their version 3. (Small aside thunderbolt underneath is actually just the pci express protocol sent long distance). And all the peripherals from USB 3 to thunderbolt to Sata to graphics cards have to share. The CPU has three big connections a quad channel to memory. A bunch of lanes that go to the bus slots and graphics cards mainly. And then Some lanes that go to the pch peripheral controller chipset for Sata, Ethernet etc. that is called the x99 for haswell-e or z170 for skylake.
Graphics cards are the big consumers and can handle up to 16 lanes worth data. But if you do the math if you had 4 video cards that’s way too much requirement with 16×4. So what’s the solution. Well it’s the same as networking. Install a switch. In most cases without a switch to support 4 cards you end up with 4x/4x/4x/4x so only ¼ of the bandwidth. But if you install a PLX switch then you can time multiplex. When you access one card it gets the full 16x. So overall the bandwidth is the same but each card burst at full rate.
For the true need adding a switch tech like this does add latency. So sad me designs give card 1 8 direct to processor lanes Avst rhen mux the other cards against just 8. This makes a single gpu very fast but penalizes the other cards.
The ultimate is to use two PLX 8747 chips and that is how u build a switch which gives everyone 16 lanes of hurst capacity. Although aggregate is always 16. This chip takes two 16x lanes and multiplexes them into a single 16x. So if you have two such chips, you can get 16x across 4 slots. Note that this is going to hide a lot of connectors underneath the video cards particularly the last one.
In the older sandy bridge there were only 20 total Kane’s but with faster peripherals the haswell limits are 28 and 40. Even so u need plz chips to get four cards ruining at full rate.
Most games do not need 16x so this is really for machine learning when you push models in and read them out.
Now in truth the CPU doesn’t actually send data to a peripheral. That would be really inefficient. What actually happening is that all pci express devices can be DMA masters. So what really happens is the CPU tells the graphics cards read the memory here and the gpu uses the pci express bus to fetch. That’s why the 16x is important if there is a huge batch of data. In the old days this was called northbridge which connects the CPU and Gpus to memory and southbridge which is a giant multiplex for all the slower peripherals.
Demand for lanes keeps growing. The new m.2 ssds need 4x pci express lanes alone and that is per ssd! And thunderbolt 3 needs up to four lanes for 40Gbps mode so more that’s a good reason to get a Xeon class processor if you are really doing heavy server or machine learning work where burst speeds matter.
Another detail this is just about gpu to memory, sli is a different a specialize secondary bus used just for processing display renderings. In this mode there is a separate sli bridge and the master graphics card farms out work to the other three cards. It stands for scan line interleaved and literally each gpu calculates a set of scan lines and reports them back. It used to be alernating scan lines but now it can be alternate frames. Or half of the image goes to a master and a slave. Or you can hand off antialiasing of a single image and each gpu does a part of the screen. Note you don’t need sli for machine learning apps typically. Instead you batch the sets of training data and then average the weights.
Ok this is the last detail. In truth there isn’t of course a single CPU. Modern machines have multiple cores and caches. So what the memory sees is just what the last cache is missing. Most of the time hopefully the CPU the is only reading from the cache. The Gpus are processing just in their VRAM but when there is a large write like a big batch change. That is when this really matters.