Building deep learning machines 2019


Ok, it’s been a year since I’ve written down a recommendation for building a gigantic computer or any computer for that matter. I haven’t been building many of these now that a basic laptop like the MacBook Pro 15″ is so powerful. And I haven’t been doing much gaming and most of the machine learning work happens in the cloud.
But there is still room for a custom machine for dedicated researcher doing machine learning work. If you have free grad student labor, it makes more sense to use a $20K machine than pay $5/hour for machine learning if you are going to use it for more than two years day and night (that is you can buy 4,000 hours or about 180 days of processing if you are running continuously.
Tim Dettmers has some great insights and recommendations having built seven inference machines. Slav Ivanov has a good set of specific recommendations.

Graphics Card: Titan RTX or 2080 RTX

The main thing to ensure is that you have selected a card which works with your workloads. If you use 16-bit models, then you should use the RTX (Turing) models as they are tuned for 16-bit performance. The now older GTX 1080 Ti and the like can be used just for 32-bit since they don’t have 16-bit optimizations.
But the most important thing is how big a model you are running. You need:

  • 11GB+ State-of-the-art scores, training
  • 8GB looking for running research models

So let’s look at the performance of the all-important GPU on machine learning modes. The latest thing to happen is the new architectures that nVidia has produced.
Two years ago, it was the Pascal architecture that was king of the hill with the Titan Xp running the GP102 architecture that was king, then at the high end, there was the Volta PV102 in the Titan V. Now the new king is the Turing architecture called the TU102 in the Titan RTX. They are really breaking price records here and it costs $2500. Wow, that’s pretty penny
Looking at the benchmarks, the already very expensive RTX 2080 is about 10-15% slower mainly because the Titan V has 24GB of VRAM so for big loads, it can handle more of the workload in the VRAM thus handling more batches in the load. Net, net, if you have the budget.

GPU Cooling

With four GPUs, you are going to have four cards stuffed all together, so cooling them is a big problem. You can lose all kinds of performance if the cards thermal limit. You for sure want blower cards that blow in from the front and out the back.
And it is not easy to get nVidia GPUs to with multiple GPUs under linux. Coolbits doesn’t work well for multiple GPUs. The solution I’ve found is that you need a big motherboard and try not to stack GPUs next to each other. 2 GPUs in a big server case isn’t a bad idea.
GPU Recommendations
nVidia Titan RTX. The the absolute best performance with 24GB of vRAM, this is quite a future proofed unit. If you are concerned, remember, you don’t necessarily need to match these GPU Cards as you are running separate loads, so you could get the 24GB RTX and then buy more as needed. These are a mighty $2400 each
ASUS ROG Strix RTX 2080 Ti. This is the 11GB vRAM with about 10% less performance. I hesitate to call it a budget choice, but it is half the price of the RTX. About $1300 each.

CPU: AMD Ryzen 2950X

The CPU has a major role to play because it preprocesses data and in many cases will do the upload to the GPUs. Here are some requirements

CPU: 2 Cores per GPU minimum

The main thing to realize is what the CPU does, it’s doesn’t do the computation, it preprocess the data and there are two strategies.
With the first, you overlap preprocessing with training so the CPU preprocesses while the GPU trains. That is you load a mini-batch into main memory from the disk, preprocess the mini-batch in the CPU, then load data into the GPU and finally, the GPU does the training.
This first type benefits with a CPU with lots of cores as you, so you want 2 CPU cores per GPU and you get an additional 5% or less for each additional core/GPU.
The second strategy is to preprocess all the data first and then the CPU loads it in from the disk, transfers to GPU and runs it. This second one you just need one core per GPU and adding more cores doesn’t help, but of course, this is not overlapped so likely slower.

CPU Clock Rates

When running deep learning models, the CPUs are also typically running at 100%, so you get some benefits from higher CPU clock rates. Looking at old benchmarks, a 2.6GHz core is about 3% slower than a 3.5GBhz core, so although they are maxed, they don’t make that much difference.
You get more with GPU improvements in general, dominant a decent 4-8 core CPU running at 3GHz is a good match to 4GPU setup.

AMD CPU Recommendations

Looking 18 months later at CPUs, the big change is the arrival of the 9th generation Intel architecture, the so-called Coffee Lake Refresh and AMD now has the 2nd generation Ryzen. In the last review, the Xeon 1630 v3 was a good choice because you can overclock it and it has 40 PCIe Lanes, but the AMDs are more like 60.
Looking at Tom’s Hardware Gamer and Desktop reviews, you want something called a HEDT or high end desktop or workstation class machine (along with motherboard). Here are some ore recommendations:
AMD Ryzen 2950X, this is a follow on to the 1950X recommend in 2017 in PC Parts Picker. In many ways, we are going to use the same system since AMD (unlike Intel) is not constantly changing the motherboards needed so you can use the same basic parts. It has the same 16 Core/32 Thread architecture as the 1950X. The main improvements are moving from 14nm to 12nm. It’s big brother is the monstrous 2990WX with 32 Cores and 64 threads useful for lots of multithreaded jobs like video editing. Probably overkill for this purpose. They both have 64PCIe lanes, so plenty for multiple GPUs in case you need to feed several GPUs running ML jobs at the same time. Running a 3.5GHz, this is a perfect CPU to feed lots of GPUs, although it does cost $900. And it uses the same X399 boards as the first generation Ryzen.
AMD Ryzen 7 2700X. This is the budget pick at $400 with 8 Cores and 16 Threads, but running at 3.7 GHz and is the minimum needed to server 4 GPUs.

Intel CPU Recommendations

In the last year, Intel has rejiggered their line. At the end of 2017, they basically had old Broadwell-E Xeon on 2011 v3 sockets and then the new Skylake for Core i7 on 1151. In those days, if you went to 2011, you got ECC as the main benefit. Now, they have Core i9, Xeon W and Xeon Scalable which are roughly prosumer, workstation and server products.
The Xeon W are the processor for workstations using the new 2066 pinouts and required the C422 chipset. It uses the Skylake-W processor variant. It Intel is always changing pinouts which means that motherboards don’t just work, so you spend alot of time understand new lines. As before, support large memory and ECC is the reason to buy these. They have 48 lanes. Note that there are special Mac version that go into the iMac Pro. Right now it is nearly impossible to build your own workstation because these are such limited part. A comparable part might be the Xeon W-2145 with 8 cores/16 threads for $1600. But you can get Supermicro X11 SRA (Amazon) but this only has three full length slots (16/16/8) or the ASUS WS C422 Pro SE (Amazon) which has three slots with true 16/16/16. So not a bad choice.
So moving to the LGA 1151 consumer line, you can get, without ECC:
Intel Core i7-9700K at $400 is the other choice, it has 8 Cores and 8 Threads. It is a 1151 pin system so it has 16 available lanes, but it is Intel so there is safety in picking Intel, so this is not a bad choice.


Although water cooling sounds neat, the truth is that in a big case with a well designed quiet fan, there is really no difference and an air-cooled fan is much simpler. The Noctua NH-U14S has long been a favorite of mine.

RAM: 16GBx4 ECC DDR4-2666

The main change here is the use of “pinned” memory, this means your information is in a place that a GPU can find it and transfer to its vRAM without any CPU involvement. In this case, memory speed improvements don’t matter. Even with CPU mediated transfers, overclocked RAM results in a 3% speed improvement at most.
Also, in terms of memory, then you need enough to have it all fit. so having 64GB makes sense. As an aside, you probably really want ECC RAM at this level if you are running loads for a long time. Unless you need over 384GB, you can unbuffered.
A year ago, you could only get DDR4-2133 RAM, but now the new Ryzen allows up to DDR4-2666 RAM which help performance marginally. You can get 16GB per stick from Crucial or 32GB per stick for high density configurations (only for specialized motherboard supports it). The big tradeoff is that if you go to 8 slots worth of memory with dual ranked RAM, then you can only run at DDR4-1866, so don’t overbuy if you need lots of RAM.


Motherboard Requirements: PCI Lanes: 8 per GPU

The big thing is not to focus too much on PCIe lanes in your CPU. These lanes are used to take data from the dRAM to the VRAM of your GPU.
The conventional wisdom is that you want the full 16 lanes available for the GPU to transfer from the CPU. But in fact, this transfer time isn’t as important. The main point here is that the difference between 16 PCIe CPU to GPU is 2ms on a 216ms ImageNet pass, so most of the time this doesn’t matter much.
This means that with most Intel systems, you can have a dual GPU system where each gets 8 lanes and you should have about the same performance.
With multiple GPU training, you really only need 8 PCIe lanes per GPU so 32 lanes total through the motherboard, so make sure this is what is available in a 4 GPU system. This normally requires a PCIe switch in the motherboard which only specialized units have.
On the Intel side, each generation requires a different motherboard, but with AMD, the same X399 motherboards will work for Ryzen 1 or 2. For Intel, the latest overclockable line is the Z390 series that pair with the overclockable K processors. But for machine learning, you don’t need to overclock. You want reliability.
To get four GPUs into a single motherboard, you will need ATX at a minimum, eATX will give you an additional slot if you need it. We’ve used this in the past for NVMe memory cards if you run out of slots.

AMD Motherboard Recommendations

As an example, we have a new set of boards that are designed for the Ryzen 2 from Anandtech and Tom’s Hardware:
ASUS X399 ROG Zenith Extreme. This is their flagship gamer board, it has room for 4 full-length GPUs running 16/8/16/8 lanes which should work OK for machine learning, but obviously, all 16 would be better. It also has three m.2 slots. The big issue is that it is hard to fit an air cooler on the CPU as the PCIe slot is very close to it. That is because they wanted a sixth slot for a 4 lane PCI Express card to run 10Gbe Ethernet.
Gigabyte Aorus X399 Extreme (Tom’s Hardware) is an eATX so you can put four graphics cards in with 16/8/16/8 lanes available plus a single lane card as well. It supports 128GB in 8 slots as well with ECC available. And has room for three M.2 NVME cards in 2280 and a single 22110 form factor. And it has onboard 802.11ac Wifi. It’s an expensive $400 board though and Gigabyte has been Ok for me but doesn’t have the same reputation as an ASUS.
MSI MEG X399 Creation. Like the ASUS, this is a monster system with 16 PCI Express to each of GPUs and there is a riser that attaches up to three M.2 cards on it. Also if you don’t completely populate the graphics cards, it does includea special riser care that lets you attach four additional NVMe cards on an x16 slot. The main issue is lack of 10Gb Ethernet which is becoming important for machines like this.
Then are less premium boards, they don’t have the bells and whistles but are better deals
Gigabyte X399 Aorus Gaming 7. This is one step down from the Extreme. It has 4 slots at 16/8/16/8 and there is a 4 lane slot but this prevents one GPU from being installed. And it has two M.2 22110 and one 2280 on the motherboard.
ASRock X399 Taichi. (Anandtech) This is an ATX board that is also 16/8/16/8 and is a good budget choice at $280. This doesn’t have any of the fancy features like 10GBe or 802.11ad, but for a server, it isn’t clear if you need all this given most of tine it is grinding away

Intel Motherboard Recommendations

In the past, the ASUS Workstation line has been my goto for Intel based systems. The WS. Now for the new 9th generation, you can get the WS Z390 Pro for instance and I’ve had good luck with the WS X99 for the older LGA 2011 v3 that works with the old Broadwell Xeons that allowed a 16/8/8/8 configuration with the 40 PCIe Lanes from that chip.

Mass Storage SSDs and HDs

These days it only makes sense to get an NVMe SSD for the boot drive and for swapping, these aren’t expensive these days. Most modern motherboards let you put three M,2 NVMe cards in, so you can have a configuration that is all M2. These take a total of 4 PCI Express lanes each, so it is nice to have Ryzen’s 64 lanes. I would allocate them a system, boot and data drives.
Then you need a place to put all the data that you are running. Most datasets these days are pretty smaller so a RAID-1 10TB hard drive is not a bad choice.
The market has slowed down some for this category, but Tom’s Hardware and they recommend that you stick with 1TB SSDs as the most economical choice and Anandtech also has some recommendations but the high-end SSD market has plateaued in performance:

  1. AData XPG Gammix s11. These are super fact, but they are no longer available and have been replaced by the AData SX8200 Pro at $180.
  2. Mushkin Pilot. AT $190 is another low cost choice.
  3. Samsung 970 Evo. this is slightly lower performance but at $250 much cheaper.
  4. Samsung 970 Pro. You can’t go too wrong with using these, but the Adata is apparently faster. At $350 each, prices have really fallen, but they are still super expensive.
  5. HP EX920. This is last years model (the EX950 is the new one), but you can get a great value on closeouts at $180 for 1TB!

Power Supplies

You want a big power supply, they decrease in efficiency by 20% over their life. So, for instance, a 4GPU system with 250 watts TDP plus the CPU can easily require 1250 watts and you need 20% over that, so a 1.6kW supply isn’t unusual.

Bitcoin Mining Rigs

These Rigs are mainly not super appropriate because they don’t need many PCI lanes, but there are no motherboards that are designed to have more than 4 graphics cards in them.

Related Posts

© All Right Reserved