Published on

Increase usable cloud GPU memory by up to 6.6% through disabling ECC

On all the cloud providers, you may have noticed that the amount of GPU memory on certain instance types is a little bit less than advertised. For example, Amazon's g4dn instances contain NVIDIA T4 GPUs. According to NVIDIA, these GPUs come with 16GB of GDDR6 memory.

Figure 1. Nvidia T4 specifications

Logging into an AWS g4dn.xlarge instance and running nvidia-smi shows something a little bit different:

There's only 15GB! Where did the last gigabyte go?

Figure 2. nvidia-smi on g4dn.xlarge

Error correction code (ECC) memory

The answer lies in NVIDIA's implementation of error correcting memory for their professional / datacenter GPUs. Error correcting memory uses some mathematical magic so that the processor can detect and/or correct any bit flips that occur in memory. Bit flips are rare but do happen from time to time due to causes like hardware failure and cosmic rays. There's no free lunch here though: all error correction schemes require some additional space to store redundant data.

For example, a common scheme on CPUs uses 72 bits of storage for every 64 bits of usable memory. You can actually see this when you compare a stick of server memory with ECC support against desktop memory without ECC support. Both of these sticks have a usable space of 16GB, but the server memory has 9 memory chips on one side and the desktop memory only has 8.

Figure 3. 16GB ECC memory for CPU (9 memory chips per side)

Figure 4. 16GB non-ECC memory for CPU (8 memory chips per side)

On many of their datacenter GPUs, NVIDIA implements ECC by repurposing some of the existing memory capacity to store the redundant data. When the GPU is advertised as 16GB of memory, 16GB refers to the physical amount of memory, not the amount after ECC. After ECC overhead, there is only 15GB of usable space left! This is what we're seeing on the AWS instance. Another less obvious thing that is lost when ECC is enabled is 1/16 of the memory bandwidth. The 320 GB/s of advertised bandwidth on the T4 becomes more like 320 * 15/16 = 300 GB/s.

Note that this doesn't apply to GPUs that use HBM memory such as the V100 and A100, since HBM also has additional storage specifically for ECC already built into each memory chip. Unlike GPUs with regular GDDR memory, the extra capacity for ECC is not counted as part of the advertised memory capacity.

Getting the 1GB back

The really interesting thing about ECC memory on GPUs is that it's configurable in software. From the Linux command line, it's a single command to turn off ECC on NVIDIA GPUs:

sudo nvidia-smi -g X --ecc-config=0

where X is the GPU ID number.

After rebooting, we can see the full 16GB of GPU memory on our instance now, 6.6% more than before:

Figure 5. nvidia-smi on g4dn.xlarge with ECC off

In an environment like Kubernetes, a simple daemonset could be used to achieve the same effect on new nodes added to the cluster.

To re-enable ECC, run:

sudo nvidia-smi -g X --ecc-config=1

and reboot again.

Is it worth it?

Obviously, turning off a feature like ECC that can impact data integrity should not be done without considering the trade-offs. The pros and cons are summarized below:

Pros:

  • More GPU memory available to hold larger datasets
  • Proportionally more GPU memory bandwidth, improving the performance of memory-bound GPU applications

Cons:

  • Higher chance for data corruption, including silent data corruption. One study by Google in 2009 shows a widely varying rate (0.05%-20%) of CPU memory modules had at least one correctable memory error per year, but it's hard to say what error rates would look like on today's GPU memory.

For many applications, the output can be invalid once in a million times or errors can be detected and retried through a separate mechanism (eg. checkpointing). In these cases, turning off ECC can be the right choice.

At Exafunction, we’re constantly searching for ways to make GPUs more efficient. This article is part of our series on GPU Tips and Tricks, where we’ll share interesting insights and learnings we found along the way. Stay tuned for more content about how you can get the most out of your GPU in deep learning applications!