Cybersecurity researchers from the Pacific Northwest National Laboratory (PNNL) have discovered vulnerabilities in Nvidia DGX systems that expose devices to the risks of attacks through third-party and hidden channels.
The discovered vulnerabilities are associated with microarchitectural errors and can affect both local and remote systems. A team of specialists reconstructed the cache hierarchy, showing how an attack on a single GPU can affect the level 2 cache of a connected GPU (accelerators are linked together with Nvidia’s proprietary NVLink) and cause a conflict on the connected GPU.
When reverse engineering caches and examining the general configuration of Non-Uniform Memory Access (NUMA), the team found that “the level 2 cache on each GPU caches data for any memory pages mapped to the physical memory of this GPU (even from a remote GPU).”
This allows you to create competition for remote caches by allocating memory on the target GPU, which is an important component that allows you to use hidden and third-party channels. Such attacks bypass isolation-based defenses, such as partition-based protection mechanisms that can be enabled for processes running on the same GPU.
Attacks are carried out entirely at the user level without any special access. The attack model challenges assumptions about previous GPU-based attacks and significantly expands experts’ understanding of the threat model for servers with multiple GPUs.
Measures to prevent exploitation of vulnerabilities include static or dynamic sharing of shared resources. Each individual GPU can be divided into separate GPU instances in multi-user environments, which means direct and isolated paths through cache and memory.