GPUaaS Nodes Roles

Each GPUaaS node can take on one or many of the below roles

Controller Node

The Controller node orchestrates GPUaaS services within your region. Allocating sufficient resource allocation prevents service disruptions.

Reserve at least 16GB of memory for the Controller role.

Worker nodes have GPU devices mapped to them and they provide the CPU and memory resources for the end user containers that execute locally to them. Correct resource allocation is crucial for efficient workload processing and to avoid performance bottlenecks.

System memory and CPU requirements depend on the vRAM in the GPUs in the node and the sharing ratio. The sharing ratio determines how many workloads can share the same GPU.

Reserve vRAM * sharing ratio * 1.1 as a baseline. For example, if your GPU has 16GB of vRAM and a sharing ratio of 2, ensure you have 16 * 2 * 1.1 = 35.2GB of memory.

The worker nodes should also provide sufficient physical storage (ideally fast NVMe storage) for the ephemeral workload storage presented to the containers. The actual amount required will depend on what storage size options are presented to the end users, how many GPU cards are present and the sharing ratio that is configured. Reserve #GPUs * sharing ratio * maximum storage allocation per end user workload.

Storage Service Node

The Storage Service node provides persistent storage to instances. Adequate CPU (16 cores) and Memory (16GB) resources should be provided alongside sufficient storage capacity presented via an LVM VolumeGroup. The LVM Volume Group can be mapped against a local storage drive/array or against a remote SAN over iSCSI or FIbreChannel.

Service Gateway Node

The Service Gateway node provides external access to services running on worker nodes. Allocating adequate memory (8GB) ensures reliable and consistent external access.

Controller Node

Worker Node

Storage Service Node

Service Gateway Node