How do I access the GPUs I subscribed to?
By logging into your GPUaaS instance you can interact directly with the GPU(s) which were assigned during subscription.
Can I run 'nvidia-smi`?
Yes nvidia-smi is functional with GPUaaS instances.
Won't I get out of memory errors if sharing a GPU with others?
HostedAI supports full GPU memory partitioning, preventing other users from causing out-of-memory (OOM) errors.
How does hosted·ai deal with physical vRAM limitations?
hosted·ai allows you to sell a GPU device with multiple times by managing memory context switching.
The scheduler pairs system memory with the GPU card memory and opportunistically pages memory in and out of the card without the consumer knowing this. In this manner each of the end users can subscribe up to the full vRAM capacity of the card. The service provider has an additional cost in providing more system RAM capacity but at a significantly lower cost to GPU vRAM.
Is the vRAM capacity split between workloads?
Each Team has a hard limit for the size of GPU vRAM they can allocate based on their Pool subscriptions. This limit ensures customers cannot use more vRAM than allocated. The data stored in their allocated memory is isolated from other users.
How are TFLOPs shared?
hosted·ai uses a time-based approach to divide GPU resource availability when oversubscribed. The Time Quantum (TQ) is set to 30 seconds by default. This means each task can use the GPU for a block of 30 seconds.
You can reduce the Time Quantum to 2 seconds for more rapid switching between tasks. Be aware that this setting increases overhead and may be less efficient for longer-running tasks. TQ can be set on a per Pool basis. Contact the Support Team for assistance if you would like to tailor this value.
What is the Performance Impact of Time Quantum?
Tasks that run longer than the TQ get a guaranteed minimum share of the GPU, based on the Sharing Ratio applied to the GPUaaS Pool. When no other tasks are running on the same physical GPU they can use the GPU fully.
Can you explain about the different scheduler modes?
The scheduler supports three GPU sharing modes that determine how GPU resources are allocated between users: Security, Balanced, and Performance. These modes correspond to different configurations of Temporal and Spatial scheduling.
Temporal mode means that Time Division Multiplexing (TDM) is utilised to share access to the GPU resources, while Spatial means that multiple team tasks can run on the GPU at the same time. The admin chooses the mode that should be applied per pool when the pool is initialised via a slider; options range from Performance mode (Spatial scheduling) to Secure mode (Temporal scheduling). The highest security mode enables full memory zeroing between tenants, ensuring that no runtime vRAM state is readable by tasks between time slots. There is therefore a trade-off between lower security with higher performance on the left side of the slider vs higher security with lower performance on the right side of the slider.
Balanced mode provides a middle ground between performance and security. It still uses Temporal scheduling to share GPU resources, but memory zeroing between task executions is disabled, improving efficiency while maintaining a good level of workload separation. Recommended for internal or trusted environments where performance is important but basic isolation is still needed.
When in Temporal mode, each team subscribes to a guaranteed minimum share of the GPU, and they can elastically increase or decrease this amount depending on how much of the processing time they need to consume. The TDM time slot is controlled by the ‘Time Quantum’ (TQ) parameter -> a smaller TQ value means that context switching between users is more frequent, which reduces response latency but also increases overhead (less efficient) whereas a larger TQ decreases overhead but processes are less responsive. The sharing ratio that is applied to a pool of GPU cards determines how many times the service provider is going to oversubscribe the GPU resource and, therefore the maximum number of users that may be sharing a card at any time.
The scheduler always monitors processing activity and quickly determines when tasks fall idle before the expiration of the allocated TQ slot. When tasks fall idle, the scheduler will always opportunistically deploy other tasks that have work to do rather than allow the GPU to sit idle for any period of time. The scheduler applies a credit algorithm that records how much of an allocated time allowance is being consumed by each team, and if a team is not utilising their full allocated share, then credit is applied to that team. If teams have more credit than the currently executing user and they do suddenly need to execute a task, then the scheduler will pre-empt the running task in favour of the user with higher credit, thereby rewarding them for accumulating credit and allowing others to execute previously. Processes that are latency sensitive, such as inferencing tasks, are the ones that can benefit most in this scenario. In most cases, this means that the inferencing tasks behave just the same as they would on a dedicated GPU but with the considerable advantage of allowing other users to consume the idle cycles in between.