Skip to main content

GPUaaS Infrastructure API

markdown
# hosted·ai GPUaaS Infrastructure Management API Guide

## Overview

The hosted·ai GPUaaS (GPU as a Service) platform provides comprehensive REST APIs for infrastructure-level management of GPU resources.

## GPU Pool Management APIs

### Pool Provisioning

#### Create GPU Pool
- **Endpoint**: `POST /gpuaas/pool/create`
- **Purpose**: Create a new GPU pool with specified configuration
- **Key Parameters**:
  - `sharing_ratio`: Overcommit ratio (1-10x)
  - `time_quantum_in_sec`: Time slicing quantum (10-120 seconds)
  - `security_level`: Isolation level (low/medium/high)
  - `attach_gpu_ids`: GPU devices to assign to pool

#### Add Pool to Existing GPUaaS
- **Endpoint**: `POST /gpuaas/pool/add`
- **Purpose**: Add additional pools to existing GPUaaS instances

#### Delete GPU Pool
- **Endpoint**: `DELETE /gpuaas/pool/{poolId}/delete`
- **Purpose**: Remove GPU pools from infrastructure

### Pool Discovery and Monitoring

#### List GPU Pools
- **Endpoint**: `GET /gpuaas/{gpuaasId}/pool/list`
- **Purpose**: Retrieve all pools in a GPUaaS instance

#### Get Pool Details
- **Endpoint**: `GET /gpuaas/pool/{poolId}`
- **Purpose**: Get detailed pool configuration and status

#### Pool Metrics
- **Endpoint**: `GET /gpuaas/pool/{poolId}/metrics`
- **Purpose**: Retrieve pool performance and utilization metrics
- **Parameters**: `time_grouping` for metric aggregation

### GPU Assignment Management

#### Assign GPUs to Pool
- **Endpoint**: `POST /gpuaas/pool/{poolId}/add_gpus`
- **Purpose**: Assign available GPU devices to pools
- **Parameters**: `gpu_uuids` array of GPU identifiers

#### Remove GPUs from Pool
- **Endpoint**: `POST /gpuaas/pool/{poolId}/remove_gpus`
- **Purpose**: Unassign GPU devices from pools

#### List Available GPUs
- **Endpoint**: `GET /gpuaas/{gpuaasId}/pool/available_gpus`
- **Purpose**: Get list of unassigned GPU devices available for pool assignment

#### List All Pool GPUs
- **Endpoint**: `GET /gpuaas/{gpuaasId}/pool/all_gpus`
- **Purpose**: Get complete inventory of GPU devices in pools

## GPU Node Management APIs

### Node Provisioning

#### Add GPU Node
- **Endpoint**: `POST /gpuaas/node/add`
- **Purpose**: Add new GPU nodes to infrastructure
- **Key Parameters**:
  - `name`: Node identifier
  - `region_id`: Target region
  - `node_ip`: Node IP address
  - `username`: SSH username
  - `port`: SSH port
  - `is_controller_node`: Controller role flag
  - `is_worker_node`: Worker role flag
  - `is_gateway_service`: Gateway role flag
  - `is_storage_service`: Storage role flag
  - `volume_group`: Storage configuration

#### Update Node Configuration
- **Endpoint**: `PUT /gpuaas/node/{nodeId}/edit`
- **Purpose**: Modify existing node configurations

#### Remove GPU Node
- **Endpoint**: `DELETE /gpuaas/node/{nodeId}/delete`
- **Purpose**: Remove nodes from infrastructure

### Node Lifecycle Operations

#### Initialize Node
- **Endpoint**: `GET /gpuaas/node/init?gpuaas_node_id={nodeId}`
- **Purpose**: Initialize GPU node for service

#### Deinitialize Node
- **Endpoint**: `GET /gpuaas/node/deinit?gpuaas_node_id={nodeId}`
- **Purpose**: Safely deinitialize GPU node

#### Reboot Node
- **Endpoint**: `POST /gpuaas/node/reboot?node_id={nodeId}`
- **Purpose**: Restart GPU node infrastructure

#### Join GPU Pool
- **Endpoint**: `POST /gpuaas/node/join`
- **Purpose**: Add node to GPU pool cluster
- **Parameters**: `joinee_node_id`, `region_id`

#### Leave GPU Pool
- **Endpoint**: `POST /gpuaas/node/leave?node_id={nodeId}`
- **Purpose**: Remove node from GPU pool cluster

### Node Discovery and Status

#### List GPU Nodes
- **Endpoint**: `GET /gpuaas/node/list?region_id={regionId}`
- **Purpose**: Get all GPU nodes in specified region

#### Get Node Details
- **Endpoint**: `GET /gpuaas/node/{nodeId}`
- **Purpose**: Retrieve detailed node information and status

#### Test Node Connectivity
- **Endpoint**: `GET /gpuaas/node/test_conn?gpuaas_node_id={nodeId}`
- **Purpose**: Verify node network connectivity and SSH access

#### Get Node Logs
- **Endpoint**: `GET /gpuaas/node/{nodeId}/logs`
- **Purpose**: Retrieve node operation logs
- **Parameters**: `per_page`, `page` for pagination

#### Storage Metrics
- **Endpoint**: `GET /gpuaas/node/{nodeId}/storage_metrics`
- **Purpose**: Get node storage utilization metrics

### Node GPU Discovery

#### Scan GPUs
- **Endpoint**: `GET /gpuaas/node/scan_gpus?gpuaas_node_id={nodeId}`
- **Purpose**: Initiate GPU hardware discovery on node

#### Scan NPUs
- **Endpoint**: `GET /gpuaas/node/scan_npus?gpuaas_node_id={nodeId}`
- **Purpose**: Initiate NPU hardware discovery on node

#### GPU Scan Status
- **Endpoint**: `GET /gpuaas/node/scan_gpus/status?scan_id={scanId}`
- **Purpose**: Check status of GPU discovery operation

#### NPU Scan Status
- **Endpoint**: `GET /gpuaas/node/scan_npus/status?scan_id={scanId}`
- **Purpose**: Check status of NPU discovery operation

## High Availability and Control Plane APIs

#### Enable GPUaaS
- **Endpoint**: `POST /gpuaas/enable?controller_node_id={nodeId}`
- **Purpose**: Enable GPUaaS service on controller node

#### Disable Node
- **Endpoint**: `POST /gpuaas/node/{nodeId}/disable`
- **Purpose**: Disable GPUaaS functionality on node

#### SSH Key Management
- **Endpoint**: `GET /gpuaas/node/{nodeId}/ssh_keys`
- **Purpose**: Retrieve SSH public keys for node access

## Region and Resource Discovery APIs

#### List Regions
- **Endpoint**: `GET /gpuaas/list/regions`
- **Purpose**: Get available regions for GPUaaS deployment

#### Get GPUaaS Status
- **Endpoint**: `GET /gpuaas/{gpuaasId}`
- **Purpose**: Check GPUaaS instance status and configuration

## Overcommit Configuration

### Sharing Ratio (Overcommit)
- **Range**: 1x to 10x overcommit ratio
- **Purpose**: Configure how many virtual GPU instances can share physical GPU resources
- **1x**: No overcommit (dedicated GPU access)
- **2x-10x**: Overcommitted sharing with time-slicing

### Time Quantum Scheduling
- **Range**: 10-120 seconds
- **Purpose**: Define time slice duration for GPU sharing
- **10-60 sec**: Low latency, frequent context switching
- **60-120 sec**: High efficiency, longer uninterrupted access

### Security Levels
- **Performance (low)**: Optimized for maximum throughput
- **Balanced (medium)**: Balance between security and performance
- **Security (high)**: Maximum isolation between workloads

## Node Types and Roles

### Controller Node
- Manages GPU pool orchestration
- Handles scheduling and resource allocation
- Maintains cluster state and coordination

### Worker Node
- Provides GPU compute resources
- Executes workloads assigned by controller
- Reports resource utilization

### Service Gateway Node
- Handles external connectivity
- Manages ingress/egress traffic
- Provides load balancing

### Storage Service
- Manages persistent storage
- Handles data persistence and retrieval
- Provides shared storage across nodes

## Status Values

### Pool Status
- `DEPLOYMENT_SCHEDULED`: Pool creation queued
- `VM_CREATION`: Virtual machines being created
- `SETTING_UP`: Pool configuration in progress
- `ACTIVE`: Pool operational and ready
- `ERROR_WHILE_DEPLOYING`: Pool creation failed
- `DELETING`: Pool removal in progress

### Node Status
- `uninitialized`: Node added but not configured
- `initializing`: Node setup in progress
- `initialized`: Node ready for use
- `k8_initializing`: Kubernetes setup in progress
- `k8_initialized`: Kubernetes ready
- `healthy`: Node operational
- `error`: Node experiencing issues
- `resetting`: Node being reset

### GPU Assignment Status
- `UNASSIGNED`: GPU available for assignment
- `ASSIGNED`: GPU allocated to pool
- `ASSIGNING`: GPU assignment in progress
- `UNASSIGNING`: GPU being removed from pool
- `UNAVAILABLE`: GPU not available for assignment

## Usage Examples

### Creating a GPU Pool with Overcommit
```json
POST /gpuaas/pool/create
{
  "gpuaas_id": 1,
  "pool": {
    "pool_name": "high-throughput-pool",
    "attach_gpu_ids": ["gpu-uuid-1", "gpu-uuid-2"],
    "sharing_ratio": 4,
    "time_quantum_in_sec": 30,
    "security_level": "medium"
  }
}
```

### Adding a GPU Node
```json
POST /gpuaas/node/add
{
  "name": "gpu-worker-01",
  "region_id": 1,
  "node_ip": "192.168.1.100",
  "username": "admin",
  "port": 22,
  "is_controller_node": false,
  "is_worker_node": true,
  "is_gateway_service": false,
  "is_storage_service": false
}
```