markdown
# hosted·ai GPUaaS Infrastructure Management API Guide
## Overview
The hosted·ai GPUaaS (GPU as a Service) platform provides comprehensive REST APIs for infrastructure-level management of GPU resources.
## GPU Pool Management APIs
### Pool Provisioning
#### Create GPU Pool
- **Endpoint**: `POST /gpuaas/pool/create`
- **Purpose**: Create a new GPU pool with specified configuration
- **Key Parameters**:
- `sharing_ratio`: Overcommit ratio (1-10x)
- `time_quantum_in_sec`: Time slicing quantum (10-120 seconds)
- `security_level`: Isolation level (low/medium/high)
- `attach_gpu_ids`: GPU devices to assign to pool
#### Add Pool to Existing GPUaaS
- **Endpoint**: `POST /gpuaas/pool/add`
- **Purpose**: Add additional pools to existing GPUaaS instances
#### Delete GPU Pool
- **Endpoint**: `DELETE /gpuaas/pool/{poolId}/delete`
- **Purpose**: Remove GPU pools from infrastructure
### Pool Discovery and Monitoring
#### List GPU Pools
- **Endpoint**: `GET /gpuaas/{gpuaasId}/pool/list`
- **Purpose**: Retrieve all pools in a GPUaaS instance
#### Get Pool Details
- **Endpoint**: `GET /gpuaas/pool/{poolId}`
- **Purpose**: Get detailed pool configuration and status
#### Pool Metrics
- **Endpoint**: `GET /gpuaas/pool/{poolId}/metrics`
- **Purpose**: Retrieve pool performance and utilization metrics
- **Parameters**: `time_grouping` for metric aggregation
### GPU Assignment Management
#### Assign GPUs to Pool
- **Endpoint**: `POST /gpuaas/pool/{poolId}/add_gpus`
- **Purpose**: Assign available GPU devices to pools
- **Parameters**: `gpu_uuids` array of GPU identifiers
#### Remove GPUs from Pool
- **Endpoint**: `POST /gpuaas/pool/{poolId}/remove_gpus`
- **Purpose**: Unassign GPU devices from pools
#### List Available GPUs
- **Endpoint**: `GET /gpuaas/{gpuaasId}/pool/available_gpus`
- **Purpose**: Get list of unassigned GPU devices available for pool assignment
#### List All Pool GPUs
- **Endpoint**: `GET /gpuaas/{gpuaasId}/pool/all_gpus`
- **Purpose**: Get complete inventory of GPU devices in pools
## GPU Node Management APIs
### Node Provisioning
#### Add GPU Node
- **Endpoint**: `POST /gpuaas/node/add`
- **Purpose**: Add new GPU nodes to infrastructure
- **Key Parameters**:
- `name`: Node identifier
- `region_id`: Target region
- `node_ip`: Node IP address
- `username`: SSH username
- `port`: SSH port
- `is_controller_node`: Controller role flag
- `is_worker_node`: Worker role flag
- `is_gateway_service`: Gateway role flag
- `is_storage_service`: Storage role flag
- `volume_group`: Storage configuration
#### Update Node Configuration
- **Endpoint**: `PUT /gpuaas/node/{nodeId}/edit`
- **Purpose**: Modify existing node configurations
#### Remove GPU Node
- **Endpoint**: `DELETE /gpuaas/node/{nodeId}/delete`
- **Purpose**: Remove nodes from infrastructure
### Node Lifecycle Operations
#### Initialize Node
- **Endpoint**: `GET /gpuaas/node/init?gpuaas_node_id={nodeId}`
- **Purpose**: Initialize GPU node for service
#### Deinitialize Node
- **Endpoint**: `GET /gpuaas/node/deinit?gpuaas_node_id={nodeId}`
- **Purpose**: Safely deinitialize GPU node
#### Reboot Node
- **Endpoint**: `POST /gpuaas/node/reboot?node_id={nodeId}`
- **Purpose**: Restart GPU node infrastructure
#### Join GPU Pool
- **Endpoint**: `POST /gpuaas/node/join`
- **Purpose**: Add node to GPU pool cluster
- **Parameters**: `joinee_node_id`, `region_id`
#### Leave GPU Pool
- **Endpoint**: `POST /gpuaas/node/leave?node_id={nodeId}`
- **Purpose**: Remove node from GPU pool cluster
### Node Discovery and Status
#### List GPU Nodes
- **Endpoint**: `GET /gpuaas/node/list?region_id={regionId}`
- **Purpose**: Get all GPU nodes in specified region
#### Get Node Details
- **Endpoint**: `GET /gpuaas/node/{nodeId}`
- **Purpose**: Retrieve detailed node information and status
#### Test Node Connectivity
- **Endpoint**: `GET /gpuaas/node/test_conn?gpuaas_node_id={nodeId}`
- **Purpose**: Verify node network connectivity and SSH access
#### Get Node Logs
- **Endpoint**: `GET /gpuaas/node/{nodeId}/logs`
- **Purpose**: Retrieve node operation logs
- **Parameters**: `per_page`, `page` for pagination
#### Storage Metrics
- **Endpoint**: `GET /gpuaas/node/{nodeId}/storage_metrics`
- **Purpose**: Get node storage utilization metrics
### Node GPU Discovery
#### Scan GPUs
- **Endpoint**: `GET /gpuaas/node/scan_gpus?gpuaas_node_id={nodeId}`
- **Purpose**: Initiate GPU hardware discovery on node
#### Scan NPUs
- **Endpoint**: `GET /gpuaas/node/scan_npus?gpuaas_node_id={nodeId}`
- **Purpose**: Initiate NPU hardware discovery on node
#### GPU Scan Status
- **Endpoint**: `GET /gpuaas/node/scan_gpus/status?scan_id={scanId}`
- **Purpose**: Check status of GPU discovery operation
#### NPU Scan Status
- **Endpoint**: `GET /gpuaas/node/scan_npus/status?scan_id={scanId}`
- **Purpose**: Check status of NPU discovery operation
## High Availability and Control Plane APIs
#### Enable GPUaaS
- **Endpoint**: `POST /gpuaas/enable?controller_node_id={nodeId}`
- **Purpose**: Enable GPUaaS service on controller node
#### Disable Node
- **Endpoint**: `POST /gpuaas/node/{nodeId}/disable`
- **Purpose**: Disable GPUaaS functionality on node
#### SSH Key Management
- **Endpoint**: `GET /gpuaas/node/{nodeId}/ssh_keys`
- **Purpose**: Retrieve SSH public keys for node access
## Region and Resource Discovery APIs
#### List Regions
- **Endpoint**: `GET /gpuaas/list/regions`
- **Purpose**: Get available regions for GPUaaS deployment
#### Get GPUaaS Status
- **Endpoint**: `GET /gpuaas/{gpuaasId}`
- **Purpose**: Check GPUaaS instance status and configuration
## Overcommit Configuration
### Sharing Ratio (Overcommit)
- **Range**: 1x to 10x overcommit ratio
- **Purpose**: Configure how many virtual GPU instances can share physical GPU resources
- **1x**: No overcommit (dedicated GPU access)
- **2x-10x**: Overcommitted sharing with time-slicing
### Time Quantum Scheduling
- **Range**: 10-120 seconds
- **Purpose**: Define time slice duration for GPU sharing
- **10-60 sec**: Low latency, frequent context switching
- **60-120 sec**: High efficiency, longer uninterrupted access
### Security Levels
- **Performance (low)**: Optimized for maximum throughput
- **Balanced (medium)**: Balance between security and performance
- **Security (high)**: Maximum isolation between workloads
## Node Types and Roles
### Controller Node
- Manages GPU pool orchestration
- Handles scheduling and resource allocation
- Maintains cluster state and coordination
### Worker Node
- Provides GPU compute resources
- Executes workloads assigned by controller
- Reports resource utilization
### Service Gateway Node
- Handles external connectivity
- Manages ingress/egress traffic
- Provides load balancing
### Storage Service
- Manages persistent storage
- Handles data persistence and retrieval
- Provides shared storage across nodes
## Status Values
### Pool Status
- `DEPLOYMENT_SCHEDULED`: Pool creation queued
- `VM_CREATION`: Virtual machines being created
- `SETTING_UP`: Pool configuration in progress
- `ACTIVE`: Pool operational and ready
- `ERROR_WHILE_DEPLOYING`: Pool creation failed
- `DELETING`: Pool removal in progress
### Node Status
- `uninitialized`: Node added but not configured
- `initializing`: Node setup in progress
- `initialized`: Node ready for use
- `k8_initializing`: Kubernetes setup in progress
- `k8_initialized`: Kubernetes ready
- `healthy`: Node operational
- `error`: Node experiencing issues
- `resetting`: Node being reset
### GPU Assignment Status
- `UNASSIGNED`: GPU available for assignment
- `ASSIGNED`: GPU allocated to pool
- `ASSIGNING`: GPU assignment in progress
- `UNASSIGNING`: GPU being removed from pool
- `UNAVAILABLE`: GPU not available for assignment
## Usage Examples
### Creating a GPU Pool with Overcommit
```json
POST /gpuaas/pool/create
{
"gpuaas_id": 1,
"pool": {
"pool_name": "high-throughput-pool",
"attach_gpu_ids": ["gpu-uuid-1", "gpu-uuid-2"],
"sharing_ratio": 4,
"time_quantum_in_sec": 30,
"security_level": "medium"
}
}
```
### Adding a GPU Node
```json
POST /gpuaas/node/add
{
"name": "gpu-worker-01",
"region_id": 1,
"node_ip": "192.168.1.100",
"username": "admin",
"port": 22,
"is_controller_node": false,
"is_worker_node": true,
"is_gateway_service": false,
"is_storage_service": false
}
```