Blog Post

Azure AI Foundry Blog
14 MIN READ

AI Inference Task Migration from CPU to GPU: Methodology Overview

xinyuwei's avatar
xinyuwei
Icon for Microsoft rankMicrosoft
Aug 03, 2025

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git 

This repository ties together the entire methodology with a minimalistic example: first identify computational hotspots on the CPU, then rewrite loops characterized by "high parallelism, sequential memory access, and simple branching" into CUDA kernels to offload CPU workload and unleash GPU computing power. Running the code once can compare CPU and GPU execution times, separate transfer and computation overheads, and verify result errors. This quickly validates the feasibility and expected gains of "CPU → GPU" migration, providing a template for subsequent large-scale migration, pipeline optimization, and MIG resource partitioning in a real business environment.

Overall Approach Steps

The overall approach is divided into four steps:

  1. Single-machine software optimization
  2. CPU-to-GPU migration of compute-intensive tasks
  3. Application architecture splitting
  4. High-level elasticity and disaster recovery extension

The following content will explain these four parts in detail.

A100 Technical Specifications

Testing was conducted using the Azure NC26 A100 GPU VM. The specification analysis of this GPU VM is as below.

ComponentFull NameQuantity / ScaleCompute LogicMain Function or Description
GPUGraphics Processing Unit1Single physical chipWhole A100 compute card
GPCGraphics Processing Cluster7Fixed 7 clustersTop-level scheduling + graphics pipeline
TPCTexture Processing Cluster56 (7 GPC × 8 TPC)Each TPC contains 2×SM + texture frontend 
SMStreaming Multiprocessor108 (56 TPC × 2 = 112 → 108 enabled)CUDA instruction execution cluster, integrates shared memory/registers 
Warp SchedulerWarp Scheduler432 (108 SM × 4)Each SM has 4 schedulers; each scheduler can select 1 ready warp per cycle to issue instructions, supporting dual issue; a single SM can launch ≤4 warps and issue ≤8 instructions per clock cycle in ideal case 
FP32 CUDA CoreFP32 Core6,912 (108 SM × 64)Single-precision ALUPeak 19.5 TFLOPS
INT32 CUDA CoreINT32 Core6,912 (Shared with FP32 ALU)32-bit integer arithmetic 
FP16 CUDA CoreFP16 Core6,912 (Shared with FP32 ALU)Half precisionPeak 78 TFLOPS (2:1 ratio)
Tensor Core3rd-Gen Tensor Core432 (108 SM × 4)FP16/BF16 312 TFLOPS; TF32 156 TFLOPS; INT8 624 TOPS 
Memory ControllerHBM2e MC8FixedEach controller is 512-bit; total bus width 4,096-bit
HBM2e StacksHigh-Bandwidth Memory6 (3D stacked) 80GB total capacity with 1.55 TB/s bandwidth
L2 CacheLevel-2 Cache40 MBGlobally sharedShared among all SMs
Max Resident WarpMaximum Resident Warps48 per SM; 5,184 per card1,536 threads/SM ÷ 32Dynamic concurrency limit
Max Resident ThreadMaximum Resident Threads1,536 per SM; 165,888 per card108 SM × 1,536Dynamic concurrency limit

Hardware structure diagram:

GPU Chip (Total: 1 A100)
└─ GPC (Total: 7)
└─ TPC (Total: 56, 7 GPC × 8 TPC)
└─ SM (Streaming Multiprocessor) (Total: 108)
│
├─ 4 Warp Schedulers (4 input ports; up to 4 warps can be selected and issued instructions simultaneously each cycle)
│ ├─ warp 0 (each warp = group of 32 threads)
│ ├─ warp 1
│ ├─ warp 2
│ └─ warp 3 (up to 4 warps can be active per cycle)
│ └─ Execution Resources (Hardware Units)
├─ 64 CUDA Cores (FP32 cores): standard floating-point cores like in regular computers
├─ 4 Tensor Cores (Matrix multiplication cores): advanced specialized matrix computation units
└─ No RT cores (Ray Tracing cores): the A100 does not include ray tracing hardware cores

Hierarchy view:

GPU > GPC > TPC > SM > Warp Scheduler (4 Warps per cycle) > Warp (48 warps resident) > Thread > Instruction
Thread executes instructions delivered by warp to CUDA or Tensor Cores.

Each SM has 4 warp schedulers.
Each scheduler selects one ready warp per cycle; each warp has 32 threads executing the same instruction synchronously (SIMT).
Hence, a single SM can launch up to 4 warps per cycle.

Single SM schematic:

┌───────────────────── 1× SM (Streaming Multiprocessor) ─────────────────────┐
│                                                                            │
│  Warp Scheduler 0   Warp Scheduler 1   Warp Scheduler 2   Warp Scheduler 3 │
│  ────────────────   ────────────────   ────────────────   ──────────────── │
│  ● Select 1 ready warp  │ ● Select 1 ready warp  │ ● Select 1 ready warp  │ ● Select 1 ready warp  │
│  ▼                      ▼                       ▼                      ▼               │
│ ┌──────────────────── Issue ──────────────────────┐                    │
│ │  If 2 instructions target different functional units, can perform "dual issue" → each scheduler issues up to 2 instructions per cycle, total up to 8 instructions │ │
│ └───────────────────────────────────────────────┘                     │
│            │                 │                        │                    │
│            │ Same warp's 32 threads execute same 1 instruction in lockstep (SIMT)          │
│            ▼                 ▼                        ▼                    │
│ ╔═════════════════  Execution Units  ══════════════════════════╗ │
│ ║  FP32 cores ×64    │  INT32 cores ×64  │  Tensor Cores ×4   │  Load/Store units  ║ │
│ ╚══════════════════════════════════════════════════════════════════╝ │
│                                                                            │
│  (Up to 4 warps can be selected and start execution in one clock cycle) │
└────────────────────────────────────────────────────────────────────────────┘
            ▲                          ▲
            │                          │
  32 threads = 1 warp     Up to 48 warps can be resident (active/suspended) per SM
                               └─ Max 1,536 threads per SM

 

Step 1: Single-Machine Software Optimization

Overall idea:

  • Optimize existing C++ code to reduce unnecessary CPU overhead (memory allocation, thread management, I/O optimizations).
  • Optimize CPU NUMA deployment and core affinity.
sudo apt install hwloc
lstopo --no-io --no-bridges --of txt > topology.txt

 

Example system topology:

CPU Topology

L3 Cache Structure:

  • 3 groups of L3 cache, each 32MB.
  • Grouping:
    • L3 Group 0: Cores 0-7
    • L3 Group 1: Cores 8-15
    • L3 Group 2: Cores 16-23

Core layout:

  • 24 physical cores with no hyperthreading.
  • Each core has dedicated L1d/L1i/L2 caches.

Optimization strategy (graphviz notation):

graph LR
MIG0 --> L3Group0(Cores 0-7)
MIG1 --> L3Group1(Cores 8-15)
MIG2 --> L3Group2(Cores 16-23)

 

Container binding examples:

# MIG container 0 bound to L3 Group 0
docker run -d \
  --gpus '"device=0"' \
  --cpuset-cpus 0-7 \
  -e CUDA_VISIBLE_DEVICES=0 \
  your_image

# MIG container 1 bound to L3 Group 1
docker run -d \
  --gpus '"device=1"' \
  --cpuset-cpus 8-15 \
  -e CUDA_VISIBLE_DEVICES=0 \
  your_image

# MIG container 2 bound to L3 Group 2
docker run -d \
  --gpus '"device=2"' \
  --cpuset-cpus 16-23 \
  -e CUDA_VISIBLE_DEVICES=0 \
  your_image

 

Verification example:

docker run -it --rm --name gpu_test --gpus '"device=0"' --cpuset-cpus 0-7 -e CUDA_VISIBLE_DEVICES=0 ubuntu:22.04
apt update && apt install -y hwloc
lstopo --no-io --of txt

 

Container CPU affinity benefits:

MetricShared L3Dedicated L3Improvement
L3 Cache Hit Rate68%96%+41%
Memory Latency89ns61ns-31%
Compute Throughput1.2 TFLOPS1.8 TFLOPS+50%

Step 2: CPU-to-GPU Migration Assessment for Compute-Intensive Tasks

Overall approach:

  • Migration: identify CPU hotspots and migrate them to the GPU using CUDA.
  • GPU parallelism: leverage CUDA Streams for pipeline architecture to increase GPU utilization.

Migration Workflow

Step 1: Hotspot identification with perf tool

perf record -F 99 -g ./your_app perf report -g "graph,0.5,caller"

Example perf output:

OverheadCommandShared ObjectSymbol
62.3%your_appyour_app[.] heavy_compute_function
18.7%your_appyour_app[.] data_preprocessing

Example code perf_demo.cpp:

#include <vector>
#include <random>
#include <cmath>
#include <algorithm>
#include <numeric>
#include <chrono>
#include <iostream>

// Function annotations to disable inlining for perf clarity
#define NOINLINE __attribute__((noinline))

// Hotspot 1: heavy trigonometric usage
NOINLINE void hot_trig(std::vector<double>& dst) {
    for (double& v : dst) {
        double t = std::sin(v);
        v = t * std::cos(t) + std::sqrt(t);
        v += std::sin(v) * std::cos(v);
    }
}

// Hotspot 2: STL sorting
NOINLINE void hot_sort(std::vector<double>& dst) {
    std::sort(dst.begin(), dst.end());
}

// Hotspot 3: vector accumulation
NOINLINE double hot_accumulate(const std::vector<double>& src) {
    return std::accumulate(src.begin(), src.end(), 0.0);
}

int main() {
    constexpr std::size_t N = 200'000;
    constexpr int ITER = 500;

    std::mt19937_64 rng(42);
    std::uniform_real_distribution<> dist(0.0, 1000.0);
    std::vector<double> data(N);
    for (double& v : data) v = dist(rng);

    double checksum = 0.0;
    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < ITER; ++i) {
        hot_trig(data);
        hot_sort(data);
        checksum += hot_accumulate(data);
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = end - start;
    std::cout << "checksum = " << checksum << "\n"
              << "elapsed  = " << elapsed.count() << " s\n";
    return 0;
}

 

Compile with:

g++ -O0 -g -fno-omit-frame-pointer -fno-inline perf_demo.cpp -o perf_demo

 

Perf report command:

sudo perf report --stdio --sort symbol --children no --percent-limit 0 | head -40

Key results:

SymbolTime %
hot_sort45.3%
hot_trig32.8%
hot_accumulate20.4%

Step 2: Migration feasibility evaluation

MetricSuitable for GPUNot suitable for GPU
Compute Density (FLOPs/Byte)> 10< 1
ParallelismData parallelism > 1000Strong data dependency
Branch ComplexitySimple branching (if/else < 5%)Complex branching (switch)
Memory Access PatternSequential/contiguous accessRandom access

Compute Density:

  • Measures average number of floating-point operations per byte of data moved.
  • High FLOPs/Byte implies computation can amortize data transfers, making GPU effective.
  • Low FLOPs/Byte indicates memory-bound workload, CPU likely better.

Parallelism:

  • Number of independent data elements or tasks to execute in parallel.
  • GPU achieves high throughput by thousands of concurrent threads.
  • Parallelism below hundreds is insufficient to saturate GPU.

Branch Complexity:

  • Branch instructions cause thread divergence in GPU warps.
  • Simple branching ensures high efficiency.

Memory Access:

  • GPUs benefit from coalesced (contiguous) memory accesses.
  • Random or irregular access patterns degrade performance.

Quantitative example: calculate FLOPs/Byte using perf counters

sudo perf stat -x, -e r01C7 -e r02C7 -e r04C7 -e r08C7 -e r412E ./perf_demo

Where counters correspond to:

CounterMeaning
r01C7scalar-double FLOPs
r02C7128-bit packed FLOPs
r04C7256-bit packed FLOPs
r08C7512-bit packed FLOPs
r412ELast-level cache misses (bytes × 64)

Script to calculate FLOPs and bytes moved:

#!/usr/bin/env bash
BIN=./perf_demo
read a b c d m <<<$(sudo perf stat -x, -e r01C7 -e r02C7 -e r04C7 -e r08C7 -e r412E $BIN 2>&1 | awk -F, '{print $1}')

a=${a:-0}; b=${b:-0}; c=${c:-0}; d=${d:-0}; m=${m:-0}

FLOPS=$((d + 2*b + 4*c + 8*a))
BYTES=$((m * 64))

echo "FLOPs       : $FLOPS"
echo "Bytes       : $BYTES"
if [ "$BYTES" -gt 0 ]; then
  echo "FLOPs/Byte  : $(echo "scale=3; $FLOPS / $BYTES" | bc -l)"
else
  echo "FLOPs/Byte  : N/A (0 Bytes)"
fi

 

Parallelism estimation

sudo perf stat -e task-clock,context-switches ./perf_demo

Example:

32946.85 ms task-clock # 1 CPU utilized

Wall-clock ~32.95 sec → parallelism ≈ 1 (serial execution).

Branch divergence

sudo perf stat -e branches,branch-misses ./perf_demo

Example:

  • 98 million branches
  • 3.4 million branch misses → miss rate ≈ 3.5% (<5%, good)

Memory access pattern

sudo perf stat -e cache-references,cache-misses ./perf_demo

Example:

  • 210 million cache references
  • 11 million cache misses → miss rate ≈ 5.2% (fairly sequential)

Summary

MetricMeasurement or ConclusionMigration Suitability
Compute DensityApprox 0.5 FLOPs/Byte (low)Uncertain/low
Parallelism1 (serial)❌ Not suitable
Branching3.5% miss rate✅ Suitable
Memory Access5.2% miss rate (sequential)✅ Suitable

Step 3: CUDA Migration Implementation

Example of migrating scalar element-wise mathematical transformation:

CPU version:

void process_data_cpu(const float* in, float* out, int N) {
    for (int i = 0; i < N; ++i)
        out[i] = std::sqrt(in[i]) * std::sin(in[i]) / std::log(in[i] + 1.0f);
}

GPU kernel:

__global__ void process_data_kernel(const float* in, float* out, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = idx; i < N; i += stride) {
        float v = in[i];
        out[i] = sqrtf(v) * sinf(v) / logf(v + 1.0f);
    }
}

Core idea: Transform the loop applying a scalar formula over large vector data from CPU sequential execution to thousands of GPU parallel threads, keeping input-output logic and formula same.

Complete GPU-CPU comparison example process_gpu.cu provided, including error checking, timing, memory allocation, and correctness verification, achieving ~9x speedup with negligible numerical error.

Compile and run with:

nvcc -O3 -std=c++17 process_gpu.cu -o process_gpu
./process_gpu

Typical output:

CPU time           = 288.920 ms
GPU time  (total)  = 33.453 ms
GPU time  (kernel) = 26.006 ms
max abs err = 9.536743e-07  |  max rel err = 3.537088e-07

 

Step 4: Performance Optimization Tips

Memory coalescing

// Inefficient: strided access
value = data[row * width + col];

// Efficient: continuous access by rearranging layout
value = data[col * height + row];  // column-major style

 

Use fast math intrinsics

// Inefficient: strided access
value = data[row * width + col];

// Efficient: continuous access by rearranging layout
value = data[col * height + row];  // column-major style

 

Shared memory optimization

__shared__ float tile[256];
tile[threadIdx.x] = input[global_idx];
__syncthreads();
// Collaborative computations within block

 

CUDA Stream Pipeline Architectures

The key lies in what overlaps with what — across batches? inside a batch? or different tasks? Below is a table describing common modes:

PatternTypical StreamsOverlapping DimensionsExample (times: H2D=4ms, Kernel=8ms, D2H=4ms)Extra TechniquesSuitable Scenario
A. Single stream serial1None; H2D → Kernel → D2H serialSingle image Gaussian blur processing in default streamNoneDebugging or functional verification
B. Per-batch single stream rotation (pipeline across batches)≥3 streams (stream0 batch0, stream1 batch1, etc.)Overlaps H2D/Kernel/D2H of different batches; serial inside batchCamera 30FPS inference; Timeline overlaps transfers and compute across batchesPinned memory recommended (not mandatory)Continuous streaming inference or ETL
C. Copy stream + Compute stream separated per batch2–3 streams per batch (H2D, Kernel, D2H)Overlaps H2D/Kernel/D2H inside same batch plus across batchesLarge matrices with 200MB batch data; High hardware utilization in 3 pipeline stepsMust use pinned host memory + event synchronizationLarge batch size or heavy PCIe usage
D. Concurrent kernel multi-tenantN streams (one per model/task)Completely different kernels / tasks execute concurrentlyMulti-model A100 MIG service running ResNet50 and BERT concurrentlyRequires concurrent kernel support; no need for event dependenciesMulti-model inference, microservices, A/B testing

1. Default Stream Essence

  • All CUDA programs have an implicit default stream (stream 0).
  • Operations in default stream, including asynchronous APIs, execute serially.
  • No overlap between kernel execution and data transfers.
  • Comparable to single-lane highway: next op waits for previous.

2. Default stream bottlenecks

Typical timeline with CPU saturated but GPU underutilized:

0-5 ms: H2D copy 5-10 ms: idle wait (CPU processing) 10-20 ms: kernel execution 20-25 ms: idle wait (CPU postprocessing) 25-30 ms: H2D copy

Leads to under 50% GPU utilization; CPU and GPU alternate idling.

3. Necessity of multiple streams

ScenarioDefault Stream EnoughMulti-Stream Needed
Simple single task
CPU-GPU pipeline
Multiple concurrent tasks
Real-time streaming

Multi-stream can overlap copies with kernel executions, improving GPU utilization >60%.

4. Advantages:

  • Critical for boosting GPU utilization.
  • Addresses CPU 100% plus low GPU utilization.
  • Common developer misunderstanding: CUDA does not automatically parallelize everything.
  • Code changes minimal (~50 lines).
  • Zero hardware cost.
  • Azure A100 supports per-MIG instance streams.

Example:

cudaStream_t mig_streams[3]; for (int i = 0; i < 3; ++i) { cudaSetDevice(i); cudaStreamCreate(&mig_streams[i]); }

5. Minimal working example

cudaStream_t s0, s1, s2;
cudaStreamCreate(&s0);
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);

for (int i = 0; i < batches; ++i) {
    cudaStream_t cur_stream;
    switch (i % 3) {
        case 0: cur_stream = s0; break;
        case 1: cur_stream = s1; break;
        case 2: cur_stream = s2; break;
    }

    cudaMemcpyAsync(dev_buf, host[i], size, cur_stream);
    kernel<<<grid, block, 0, cur_stream>>>(dev_buf);
    cudaMemcpyAsync(host[i], dev_buf, size, cur_stream);
}

cudaStreamSynchronize(s0);
cudaStreamSynchronize(s1);
cudaStreamSynchronize(s2);

 

Step 3: Application Architecture Splitting

Background:

After CPU optimization and GPU migration, issues remain:

  • GPU utilization high but CPU stays >70%
  • CPU/GPU peak loads misaligned, making single-node scaling hard
  • Different tenants have varying CPU and GPU scaling demands

When any of the above appear, the CPU-intensive logic and GPU compute logic should be decoupled into separate microservices.

1. Splitting decision matrix

DimensionCandidate A (Keep on CPU)Candidate B (Keep on GPU)Decision Criteria
Compute DensityFLOPs/Byte < 3FLOPs/Byte > 10GPU if B
Parallelism< 1k> 10kOnly high parallelism suits GPU
Call LatencyP99 < 2 msP99 < 5 msLatency-sensitive logic on CPU
State CouplingStrongWeakStrong coupling postpone splitting
Data SizeKB-levelMB-levelLarge data prefers GPU batch processing

If 3 or more criteria hit 'split' side → proceed with microservices.

2. Deployment process

Phase 1: Service boundaries & tech stack selection

  • Design protobuf interfaces defining input/output.
  • Measure serialization and gRPC RTT: packet < 1 MB, RTT < 1 ms.
  • Communication modes:
    • Real-time: gRPC (Unary or bidirectional streaming)
    • Async batch: Kafka / AMQP

Phase 2: CPU-VM (Service-CPU)

  • Extract CPU hotspot code into separate repo and Docker image.
  • Use thread pools, NUMA affinity, jemalloc from step 1.
  • Deploy on general-purpose VM (e.g., D8s v5), replicas scaled per CPU utilization (~70%).

Phase 3: GPU-VM (Service-GPU)

  • Extract GPU kernel and streams to independent process.
  • Assign MIG resources per tenant with 1:1 mapping.
  • Expose gRPC supporting dynamic batch size.

Phase 4: Communication and fallback

sequenceDiagram
Client->>CPU-Svc: Business HTTP request (JSON/REST)
CPU-Svc->>GPU-Svc: gRPC call (<1 ms)
note right of CPU-Svc: start timer
GPU-Svc-->>CPU-Svc: Inference result
CPU-Svc-->>Client: Final response
CPU-Svc->>CPU-Svc: if latency > 2 ms\n  switch to local fallback

 

Phase 5: CI/CD & rollback

  • Maintain paired images: service-cpu:{sha} and service-gpu:{sha}.
  • Use Helm / Argo Rollouts for blue-green and canary deployments.
  • Gradually ramp GPU side traffic from 10% to full.

Phase 6: Monitoring & autoscaling

ComponentKey MetricsScaling Strategy
CPU-Svccpu_util, req_qpsHPA scale up when CPU >70% and QPS high
GPU-Svcnvidia_gpu_utilization, mig_mem_usedScale down if GPU<50%, scale up if >80% or add MIG slices
Pipelinerpc_latency_p95, fallback_countAlert on rising fallback count

3. Typical Implementation Cases

Case A: E-commerce Recommendation (Feature engineering + Transformer inference)

MetricMonolithAfter Splitting
CPU Utilization90%50%
GPU Utilization40%75%
P99 Latency10 ms6 ms (GPU normal)
11 ms (GPU fallback)

Proto example:

message InferenceReq  { 
  repeated float sparse = 1; 
  repeated int64 dense = 2; 
}
message InferenceResp { 
  repeated int64 item_id = 1; 
  repeated float score = 2; 
}
service RecGPU { 
  rpc Predict(InferenceReq) returns (InferenceResp); 
}

 

Case B: Real-time Video (Demultiplexing + Super-Resolution)

  • CPU VM: ffmpeg demux + H.264 decoding (e.g., c6i.4xlarge)
  • GPU VM: A100 40GB, 3×1g.10gb MIG slices for super-res models
  • Run 8 streams of 1080p 60fps each, end-to-end latency < 40 ms

gRPC bidirectional streaming excerpt:

auto stream = stub->Process(&ctx);
for (;;) {
    Frame f = pull_frame();   // CPU decode
    stream->Write(f);         // Async H2D
    Frame out;
    if (stream->Read(&out)) push_to_encoder(out);
}

 

4. Fallback / Circuit Breaker Strategy

TriggerFallback ActionRecovery ConditionsMonitoring Metrics
RTT > 2 ms x3 or error rate >5%Use CPU version; enqueue to Kafka GPU queue30s of RTT<1ms and error<1%gpu_fallback_total, rpc_latency_p95, rpc_error_ratio

5. Grafana Core Dashboards

  • GPU-Svc: GPU utilization per slice, gRPC latency histogram.
  • CPU-Svc: CPU usage ratio, fallback counters.
  • Pipeline: P99 RPC latency, RPC error ratio.

6. Common Issues

IssueSymptomMitigation
gRPC deserialization takes >1msCPU-Svc high latencyUse proto zero-copy and pinned host memory
Batch too large causes tail latency spikeElevated P95 latencyUse dynamic batching with latency target (e.g., 4ms)
MIG memory fragmentationInference crashesFix slice size; nightly MIG rebuild

7. Implementation Timeline (5 Weeks Template)

WeekFocus
W1Service boundaries and protobuf design
W2Extract CPU logic → Service-CPU
W3Extract GPU logic → Service-GPU
W4gRPC, circuit breaker, Prometheus monitoring
W5GPU service canary 10% → full rollout → decommission monolith

You can replace parameters with your own business specifics and quickly achieve CPU-GPU microservice decoupling + fallback protection + full-link observability.

Step 4: High-Level Elasticity and Disaster Recovery Extension

Goal:

On top of the decoupled CPU-VM ↔ GPU-VM microservices, further implement:

  • Cross-region disaster recovery
  • Fully rollbackable canary deployments
  • Serverless elastic scaling of CPU backend

If no global traffic or extreme availability requirements, any one can be selectively implemented.

Multi-Region / Multi-Cluster Disaster Recovery

SolutionTopologyCore ComponentsTypical LatencySuitable Scenario
Active–Active🇺🇸↔🇸🇬 Anycast GSLBGlobal DNS (Route 53/GSLB), Istio multi-primary, CockroachDB or Spanner<100 msGlobal users > 1 million, evenly distributed
Active–Passive🇺🇸(Primary) ↔ 🇪🇺(Backup)DNS weighted routing + health checks; periodic backups30-60 s switch95% traffic concentrated in single region
Zonal Failover (Same Region)AZ-A↔AZ-BKubernetes topology spread, GPU VM image sync<10 sMulti-AZ within single cloud provider

Implementation checklist:

  • Global ingress: Anycast + GeoDNS with <5s health probe downtime failover.
  • GPU model checkpoint: object storage + incremental rsync; primary→backup delay <60s.
  • Data layer: cross-region Spanner / CockroachDB or async dual writes via Kafka + Debezium.
  • Disaster drills: monthly manual failover for 15 min validating RPO=0 and RTO<60s.

GPU & CPU Mixed Canary Deployment

Example traffic split:

Client
  │
  ├── Istio Ingress (v1 90%, v2 10%)
  │     ├── Service-CPU-v1 ←────┐
  │     └── Service-CPU-v2 ←───┤ Argo Rollouts
  │                           └─ Service-GPU-v1/v2

Sample TrafficSplit YAML:

apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
spec:
  backends:
  - service: svc-cpu-v1
    weight: 90
  - service: svc-cpu-v2
    weight: 10

Two-dimensional canary principle:

  • Keep CPU and GPU versions in sync (schemaVersion label).
  • Progressively rollout CPU 10% → GPU 10% → CPU 100% → GPU 100%.

Auto rollback criteria:

  • gpu_fail_ratio > 1% or rpc_latency_p95 increases 30% in 2 minutes.

Argo Rollouts example:

successCondition: result.gpu_fail_ratio < 0.01
failureLimit: 1
metrics:
  - name: gpu_fail_ratio
    interval: 1m

Conclusion

Migrating C++-dominant applications from CPU to GPU is a systematic process requiring comprehensive consideration of hardware conditions, software tools, and code architecture. This article provides step-by-step guidelines covering analysis, tool selection, code restructuring, and performance tuning.

Practical migration success demonstrates that thorough preliminary analysis, proper CUDA usage, and patient tuning yield significant GPU acceleration. Understand that GPU acceleration is not universal — advantages manifest in cases of high algorithm parallelism and large data scales.

Published Aug 03, 2025
Version 1.0