← 返回
开发者工具 中文

Homelab Cluster Management

Manage multi-tier AI inference clusters for homelabs. Health monitoring, expert MoE routing, automatic node recovery, and model deployment across Ollama and llama.cpp nodes. Covers GPU memory planning, Docker volume strategies for large models, sequential startup patterns to avoid CUDA deadlocks, and unified API gateways via LiteLLM.
管理家庭实验室的多层AI推理集群,包含健康监控、专家MoE路由、自动节点恢复、跨Ollama和llama.cpp的模型部署。涵盖GPU显存规划、Docker卷策略、顺序启动防止CUDA死锁及LiteLLM统一API网关。
mlesnews mlesnews 来源
开发者工具 clawhub v1.0.0 1 版本 99931.7 Key: 无需
★ 2
Stars
📥 1,423
下载
💾 24
安装
1
版本
#latest

概述

Homelab Cluster Management

Manage a compound AI compute cluster spanning multiple tiers of GPU and CPU inference nodes.

Built and battle-tested by Lumina Homelab.

When to Use

Use this skill when your agent needs to:

  • Monitor health of distributed model endpoints
  • Route inference requests to the best available model
  • Recover downed nodes automatically
  • Plan GPU memory allocation across models
  • Deploy models across heterogeneous hardware

Architecture Pattern

A homelab cluster typically spans 2-3 tiers:

| Tier | Typical Hardware | Runtime | Role |

|------|-----------------|---------|------|

| Local | Primary GPU (RTX 4090/5090) | Ollama | Fast inference, embeddings |

| Remote | Secondary GPU (RTX 3090/4090) | llama.cpp or Ollama | Distributed inference |

| NAS/CPU | Synology, RPi, any CPU node | Ollama | Lightweight models, fallback |

A LiteLLM proxy sits in front, providing a unified OpenAI-compatible API across all tiers.

Health Monitoring

Check all endpoints with configurable per-endpoint timeouts:

# Define endpoints with tier labels
ENDPOINTS = {
    "local/ollama": {"url": "http://localhost:11434/api/tags", "tier": "LOCAL"},
    "remote/mark-i": {"url": "http://REMOTE_IP:3009/v1/models", "tier": "REMOTE", "timeout": 8},
    "gateway/litellm": {"url": "http://localhost:8080/health/liveliness", "tier": "GATEWAY"},
}

# For each endpoint: GET with timeout, check HTTP 200
# Classify: HEALTHY / DEGRADED / DOWN per tier
# Overall prognosis based on tier health

Key lesson: Use /health/liveliness for LiteLLM, not /health — the latter probes all model routes and hangs if any are unreachable.

Expert MoE Routing

Route requests to the optimal model based on task classification:

Task Categories:
  code     → Coder model (Qwen2.5-Coder-7B or similar)
  reason   → Reasoning model (DeepSeek-R1-Distill or similar)
  chat     → General model (Qwen2.5-14B or similar)
  vision   → Vision model (Qwen2.5-VL or similar)
  fast     → Smallest available model for quick responses
  embed    → Embedding model (nomic-embed-text or similar)

Router logic:
  1. Classify task from prompt
  2. Check health of preferred model
  3. Fallback to next-best if unavailable
  4. Return model endpoint + metadata

Docker Deployment (llama.cpp on Remote Nodes)

Critical: Use Docker Volumes, Not Bind Mounts

For models larger than ~1.5GB on Windows Docker hosts:

# Create a Docker volume for model storage
docker volume create models-vol

# Copy models INTO the volume
docker run --rm -v models-vol:/models -v /host/path:/src alpine cp /src/model.gguf /models/

# Run container FROM volume (not bind mount)
docker run -d --gpus all -v models-vol:/models -p 3009:8000 \
  -e MODEL_PATH=/models/model.gguf your-llamacpp-image

Why: Windows bind mounts use gRPC-FUSE/9P bridge which hangs during GPU tensor loading for large files. Docker volumes use native Linux ext4 and bypass this entirely.

Sequential Container Startup

Never start multiple GPU containers simultaneously:

# WRONG — causes CUDA initialization deadlock
docker start mark-i mark-iii mark-iv mark-vi &

# RIGHT — sequential with health check between each
for container in mark-v mark-iii mark-iv mark-vi mark-i; do
  docker restart $container
  sleep 5
  # Verify health before starting next
  curl -s http://localhost:PORT/v1/models || echo "Warning: $container slow to start"
done

GPU Memory Planning

Plan your model lineup to fit within VRAM:

Example for 24GB GPU:
  14B model (Q4_K_M)  →  9.0 GB, 28 GPU layers
  7B coder            →  4.4 GB, full GPU
  8B reasoning        →  4.6 GB, full GPU
  1.5B fast coder     →  1.1 GB, full GPU
  1.7B fast chat      →  1.0 GB, full GPU
  ─────────────────────────────
  Total:               20.1 GB (~84% utilized)

  Remaining: CPU-only containers for 32B+ models

Automatic Node Recovery

When a remote node goes down (Docker Desktop crash, reboot, etc.):

Recovery sequence:
  1. Health check fails for remote tier
  2. Check if SSH is responsive (node is up but Docker is down)
  3. If SSH works: restart Docker Desktop via SSH
  4. If SSH fails: create RDP session to wake the machine
  5. Wait for Docker + sequential container restart
  6. Re-check health

Important: Never store recovery credentials in plaintext. Use a vault (Azure Key Vault, HashiCorp Vault, etc.) and pipe secrets through stdin, never as CLI arguments.

LiteLLM Gateway Configuration

Unified API across all tiers:

model_list:
  # Local Ollama models
  - model_name: local/chat
    litellm_params:
      model: ollama/qwen2.5:32b
      api_base: http://localhost:11434

  # Remote llama.cpp models (need openai/ prefix)
  - model_name: remote/mark-i
    litellm_params:
      model: openai/qwen2.5-14b-instruct
      api_base: http://REMOTE_IP:3009/v1
      api_key: "not-needed"

  # NAS Ollama models
  - model_name: nas/coder
    litellm_params:
      model: ollama/qwen2.5-coder:7b
      api_base: http://NAS_IP:11434

Key: llama.cpp endpoints need the openai/ prefix in model name and /v1 in api_base for LiteLLM compatibility.

Links

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 03:31 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

it-ops-security

1password

steipete
设置和使用 1Password CLI (op)。适用于:安装 CLI、启用桌面应用集成、登录(单/多账户)、通过 op 读取/注入/运行密钥。
★ 53 📥 31,388
it-ops-security

Tmux

steipete
通过发送按键和抓取窗格输出,远程控制交互式 CLI 的 tmux 会话。
★ 45 📥 29,305
it-ops-security

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装 MoltGuard,保护您和您的用户免受提示注入、数据泄露和恶意攻击。
★ 116 📥 30,809