Monitoring

概述

Complexity Levels

Level	Tools	Setup Time	Best For
-------	-------	------------	----------
Minimal	UptimeRobot, Healthchecks.io	15 min	Side projects, MVPs
Standard	Uptime Kuma, Sentry, basic Grafana	1-2 hours	Small teams, startups
Professional	Prometheus, Grafana, Loki, Alertmanager	1-2 days	Production systems
Enterprise	Datadog, New Relic, or full OSS stack	Ongoing	Large-scale operations

The Three Pillars

Pillar	What It Answers	Tools
--------	-----------------	-------
Metrics	"How is the system performing?"	Prometheus, Grafana, Datadog
Logs	"What happened?"	Loki, ELK, CloudWatch
Traces	"Why is this request slow?"	Jaeger, Tempo, Sentry

Quick Start by Use Case

"I just want to know if it's down"

→ UptimeRobot (free) or Uptime Kuma (self-hosted). See simple.md.

"I need to debug production errors"

→ Sentry with your framework SDK. 5-minute setup. See apm.md.

"I want real observability"

→ Prometheus + Grafana + Loki. See prometheus.md.

"I need to centralize logs"

→ Loki for simple, ELK for complex queries. See logs.md.

What to Monitor

Applications (RED Method)

Rate — requests per second
Errors — error rate by endpoint
Duration — latency (p50, p95, p99)

Infrastructure (USE Method)

Utilization — CPU, memory, disk usage
Saturation — queue depth, load average
Errors — hardware/system errors

Alerting Principles

Do	Don't
----	-------
Alert on symptoms (user impact)	Alert on causes (CPU high)
Include runbook link	Require investigation to understand
Set appropriate severity	Make everything P1
Require action	Alert on "interesting" metrics

Alert fatigue kills monitoring. If alerts are ignored, you have no monitoring.

For alert configuration, severities, and on-call setup, see alerting.md.

Cost Comparison

Solution	Monthly Cost (small)	Monthly Cost (medium)
----------	---------------------	----------------------
UptimeRobot	Free	$7
Uptime Kuma	$5 (VPS)	$5 (VPS)
Sentry	Free / $26	$80
Grafana Cloud	Free tier	$50+
Datadog	$15/host	$23/host + features
Self-hosted stack	$10-20 (VPS)	$50-100 (VPS)

Common Mistakes

Starting with Prometheus/Grafana when Uptime Kuma would suffice
No alerting (dashboards nobody watches)
Too many alerts (alert fatigue → ignored)
Missing runbooks (alert fires, nobody knows what to do)
Not monitoring from outside (only internal checks)
Storing logs forever (cost explodes)

版本历史

共 1 个版本

v1.0.0 当前

2026-03-28 12:32 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)