K8S實戰:基于k6 + Blackbox Exporter搭建純開源自建Synthetic Monitoring平臺
最近被分配到新的任務,需要做個自建 Synthetic Monitoring,可以在公司現有 Kubernetes 集群內部署、可量化核心互聯網服務的可用性 / 性能 / 證書健康,并支撐 SLO 與誤差預算(Error Budget)管理。調研后我基于 Blackbox Exporter + Prometheus + k6 + Grafana 構建了一套“主動探測 + 事務腳本 + 性能基線”體系。本文系統記錄設計思路、技術選型、部署清單、指標與告警公式、Dashboard 以及迭代建議,供團隊落地參考。
為什么要自建 Synthetic Monitoring
傳統監控(主機指標 / APM / 日志)多是“被動”觀察;Synthetic Monitoring 主動從“用戶視角”定期模擬訪問與事務,以便:
- ? 及早發現外部依賴或網絡路徑問題
- ? 驗證上線后關鍵路徑真實可達性與性能基線
- ? 覆蓋證書、DNS、鏈路抖動、超時等非業務代碼層面的風險
- ? 將性能與可用性與 SLO/SLA 對齊,量化誤差預算消耗
核心能力拆分:
1. 基礎可達性:HTTP / HTTPS / DNS / TCP / ICMP
2. 復雜事務:登錄、下單、支付、查詢、多接口編排
3. 性能基準:p95 / p99 / 吞吐 / 錯誤率
4. 證書剩余天數、TLS 版本、重定向鏈
5. 全局可觀測:統一指標 → 規則 → 告警 → Dashboard
總體架構

組件選型
組件 | 職責 | 關鍵輸出 |
Blackbox Exporter | 協議級探測 (HTTP/HTTPS/TCP/DNS/ICMP) | probe_* 指標 |
k6 | 復雜事務場景 / 負載基準 / 合成事務 | http_req_* / checks / 自定義指標 |
StatsD Exporter (可選) | 聚合 k6 指標供 Prometheus 抓取 | statsd_metric_* |
Prometheus | 抓取、記錄、聚合、規則計算 | 原始 & 錄制指標 |
Recording Rules | 計算可用率、p95 延遲、證書剩余天數等 | instance:* 系列 |
Alert Rules | 閾值 & SLO 偏差告警 | 告警事件 |
Grafana | 可視化與 SLO 看板 | 圖表 / 單值 |
Pushgateway (可選) | 臨時/批處理腳本上報 | 自定義 probe 指標 |
項目搭建(純開源實現)
1.Namespace
apiVersion: v1
kind: Namespace
metadata:
name: monitoring2.Blackbox Exporter
apiVersion: v1
kind:ConfigMap
metadata:
name:blackbox-exporter-config
namespace:monitoring
data:
blackbox.yml:|
modules:
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: "ip4"
http_tls:
prober: http
timeout: 5s
http:
method: GET
tls_config:
insecure_skip_verify: false
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 3s
dns_udp:
prober: dns
dns:
query_name: "example.com."
query_type: "A"
transport_protocol: "udp"
---
apiVersion:apps/v1
kind:Deployment
metadata:
name:blackbox-exporter
namespace:monitoring
spec:
replicas:1
selector:
matchLabels:
app:blackbox-exporter
template:
metadata:
labels:
app:blackbox-exporter
spec:
containers:
-name:blackbox-exporter
image:prom/blackbox-exporter:v0.25.0
args:
---config.file=/config/blackbox.yml
ports:
-containerPort:9115
resources:
requests:
cpu:50m
memory:64Mi
limits:
cpu:200m
memory:256Mi
volumeMounts:
-name:cfg
mountPath:/config
volumes:
-name:cfg
configMap:
name:blackbox-exporter-config
---
apiVersion:v1
kind:Service
metadata:
name:blackbox-exporter
namespace:monitoring
spec:
selector:
app:blackbox-exporter
ports:
-port:9115
targetPort: 91153.Prometheus(原生部署方式)
apiVersion: v1
kind:ConfigMap
metadata:
name:prometheus-config
namespace:monitoring
data:
prometheus.yml:|
global:
scrape_interval: 15s
evaluation_interval: 30s
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
-job_name:blackbox_http
metrics_path:/probe
params:
module: [http_2xx]
static_configs:
-targets:
-https://example.com
-https://grafana.com
relabel_configs:
-source_labels: [__address__]
target_label:__param_target
-source_labels: [__param_target]
target_label:instance
-target_label:__address__
replacement:blackbox-exporter.monitoring.svc.cluster.local:9115
-job_name:blackbox_tls
metrics_path:/probe
params:
module: [http_tls]
static_configs:
-targets:
-https://example.com
-https://k6.io
relabel_configs:
-source_labels: [__address__]
target_label:__param_target
-source_labels: [__param_target]
target_label:instance
-target_label:__address__
replacement:blackbox-exporter.monitoring.svc.cluster.local:9115
-job_name:blackbox_icmp
metrics_path:/probe
params:
module: [icmp]
static_configs:
-targets:
-1.1.1.1
-8.8.8.8
relabel_configs:
-source_labels: [__address__]
target_label:__param_target
-source_labels: [__param_target]
target_label:instance
-target_label:__address__
replacement:blackbox-exporter.monitoring.svc.cluster.local:9115
-job_name:k6_statsd
static_configs:
-targets: ['statsd-exporter.monitoring.svc.cluster.local:9102']
rules.yml:|
groups:
- name: recording_service
rules:
- record: instance:probe_availability:ratio
expr: avg_over_time(probe_success[5m])
- record: instance:probe_latency_p95:seconds
expr: histogram_quantile(0.95, sum(rate(probe_duration_seconds_bucket[5m])) by (le, instance))
- record: instance:probe_tls_cert_days_left
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400
- record: job:k6_http_req_p95:seconds
expr: histogram_quantile(0.95, sum(rate(http_req_duration_bucket[5m])) by (le, test_name))
- name: alerts_service
rules:
- alert: SyntheticHighLatency
expr: instance:probe_latency_p95:seconds > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "High latency (p95>800ms) {{ $labels.instance }}"
- alert: SyntheticDown
expr: instance:probe_availability:ratio < 0.95
for: 3m
labels:
severity: critical
annotations:
summary: "Availability <95% {{ $labels.instance }}"
- alert: CertExpiringSoon
expr: instance:probe_tls_cert_days_left < 30
for: 10m
labels:
severity: warning
annotations:
summary: "TLS cert expires in <30 days {{ $labels.instance }}"
- alert: CertExpiringCritical
expr: instance:probe_tls_cert_days_left < 7
for: 5m
labels:
severity: critical
annotations:
summary: "TLS cert expires in <7 days {{ $labels.instance }}"
---
apiVersion:apps/v1
kind:Deployment
metadata:
name:prometheus
namespace:monitoring
spec:
replicas:1
selector:
matchLabels:
app:prometheus
template:
metadata:
labels:
app:prometheus
spec:
containers:
-name:prometheus
image:prom/prometheus:v2.53.0
args:
---config.file=/etc/prometheus/prometheus.yml
---storage.tsdb.retention.time=15d
---storage.tsdb.path=/prometheus
---web.enable-lifecycle
ports:
-containerPort:9090
volumeMounts:
-name:cfg
mountPath:/etc/prometheus
-name:data
mountPath:/prometheus
resources:
requests:
cpu:200m
memory:512Mi
limits:
cpu:1
memory:2Gi
volumes:
-name:cfg
configMap:
name:prometheus-config
-name:data
emptyDir: {}
---
apiVersion:v1
kind:Service
metadata:
name:prometheus
namespace:monitoring
spec:
ports:
-port:9090
targetPort:9090
selector:
app: prometheus4.StatsD Exporter(接收 k6 指標)
apiVersion: apps/v1
kind:Deployment
metadata:
name:statsd-exporter
namespace:monitoring
spec:
replicas:1
selector:
matchLabels:
app:statsd-exporter
template:
metadata:
labels:
app:statsd-exporter
spec:
containers:
-name:statsd-exporter
image:prom/statsd-exporter:v0.26.0
args:
---statsd.listen-udp=:9125
---web.listen-address=:9102
ports:
-name:statsd-udp
containerPort:9125
protocol:UDP
-name:metrics
containerPort:9102
---
apiVersion:v1
kind:Service
metadata:
name:statsd-exporter
namespace:monitoring
spec:
selector:
app:statsd-exporter
ports:
-name:metrics
port:9102
targetPort:9102
-name:statsd-udp
port:9125
protocol:UDP
targetPort: 91255.構建支持 StatsD 輸出的 k6(可用官方鏡像)
k6 原生支持 --out statsd。若需 Prometheus Remote Write 可構建擴展(xk6)。這里示例使用 StatsD。
CronJob(周期性事務 + 輕量負載)
apiVersion: batch/v1
kind:CronJob
metadata:
name:k6-synthetic-cron
namespace:monitoring
spec:
schedule:"*/5 * * * *"# 每 5 分鐘運行一次
successfulJobsHistoryLimit:1
failedJobsHistoryLimit:2
jobTemplate:
spec:
backoffLimit:0
template:
spec:
restartPolicy:Never
containers:
-name:k6
image:grafana/k6:0.49.0
args:
-run
---vus
-"5"
---duration
-"1m"
---out
-statsd=statsd-exporter.monitoring.svc.cluster.local:9125
-/scripts/synthetic.js
volumeMounts:
-name:scripts
mountPath:/scripts
volumes:
-name:scripts
configMap:
name:k6-scripts
---
apiVersion:v1
kind:ConfigMap
metadata:
name:k6-scripts
namespace:monitoring
data:
synthetic.js:|
import http from 'k6/http';
import { check, sleep, Trend } from 'k6';
constloginTrend=newTrend('business_login_duration');
exportconstoptions= {
thresholds: {
http_req_duration: ['p(95)<800'],
http_req_failed: ['rate<0.02'],
business_login_duration: ['p(95)<400'],
},
};
exportdefaultfunction() {
constres=http.get('https://example.com/');
check(res, {
'status 200':r=>r.status===200,
'body non-empty':r=>r.body&&r.body.length>0
});
constt0=Date.now();
//偽造業務邏輯
sleep(Math.random()*0.2);
loginTrend.add(Date.now()-t0);
sleep(1);
}6.Grafana
apiVersion: apps/v1
kind:Deployment
metadata:
name:grafana
namespace:monitoring
spec:
replicas:1
selector:
matchLabels:
app:grafana
template:
metadata:
labels:
app:grafana
spec:
containers:
-name:grafana
image:grafana/grafana:10.4.5
ports:
-containerPort:3000
env:
-name:GF_SECURITY_ADMIN_PASSWORD
value:"admin123"
---
apiVersion:v1
kind:Service
metadata:
name:grafana
namespace:monitoring
spec:
selector:
app:grafana
ports:
-port:3000
targetPort: 3000端口轉發調試:
kubectl -n monitoring port-forward svc/grafana 3000:30007.Alertmanager(可選最小示例)
apiVersion: v1
kind:ConfigMap
metadata:
name:alertmanager-config
namespace:monitoring
data:
alertmanager.yml:|
route:
receiver: default
receivers:
- name: default
webhook_configs:
- url: http://example-webhook.local/alert
---
apiVersion:apps/v1
kind:Deployment
metadata:
name:alertmanager
namespace:monitoring
spec:
replicas:1
selector:
matchLabels:
app:alertmanager
template:
metadata:
labels:
app:alertmanager
spec:
containers:
-name:alertmanager
image:prom/alertmanager:v0.27.0
args:
---config.file=/etc/alertmanager/alertmanager.yml
ports:
-containerPort:9093
volumeMounts:
-name:cfg
mountPath:/etc/alertmanager
volumes:
-name:cfg
configMap:
name:alertmanager-config
---
apiVersion:v1
kind:Service
metadata:
name:alertmanager
namespace:monitoring
spec:
selector:
app:alertmanager
ports:
-port:9093
targetPort: 9093Prometheus 添加 Alertmanager(在 prometheus.yml global 下方):
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager.monitoring.svc.cluster.local:9093"]指標設計與公式
1. 可用率(某目標近 5m):[availability_ratio = avg_over_time(probe_success[5m])]
2. p95 探測延遲(Blackbox HTTP):[p95_latency = histogram_quantile(0.95,\ sum(rate(probe_duration_seconds_bucket[5m])) by (le, instance))]
3. 證書剩余天數:[cert_days_left = (probe_ssl_earliest_cert_expiry - time()) / 86400]
4. k6 場景 p95:[k6_http_p95 = histogram_quantile(0.95,\ sum(rate(http_req_duration_bucket[5m])) by (le, test_name))]
5. SLO 剩余誤差預算消耗速率(示例 SLO 99% 可用):[error_rate = 1 - availability_ratio][budget_consumption_rate = error_rate / (1 - 0.99)]
Grafana 儀表盤建議
關鍵面板:
1. 綜合拓撲狀態(Stat + Traffic Light)
2. 可用率單值 + 誤差預算燃盡圖:Query: 1 - instance:probe_availability:ratio
3. HTTP 探測 p95:instance:probe_latency_p95:seconds
4. 證書剩余天數:instance:probe_tls_cert_days_left
5. k6 事務 p95:job:k6_http_req_p95:seconds
6. k6 錯誤率:sum(rate(http_req_failed{test_name!="",} [5m])) / sum(rate(http_reqs[5m]))
7. DNS / TCP 失敗次數:sum(increase(probe_failed_due_to_dns_lookup[15m])) by (instance)
(可將 Dashboard JSON 導出歸檔至 Git 做版本控制)
運維與排障
場景 | 排查方向 |
探測全部失敗 | Blackbox Exporter Pod / Service / DNS / 網絡策略 |
延遲突增 | Upstream 服務響應、網絡丟包、出口帶寬、限流 |
證書指標缺失 | 目標非 TLS、未走 http_tls 模塊 |
k6 指標缺失 | StatsD UDP 丟包(集群間跨節點),可改用 DaemonSet + 本地 NodeLocal |
Prometheus OOM | 增加 retention 策略,分拆實例或加入 remote_write |
告警風暴 | 增加 for、使用 group_by 聚合、引入 SLO 多窗口或抑制(Alertmanager routes) |
成本與優化
? 指標基數控制:采集目標拆分成獨立 job,避免 labels 組合爆炸
? k6 事務腳本合并:減少 CronJob Pod 啟停成本
? 降低 scrape 頻率:對低變動目標 30s~60s
? p95 計算窗口:5m 與 30m 雙 recording,短期波動不直報警
與“被動”監控協同
層次 | 主動 (Synthetic) | 被動 (Metrics/APM/Logs) |
入口檢測 | 外部路徑真實可達 | 內部組件細粒度指標 |
證書/DNS | 主動提前預警 | 通常缺失 |
事務組合 | 統一腳本 | APM Trace 鏈路還原 |
回滾決策 | 快速驗證效果 | 確認內部資源使用健康 |
最佳實踐:告警分層 —— 外部可用性為頂層觸發,內部指標輔助定位。
最終效果
1. 總覽儀表盤:所有目標可用性狀態(綠/黃/紅)
2. 延遲分布:p50 / p95 / p99 疊加趨勢
3. 事務響應時間階梯圖(k6 Trend)
4. 證書剩余天數 Top 面板 + 低于閾值高亮
5. Error Budget 燃盡曲線
圖片
總結
通過上述方案,你已在 Kubernetes 上構建了:
? 多協議主動可用性探測
? 事務級 & 性能模擬(k6)
? 統一指標采集 / 錄制 / 告警
? 證書、延遲、可用率全鏈路可視化
? 易擴展(多區域、動態目標、SLO 驅動)
如果你也在推進類似體系,歡迎一起交流實踐經驗與 SLO 設計思路。希望本文對你落地自建 Synthetic Monitoring 有幫助。



































