基于 OpenClaw + Claude Code 的端到端研发自动化系统 - 全链路监控埋点方案
覆盖需求分析 → PRD 设计 → 技术方案 → API 设计 → AI Coding → 单元测试 → 集成测试 → CI/CD → K8s 部署 → UI 自动化验收全流程
| 层级 | 监控对象 | 关键指标类型 | 采集频率 |
|---|---|---|---|
| L1 - Agent 应用层 | 各研发角色 Agents | 任务执行指标、LLM 调用指标、业务转化指标 | 实时 (15s) |
| L2 - 服务层 | Jenkins/Docker/K8s | CI/CD 流水线指标、容器指标、集群指标 | 实时 (15s) |
| L3 - 基础设施层 | Node/Pod/Network | CPU/内存/磁盘/网络 IO | 高频 (5s) |
| L4 - 业务层 | 研发流程效率 | 阶段耗时、成功率、质量评分 | 中频 (1min) |
status project_id priority
complexity_level outcome
token_type model_name
intervention_type resolution_status
version approval_status project_id
review_rounds change_count
requirement_category
revision_reason requester_role
design_type complexity_score
component_type layer
api_count specification_format
debt_category
language file_type project_id
complexity lines_of_code
model endpoint streaming
review_outcome issue_severity
task_type
test_type coverage_area
coverage_type module
test_suite parallel_workers
severity detection_stage status
test_category
environment deployment_strategy outcome
pipeline_stage artifact_size_mb
pod_name namespace deployment
rollback_reason time_to_rollback
resource_type container_name
job_name result branch
node_label
job_name stage outcome
node_name executor_type
container_name image host
container_name memory_limit
container_name interface
container_name check_name
pod namespace phase
deployment namespace
node condition status
pod container reason
workspace resource_type
service_requests_total{service, endpoint, method} - 每秒请求数service_errors_total{service, error_type} - 错误请求占比service_request_duration_seconds{service, endpoint, quantile} - 请求延迟分布 (P50/P90/P99)
ALERT: AgentHighFailureRate
IF: sum(rate(req_agent_tasks_total{status="failed"}[5m])) / sum(rate(req_agent_tasks_total[5m])) > 0.1
FOR: 5m
LABELS: {severity="warning", team="ai-agents"}
ANNOTATIONS:
summary: "Agent 任务失败率超过 10%"
description: "过去 5 分钟内 {{ $value | humanizePercentage }} 的任务执行失败"
ALERT: HighLLMLatency
IF: histogram_quantile(0.95, rate(coding_agent_llm_inference_latency_seconds_bucket[5m])) > 30
FOR: 10m
LABELS: {severity="critical", team="platform"}
ANNOTATIONS:
summary: "LLM API P95 延迟超过 30 秒"
description: "当前 P95 延迟为 {{ $value }} 秒,严重影响编码效率"
ALERT: AbnormalTokenConsumption
IF: sum(rate(coding_agent_llm_tokens_total[1h])) > 1000000
FOR: 1h
LABELS: {severity="warning", team="finance"}
ANNOTATIONS:
summary: "小时级 Token 消耗超过 100 万"
description: "过去 1 小时消耗 {{ $value }} tokens,可能产生高额费用"
ALERT: HighBuildFailureRate
IF: sum(rate(jenkins_builds_total{result="FAILURE"}[30m])) / sum(rate(jenkins_builds_total[30m])) > 0.2
FOR: 30m
LABELS: {severity="critical", team="devops"}
ANNOTATIONS:
summary: "Jenkins 构建失败率超过 20%"
description: "过去 30 分钟失败率为 {{ $value | humanizePercentage }}"
ALERT: FrequentDeploymentRollbacks
IF: increase(deploy_agent_rollbacks_total[1h]) > 3
FOR: 1h
LABELS: {severity="critical", team="release"}
ANNOTATIONS:
summary: "1 小时内发生多次部署回滚"
description: "过去 1 小时回滚 {{ $value }} 次,部署质量需关注"
ALERT: PodCrashLooping
IF: increase(kube_pod_container_status_restarts_total[1h]) > 5
FOR: 1h
LABELS: {severity="critical", team="platform"}
ANNOTATIONS:
summary: "Pod 进入崩溃循环"
description: "Pod {{ $labels.pod }} 在 1 小时内重启 {{ $value }} 次"
ALERT: NodeResourceExhausted
IF: (1 - sum(kube_node_status_condition{condition="Ready", status="true"}) / count(kube_node_status_condition{condition="Ready"})) > 0.2
FOR: 5m
LABELS: {severity="critical", team="infra"}
ANNOTATIONS:
summary: "超过 20% 的节点不可用"
description: "集群健康状况恶化,请立即检查"
ALERT: CodeCoverageDrop
IF: test_agent_code_coverage_percent{coverage_type="branch"} < 80
FOR: 1d
LABELS: {severity="warning", team="quality"}
ANNOTATIONS:
summary: "分支覆盖率低于 80%"
description: "当前分支覆盖率为 {{ $value }}%,未达到质量门禁"
用途: 管理层视角,展示端到端研发流程的整体健康度和效率
用途: AI 工程师视角,监控各 Agent 的运行状态和 LLM 使用情况
用途: DevOps 团队视角,监控构建/测试/部署流水线
用途: 财务/运营视角,监控资源使用和成本优化
依赖安装: pip install prometheus-client
核心代码结构:
Counter 记录任务执行次数Histogram 记录耗时分布Gauge 记录实时状态Summary 记录分位数统计@monitor_task 装饰器统一埋点/metrics HTTP 端点供 Prometheus 抓取依赖安装: npm install prom-client
核心代码结构:
prom-client 库定义指标/metrics 端点async_hooks 追踪异步任务上下文opentelemetry 实现分布式追踪插件安装: Prometheus Metrics Plugin
配置要点:
Default Job Collector 自动采集构建指标Custom Metrics 采集业务特定指标prometheus.gauge() 自定义指标/prometheus 端点(默认端口 8080)核心组件:
<service>_<feature>_<metric_type> 格式,单位明确(seconds/bytes)/health 和 /ready 端点