PromQL 是 Prometheus 监控系统的核心查询语言，用于检索、过滤、聚合和计算时间序列数据。

1. 基本概念

时间序列 (Time Series): 由一个指标名称（metric name）和一组键值对（标签，labels）唯一标识的数据流。例如：http_requests_total{job="api-server", instance="10.0.0.1:8080"}。
指标名称 (Metric Name): 描述被测量事物的名称，如 http_requests_total。
标签 (Labels): 附加在指标上的键值对，用于区分同一指标的不同维度（如不同实例、不同状态码、不同方法等）。标签是 PromQL 过滤和聚合的基础。

2. 基本语法

a. 简单时间序列选择器 (Instant Vector Selector)

选择一个指标名称对应的所有时间序列的当前（或指定时间点）样本。

# 选择所有名为 'http_requests_total' 的时间序列
http_requests_total
# 选择 'http_requests_total' 指标，并通过标签进行过滤
# 只选择 job 标签值为 "api-server" 的时间序列
http_requests_total{job="api-server"}
# 过滤多个标签，使用逗号分隔
# 选择 job 为 "api-server" 且 status 为 "200" 的时间序列
http_requests_total{job="api-server", status="200"}
# 使用正则表达式匹配标签值
# job 标签值以 "node" 开头
http_requests_total{job=~"node.*"}
# instance 标签值不等于 "10.0.0.1:9090"
http_requests_total{instance!="10.0.0.1:9090"}
# instance 标签值不以 "test" 开头
http_requests_total{instance!~"test.*"}
# 使用逻辑操作符 (在标签选择器内)
# 选择 job 为 "api-server" 或 "frontend" 的时间序列
http_requests_total{job=~"api-server|frontend"}

b. 范围向量选择器 (Range Vector Selector)

选择一个指标名称对应的所有时间序列在过去一段时间内的样本。这对于计算速率、增长率等非常有用。

语法：<vector selector>[<duration>]

# 选择过去 5 分钟内 'http_requests_total' 的所有样本
http_requests_total[5m]
# 选择过去 1 小时内 'http_requests_total' 且 job 为 "api-server" 的样本
http_requests_total{job="api-server"}[1h]
# 选择过去 30 秒内 'cpu_usage' 的样本
cpu_usage[30s]

持续时间 (Duration) 单位:
s: 秒 (seconds)
m: 分钟 (minutes)
h: 小时 (hours)
d: 天 (days)
w: 周 (weeks)
y: 年 (years)

c. 时间位移 (Offset Modifier)

允许查询相对于当前时间的过去某个时间点的数据。

语法：<vector selector> offset <duration> 或 <vector selector> [ <range> ] offset <duration>

# 查询当前 'http_requests_total' 的值
http_requests_total
# 查询 1 小时前 'http_requests_total' 的值
http_requests_total offset 1h
# 查询 2 天前过去 5 分钟内的 'http_requests_total' 样本
http_requests_total[5m] offset 2d

3. 操作符

a. 算术二元操作符

用于对两个瞬时向量（Instant Vector）或标量（Scalar）进行算术运算。

+ (加), - (减), * (乘), / (除), % (取模), ^ (幂)

# 计算两个指标的和 (标签必须匹配)
http_requests_total + http_errors_total
# 计算成功率 (假设 error_rate 是错误率)
1 - error_rate
# 计算 CPU 使用率 (假设 cpu_used 和 cpu_total)
cpu_used / cpu_total

b. 比较二元操作符

用于比较两个瞬时向量或标量。默认情况下，比较会匹配标签。

== (等于), != (不等于), > (大于), < (小于), >= (大于等于), <= (小于等于)

# 找出请求总数大于 100 的实例
http_requests_total > 100
# 找出错误数等于 0 的实例
http_errors_total == 0
# 使用 bool 修饰符：比较结果为 0 或 1，而不是过滤
# 返回所有实例，满足条件的值为 1，不满足的为 0
http_requests_total > bool 100

c. 集合二元操作符

用于处理集合（向量）之间的关系。

and (交集): 保留左边向量中与右边向量有完全相同标签集的元素。
or (并集): 返回左边向量中的所有元素以及右边向量中没有出现在左边的元素。
unless (补集): 从左边向量中移除那些在右边向量中能找到完全匹配（标签集相同）的元素。

# 获取同时存在于 A 和 B 中的时间序列
A and B
# 获取 A 中所有以及 B 中不在 A 中的时间序列
A or B
# 获取 A 中不在 B 中的时间序列
A unless B

d. 一元操作符

+ (正号), - (负号，例如 -http_requests_total)

4. 函数

PromQL 提供了丰富的函数来处理时间序列数据。

a. 聚合函数 (Aggregation Operators)

对具有相同标签集（或根据 by/without 子句分组）的样本进行聚合。

sum (求和), min (最小值), max (最大值), avg (平均值), count (计数), stddev (标准差), stdvar (方差), count_values (按值计数)

# 计算所有实例的总请求数
sum(http_requests_total)
# 按 job 标签分组，计算每个 job 的总请求数
sum(http_requests_total) by (job)
# 计算每个实例的平均 CPU 使用率，忽略 job 标签
avg(cpu_usage) without (job)
# 统计不同状态码的数量
count_values("count", http_requests_total) by (status)

b. 范围向量函数 (Range Vector Functions)

作用于范围向量选择器，返回瞬时向量。

rate(): 计算每秒的平均增长率。最常用于计数器（Counter）。
irate(): 计算基于最后两个数据点的每秒瞬时增长率。
increase(): 计算时间范围内的总增长量。
delta(): 类似于 increase，但用于仪表盘（Gauge）。
changes(): 返回时间范围内样本值变化的次数。
deriv(): 使用简单线性回归计算时间序列的每秒导数。
holt_winters(): 生成时间序列的预测值。

# 计算过去 5 分钟内 http_requests_total 每秒的平均增长率
rate(http_requests_total[5m])
# 计算过去 1 小时内 http_requests_total 的总增长量
increase(http_requests_total[1h])
# 计算过去 10 分钟内 cpu_temp 的变化量 (Gauge)
delta(cpu_temp[10m])

c. 其他常用函数

abs(): 绝对值
ceil(), floor(): 向上/下取整
clamp_min(), clamp_max(): 限制最小/最大值
label_replace(): 添加或修改标签
timestamp(): 返回时间序列的最新样本的时间戳

5. 关键要点

瞬时向量 (Instant Vector): 在单个时间点上的一组时间序列样本。
范围向量 (Range Vector): 在一段时间内的一组时间序列样本。
向量匹配 (Vector Matching): 当对两个向量进行二元操作时，Prometheus 需要确定哪些元素之间可以进行操作。这通常基于标签。可以使用 ignoring (忽略某些标签) 或 on (只考虑某些标签) 修饰符来控制匹配行为。
计数器 (Counter) vs 仪表盘 (Gauge): 理解指标类型至关重要。rate() 和 increase() 主要用于 Counter（只增不减或重置的计数器）。Gauge（可增可减的测量值）则使用 delta() 或直接比较。

示例：

# 计算 API 服务器每秒的 HTTP 请求数 (按状态码和方法分组)
rate(http_requests_total{job="api-server"}[5m])
# 计算过去 5 分钟内，每个实例的错误率 (错误请求数 / 总请求数)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# 找出 CPU 使用率超过 80% 的实例
100 * (avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))) > 80

北屋教程网

专注编程知识分享，从入门到精通的编程学习平台

PromQL基本语法（proceduremysql）