背景

grafana构建nodegraph依赖两个数据,node & edge。

node和edge字段参考:https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/node-graph/

预期,通过prometheus查询语句,能够将下面的metrics转化为nodegraph依赖的数据。

from:

# counter
servicegraph_span_metrics_span_count_total
{name="span_name", parentName="span_root", status="Error"} 1

# histogram label同上
servicegraph_span_metrics_span_latency_bucket
servicegraph_span_metrics_span_latency_sum
servicegraph_span_metrics_span_latency_count

to:

node:

id, title, mainstat, secondarystat, arc__green, arc__red, arc__gray

edge:

id, source, target

  1. 每一个span从name指向parent_name。

  2. mainstat展示每一个span的count。

  3. secondary展示每一个span的p90 latency。

  4. arc__green,red,gray分别展示span的成功,错误,未定义数量。

grafana 不支持从渲染数据构建nodegraph,必须依赖后端返回的数据直接构建,所以要求我们使用promQL使得查询之后的数据能够满足node和edge的要求。

QL编写

edge

区间向量范围:3d。

increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d])

  1. 使用label_join,name:parentName作为edge id。

abel_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d]), "id", ":","name","parentName")

  1. 基于1,使用label_replace进行label替换,基于name生成新的label:source。

label_replace(label_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d]), "id", ":","name","parentName"), "source", "$1", "name", "(.*)")

  1. 基于2,使用label_replace进行label替换,基于parentName生成新的label: target

label_replace(label_replace(label_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d]), "id", ":","name","parentName"), "source", "$1", "name", "(.*)"), "target","$1","target","(.*)")

  1. 基于3,再进行一次筛选聚合得到最终结果。聚合重复的id,source,target。原标签携带了instance等prometheus的信息,不聚合的话会导致多条重复记录。

sum by(id, source, target) (label_replace(label_replace(label_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3h]), "id", ":","name","parentName"), "source", "$1", "name", "(.*)"), "target","$1","target","(.*)"))

node

node的处理,与edge处理基本一致,难点在于:

  1. 如何将查询结果聚合为label,再进行读取处理。

  2. 如何将对latency和count的查询聚合到一个表内。

增量范围:与edge保持一致。

increase(servicegraph_span_metrics_span_latency_count[3d])

1. 将count统计到label里面

使用count_value, 能够将查询值重命名为label,进行返回。

需要注意,count_values统计到向量依然需要sum进行聚合。排除instance等数据干扰。

count_values("count", sum(increase(servicegraph_span_metrics_span_latency_count[3d])) by (name, parentName)) by(name, parentName)

2. 将latancy统计到label里

count_values("latency",histogram_quantile(0.9, sum(increase(servicegraph_span_metrics_span_latency_bucket[3d])) by (le, name, parentName))) by(name, parentName)

3. 将两个查询语句聚合为一个表

通过 * on(name, parentName) 以及group_left进行表连接。

表示以左边查询为基准,基于name和parentName连接表,并且保留右边的latency字段。

count_values("count", sum(increase(servicegraph_span_metrics_span_latency_count[3d])) by (name, parentName)) by(name, parentName) * on(name, parentName) group_left(latency) count_values("latency",histogram_quantile(0.9, sum(increase(servicegraph_span_metrics_span_latency_bucket[3d])) by (le, name, parentName))) by(name, parentName)

4. 替换label

label_replace(label_replace(label_replace(label_replace(count_values("count", sum(increase(servicegraph_span_metrics_span_latency_count[3d)) by (name, parentName,status)) by(name, parentName) on(name, parentName) group_left(latency) count_values("latency",histogram_quantile(0.9, sum(increase(servicegraph_span_metrics_span_latency_bucket[3d])) by (le, name, parentName))) by(name, parentName), "id","$1","name","(.)"),"title","$1","name","([^:]+)(:.*)?"), "mainstat","$1","count","(.*)"),"secondarystat","$1 ms","latency","(.*)")