背景
grafana构建nodegraph依赖两个数据,node & edge。
node和edge字段参考:https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/node-graph/
预期,通过prometheus查询语句,能够将下面的metrics转化为nodegraph依赖的数据。
from:
# counter
servicegraph_span_metrics_span_count_total
{name="span_name", parentName="span_root", status="Error"} 1
# histogram label同上
servicegraph_span_metrics_span_latency_bucket
servicegraph_span_metrics_span_latency_sum
servicegraph_span_metrics_span_latency_count
to:
node:
id, title, mainstat, secondarystat, arc__green, arc__red, arc__gray
edge:
id, source, target
每一个span从name指向parent_name。
mainstat展示每一个span的count。
secondary展示每一个span的p90 latency。
arc__green,red,gray分别展示span的成功,错误,未定义数量。
grafana 不支持从渲染数据构建nodegraph,必须依赖后端返回的数据直接构建,所以要求我们使用promQL使得查询之后的数据能够满足node和edge的要求。
QL编写
edge
区间向量范围:3d。
increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d])
使用label_join,name:parentName作为edge id。
abel_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d]), "id", ":","name","parentName")
基于1,使用label_replace进行label替换,基于name生成新的label:source。
label_replace(label_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d]), "id", ":","name","parentName"), "source", "$1", "name", "(.*)")
基于2,使用label_replace进行label替换,基于parentName生成新的label: target
label_replace(label_replace(label_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3d]), "id", ":","name","parentName"), "source", "$1", "name", "(.*)"), "target","$1","target","(.*)")
基于3,再进行一次筛选聚合得到最终结果。聚合重复的id,source,target。原标签携带了instance等prometheus的信息,不聚合的话会导致多条重复记录。
sum by(id, source, target) (label_replace(label_replace(label_join(increase(servicegraph_span_metrics_span_latency_count{parentName!="root"}[3h]), "id", ":","name","parentName"), "source", "$1", "name", "(.*)"), "target","$1","target","(.*)"))
node
node的处理,与edge处理基本一致,难点在于:
如何将查询结果聚合为label,再进行读取处理。
如何将对latency和count的查询聚合到一个表内。
增量范围:与edge保持一致。
increase(servicegraph_span_metrics_span_latency_count[3d])
1. 将count统计到label里面
使用count_value, 能够将查询值重命名为label,进行返回。
需要注意,count_values统计到向量依然需要sum进行聚合。排除instance等数据干扰。
count_values("count", sum(increase(servicegraph_span_metrics_span_latency_count[3d])) by (name, parentName)) by(name, parentName)
2. 将latancy统计到label里
count_values("latency",histogram_quantile(0.9, sum(increase(servicegraph_span_metrics_span_latency_bucket[3d])) by (le, name, parentName))) by(name, parentName)
3. 将两个查询语句聚合为一个表
通过 * on(name, parentName) 以及group_left进行表连接。
表示以左边查询为基准,基于name和parentName连接表,并且保留右边的latency字段。
count_values("count", sum(increase(servicegraph_span_metrics_span_latency_count[3d])) by (name, parentName)) by(name, parentName) * on(name, parentName) group_left(latency) count_values("latency",histogram_quantile(0.9, sum(increase(servicegraph_span_metrics_span_latency_bucket[3d])) by (le, name, parentName))) by(name, parentName)
4. 替换label
label_replace(label_replace(label_replace(label_replace(count_values("count", sum(increase(servicegraph_span_metrics_span_latency_count[3d)) by (name, parentName,status)) by(name, parentName) on(name, parentName) group_left(latency) count_values("latency",histogram_quantile(0.9, sum(increase(servicegraph_span_metrics_span_latency_bucket[3d])) by (le, name, parentName))) by(name, parentName), "id","$1","name","(.)"),"title","$1","name","([^:]+)(:.*)?"), "mainstat","$1","count","(.*)"),"secondarystat","$1 ms","latency","(.*)")
评论