赞
踩
以下是一个可参考的告警配置:
- # Sample alarm rules.
- rules:
- # Rule unique name, must be ended with `_rule`.
- service_resp_time_rule:
- metrics-name: service_resp_time
- op: ">"
- threshold: 500
- period: 10
- count: 1
- silence-period: 5
- message: Response time of service {name} is more than 1000ms in last 10 minutes.
- service_sla_rule:
- # Indicator value need to be long, double or int
- metrics-name: service_sla
- op: "<"
- threshold: 8000
- # The length of time to evaluate the metric
- period: 10
- # How many times after the metric match the condition, will trigger alarm
- count: 2
- # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
- silence-period: 3
- message: Successful rate of service {name} is lower than 80% in last 10 minutes.
- service_p90_sla_rule:
- # Indicator value need to be long, double or int
- metrics-name: service_p90
- op: ">"
- threshold: 500
- period: 10
- count: 1
- silence-period: 5
- message: 90% response time of service {name} is lower than 1000ms in last 10 minutes
- service_instance_resp_time_rule:
- metrics-name: service_instance_resp_time
- op: ">"
- threshold: 500
- period: 10
- count: 1
- silence-period: 5
- message: Response time of service instance {name} is more than 1000ms in last 10 minutes.
- endpoint_avg_rule:
- metrics-name: endpoint_avg
- op: ">"
- threshold: 500
- period: 10
- count: 1
- silence-period: 5
- message: Response time of endpoint {name} is more than 1000ms in last 10 minutes.
-
- webhooks:
- # - http://127.0.0.1/notify/
- # - http://127.0.0.1/go-wechat/

Skywalking每隔一段时间根据收集到的链路追踪的数据和配置的告警规则(如服务响应时间、服务响应 时间百分比)等,判断如果达到阈值则发送相应的告警信息。发送告警信息是通过调用webhook接口完 成,具体的webhook接口可以使用者自行定义,从而开发者可以在指定的webhook接口中编写各种告 警方式,比如邮件、短信等。告警的信息也可以在RocketBot中查看到。
以下是默认的告警规则配置,位于skywalking安装目录下的config文件夹下 alarm-settings.yml文件 中:
- rules:
- # Rule unique name, must be ended with `_rule`.
- service_resp_time_rule:
- metrics-name: service_resp_time
- op: ">"
- threshold: 1000
- period: 10
- count: 3
- silence-period: 5
- message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
- service_sla_rule:
- # Metrics value need to be long, double or int
- metrics-name: service_sla
- op: "<"
- threshold: 8000
- # The length of time to evaluate the metrics
- period: 10
- # How many times after the metrics match the condition, will trigger alarm
- count: 2
- # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
- silence-period: 3
- message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
- service_p90_sla_rule:
- # Metrics value need to be long, double or int
- metrics-name: service_p90
- op: ">"
- threshold: 1000
- period: 10
- count: 3
- silence-period: 5
- message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
- service_instance_resp_time_rule:
- metrics-name: service_instance_resp_time
- op: ">"
- threshold: 1000
- period: 10
- count: 2
- silence-period: 5
- message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
- # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
- # Because the number of endpoint is much more than service and instance.
- #
- # endpoint_avg_rule:
- # metrics-name: endpoint_avg
- # op: ">"
- # threshold: 1000
- # period: 10
- # count: 2
- # silence-period: 5
- # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
-
- webhooks:
- # - http://127.0.0.1/notify/
- # - http://127.0.0.1/go-wechat/

以上文件定义了默认的4种规则
属性参照表
属性 | 含义 |
---|---|
metrics-name | oal脚本中的度量名称 |
threshold | 阈值,与metrics-name和下面的比较符号相匹配 |
op | 比较操作符,可以设定>,<,= |
period | 多久检查一次当前的指标数据是否符合告警规则,单位分钟 |
count | 达到多少次后,发送告警消息 |
silence-period | 在多久之内,忽略相同的告警消息 |
message | 告警消息内容 |
include-names | 本规则告警生效的服务列表 |
webhooks可以配置告警产生时的调用地址。
编写告警功能接口来进行测试,创建skywalking_alarm项目。
AlarmController
- import org.springframework.web.bind.annotation.GetMapping;
- import org.springframework.web.bind.annotation.RestController;
-
- @RestController
- public class AlarmController {
-
- //每次调用睡眠1.5秒,模拟超时的报警
- @GetMapping("/timeout")
- public String timeout(){
- try {
- Thread.sleep(1500);
- } catch (InterruptedException e) {
- e.printStackTrace();
- }
-
- return "timeout";
- }
- }

该接口主要用于模拟超时,多次调用之后就可以生成告警信息。
WebHooks
- import com.sf.saas.skywalking_alarm.pojo.AlarmMessage;
- import org.springframework.web.bind.annotation.GetMapping;
- import org.springframework.web.bind.annotation.PostMapping;
- import org.springframework.web.bind.annotation.RequestBody;
- import org.springframework.web.bind.annotation.RestController;
-
- import java.util.ArrayList;
- import java.util.List;
-
- @RestController
- public class WebHooks {
-
- private List<AlarmMessage> lastList = new ArrayList<>();
-
- @PostMapping("/webhook")
- public void webhook(@RequestBody List<AlarmMessage> alarmMessageList){
- lastList = alarmMessageList;
- }
-
- @GetMapping("/show")
- public List<AlarmMessage> show(){
- return lastList;
- }
- }
-

产生告警时会调用webhook接口,该接口必须是Post类型,同时接口参数使用RequestBody。参 数格式为:
- [{
- "scopeId": 1,
- "scope": "SERVICE",
- "name": "serviceA",
- "id0": 12,
- "id1": 0,
- "ruleName": "service_resp_time_rule",
- "alarmMessage": "alarmMessage xxxx",
- "startTime": 1560524171000
- }, {
- "scopeId": 1,
- "scope": "SERVICE",
- "name": "serviceB",
- "id0": 23,
- "id1": 0,
- "ruleName": "service_resp_time_rule",
- "alarmMessage": "alarmMessage yyy",
- "startTime": 1560524171000
- }]

AlarmMessage
- public class AlarmMessage {
- private int scopeId;
- private String name;
- private int id0;
- private int id1;
- //告警的消息
- private String alarmMessage;
- //告警的产生时间
- private long startTime;
-
- public int getScopeId() {
- return scopeId;
- }
-
- public void setScopeId(int scopeId) {
- this.scopeId = scopeId;
- }
-
- public String getName() {
- return name;
- }
-
- public void setName(String name) {
- this.name = name;
- }
-
- public int getId0() {
- return id0;
- }
-
- public void setId0(int id0) {
- this.id0 = id0;
- }
-
- public int getId1() {
- return id1;
- }
-
- public void setId1(int id1) {
- this.id1 = id1;
- }
-
- public String getAlarmMessage() {
- return alarmMessage;
- }
-
- public void setAlarmMessage(String alarmMessage) {
- this.alarmMessage = alarmMessage;
- }
-
- public long getStartTime() {
- return startTime;
- }
-
- public void setStartTime(long startTime) {
- this.startTime = startTime;
- }
-
- @Override
- public String toString() {
- return "AlarmMessage{" +
- "scopeId=" + scopeId +
- ", name='" + name + '\'' +
- ", id0=" + id0 +
- ", id1=" + id1 +
- ", alarmMessage='" + alarmMessage + '\'' +
- ", startTime=" + startTime +
- '}';
- }
- }

实体类用于接口告警信息
首先需要修改告警规则配置文件,将webhook地址修改为
- webhooks:
- - http://127.0.0.1:8089/webhook
然后重启skywalking
1、将 skywalking_alarm.jar上传至 /usr/local/skywalking目录下。
2、启动skywalking_alarm应用,等待启动成功。
- java -javaagent:/usr/local/skywalking/apache-skywalking-apm-
- bin/agent/skywalking-agent.jar -Dskywalking.agent.service_name=skywalking_alarm -jar skywalking_alarm.jar
3、不停调用接口,接口地址为:http://虚拟机IP:8089/timeout
4、直到出现告警:
5、查看告警信息接口:http://虚拟机IP:8089/show
从上图中可以看到,我们已经获取到了告警相关的信息,在生产中使用可以在webhook接口中对接短 信、邮件等平台,当告警出现时能迅速发送信息给对应的处理人员,提高故障处理的速度。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。