当前位置:   article > 正文

Skywalking-告警功能实践_from(endpoint.*).filter(responsecode in [404,500,5

from(endpoint.*).filter(responsecode in [404,500,503]);

以下是一个可参考的告警配置:

  1. # Sample alarm rules.
  2. rules:
  3. # Rule unique name, must be ended with `_rule`.
  4. service_resp_time_rule:
  5. metrics-name: service_resp_time
  6. op: ">"
  7. threshold: 500
  8. period: 10
  9. count: 1
  10. silence-period: 5
  11. message: Response time of service {name} is more than 1000ms in last 10 minutes.
  12. service_sla_rule:
  13. # Indicator value need to be long, double or int
  14. metrics-name: service_sla
  15. op: "<"
  16. threshold: 8000
  17. # The length of time to evaluate the metric
  18. period: 10
  19. # How many times after the metric match the condition, will trigger alarm
  20. count: 2
  21. # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
  22. silence-period: 3
  23. message: Successful rate of service {name} is lower than 80% in last 10 minutes.
  24. service_p90_sla_rule:
  25. # Indicator value need to be long, double or int
  26. metrics-name: service_p90
  27. op: ">"
  28. threshold: 500
  29. period: 10
  30. count: 1
  31. silence-period: 5
  32. message: 90% response time of service {name} is lower than 1000ms in last 10 minutes
  33. service_instance_resp_time_rule:
  34. metrics-name: service_instance_resp_time
  35. op: ">"
  36. threshold: 500
  37. period: 10
  38. count: 1
  39. silence-period: 5
  40. message: Response time of service instance {name} is more than 1000ms in last 10 minutes.
  41. endpoint_avg_rule:
  42. metrics-name: endpoint_avg
  43. op: ">"
  44. threshold: 500
  45. period: 10
  46. count: 1
  47. silence-period: 5
  48. message: Response time of endpoint {name} is more than 1000ms in last 10 minutes.
  49. webhooks:
  50. # - http://127.0.0.1/notify/
  51. # - http://127.0.0.1/go-wechat/

Skywalking每隔一段时间根据收集到的链路追踪的数据和配置的告警规则(如服务响应时间、服务响应 时间百分比)等,判断如果达到阈值则发送相应的告警信息。发送告警信息是通过调用webhook接口完 成,具体的webhook接口可以使用者自行定义,从而开发者可以在指定的webhook接口中编写各种告 警方式,比如邮件、短信等。告警的信息也可以在RocketBot中查看到。

以下是默认的告警规则配置,位于skywalking安装目录下的config文件夹下 alarm-settings.yml文件 中:

  1. rules:
  2. # Rule unique name, must be ended with `_rule`.
  3. service_resp_time_rule:
  4. metrics-name: service_resp_time
  5. op: ">"
  6. threshold: 1000
  7. period: 10
  8. count: 3
  9. silence-period: 5
  10. message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  11. service_sla_rule:
  12. # Metrics value need to be long, double or int
  13. metrics-name: service_sla
  14. op: "<"
  15. threshold: 8000
  16. # The length of time to evaluate the metrics
  17. period: 10
  18. # How many times after the metrics match the condition, will trigger alarm
  19. count: 2
  20. # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
  21. silence-period: 3
  22. message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  23. service_p90_sla_rule:
  24. # Metrics value need to be long, double or int
  25. metrics-name: service_p90
  26. op: ">"
  27. threshold: 1000
  28. period: 10
  29. count: 3
  30. silence-period: 5
  31. message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
  32. service_instance_resp_time_rule:
  33. metrics-name: service_instance_resp_time
  34. op: ">"
  35. threshold: 1000
  36. period: 10
  37. count: 2
  38. silence-period: 5
  39. message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  40. # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
  41. # Because the number of endpoint is much more than service and instance.
  42. #
  43. # endpoint_avg_rule:
  44. # metrics-name: endpoint_avg
  45. # op: ">"
  46. # threshold: 1000
  47. # period: 10
  48. # count: 2
  49. # silence-period: 5
  50. # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
  51. webhooks:
  52. # - http://127.0.0.1/notify/
  53. # - http://127.0.0.1/go-wechat/

以上文件定义了默认的4种规则

  1. 最近3分钟内服务的平均响应时间超过1秒
  2. 最近2分钟服务成功率低于80%
  3. 最近3分钟90%服务响应时间超过1秒
  4. 最近2分钟内服务实例的平均响应时间超过1秒 规则中的参数属性如下

属性参照表

属性含义
metrics-nameoal脚本中的度量名称
threshold阈值,与metrics-name和下面的比较符号相匹配
op比较操作符,可以设定>,<,=
period多久检查一次当前的指标数据是否符合告警规则,单位分钟
count达到多少次后,发送告警消息
silence-period在多久之内,忽略相同的告警消息
message告警消息内容
include-names本规则告警生效的服务列表

webhooks可以配置告警产生时的调用地址。

3.4.2 告警功能测试代码

编写告警功能接口来进行测试,创建skywalking_alarm项目。

AlarmController

  1. import org.springframework.web.bind.annotation.GetMapping;
  2. import org.springframework.web.bind.annotation.RestController;
  3. @RestController
  4. public class AlarmController {
  5. //每次调用睡眠1.5秒,模拟超时的报警
  6. @GetMapping("/timeout")
  7. public String timeout(){
  8. try {
  9. Thread.sleep(1500);
  10. } catch (InterruptedException e) {
  11. e.printStackTrace();
  12. }
  13. return "timeout";
  14. }
  15. }

该接口主要用于模拟超时,多次调用之后就可以生成告警信息。

WebHooks

  1. import com.sf.saas.skywalking_alarm.pojo.AlarmMessage;
  2. import org.springframework.web.bind.annotation.GetMapping;
  3. import org.springframework.web.bind.annotation.PostMapping;
  4. import org.springframework.web.bind.annotation.RequestBody;
  5. import org.springframework.web.bind.annotation.RestController;
  6. import java.util.ArrayList;
  7. import java.util.List;
  8. @RestController
  9. public class WebHooks {
  10. private List<AlarmMessage> lastList = new ArrayList<>();
  11. @PostMapping("/webhook")
  12. public void webhook(@RequestBody List<AlarmMessage> alarmMessageList){
  13. lastList = alarmMessageList;
  14. }
  15. @GetMapping("/show")
  16. public List<AlarmMessage> show(){
  17. return lastList;
  18. }
  19. }

产生告警时会调用webhook接口,该接口必须是Post类型,同时接口参数使用RequestBody。参 数格式为:

  1. [{
  2. "scopeId": 1,
  3. "scope": "SERVICE",
  4. "name": "serviceA",
  5. "id0": 12,
  6. "id1": 0,
  7. "ruleName": "service_resp_time_rule",
  8. "alarmMessage": "alarmMessage xxxx",
  9. "startTime": 1560524171000
  10. }, {
  11. "scopeId": 1,
  12. "scope": "SERVICE",
  13. "name": "serviceB",
  14. "id0": 23,
  15. "id1": 0,
  16. "ruleName": "service_resp_time_rule",
  17. "alarmMessage": "alarmMessage yyy",
  18. "startTime": 1560524171000
  19. }]

AlarmMessage

  1. public class AlarmMessage {
  2. private int scopeId;
  3. private String name;
  4. private int id0;
  5. private int id1;
  6. //告警的消息
  7. private String alarmMessage;
  8. //告警的产生时间
  9. private long startTime;
  10. public int getScopeId() {
  11. return scopeId;
  12. }
  13. public void setScopeId(int scopeId) {
  14. this.scopeId = scopeId;
  15. }
  16. public String getName() {
  17. return name;
  18. }
  19. public void setName(String name) {
  20. this.name = name;
  21. }
  22. public int getId0() {
  23. return id0;
  24. }
  25. public void setId0(int id0) {
  26. this.id0 = id0;
  27. }
  28. public int getId1() {
  29. return id1;
  30. }
  31. public void setId1(int id1) {
  32. this.id1 = id1;
  33. }
  34. public String getAlarmMessage() {
  35. return alarmMessage;
  36. }
  37. public void setAlarmMessage(String alarmMessage) {
  38. this.alarmMessage = alarmMessage;
  39. }
  40. public long getStartTime() {
  41. return startTime;
  42. }
  43. public void setStartTime(long startTime) {
  44. this.startTime = startTime;
  45. }
  46. @Override
  47. public String toString() {
  48. return "AlarmMessage{" +
  49. "scopeId=" + scopeId +
  50. ", name='" + name + '\'' +
  51. ", id0=" + id0 +
  52. ", id1=" + id1 +
  53. ", alarmMessage='" + alarmMessage + '\'' +
  54. ", startTime=" + startTime +
  55. '}';
  56. }
  57. }

实体类用于接口告警信息

3.4.3 部署测试

首先需要修改告警规则配置文件,将webhook地址修改为

  1. webhooks:
  2. - http://127.0.0.1:8089/webhook

然后重启skywalking
1、将 skywalking_alarm.jar上传至 /usr/local/skywalking目录下。

2、启动skywalking_alarm应用,等待启动成功。

  1. java -javaagent:/usr/local/skywalking/apache-skywalking-apm-
  2. bin/agent/skywalking-agent.jar -Dskywalking.agent.service_name=skywalking_alarm -jar skywalking_alarm.jar

3、不停调用接口,接口地址为:http://虚拟机IP:8089/timeout

4、直到出现告警:

在这里插入图片描述

5、查看告警信息接口:http://虚拟机IP:8089/show
在这里插入图片描述

从上图中可以看到,我们已经获取到了告警相关的信息,在生产中使用可以在webhook接口中对接短 信、邮件等平台,当告警出现时能迅速发送信息给对应的处理人员,提高故障处理的速度。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/236448
推荐阅读
相关标签
  

闽ICP备14008679号