Alerts
The following section provides details on GR alerts.
BFD Alerts
The following table list alerts for rule group BFD with interval-seconds as 60.
Alert Rule |
Severity |
Duration (in mins) |
Type |
---|---|---|---|
BFD-Link-Fail |
critical |
1 |
Communication Alarm |
Expression: sum by (namespace,pod,status) (bgp_speaker_bfd_status{status='BFD_STATUS'}) == 0 Description: This alert is generated when BFD link associated with BGP peering is down. |
GR Alerts
The following table list alerts for rule group GR with interval-seconds as 60.
Alert Rule |
Severity |
Duration (in mins) |
Type |
---|---|---|---|
Cache-POD- Replication-Immediate -Local |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='CACHE_POD', ReplicationSyncType='Immediate',ReplicationReceiver='local', ReplicationRequestType='Response',status='success'}[1m]))/sum by (namespace) (increase(geo_replication_total{ReplicationNode='CACHE_POD', ReplicationSyncType='Immediate',ReplicationReceiver='local', ReplicationRequestType='Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of CACHE_POD sync type:Immediate and replication receiver:Local is below threshold value. |
|||
Cache-POD- Replication-Immediate -Remote |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='CACHE_POD', ReplicationSyncType='Immediate',ReplicationReceiver='remote', ReplicationRequestType='Response',status='success'}[1m]))/sum by (namespace) (increase(geo_replication_total{ReplicationNode='CACHE_POD', ReplicationSyncType='Immediate',ReplicationReceiver='remote', ReplicationRequestType='Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of CACHE_POD sync type:Immediate and replication receiver:Remote is below threshold value. |
|||
Cache-POD- Replication-PULL -Remote |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='CACHE_POD', ReplicationSyncType='PULL',ReplicationReceiver='remote', ReplicationRequestType='Response',status='success'}[1m]))/sum by (namespace) (increase(geo_replication_total{ReplicationNode='CACHE_POD', ReplicationSyncType='PULL',ReplicationReceiver='remote', ReplicationRequestType='Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of CACHE_POD sync type:PULL and replication receiver:Remote is below threshold value. |
|||
ETCD- Replication-Immediate -Local |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='ETCD', ReplicationSyncType='Immediate',ReplicationReceiver='local', ReplicationRequestType='Response',status='success'}[1m]))/sum by (namespace) (increase(geo_replication_total{ReplicationNode='ETCD', ReplicationSyncType='Immediate',ReplicationReceiver='local', ReplicationRequestType='Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of ETCD sync type:Immediate and replication receiver:Local is below threshold value. |
|||
ETCD- Replication-Immediate -Remote |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='ETCD', ReplicationSyncType='Immediate',ReplicationReceiver='remote', ReplicationRequestType='Response',status='success'}[1m]))/sum by (namespace) (increase(geo_replication_total{ReplicationNode='ETCD', ReplicationSyncType='Immediate',ReplicationReceiver='remote', ReplicationRequestType='Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of ETCD sync type:Immediate and replication receiver:Remote is below threshold value. |
|||
ETCD- Replication-PULL -Remote |
critical |
1 |
Communication Alarm |
Expresion: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='ETCD', ReplicationSyncType='PULL',ReplicationReceiver='remote', ReplicationRequestType='Response',status='success'}[1m]))/ sum by (namespace) (increase(geo_replication_total {ReplicationNode='ETCD',ReplicationSyncType='PULL', ReplicationReceiver='remote',ReplicationRequestType= 'Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of ETCD sync type:PULL and replication receiver:Remote is below threshold value. |
|||
Heartbeat-Remote -Site |
critical |
- |
Communication Alarm |
Expression: sum by (namespace) (increase(geo_monitoring_total{ControlActionNameType= 'RemoteMsgHeartbeat',status!='success'}[1m])) > 0 Description: This alert is triggerd when periodic Heartbeat to remote site fails. |
|||
Local-Site- POD-Monitoring |
critical |
- |
Communication Alarm |
Expression: sum by (namespace,AdminNode) (increase(geo_monitoring_total{ControlActionNameType ='MonitorPod'}[1m])) > 0 Description: This alert is triggerd when local site pod monitoring failures breaches the configured threshold for the pod mentioned in Label:{{$labels.AdminNode}}. |
|||
PEER-Replication- Immediate- Local |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='PEER', ReplicationSyncType='Immediate',ReplicationReceiver='local', ReplicationRequestType='Response',status='success'} [1m]))/sum by (namespace) (increase(geo_replication_total {ReplicationNode='PEER',ReplicationSyncType= 'Immediate',ReplicationReceiver='local', ReplicationRequestType='Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of PEER sync type:Immediate and replication receiver:Local is below threshold value. |
|||
PEER-Replication- Immediate- Remote |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(geo_replication_total{ReplicationNode='PEER', ReplicationSyncType='Immediate',ReplicationReceiver='remote', ReplicationRequestType='Response',status='success'} [1m]))/sum by (namespace) (increase(geo_replication_total {ReplicationNode='PEER',ReplicationSyncType='Immediate', ReplicationReceiver='remote',ReplicationRequestType= 'Request'}[1m])))*100 < 90 Description: This alert is generated when the success rate of PEER sync type:Immediate and replication receiver:Remote is below threshold value. |
|||
RemoteCluster- PODFailure |
critical |
- |
Communication Alarm |
Expression: sum by (namespace,AdminNode) (increase(geo_monitoring_total{ControlActionNameType ='RemoteClusterPodFailure'}[1m])) > 0 Description: This alert is generated when pod failure is detected on the Remote site for the pod mentioned in Label:{{$labels.AdminNode}}. |
|||
RemoteMsg NotifyFailover |
critical |
1 |
Communication Alarm |
Expression: sum by (namespace,status) (increase(geo_monitoring_total{ControlActionNameType ='RemoteMsgNotifyFailover',status!='success'}[1m])) > 0 Description: This alert is generated when transient role RemoteMsgNotifyFailover has failed for the reason mentioned in Label:{{$labels.status}}. |
|||
RemoteMsg NotifyPrepare Failover |
critical |
1 |
Communication Alarm |
Expression: sum by (namespace,status) (increase(geo_monitoring_total{ControlActionNameType ='RemoteMsgNotifyPrepareFailover',status!='success'}[1m])) > 0 Description: This alert is generated when transient role RemoteMsgNotifyPrepareFailover has failed for the reason mentioned in Label:{{$labels.status}}. |
|||
RemoteSite- RoleMonitoring |
critical |
- |
Communication Alarm |
Expression: sum by (namespace,AdminNode) (increase(geo_monitoring_total{ControlActionNameType ='RemoteSiteRoleMonitoring'}[1m])) > 0 Description: This alert is generated when RemoteSiteRoleMonitoring detects role inconsistency for an instance on the partner rack and accordingly changes the role of the respective instance on local rack to Primary. The impacted instance is in Label:{{$labels.AdminNode}}. |
|||
ResetRoleApi -Initiated |
critical |
- |
Communication Alarm |
Expression: sum by (namespace,status) (increase(geo_monitoring_total{ControlActionNameType ='ResetRoleApi'}[1m])) > 0 Description: This alert is generated when ResetRoleApi is initiated with the state transition of roles mentioned in Label:{{$labels.status}}. |
|||
TriggerGRApi -Initiated |
critical |
- |
Communication Alarm |
Expression: sum by (namespace,status) (increase(geo_monitoring_total{ControlActionNameType ='TriggerGRApi'}[1m])) > 0 Description: This alert is generated when TriggerGRApi is initiated for the reason mentioned in Label:{{$labels.status}}. |
|||
VIP-Monitoring -Failures |
critical |
- |
Communication Alarm |
Expression: sum by (namespace,AdminNode) (increase(geo_monitoring_total{ControlActionNameType ='MonitorVip'}[1m])) > 0 Description: This alert is generated when GR is generated upon detecting VIP monitoring failures for the VIP and Instance mentioned in the Label:{{$labels.AdminNode}}. |
CDL Alerts
The following table list alerts for rule group CDL with interval-seconds as 60.
Alert Rule |
Severity |
Duration (in mins) |
Type |
---|---|---|---|
GRPC- Connections- Remote-Site |
critical |
1 |
Communication Alarm |
Expression: sum by (namespace, pod, systemId) (remote_site_connection_status) !=4 Description: This alert is generated when GRPC connections to remote site are not equal to 4. |
|||
Inter-Rack -CDL-Replication |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(datastore_requests_total{local_request=\"0\", errorCode=\"0\"}[1m]))/sum by (namespace) (increase(datastore_requests_total{local_request=\"0\"} [1m])))*100 < 90 Description: This alert is generated when the Inter-rack CDL replication success rate is below threshold value. |
|||
Intra-Rack -CDL-Replication |
critical |
1 |
Communication Alarm |
Expression: (sum by (namespace) (increase(datastore_requests_total{local_request=\"1\", errorCode=\"0\"}[1m]))/sum by (namespace) (increase(datastore_requests_total{local_request=\"1\"} [1m])))*100 < 90 Description: This alert is generated when the Intra-rack CDL replication success rate is below threshold. |