bulk_task_ongoing
Description: Gauge metric to indicate number of bulk tasks that is being processed at any given point in time
Sample Query: bulk_task_ongoing
Labels:
-
Label: db
Label Description: DB name
Example: session
-
Label: slot_shard_id
Label Description: The slot shard id
Example: 1, 2
-
Label: slot_instance_id
Label Description: The slot instance id
Example: 1, 2
-
Label: cdl_slice
Label Description: The name of the logical cdl slice
Example: session
bulk_task_total
Description: Total number of bulk tasks with processing status
Sample Query: bulk_task_total
Labels:
-
Label: db
Label Description: DB name
Example: session
-
Label: slot_shard_id
Label Description: The slot shard id
Example: 1, 2
-
Label: slot_instance_id
Label Description: The slot instance id
Example: 1, 2
-
Label: cdl_slice
Label Description: The name of the logical cdl slice
Example: session
-
Label: status
Label Description: Processing status of bulk task
Example: timeout, skipped, aborted, completed_last_record, completed
cdl_ep_to_slot_request_tps
Description: Recording rule for endpoint to slot request TPS measurement
Sample Query: cdl_ep_to_slot_request_tps
Labels:
Labels:
-
Label: pod
Label Description: Endpoint pod name from which the metric is generated
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, UpdateFlags
-
Label: errorCode
Label Description: The errorCode in the DB response for deletion
Example: 0, 502
cdl_ep_to_slot_response_time
Description: Recording rule for endpoint to slot response time measurement
Sample Query: cdl_ep_to_slot_response_time
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, UpdateFlags
Labels:
cdl_geo_replication_enabled
Description: Gauge metric to indicate geo replication status. If Geo replication is enabled then value is 1 else 0
Sample Query: cdl_geo_replication_enabled
cdl_index_record_capacity
Description: Total index record capacity of CDL
Sample Query: cdl_index_record_capacity{db=\"session\"}
Labels:
cdl_slice_state
Description: CDL slice active state information in GR instance-awareness. If value is 1 then slice is active
Sample Query: cdl_slice_state
Labels:
Labels:
cdl_slot_record_capacity
Description: Total slot record capacity of CDL
Sample Query: cdl_slot_record_capacity{db=\"session\"}
Labels:
cdl_slot_size_capacity
Description: Total slot size capacity of CDL
Sample Query: cdl_slot_size_capacity{db=\"session\"}
Labels:
consumer_kafka_nonprocessed_records_total
Description: Total count of unprocessed kafka records since originated from same pod
Sample Query: sum(consumer_kafka_nonprocessed_records_total)by(shardId,instanceId)
Labels:
Labels:
Labels:
Labels:
Labels:
consumer_kafka_records_duration_seconds
Description: Time taken to process consumed kafka records
Sample Query: sum(irate(consumer_kafka_records_duration_seconds[5m]))by(shardId,instance_id)
Labels:
Labels:
Labels:
Labels:
consumer_kafka_records_total
Description: Total count of records consumed from kafka
Sample Query: sum(irate(consumer_kafka_records_total[5m]))by(shardId,instance_id)
Labels:
Labels:
Labels:
Labels:
Labels:
datastore_internal_requests_duration_seconds
Description: Time taken for processing of internal datastore requests
Sample Query: sum(datastore_internal_requests_duration_seconds)by(operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: RemoteBulkRead, RemoteBulkReadIndexing, GetChecksumRemoteSlot
Labels:
Labels:
datastore_requests_duration_seconds
Description: Total time taken for processing requests at cdl-ep
Sample Query: sum(irate(datastore_requests_duration_seconds{errorCode=\"0\",local_request=\"1\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, Find, FindByUK, GetCdlStatus, UpdateFlags
Labels:
-
Label: errorCode
Label Description: The errorCode in the DB response
Example: 0, 400, 403, 404, 409, 413, 501, 502, 503, 507, 508
Labels:
Labels:
datastore_requests_total
Description: Total count of requests received at cdl-ep
Sample Query: sum(irate(datastore_requests_total{errorCode=\"0\",local_request=\"1\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, Find, FindByUK, GetCdlStatus, UpdateFlags
Labels:
-
Label: errorCode
Label Description: The errorCode in the DB response
Example: 0, 400, 403, 404, 409, 413, 501, 502, 503, 507, 508
Labels:
Labels:
db_records_softdelete_total
Description: Total count of records for the db which are in soft delete/purge state due to purgeOnEval set
Sample Query: sum(avg(db_records_softdelete_total{notify=\"1\"})by(notify))
Labels:
Labels:
Labels:
db_records_total
Description: Total count of records for the db. The following metrics can be achieved: 1. Total record count - Query: sum(avg(db_records_total{namespace=\"$namespace\",session_type=\"total\",appInstanceId=\"0\"})by(systemId,cdl_slice))
2. Slice wise record count - Query: sum(avg(db_records_total{namespace=\"$namespace\",session_type=\"total\",appInstanceId=\"0\"})by(systemId,cdl_slice))by(cdl_slice)
3. System ID based count: sum(avg(db_records_total{namespace=\"$namespace\",session_type=\"total\",appInstanceId=\"0\"})by(systemId,cdl_slice))by(systemId)
4. Sessions grouped by session type - Query: avg(db_records_total{namespace=\"$namespace\",session_type!=\"total\"}) by (session_type)
Sample Query: sum(avg(db_records_total{namespace=\"$namespace\",session_type=\"total\",appInstanceId=\"0\"})by(systemId,cdl_slice))
Labels:
Labels:
Labels:
Labels:
Labels:
dpapp_internal_requests_total
Description: Total count of internal dp app requests
Sample Query: sum(dpapp_internal_requests_total)by(operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: RemoteBulkRead, RemoteBulkReadIndexing, GetChecksumRemoteSlot
Labels:
Labels:
duplicate_slot_records_deleted
Description: Total slot records deleted due to duplicate slot data found
Sample Query: duplicate_slot_records_deleted
Labels:
Labels:
find_no_record_total
Description: Total count of find requests for which no records are sent back
Sample Query: sum(find_no_record_total)by(not_found_in,operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: FindByUk, FindTagsByUk, Find
Labels:
Labels:
findall_records_bucket
Description: The total number of findAll requests received which can be grouped into the number of records sent in response
Sample Query: sum(irate(findall_records_bucket[5m]))by(bucket)
Labels:
-
Label: bucket
Label Description: Buckets grouped by no of records
Example: =0, <=10, <=20, <=50, <=100, >100
Labels:
index_init_sync_duration_seconds
Description: Time taken by the index to sync with local and remote peers during startup
Sample Query: sum(index_init_sync_duration_seconds)by(shardId,instance_id)
Labels:
Labels:
Labels:
index_rebalanced_keys_total
Description: Total no of index keys that have been rebalanced
Sample Query: index_rebalanced_keys_total
Labels:
Labels:
indexing_audit_deleted_keys_total
Description: Total number of unique keys and primary keys deleted during index auditing
Sample Query: sum(irate(indexing_audit_deleted_keys_total{errorCode=\"0\",key_type=\"unique\"}[5m]))by(shardId,instance_id)
Labels:
Labels:
Labels:
Labels:
indexing_audit_duration_seconds
Description: Total time taken for performing the indexing audit
Sample Query: indexing_audit_duration_seconds{pod=~\".*\"}
indexing_audit_total
Description: Total times the indexing audit was run
Sample Query: indexing_audit_total{pod=~\".*\"}
indexing_is_leader
Description: Indexing is leader or follower
Sample Query: indexing_is_leader
Labels:
Labels:
indexing_kafka_replication_delay_seconds
Description: Total delay in replicating indexes from kafka in index
Sample Query: sum(irate(indexing_kafka_replication_delay_seconds[5m]))by(shardId,instance_id)
Labels:
Labels:
Labels:
Labels:
indexing_operation_duration_seconds
Description: Time taken for response of indexing operations sent from cdl ep to index app
Sample Query: sum(irate(indexing_operation_duration_seconds{errorCode=\"0\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, GetByPk, GetByUk
Labels:
-
Label: errorCode
Label Description: The errorCode in the DB response
Example: 0, 404, 500, 503
Labels:
indexing_operation_total
Description: Total count of indexing operations sent from cdl ep to index app
Sample Query: sum(irate(indexing_operation_total{errorCode=\"0\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, GetByPk, GetByUk
Labels:
-
Label: errorCode
Label Description: The errorCode in the DB response
Example: 0, 404, 500, 503
Labels:
indexing_overwrites_total
Description: Total number of indexing set operations for which index record is overwriting
Sample Query: sum(indexing_overwrites_total)by(key_type,shardId)
Labels:
Labels:
Labels:
Labels:
indexing_records_total
Description: Total count of records in the indexing
Sample Query: indexing_records_total{pod=~\".*\"}
Labels:
Labels:
Labels:
indexing_requests_duration_seconds
Description: Time taken for response of indexing requests received
Sample Query: sum(irate(indexing_requests_duration_seconds{errorCode=\"0\",isKafka=\"1\"}[5m])) by (operation)
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
indexing_requests_total
Description: Total number of requests received at index pod
Sample Query: sum(irate(indexing_requests_total{errorCode=\"0\",isKafka=\"1\"}[5m])) by (operation)
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
inmemory_indexing_operation_duration_seconds
Description: Total time taken for responses to requests from cdl-ep to cdl-index pod
Sample Query: sum(inmemory_indexing_operation_duration_seconds{errorCode=\"0\"})by(operation)
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
inmemory_indexing_operation_total
Description: Total count of operations from cdl-ep to cdl-index pod
Sample Query: sum(inmemory_indexing_operation_total)by(operation,shardId)
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
kafka_connection_status
Description: Kafka connection status
Sample Query: kafka_connection_status
Labels:
Labels:
kafka_producer_downtime_op_total
Description: Total numbers of operations when kafka producer is not available and added to downtime cache
Sample Query: sum(kafka_producer_downtime_op_total) by (pod,reason)
Labels:
Labels:
Labels:
kafka_producer_pending_publish_total
Description: Total count of messages pending to be published to kafka
Sample Query: kafka_producer_pending_publish_total{pod=~\".*\"}
kafka_producer_republished_total
Description: Total count of requests republished by kafka producer
Sample Query: kafka_producer_republished_total
Labels:
Labels:
Labels:
kafka_producer_requests_duration_seconds
Description: Total time taken by kafka producer to process requests
Sample Query: sum(irate(kafka_producer_requests_duration_seconds[5m])) by (topic)
Labels:
kafka_producer_requests_total
Description: Total count of requests sent towards kafka
Sample Query: kafka_producer_requests_total
Labels:
kafka_records_replayed_total
Description: Total number of records published to kafka due to leader-switchover or kafka-reconnection
Sample Query: sum(kafka_records_replayed_total{reason=\"leader_switchover\"})by(shardId,instance_id)
Labels:
Labels:
Labels:
-
Label: reason
Label Description: The reason for replaying kafka records
Example: leader_switchover, kafka_reconnection
notification_ep_connection_total
Description: Total numbers of connections from CDL to notification endpoint
Sample Query: notification_ep_connection_total
Labels:
Labels:
notification_streaming_enabled
Description: CDL to Notification endpoint streaming connection status. If streaming is enabled then value is 1
Sample Query: notification_streaming_enabled
overwritten_index_records_deleted
Description: Total number of records deleted due to overwritten/duplicate unique keys at index
Sample Query: overwritten_index_records_deleted
Labels:
Labels:
Labels:
overwritten_index_records_skipped
Description: Total number of unprocessed stale records due to queue being full
Sample Query: overwritten_index_records_skipped
Labels:
Labels:
Labels:
records_notification_duration_seconds
Description: Time taken for notification sent towards notification endpoint
Sample Query: sum(irate(records_notification_duration_seconds[5m])) by (shardId,instance_id,notification_type)
Labels:
Labels:
-
Label: notification_type
Label Description: Type of the notification
Example: TIMER_EXPIRED, RECORD_CONFLICT, BULK_TASK_NOTIFICATION
Labels:
Labels:
Labels:
records_notification_retry_count
Description: Total notification retries by the slot app
Sample Query: sum(irate(records_notification_retry_count[5m])) by (shardId,instance_id)
Labels:
Labels:
Labels:
records_notification_total
Description: Total count of notifications sent towards notification endpoint
Sample Query: sum(irate(records_notification_total[5m])) by (shardId,instance_id,notification_type)
Labels:
Labels:
-
Label: notification_type
Label Description: Type of the notification
Example: TIMER_EXPIRED, RECORD_CONFLICT, BULK_TASK_NOTIFICATION
Labels:
Labels:
Labels:
remote_requests_dropped_total
Description: Total number of remote requests that have been dropped
Sample Query: remote_requests_dropped_total
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, UpdateFlags
Labels:
remote_site_connection_status
Description: CDL endpoint to remote site cdl-ep connection count
Sample Query: sum(remote_site_connection_status)by(pod,systemId)
Labels:
remote_site_connections_total
Description: Total number of remote site connections configured per endpoint pod
Sample Query: remote_site_connections_total
Labels:
slot_checksum_mismatch_total
Description: Total number of checksum mismatch
Sample Query: sum(irate(slot_checksum_mismatch_total[5m]))by(slot_shard_id)
Labels:
Labels:
slot_geo_replication_requests_duration_seconds
Description: Time taken to send the response of slot geo replication
Sample Query: sum(irate(slot_geo_replication_requests_duration_seconds[5m]))by(systemId,operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: CREATE, DELETE, UPDATE, UPDATEFLAGS
Labels:
slot_geo_replication_requests_total
Description: Total number of requests for slot geo replication
Sample Query: sum(irate(slot_geo_replication_requests_total[5m]))by(systemId,operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: CREATE, DELETE, UPDATE, UPDATEFLAGS
Labels:
slot_init_sync_duration_seconds
Description: Time taken by the slot to sync with local and remote peers during startup
Sample Query: sum(slot_init_sync_duration_seconds)by(shardId,instance_id)
Labels:
Labels:
Labels:
slot_operation_duration_seconds
Description: Time taken for response of operations sent from cdl ep to slot app
Sample Query: sum(irate(slot_operation_duration_seconds{errorCode=\"0\",local_request=\"1\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, Find, UpdateFlags
Labels:
Labels:
Labels:
Labels:
Labels:
slot_operation_total
Description: Total count of operations sent from cdl ep to slot app
Sample Query: sum(irate(slot_operation_total{errorCode=\"0\",local_request=\"1\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Create, Update, Delete, Find, UpdateFlags
Labels:
Labels:
Labels:
Labels:
Labels:
slot_purged_sessions_duration_seconds
Description: Time taken for purging sessions at slot due to next eval timer expiry and purge=true
Sample Query: sum(irate(slot_purged_sessions_duration_seconds{errorCode=\"0\"}[5m]))by(shardId,instance_id)
Labels:
Labels:
Labels:
Example: Get index record failure, Invalid Slice Name received
Labels:
slot_purged_sessions_total
Description: Total number of sessions purged at slot due to next eval timer expiry and purge=true
Sample Query: sum(irate(slot_purged_sessions_total{errorCode=\"0\"}[5m]))by(shardId,instance_id)
Labels:
Labels:
Labels:
Labels:
slot_reconciled_records_total
Description: Total number of reconciled records
Sample Query: sum(slot_reconciled_records_total)by(systemId,slot_shard_id,slot_instance_id)
Labels:
Labels:
Labels:
Labels:
Labels:
slot_reconciliation_duration_seconds
Description: Total time taken to execute reconciliation
Sample Query: sum(slot_reconciliation_duration_seconds{isError=\"0\"})by(slot_shard_id)
Labels:
Labels:
Labels:
slot_reconciliation_total
Description: Total number of reconciliation triggered by checksum mismatch
Sample Query: sum(slot_reconciliation_total{isError=\"0\"})by(slot_shard_id)
Labels:
Labels:
Labels:
slot_records_size_total
Description: Total size of records in bytes in the slot
Sample Query: sum(slot_records_size_total)
Labels:
Labels:
Labels:
slot_records_total
Description: Total count of records in the slot
Sample Query: sum(slot_records_total{session_type\"total\"}) by(pod)
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
-
Label: bucket
Label Description: The bucket grouped by size
Example: <=1kb, 2kb, 4kb, 8kb
Labels:
slot_requests_duration_second
Description: Time taken for response of requests received at slot app
Sample Query: sum(irate(slot_requests_duration_seconds{errorCode\"0\"}[5m])) by (errorCode)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Get, Create, Delete
Labels:
Labels:
Labels:
slot_requests_total
Description: Total count of requests received at slot app
Sample Query: sum(irate(slot_requests_total{errorCode=\"0\"}[5m])) by (operation)
Labels:
Labels:
-
Label: operation
Label Description: The type of DB operation
Example: Get, Create, Delete
Labels:
Labels:
Labels:
slot_stale_record_duration_seconds
Description: Time taken by the slot to process the stale slot records
Sample Query: slot_stale_record_duration_seconds
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
-
Label: reason
Label Description: The reason for stale record deletion
Example: find_all_notify, stale_check_enabled
slot_stale_record_total
Description: Total count of stale slot record deletions processed
Sample Query: slot_stale_record_total
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
-
Label: reason
Label Description: The reason for stale record deletion
Example: find_all_notify, stale_check_enabled