Scale-Up Your Network Monitoring Strategy Using Telemetry

Are you monitoring your network using traditional polling methods such as SNMP, Syslog, and CLI? If yes, does the data that you extract from your network help you answer these questions?
  • What percentage of the network bandwidth does the network traffic currently consume?

  • Do all the links in the network run at a hundred percent utilization rate?

  • If an unmanned router fails, is the network operator notified in real time about the issue and its related consequences?

  • Is the CPU over- or under-utilized?

  • Can the efficiency of the network be calculated based on traffic and data loss?

  • What are the possible performance issues that cause traffic loss or network latency?

  • How do you proactively prevent issues that may arise? Does the data support the study of network patterns in real time?

These traditional methods use a pull model to request information at regular intervals. The data that you collect may help you to efficiently monitor your network of a manageable size. However, as your network grows in complexity and scale, the data that you poll may be insufficient for efficient and effective monitoring. Additionally, the polling methods are resource-intensive, and network operators face information gaps in the data that they collect. With the pull model, the network device (the server) sends data only when the data collector (the client) requests it. Initiating such requests requires continual manual intervention. This manual intervention makes this model unsuitable, and limits automation and the ability to scale. It inhibits the visibility of the network and therefore provides inefficient control of the network. You need monitoring strategy that adds resiliency and stability to your network.

Telemetry does just that. Telemetry uses a push model that automatically streams data from a network device. Instead of a collector requesting data at periodic intervals, the network device streams operational data in real time.

Telemetry focuses on the power of scale, speed, and automation. With the power of flexibility, you can select data of interest from the routers and transmit it in a structured format to remote management stations for monitoring. Using the finer granularity and higher frequency of data available through telemetry, DevOps (development and operations) engineers in your organization can quickly locate and investigate issues as soon as they occur. They can, thus, collaborate to monitor and have better control over the network.

The following image shows the comparative benefits of streaming telemetry data using the telemetry push model over traditional pull models. The pull models create resource bottlenecks that prevent retrieving valuable operational data from the router. On the other hand, the push model is designed to remove such bottlenecks and deliver data efficiently.

Figure 1. Comparison Between Traditional Pull Models and Telemetry Push Model

Watch this video to see how telemetry data can unlock the intelligence of data in your network to proactively predict and troubleshoot issues.


Note


Starting from Cisco IOS XR, Release 7.0.1, Telemetry is part of the base image (<platform>-mini-x.iso). In earlier releases, Telemetry was part of the Manageability package (<platform>-mgbl-3.0.0.0-<release>.x86_64.rpm).


This article describes the benefits of using telemetry data and the various methods to stream meaningful data from your network device:

Benefits of Shifting Network Monitoring from Pull Models to Telemetry Push Model

Real-time telemetry data is useful in:

  • Managing network remotely: The primary benefit of telemetry is the ability it offers you as an end user to monitor the state of a network element remotely. After the network is deployed, you cannot be physically present at the network site to find out what works, and what is cumbersome. With telemetry, those insights can be analyzed, leveraged, and acted upon from a remote location.

  • Optimizing traffic: When link utilization and packet drops in a network are monitored at frequent intervals, it is easier to add or remove links, re-direct traffic, modify policing, and so on. With technologies like fast reroute, the network can switch to a new path and re-route faster than the traditional SNMP poll interval mechanism. Streaming telemetry data helps in providing quick response time for faster transport of traffic.

  • Preventive troubleshooting: Network state indicators, network statistics, and critical infrastructure information are exposed to the application layer, where they are used to enhance operational performance and to reduce troubleshooting time. The finer granularity and higher frequency of data available through telemetry enables better performance monitoring and therefore, better troubleshooting.

  • Visualizing data: Telemetry data acts as a data lake that analytics toolchains and applications use to visualize valuable insights into your network deployments.

  • Monitoring and controlling distributed devices: The monitoring function is decoupled from the storage and analysis functions. This decoupling helps to reduce device dependency, while providing flexibility to transform data using pipelines. These pipelines are utilities that consume telemetry data, transform it, and forward the resulting content to a downstream, typically off-the-shelf, consumer. The supported downstream consumers include Apache Kafka, Influxdata, Prometheus, and Grafana.

Streaming telemetry, thus, converts the monitoring process into a Big Data proposition that enables the rapid extraction and analysis of massive data sets to improve decision-making.

Review Mechanisms to Stream Telemetry Data from a Router to a Destination

Telemetry data can be streamed using either cadence-driven or event-driven mechanisms.

Figure 2. Cadence-driven and Event-driven Telemetry

Cadence-driven Telemetry

Cadence-driven telemetry continually streams data (operational statistics and state transitions) at a configured cadence. The higher frequency of the data that is continuously streamed helps you closely identify emerging patterns in the network.

The following image shows a continuous stream of data after a configured time interval:

Figure 3. Cadence-driven Telemetry

Event-driven Telemetry

Event-driven telemetry optimizes data that is collected at the receiver and streams data only when a state transition occurs and thus optimizes data that is collected at the receiver. For example, EDT streams data about interface state transitions, IP route updates, and so on.

The following image shows a stream of data after a state change:

Figure 4. Event-driven Telemetry

Learn About the Elements that Enable Streaming Telemetry Data

These elements are the building blocks in enabling telemetry in a network.

Sensor Path

The sensor path describes a YANG path or a subset of data definitions in a YANG data model within a container. In a YANG model, the sensor path can be specified to end at any level in the container hierarchy.

A YANG module defines a data model through the data of the router, and the hierarchical organization and constraints on that data.

YANG defines four node types. Each node has a name. Depending on the node type, the node either defines a value or contains a set of child nodes. The nodes types for data modeling are:
  • leaf node - contains a single value of a specific type

  • leaf-list node - contains a sequence of leaf nodes

  • list node - contains a sequence of leaf-list entries, each of which is uniquely identified by one or more key leaves

  • container node - contains a grouping of related nodes that have only child nodes, which can be any of the four node types


Note


Only the following Sysadmin data models are supported for streaming telemetry:
  • Cisco-IOS-XR-sysadmin-entity-mib

  • Cisco-IOS-XR-sysadmin-entity-sensor-mib

  • Cisco-IOS-XR-sysadmin-envmon-ui

  • Cisco-IOS-XR-sysadmin-asic-errors-ael

  • Cisco-IOS-XR-sysadmin-show-media


For more information about data models, see the Programmability Configuration Guide for Cisco NCS 5000 Series Routers.

The following table shows few examples of sensor paths:

Table 1. Sensor Paths

Feature

Sensor Path

CPU

Cisco-IOS-XR-wdsysmon-fd-oper:system-monitoring/cpu-utilization

Memory

Cisco-IOS-XR-nto-misc-oper:memory-summary/nodes/node/summary

Interface

Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters

Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/data-rate

openconfig-interfaces:interfaces/interface

Node summary

Cisco-IOS-XR-nto-misc-oper:memory-summary/nodes/node/summary

Forwarding information base (FIB)

Cisco-IOS-XR-fib-common-oper:fib-statistics/nodes/node/drops

Cisco-IOS-XR-fib-common-oper:fib/nodes/node/protocols/protocol/vrfs/vrf/summary

MPLS Traffic engineering (MPLS-TE)

Cisco-IOS-XR-mpls-te-oper:mpls-te/tunnels/summary

Cisco-IOS-XR-ip-rsvp-oper:rsvp/interface-briefs/interface-brief

Cisco-IOS-XR-mpls-te-oper:mpls-te/fast-reroute/protections/protection

Cisco-IOS-XR-mpls-te-oper:mpls-te/signalling-counters/signalling-summary

Cisco-IOS-XR-mpls-te-oper:mpls-te/p2p-p2mp-tunnel/tunnel-heads/tunnel-head

MPLS Label distribution protocol (MPLS-LDP)

Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/nodes/node/bindings-summary-all

Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/global/active/default-vrf/summary

Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/nodes/node/default-vrf/neighbors/neighbor

Routing

Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/statistics-global

Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/neighbors/neighbor

Cisco-IOS-XR-ip-rib-ipv4-oper:rib/rib-table-ids/rib-table-id/summary-protos/summary-proto

Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/levels/level/adjacencies/adjacency

Cisco-IOS-XR-ipv4-bgp-oper:bgp/instances/instance/instance-active/default-vrf/process-info

Cisco-IOS-XR-ip-rib-ipv6-oper:ipv6-rib/rib-table-ids/rib-table-id/summary-protos/summary-proto


Note


Use specific paths to avoid streaming data that you may not be interested. For example, if you want to stream information about only the summary of MPLS-TE, use sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/autotunnel/mesh/summary instead of sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te sensor path.


In the sensor path configuration, the schema node identifier can be configured with or without a leading slash.

An MDT-capable device, such as a router, associates the sensor path to the nearest container path in the model. The router encodes and streams the container path within a single telemetry message. A receiver receives data about all the containers and leaf nodes at and below this container path. The router streams telemetry data, for one or more sensor-paths, at the configured frequency (Cadence-driven Telemetry), or when an event occurs (Event-driven Telemetry), to one or more collectors through subscribed sessions.

Subscription

A subscription binds one or more sensor paths and destinations.

The collector uses the subscription to receive updates about the state of data on the router. A subscription can consist of one or more sensor paths. The data for the paths that you have subscribed starts streaming until the session is terminated by the collector or the telemetry subscription configuration is removed to cancel the subscription.

The following example shows subscription SUB1 that associates a sensor-group, sample interval and destination group.


Router(config)#telemetry model-driven
Router(config-model-driven)#subscription SUB1
Router(config-model-driven-subs)#sensor-group-id SGROUP1 sample-interval 10000
Router(config-model-driven-subs)#strict-timer

Note


With a strict-timer configured for the sample interval, the data collection starts exactly at the configured time interval allowing a more deterministic behavior to stream data. In 32-bit platforms, strict-timer can be configured only under the subscription. Whereas, 64-bit platforms support configurtion at global level in addition to the subscription level. However, configuring at the global level will affect all configured subscriptions.

Router(config)#telemetry model-driven
Router(config-model-driven)#strict-timer

Encoder

Data that is streamed from a router can be encoded using one of these formats:

  • GPB encoding: Configuring for GPB encoding requires metadata in the form of compiled .proto files. A .proto file describes the GPB message format which is used to stream data. The .proto files are available at Cisco Network Telemetry Proto in Github.
    • Compact GPB encoding: Data is streamed in a compressed format and not in a self-descriptive format. A .proto file corresponding to each sensor-path must be used by the collector to decode the streamed data.

    • Self-describing GPB encoding: Data streamed for each sensor path is in a self-describing and ASCII text format. A single .proto file, telemetry.proto, is used by the collector to decode any sensor path data. Self-describing GPB encoding is easier to manage because it needs single .proto file to decode any sensor path data, even though the message size is large.

  • JSON encoding: Data is streamed in strings of keys and its values in a human-readable format.

Transport

In the telemetry push model, the router streams telemetry data using a transport protocol. The generated data is encapsulated into the desired format using encoders.

Model-Driven Telemetry (MDT) data is streamed through these supported transport protocols:
  • Google Protocol RPC (gRPC): used for both dial-in and dial-out modes.


    Note


    gRPC protocol is not supported over Multiprotocol Label Switching (MPLS) including explicit-null label.


  • Transmission Control Protocol (TCP): used for only dial-out mode.

  • User Datagram Protocol (UDP): used for only dial-out mode. Because UDP is connectionless, the UDP destination is shown as Active irrespective of the state of the collector. This is not ideally suitable for a busy network. If a message is dropped by the network before it reaches the collector, the protocol does not resend the data.


Note


Telemetry data is streamed out of the router using an Extensible Manageability Services Deamon (emsd) process. The data of interest is subscribed through subscriptions and streamed through gRPC, TCP or UDP sessions. However, a combination of gRPC, TCP and UDP sessions with more than 150 active sessions leads to emsd crash or process restart.


gRPC Network Management Interface

gRPC Network Management Interface (gNMI) is a gRPC-based network management protocol used to modify, install or delete configuration from network devices. It is also used to view operational data, control and generate telemetry streams from a target device to a data collection system. It uses a single protocol to manage configurations and stream telemetry data from network devices.

For the list of gNMI RPCs, see the Programmability Configuration Guide.

gRPC Network Operations Interface

gRPC Network Operations Interface (gNOI) defines a set of gRPC-based microservices for executing operational commands on network devices. Extensible Manageability Services (EMS) gNOI is the Cisco IOS XR implementation of gNOI. gNOI uses gRPC as the transport protocol and the configuration is same as that of gRPC.

For the list of gNOI RPCs, see the Programmability Configuration Guide.

TLS Authentication

The gRPC protocol supports Transport Layer Security (TLS) for encrypting data. By default, model-driven telemetry uses TLS to dial-out.

When TLS is enabled, the server sends a certificate to authenticate it with the collector. The collector validates the certificate verifying which certificate authority has signed it and generates session keys to encrypt the session.

The TLS certificate must be copied at the /misc/config/grpc/dialout/ path. If only the protocol grpc command is configured, by default, TLS is enabled and the hostname defaults to the IP address of the destination. In addition, in the certificate, configure the Common Name (CN) as protocol grpc tls-hostname <>.

The following output shows the certificate that gRPC uses to establish a dialout session:

Router#run
[node:]$ls -l /misc/config/grpc/dialout/
total 4
-rw-r--r-- 1 root root 4017 dialout.pem

To bypass the TLS option, use grpc no-tls command.


Note


Although TLS provides secure communication between servers and clients, TLS version 1.0 may pose a security threat. You can now disable TLS version 1.0 using the grpc tlsv1-disable command.


Explore the Methods to Establish a Telemetry Session

A telemetry session can be initiated using: either the dial-out mode or the dial-in mode. Although the modes to establish a telemetry session are different, both modes use the same data model and stream the same data.

Dial-Out Mode

In a dial-out mode, the router dials out to the receiver to establish a subscription-based telemetry session. Because the router initiates the connection, there is no need to manage the ports for inbound traffic. In this default mode of operation, the protocols you use to establish a session gives you the flexibility to chose between simplicity (TCP) and security (gRPC). A simple protocol requires only accessibility to the socket on the collector. A secure protocol, additionally, offers security capabilities to authenticate and encrypt the session. You can, therefore, secure your collector, and establish a much advanced method of communication with the router. If the connection between the router and the destination is lost, the router re-establishes the connection with the destination and continues to push data again. However, data transmitted during the time of reconnection is lost.

To explore the dial-out mode, and to create a dial-out session, see Establish a Model-Driven Telemetry Session from a Router to a Collector.

Dial-In Mode

In a dial-in mode, a collector dials in to the router, and subscribes dynamically to one or more sensor paths specified in a subscription. The router is open for connections from the collector. This mode is useful to establish a single channel of communication with the router. Because the collector establishes the session, there is no need to create destinations in the configuration. Additionally, the protocol (gRPC) used to establish a session provides advanced security capabilities to authenticate and encrypt the session. If the connection between the router and the collector is lost, the session is cancelled. The collector must reconnect to the router to restart streaming data. Only gRPC supports dial-in session.

To explore the dial-in mode, and to create a dial-in session, see Establish a Model-Driven Telemetry Session from a Collector to a Router.

Identify the Telemetry Session Suitable for Your Network

The transport protocols and encoding formats in your network help you determine which mode is suitable for your needs. The encoding efficiency is determined by the space that data occupies on the wire, memory utilization, and the amount of data that you plan to stream from the router.

  • Use TCP dial-out mode if you plan to stream telemetry data using a simple setup with a single router and collector. It is simple to configure and does not require extensive knowledge about protocols. It removes the need to manage ports for inbound connections.

  • Use gRPC dial-out mode if your setup involves scaling out to many devices or needs encryption of your data. This mode removes the need to manage ports for inbound connections.

  • Use gRPC dial-in mode if you are already using gRPCin your network and you want your sessions to be dynamic without having the data streamed to fixed destinations. This mode is convenient if you prefer a centralized way configuring your network and requesting operational data.