Troubleshooting Cisco Network Elements with the USE Method

The USE Method is a model for troubleshooting a system that is in distress when you don't know exactly what the nature of the problem is.

For example, if users within a specific part of your network are complaining of slowness, disconnects and poor application performance, you can probably isolate your troubleshooting to 2-3 switches or routers. However, since the problem description is so vague (we all love the "it's slow!" report, right? ?), it's hard to know where to start with detailed troubleshooting on those specific switches/routers.

That's where the USE Method comes in.

I learned about the USE method while reading Brendan Gregg's blog. Brendan is a very skilled performance engineer specializing in UNIX systems. To quote Brendan:

The USE Method can be summarized as: For every resource, check utilization, saturation, and errors.

Resource: all physical [network element] functional components (eg, CPU, memory)

Utilization: the average time that the resource was busy servicing work

Saturation: the degree to which the resource has extra work which it can't service, often queued

Errors: the count of error events

In this post, I adapt the USE Method to Cisco network devices and show how their physical resources (CPUs, different areas and types of memory, interfaces, and more) can be methodically examined in the three dimensions of utilization, saturation, and errors. Be sure to read Brendan's blog post to understand the logic behind the USE Method and to gain insight into how to apply it.

Table of Contents⌗

Routers (ISRg2, ISR 4000, ASR 1000, CSR 1000V) running IOS or IOS-XE
UADP-Based Catalyst Switches (3650, 3850, 4500E Sup8E) running IOS-XE
Catalyst 6500/6800 Series

Routers (ISRg2, ISR 4000, ASR 1000, CSR 1000V) running IOS or IOS-XE⌗

The different colors denote the grouping of physical components.

Component	Type	Metric
Control Plane CPU	Utilization	show proc cpu: "CPU utilization for five seconds: 5%/1%; one minute: 4%; five minutes: 2%" General CPU business 5%/1% — 5% total/1% time spent in interrupt context (CEF switching) show proc cpu history: histograms; where was the CPU at X seconds/minutes ago? What was peak vs average?
Control Plane CPU	Saturation	show proc cpu extended: Run queue lengths, response times show logg \| inc CPUHOG: Is a process sitting on the CPU too long, not willingly giving it up?
Control Plane CPU	Errors	?
Control Plane Memory	Utilization	show proc memory: used/free "Processor" memory is for IOS, processes, etc "I/O" memory is for storing packets while they're being switched through the box show proc memory sorted: process hogging memory? process memory leak?
Control Plane Memory	Saturation	show buffers, show buffers failures: allocation failures, "no memory" failures show logg: malloc errors
Control Plane Memory	Errors	show logg Execute Generic Online Diagnostics (GOLD) tests: diagnostic start ... (run memory test) show diagnostic events show diagnostic result ...
Data Plane (ISR 4k, ASR 1k, CSR1000V)	Utilization	show platform hardware qfp active datapath utilization The aggregate in/out data plane utilization
Data Plane (ISR 4k, ASR 1k, CSR1000V)	Errors	show platform hardware qfp active statistics drop detail Packet drop reasons and counters

UADP-Based Catalyst Switches (3650, 3850, 4500E Sup8E) running IOS-XE⌗

The different colors denote the grouping of physical component.

Component	Type	Metric
Control Plane CPU (IOS-XE/Linux processes)	Utilization	show processes cpu detailed show processes cpu detailed \| exclude 0.00 (processes with non-zero CPU utilization)
Control Plane CPU (iosd threads)	Utilization	show processes cpu detailed process iosd sorted
Control Plane CPU	Saturation	show platform punt statistics port-asic 0 cpuq -1 direction rx Number of port-asics depends on platform type and model "cpuq -1" will list all queues; if you know the specific queue you want to view, substitute its value Look at "dropped" counters Look for high packet rate CPU Punt Path Architecture on UADP-Based Switches /CiscoLive BRKCRS-3146 show platform punt client Look for high number of packets in a queue over multiple runs of the command Look for incrementing counters in the "failures" columns show pds tag all \| include Active\|Tags\| (reveals some stats and the name of the queue) Decoding CPU Queues on UADP-Based Switches show platform punt tx show logg \| inc CPUHOG: Is a process sitting on the CPU too long, not willingly giving it up?
Control Plane CPU	Errors	?
Control Plane Memory (IOS-XE/Linux processes)	Utilization	show processes memory sorted (sorts by RSS, descending)
Control Plane Memory (iosd process memory)	Utilization	show processes memory detailed process iosd sorted
Control Plane Memory	Saturation	show buffers: allocation failures, "no memory" failures show logg: malloc errors
Control Plane Memory	Errors	show logg Execute Generic Online Diagnostics (GOLD) tests: diagnostic start ... (run memory test) show diagnostic events show diagnostic result ...
Data Plane TCAM	Utilization	show platform tcam utilization asic all Each row in the output shows max and used values in the format "XX/YY" XX: number of entries YY: number of masks Security ACEs line can be misleading PACL/VACL/RACL limits can be reached before the overall security TCAM limit of 3072 entries Troubleshoot Security ACL TCAM Exhaustion on Catalyst 3850 Switches
Data Plane TCAM	Saturation	show logg (look for messages indicating TCAM is full: MAC addresses can't be learned; ACEs cannot be installed in hardware)
Data Plane TCAM	Errors	?

Catalyst 6500/6800 Series⌗

The different colors denote the grouping of physical components.

Component	Type	Metric
Control Plane CPU	Utilization	show proc cpu: CPU utilization for five seconds: 5%/1%; one minute: 4%; five minutes: 2% General CPU business 5%/1% — 5% total/1% time spent in interrupt context (CEF switching) show proc cpu history: histograms; where was the CPU at X seconds/minutes ago? What was peak vs average?
Data Plane (Switch Fabric)	Utilization	show fabric utilization all
Data Plane (Switch Fabric)	Errors	show fabric channel-counters
Data Plane (TCAM)	Utilization	show platform hardware capacity pfc show tcam counts
Data Plane (TCAM)	Saturation	show platform hardware capacity pfc show tcam counts