Professional Documents
Culture Documents
Objectives
Be able to quickly isolate problematic nodes in the
datacenter
Become familiar with troubleshooting in NX-OS
Understand Nexus 5000 and Nexus 2000 platform
details
Gain comfort using Nexus 5000 and Nexus 2000
day to day
BRKCRS-3145
Cisco Public
BRKCRS-3145
Cisco Public
Problem Isolation
A problem well stated is a problem half solved
BRKCRS-3145
Cisco Public
Troubleshooting Tool #1
A current, accurate diagram
Physical ports
N7k-1
N7k-2
e3/1 e4/1
Logical ports
Spanning-tree root and
blocked ports
RSTP Root
vPC peer-link
e1/2, 2/2
Po100
Domain 100
vPC peer-keep
e1/1 - e1/1
e3/1 e4/1
e3/2 e4/2
e3/2 e4/2
vPC
po1
e1/30 e1/31
N5k-1
vPC peer-link
e1/1, 1/2
Po101
Domain 101
vPC
Po2
e1/30
N5k-2
e1/31e1/30 e1/31
N5k-3
vPC peer-link
e1/1, 1/2
Po102
Domain 102
e1/30 e1/31
N5k-4
N5k-5
e1/10 - e1/10
e1/12 - e1/12
STP BLK
BRKCRS-3145
Cisco Public
Cisco Public
show_tech_out.gz
6
BRKCRS-3145
Cisco Public
BRKCRS-3145
Cisco Public
Logging
Often overlooked, but very important
show logging logfile
Basis for tracing events chronologically
Try using start-time or last
N5k-1# show logging logfile start-time 2011 Mar 9 20:00:00
2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/1 is
down (None)
2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/3 is
down (None)
N5k-1# show logging last ?
<1-9999> Enter number of lines to display
Cisco Public
BRKCRS-3145
Cisco Public
10
Cisco Public
11
BRKCRS-3145
Cisco Public
12
NX-OS
Operation Tips
Support for tab auto-complete within current context, but commands will
execute at higher levels if available.
N5k-3(config-if)# switch?
switchport Configure switchport parameters <=== matching in config-if mode
N5k-3(config-if)# switchn?
switchname Configure system's host name
BRKCRS-3145
Cisco Public
13
NX-OS
Operation Tips
CLI list and grep
BRKCRS-3145
Cisco Public
14
NX-OS
File Structure
Mounts could fill, watch /var/tmp it is cleared by reload or with TAC!!!!
A full /var/tmp can cause upgrade errors, unexpected logs
N5k-1# show system internal flash
Mount-on
/
/proc
/sys
/isan
/var/tmp
/var/sysmgr
/var/sysmgr/ftp
/var/sysmgr/ftp/cores
/callhome
/dev/shm
/volatile
/debug
/dev/mqueue
/mnt/cfg/0
/mnt/cfg/1
/var/sysmgr/startup-cfg
/dev/pts
/mnt/plog
/mnt/pss
/bootflash
BRKCRS-3145
1K-blocks
204800
0
0
1536000
131072
512000
204800
20480
32768
262144
61440
2048
0
39257
37242
102400
0
56192
39273
859848
Used
111460
0
0
453760
108
4700
48604
0
0
95936
0
4
0
4332
4332
3112
0
1784
6058
768664
Available
93340
0
0
1082240
130964
507300
156196
20480
32768
166208
61440
2044
0
32898
30987
99288
0
54408
31187
47504
Cisco Public
Use%
55
0
0
30
1
1
24
0
0
37
0
1
0
12
13
4
0
4
17
95
Filesystem
/dev/root
proc
none
none
none
none
none
none
none
none
none
none
none
/dev/sda5
/dev/sda6
none
devpts
/dev/mtdblock2
/dev/sda4
/dev/sda3
15
NX-OS
File Structure
volatile: filesystem is virtual, use as scratch if needed
Obviously volatile, will not survive a reload
log: filesystem is in root /
N5k-1# debug logfile CiscoLive_debugs
N5k-1# show debug
Output forwarded to file CiscoLive_debugs (size: 4194304 bytes)
Debug level is set to Minor(1)
N5k-1# dir log:
0
Apr 04 01:14:01 2011 CiscoLive_debugs
31
Mar 11 11:38:35 2011 dmesg
0
Mar 11 11:38:57 2011 libfipf.4365
79101
Apr 04 00:34:02 2011 messages
6670
Apr 04 00:06:01 2011 startupdebug
N5k-1# copy log:CiscoLive_debugs tftp:
Enter vrf: management
Enter hostname for the tftp server: 10.91.42.134
Trying to connect to tftp server......
Connection to Server Established.
|
TFTP put operation was successful
N5k-1# clear debug-logfile CiscoLive_debugs
-ORN5k-1# undebug all
BRKCRS-3145
Cisco Public
16
BRKCRS-3145
Cisco Public
17
NX-OS
FSM
NX-OS records the finite state machine for many important processes
Using this event-history of FSM states and triggers, debugging can be done
after a problem has occurred.
Some common processes:
ethpc ethernet port client: responsible for talking to the mac and phy
ethpm ethernet port manager: responsible for translating between
configuration and ethpc. ethpc would inform ethpm that link is up, and
then ethpm will proceed to give instructions on what the configuration is
for the port
port-channel port-channeling process responsible for aggregating
physical links into logical channels
lacp 802.3ad standard for aggregating links
fwm forwarding manager; responsible for programming hardware
according to the software configuration
Important to compare timestamps and watch for inter-process
communication.
BRKCRS-3145
Cisco Public
18
NX-OS
FSM
Sometimes it is enough to look at one process
FSM, other times you are looking for related events.
Timestamps should line up when there is causality.
BRKCRS-3145
logg
13:16:49
Fabric
13:16:47
received
13:16:47
Cisco Public
19
NX-OS
FSM
A given fex host interface shows port cfg message
Indicates preparation to enable the interface
*e1/3 up at 13:16:49
2 13:16:54
Cisco Public
20
BRKCRS-3145
Cisco Public
21
NX-OS
MTS
NX-OS uses Message and Transaction
Service(MTS) to communicate between processes.
When Troubleshooting CPU issues, we can check
MTS for a large queue of messages.
BRKCRS-3145
Cisco Public
22
NX-OS
MTS
NX-OS uses Message and Transaction
Service(MTS) to communicate between processes.
Useful to check when troubleshooting
high CPU
BRKCRS-3145
Cisco Public
23
NX-OS
MTS
persistant queue is allowed to grow old
N5k-1# show system internal mts buffers details
Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC
sup/284/pers
2387380
0x101 1231
0x101
284 86017
sup/284/pers
14398
0x101 1238
0x101
284 86017
sup/284/pers
3028
0x101 1897
0x101
284 86017
sup/284/pers
818
0x101 1328
0x101
284 86017
sup/284/pers
577
0x101 1236
0x101
284 86017
sup/284/pers
42
0x101 32562 0x101
284 86017
MsgId MsgSize
1301448368 868
1301470493 868
1301473115 868
1301473633 868
1301473693 868
1301473831 868
Cisco Public
24
NX-OS
MTS
recv queue should not grow old
BRKCRS-3145
Cisco Public
MsgId MsgSize
1221952768 192
1221953842 328
1221971222 2452
1301415915 328
1301432732 2452
1301448663 192
25
NX-OS
MTS
MTS messages have been addressed to SAP 0 due
to a bug.
Reload was needed to clear this scenario
N5k-1# sh system internal mts sup sap 0 description
Not implemented
N5k-1# sh system internal mts sup sap 32 description
Syslog Sup Node Cfg
N5k-1# show system internal sysmgr service name syslogd
Service "syslogd" ("syslogd", 75):
UUID = 0x21, PID = 3924, SAP = 32
State: SRV_STATE_HANDSHAKED (entered at time Sat May 15 05:01:20
2010). Restart count: 1
Time of last restart: Sat May 15 05:01:20 2010. The service never
crashed since the last reboot.
Tag = N/A
Plugin ID: 0
BRKCRS-3145
Cisco Public
26
BRKCRS-3145
Cisco Public
27
NX-OS
Crashes
NX-OS attempts to create a core file with information helpful to aid in finding
and fixing the problem
stack trace
memory contents
Some processes in NX-OS are able to be restarted in a stateful manner.
Nexus 5000 is a single-supervisor platform; critical processes require a
system restart upon a crash.
BRKCRS-3145
Cisco Public
28
NX-OS
Crashes
show process log
View status of all processes, including if a core was created
N5k-1# show process log
Process
PID
--------------- -----eth_port_channel 2743
eth_port_channel 2761
fwm
2703
...
Normal-exit
----------N
N
N
Stack
----Y
Y
Y
Core
----N
N
N
Log-create-time
--------------Wed Mar 17 17:20:57 2010
Tue Aug 3 19:14:58 2010
Fri Oct 8 19:24:12 2010
Cisco Public
29
NX-OS
Crashes
When NX-OS system manager sysmanager resets the switch, a core file for
the offending process is often generated.
N5k-1# show cores
Module-num Instance-num Process-name PID Core-create-time
---------- ------------ ------------ --- ---------------1
fwm 2723
Sep 17 16:34
1
BRKCRS-3145
filesystem
filesystem
filesystem
filesystem
filesystem
Cisco Public
30
NX-OS
Crashes
Sometimes a core file does not exist
not enough room in the file system
kernel crashes
BRKCRS-3145
Cisco Public
31
NX-OS
Crashes
In addition to the core file, circumstantial evidence around the time of the
crash is helpful:
Was there a configuration change?
Was there a physical topology change?
Can this be reproduced?
Was there a recent upgrade?
Cisco Public
32
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
SPAN
Spanning-tree
Nexus 2000
Cisco Public
33
Hardware overview
To talk about forwarding errors and troubleshooting, drops are usually part of
this discussion
We have to know a basic hardware layout in order to know where to look for
problems
The following hardware overview is a preview of
BRKARC-3452 Cisco Nexus 5000/5500 and 2000 Switch Architecture
BRKCRS-3145
Cisco Public
34
Unified Port
Controller
Unified Port
Controller
Unified Port
Controller
Unified Crossbar
Fabric
Unified Port
Controller
...
SFP SFP
2011 Cisco and/or its affiliates. All rights reserved.
Cisco Public
35
CPU Intel
Jasper
Forest
10 Gig
Gen 2 UPC
Gen 2 UPC
Gen 2 UPC
DRAM
DDR3
South
Bridge
Flash
12 Gig
Memory
PCIe x8
NVRAM
Serial
PEX 8525
4 port PCIE
Switch
Console
PCIe x4
Gen 2 UPC
...
Gen 2 UPC
PCIE
Dual Gig
0 1
PCIE
Dual Gig
0 1
PCIE
Dual Gig
0 1
L2
L1
Mgmt 0
BRKCRS-3145
Cisco Public
36
Unicast iSLIP
Scheduler
Unified Crossbar
Fabric
BRKCRS-3145
Cisco Public
37
Unified Port
Controller
Forwarding controller
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
Cisco Public
38
Unified Port
Controller 2
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
Forwarding controller
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
MMAC + Buffer +
Forwarding
BRKCRS-3145
Cisco Public
39
CPU
South
Bridge
Cisco Public
NIC
NIC
eth3
eth4
mgmt0
Unified Port
Controller
40
CPU
Intel LV Xeon
1.66 GHz
South
Bridge
NIC
eth3
eth4
Cisco Public
BPDU
CFS
BRKCRS-3145
ICMP
41
CPU
Intel LV Xeon
1.66 GHz
South
Bridge
NIC
eth3
eth4
Metric:1
Unified Port
Controller
BRKCRS-3145
Cisco Public
42
2.
3.
4.
5.
6.
7.
BRKCRS-3145
Cisco Public
1
2
3
5
6
7
Ingress
UPC
Unified
Crossbar
Fabric
Egress
UPC
43
Cisco Public
44
Destination Interface
Switching Mode
10 GigabitEthernet
10 GigabitEthernet
Cut-Through
10 GigabitEthernet
1 GigabitEthernet
Cut-Through
1 GigabitEthernet
1 GigabitEthernet
Store-and-Forward
1 GigabitEthernet
10 GigabitEthernet
Store-and-Forward
FCoE
Fibre Channel
Cut-Through
FibreChannel
FCoE
Store-and-Forward
FibreChannel
Fibre Channel
Store-and-Forward
FCoE
FCoE
Cut-Through
BRKCRS-3145
Cisco Public
45
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
SPAN
Spanning-tree
Nexus 2000
Cisco Public
46
CRC Bad
corruption
BRKCRS-3145
Parsing
Ethernet
Header
IPv4
Header
FCS
IP Payload
Forward
Cisco Public
47
BRKCRS-3145
Cisco Public
48
Parsing
Ethernet
Header
IPv4
Header
FCS
IP Payload
corruption
A frame arrives to be parsed but is corrupted.
BRKCRS-3145
Cisco Public
49
Ethernet
Parsing
Header
IPv4
Header
FCS
IP Payload
Forward
Store-and-forward only reads the destination mac address to
make forwarding decision.
Here, the decision to forward is made, while unaware of corruption
to follow
BRKCRS-3145
Cisco Public
50
Parsing
FCS
CRC Bad
IP Payload
It is not until the FCS field in the Ethernet trailer that we can calculate
CRC value
BRKCRS-3145
Cisco Public
51
IP length error
Ethernet length error
when ethertype < 1500 / 0x5dc it is interpreted as
length
Invalid Ethernet preamble
Received and originated errors will count as TX output
errors.
Cisco Public
52
BRKCRS-3145
Cisco Public
53
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
VLAN 8
BRKCRS-3145
N5k-2
e1/4
e1/3
Cisco Public
6 broadcast
54
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
VLAN 8
BRKCRS-3145
Cisco Public
55
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
6 broadcast packets
VLAN 8
BRKCRS-3145
Cisco Public
56
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
VLAN 8
BRKCRS-3145
Cisco Public
57
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
bad fiber
VLAN 7
VLAN 8
BRKCRS-3145
Cisco Public
6 broadcast
58
Internal
7:2
e1/1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
Cisco Public
59
Internal
7:2
e1/1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
Cisco Public
60
Internal
e1/1
7:2
e1/5
7:1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
VLAN 8
BRKCRS-3145
Cisco Public
61
Internal
e1/1
7:2
e1/5
7:1
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
VLAN 8
BRKCRS-3145
N5k-2
e1/4
e1/3
Cisco Public
62
Internal
e1/1
7:2
e1/5
7:1
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
6 broadcast packets
VLAN 8
BRKCRS-3145
Cisco Public
63
Internal
e1/1
7:2
e1/5
7:1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
VLAN 8
N5k-2
e1/4
e1/3
BRKCRS-3145
Cisco Public
64
Internal
e1/1
7:2
e1/5
7:1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
VLAN 8
BRKCRS-3145
Cisco Public
65
Internal
e1/1
7:2
e1/5
7:1
e1/3
0:2
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
Cisco Public
66
Internal
e1/1
7:2
e1/5
7:1
e1/3
0:2
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
N5k-2
e1/1
e1/4
e1/3
VLAN 7
VLAN 8
BRKCRS-3145
Cisco Public
67
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
6 broadcast packets
VLAN 8
BRKCRS-3145
Cisco Public
68
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
VLAN 8
BRKCRS-3145
Cisco Public
69
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
Cisco Public
70
Internal
7:2
e1/1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
4000B frame
transmitted
VLAN 7
6 broadcast packets
VLAN 8
BRKCRS-3145
Cisco Public
71
Internal
7:2
e1/1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
4000B frame
transmitted
VLAN 7
VLAN 8
BRKCRS-3145
Cisco Public
72
Internal
e1/1
7:2
e1/12
N7k-1
class-based
MTU is 1500
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
N5k-2
e1/4
e1/3
VLAN 8
BRKCRS-3145
Cisco Public
73
Front Panel
Internal
e1/1
7:2
e1/11
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
Cisco Public
74
Internal
e1/1
7:2
0:1
e1/7
e1/11
e1/12
N7k-1
e1/7
e1/5
e1/5
N5k-1
e1/1
VLAN 7
VLAN 8
BRKCRS-3145
N5k-2
e1/4
e1/3
Cisco Public
75
Internal
e1/1
7:2
e1/7
0:1
e1/11
e1/12
N7k-1
e1/7
e1/7
e1/5
e1/5
N5k-1
e1/1
N5k-2
e1/4
e1/3
Cisco Public
76
Nexus 5000
CRC errors
Ethanalyzer / CPU
Cisco Public
77
NX-OS
High CPU
Hardware accelerated switches do not rely on the CPU for frame forwarding
and processing.
*Some L3 paths do require CPU path if hw entries are missing punt
CPU is critical for control-plane activities:
LACP without keeping up with LACPDUs, 802.3ad portchannels would
go down
STP and STP Bridge Assurance A downstream switch missing BPDUs
will go forwarding on a blocked port. If the CPU cannot keep up with
sending BPDUs, loops can form. Bridge Assurance helps in some ways,
instead of going forwarding, a BA-enabled switch will disable the
interface.
vPC programming mac addresses learned on vPC interfaces must be
installed on both switches in order to prevent flooding as well as deliver
frames to their destination
Redundancy in the event of a switch outage, the CPU needs to
reprogram state information for all processes, configure mac addresses
on interfaces in their respective VLANs.
configuration and management An unresponsive switch is not useful as
a troubleshooting tool, and you are blind without a reliable interface with
the network
BRKCRS-3145
Cisco Public
78
NX-OS
High CPU
Hopefully you have a baseline to compare the current CPU trends with a
known nominal state
Always gather 3 commands repeating frequently
| exclude 0.0
PID
Runtime(ms)
Invoked
uSecs
1Sec
Process
-----
-----------
--------
-----
------
-----------
4120
1137
10931494
17.5%
4204
1477
84831831
1.9%
pfma
gatosusd
1 minute: 0.63
Processes
CPU states
1.0% user,
Memory usage:
BRKCRS-3145
5 minutes: 1.35
8.9% kernel,
2073408K total,
90.1% idle
1412108K used,
15 minutes: 1.41
661300K free
Cisco Public
79
NX-OS
High CPU
Note the difference between *, maximum CPU and #, average CPU
This is a completely normal looking graph, try to focus on extended high
average CPU periods
N5k-1# show process cpu history
1
11
789509607796857706878950694778698849688895079850886958858500
753105000482598603786430941227125016911055026100692801248500
100
** *
90
** ** *
80 *** ** *
* *
* * ** * *
* * *** **** * *
* **
* * *
**
* * *
*** *
* **
70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **
60 *** ****************** *** ******* *********** ***** ** ****
50 ************************** ******* *************************
40 ************************************************************
30 ***********************************************************#
20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0
# = average CPU%
Cisco Public
80
NX-OS
Ethanalyzer
Displaying and capturing control-plane frames with built-in Ethanalyzer utility
based on wireshark project, NX-OS command frontend
Can display like tshark, or capture to .pcap file to analyze elsewhere
Can be used on mgmt0 as well as eth3 or eth4, the low and high priority
CPU queues
CDP
ICMP
ARP
CFS
eth3
low
NIC
South Bridge
eth4
UPC
LACPDU BPDU
eth0
DCBX
CPU
NIC
MGMT0
BRKCRS-3145
Cisco Public
81
NX-OS
Ethanalyzer example
capture mgmt0 traffic and save to a file on bootflash
view capture files
copy off for further analysis
N5k-1# ethanalyzer local interface mgmt write bootflash:managementCAP
Program exited with status 0.
N5k-1# dir bootflash: | inc management
1224
managementCAP
BRKCRS-3145
Cisco Public
82
NX-OS
Ethanalyzer example
capture high priority traffic with capture-filter and display to terminal
N5k-1# ethanalyzer local interface inbound-hi capture-filter "not ip"
Capturing on eth4
wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0
2005-02-11 20:36:50.251412 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d
2005-02-11 20:36:50.252075 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099
2005-02-11 20:36:50.252204 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a
2005-02-11 20:36:50.252317 00:0d:ec:d6:02:e9 -> 01:80:c2:00:00:00 STP Conf. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a2
2005-02-11 20:36:50.252426 00:0d:ec:d6:02:e8 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a1
2005-02-11 20:36:50.391691 00:0d:ec:d3:b5:f4 -> 01:80:c2:00:00:0e LLC U, func=UI; SNAP, OUI
0x00000C (Cisco), PID 0x0134
2005-02-11 20:36:50.803069 00:12:43:01:b0:98 -> 01:80:c2:00:00:00 STP Conf. Root =
8291/00:d0:03:62:4c:00 Cost = 0 Port = 0x8081
2005-02-11 20:36:52.251349 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d
2005-02-11 20:36:52.251366 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099
2005-02-11 20:36:52.251373 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a
BRKCRS-3145
Cisco Public
83
NX-OS
Ethanalyzer and CPU
Using to aid in identifying external causes of high CPU utilization
N5k-1# show system resources
Load average:
1 minute: 0.95
Processes
CPU states
26.7% user,
Memory usage:
5 minutes: 1.54
26.7% kernel,
2073408K total,
15 minutes: 1.46
46.5% idle
1412172K used,
661236K free
Runtime(ms)
Invoked
uSecs
1Sec
Process
-----
-----------
--------
-----
------
-----------
4230
398
5011881
22.0%
snmpd
4204
1467
84869127
20.2%
gatosusd
4226
433
5601856
5.5%
statsclient
4264
1380
391510
3.7%
ethpm
4302
254
103
2468
1.8%
netstack
BRKCRS-3145
Cisco Public
84
NX-OS
Ethanalyzer and CPU
Baseline per second
esc-n5020-1# show process cpu history
211111111131111111111121111111131111111114111111831112111111
002244240786947901001225201001390000110010000902910013010023
100
90
80
70
60
50
40
30
##
20 #
#### ##
##
##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0
BRKCRS-3145
Cisco Public
85
NX-OS
Ethanalyzer and CPU
Observed spike in CPU (per second)
N5k-1# show process cpu history
1
754669098990899966777977656766876775178734455655456466545645
006186077990796258300801881187120477641015900150830621684070
100
### ### ##
90
###########
80
###########
70 #
60 #
#####################
##### ##
###
################################# ###
##
# ###
Cisco Public
86
NX-OS
Ethanalyzer and CPU
Baseline per minute
N5k-1# show process cpu history
11
789509607796857706878950694778698849688895079850886958858500
753105000482598603786430941227125016911055026100692801248500
100
** *
90
** ** *
80 *** ** *
* *
* * ** * *
* * *** **** * *
* **
* * *
**
* * *
*** *
* **
70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **
60 *** ****************** *** ******* *********** ***** ** ****
50 ************************** ******* *************************
40 ************************************************************
30 ***********************************************************#
20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0
BRKCRS-3145
# = average CPU%
Cisco Public
87
NX-OS
Ethanalyzer and CPU
We also notice a spike in average CPU over the past 5 minutes
899074676686870687895096077968577068789506947786988496888950
189068779462040167531050004825986037864309412271250169110550
100
***
90
***
80 *****
*
*
*
** *
* * ** ** *
* * * **** ** *
* *
* * ** * *
* **
* * *** **** * *
* *
* *
* *** * **** *
70 ***** *** * *** **** ** **** * *** **** *** *** *** ****** *
60 **#** ************** ****************** *** ******* ********
50 *##**************************************** ******* ********
40 ###*#*******************************************************
30 ######******************************************************
20 #######******#****##**#*******#***********#*#*#**#**##*###*#
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0
# = average CPU%
Cisco Public
88
NX-OS
Ethanalyzer and CPU
Capturing on mgmt, we see there is an snmpwalk occuring
This should be a temporary condition and should not affect switching
performance, but perhaps you can feel latency on the terminal
Could affect other control-plane transactions like configuration backups,
collection scripts, etc.
Now you can check with your network management team to work out when
this is appropriate or if this is a mistake. A full walk is not very efficient to run
reguarly.
N5k-1# ethanalyzer local interface mgmt capture-filter "not host 10.116.114.157"
Capturing on eth0
wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0
2005-02-11 21:25:48.452632 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.455871 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.458120 172.18.118.162 -> 172.18.118.34 SNMP get-response
Cisco Public
89
Nexus 5000
CRC errors
Ethanalyzer / CPU
Cisco Public
90
Cisco Public
91
BRKCRS-3145
Cisco Public
92
sched-type
oper-bandwidth
WRR
50
WRR
50
RX Queuing
qos-group 0
q-size: 243200, HW MTU: 1600 (1500 configured)
drop-type: drop, xon: 0, xoff: 1520
Statistics:
Pkts received over the port
: 100882627
: 100877529
: 0
: 786990
: 692821
: 5098
Per-priority-pause status
: Rx (Inactive), Tx (Inactive)
Cisco Public
93
e1/5
N5k-1
e1/5
N5k-2
Trunk
e1/1
e1/3
Server A
Server B
Cisco Public
94
e1/5
gatos
Nexus 5500
carmel
e1/1
7:2
e1/5
7:1
e1/3
Server A
Nexus 5000
Internal
N5k-2
Trunk
e1/1
Front Panel
Server B
For this example, we will use Nexus 5000 outputs, but you can
substitute gatos for carmel, as they are laid out in a similar
architecture.
The actual counters and errors may vary, the methodology does not
BRKCRS-3145
Cisco Public
95
e1/5
Internal
e1/1
7:2
e1/5
7:1
N5k-2
Trunk
e1/1
Front Panel
e1/3
Server A
N5k-1#
Eth1/1
Eth1/1
Eth1/1
Eth1/1
Server B
show platform fwm info pif e1/1 | grep stats
pd: tx stats: bytes 147694477 frames 0 discard 0 drop 0
pd: rx stats: bytes 26022500 frames 0 discard 0 drop 0
pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0
BRKCRS-3145
Cisco Public
96
e1/5
N5k-1
e1/5
Internal
e1/1
7:2
e1/5
7:1
N5k-2
Trunk
e1/1
Front Panel
e1/3
Server A
Server B
Cisco Public
97
e1/5
N5k-1
e1/5
Internal
e1/1
7:2
e1/5
7:1
N5k-2
Trunk
e1/1
Front Panel
e1/3
Server A
N5k-1#
Eth1/5
Eth1/5
Eth1/5
Eth1/5
Server B
show platform fwm info pif e1/5 | grep stats
pd: tx stats: bytes 476497477 frames 0 discard 0 drop 0
pd: rx stats: bytes 232322392 frames 0 discard 0 drop 0
pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0
BRKCRS-3145
Cisco Public
98
e1/5
N5k-1
e1/5
Front Panel
Internal
e1/1
7:2
e1/5
7:1
N5k-2
Trunk
e1/1
e1/3
Server A
N5k-1#
Eth1/5
Eth1/5
Eth1/5
Eth1/5
Server B
show platform fwm info pif e1/5 | grep stats
pd: tx stats: bytes 332298390 frames 0 discard 0 drop 0
pd: rx stats: bytes 176797274 frames 0 discard 0 drop 208
pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0
BRKCRS-3145
Cisco Public
99
e1/5
N5k-1
Internal
e1/1
7:2
e1/5
7:1
e1/5
N5k-2
Trunk
e1/1
Front Panel
e1/3
Server A
Server B
N5k-1# show platform fwm info asic-errors 7
Printing non zero Gatos error registers:
DROP_SRC_VLAN_MBR: res0 = 624 res1 = 0
DROP_SRC_VLAN_MBR is 624
This counter is 3x the number of frame drops - hardware
caveat
BRKCRS-3145
Cisco Public
100
e1/5
N5k-1
e1/5
Internal
e1/1
7:2
e1/5
7:1
N5k-2
Trunk
e1/1
Front Panel
e1/3
Server A
Server B
N5k-1# show hardware internal gatos asic 7 counters interrupt
...
gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74
|
...
Interrupt counters will agree that a given error has fired from the
hardware, but the number is HEX and we also do not record
every interrupt due to the rate at which interrupts can hit CPU.
Generally this number will be somewhat less than the fwm pif
drop number.
BRKCRS-3145
Cisco Public
101
e1/5
N5k-1
e1/5
Internal
e1/1
7:2
e1/5
7:1
N5k-2
Trunk
e1/1
Front Panel
e1/3
Server A
Server B
N5k-1# show hardware internal gatos asic 7 counters interrupt
...
gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74
|
...
Interrupt counters will agree that a given error has fired from the
hardware
number is hex and
we do not record every interrupt due to the rate at which
interrupts can hit CPU. Generally this number will be somewhat
less than the show platform fwm info pif number
BRKCRS-3145
Cisco Public
102
Front Panel
Internal
e1/1
7:2
e1/5
7:1
e1/5
N5k-2
Trunk
e1/1
e1/3
Server A
N5k-1# interface Ethernet1/5
switchport mode trunk
switchport trunk allowed vlan 100-103
Server B
N5k-1# interface Ethernet1/5
switchport mode trunk
switchport trunk allowed vlan 100-102
Cisco Public
103
Nexus 5000
CRC errors
Ethanalyzer / CPU
Cisco Public
104
Spanning-tree
BRKCRS-3145
Cisco Public
105
Spanning-tree
BRKCRS-3145
Cisco Public
106
Spanning-tree
VLAN0001 [0000.0000.0000.0000 C 0 A
0 R
VLAN0001 [8001.000d.ecd6.02fc C 0 A
0 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
VLAN0001 [2063.00d0.0362.4c00 C 2 A
1 R
BRKCRS-3145
Cisco Public
107
Nexus 5000
Nexus 2000
Management
BRKCRS-3145
Cisco Public
108
FEX Management
FEX fabric interfaces run SDP satellite discovery
protocol
You can view the status of a FEX and see some
logs from the N5k:
N5k-1# show fex 100
FEX: 100 Description: FEX0100
state: Online
Max-links: 1
BRKCRS-3145
Cisco Public
109
FEX Management
N5k-1# show fex 100 detail
FEX: 100 Description: FEX0100
state: Online
Cisco Public
110
FEX Management
N5k-1# show system internal fex log fport e1/3
Satmgr debug messages for If 0x1a002000:
[19952]02/02/2005 13:08:32.191646: if [0x1a002000]:Phy cleanup rcvd
[19956]02/02/2005 13:08:32.192257: fport [0x1a002000]:Log - Interface Down
[19957]02/02/2005 13:08:32.192266: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr
state: Discovered
[19958]02/02/2005 13:08:32.192654: fport [0x1a002000]:Log - State changed to: Created
[19962]02/02/2005 13:08:32.192853: fport [0x1a002000]:satmgr_fport_fsm: new state: Created
[19967]02/02/2005 13:08:32.193991: fport [0x1a002000]:Log - fport phy cleanup retry end: sending out
resp
[19970]02/02/2005 13:08:32.206315: if [0x1a002000]:Pre Cfg rcvd
BRKCRS-3145
Cisco Public
111
Nexus 5000
Nexus 2000
Management
BRKCRS-3145
Cisco Public
112
FEX Drops
Network interface drops can be seen from N5k
show queuing interface as of 5.0(3)N1(1)
Best to attach to FEX to get detailed logs
Similar to Cat 6k or Nexus 7k linecard commands
Important to check here as FEX also have crash
logs, have their own CPU, and are responsible for
communicating link state and offloading some
protocols like CDP.
N5k-1# attach fex 100
Attaching to FEX 100 ...
To exit type 'exit', to abort type '$.'
fex-100#
BRKCRS-3145
Cisco Public
113
FEX Drops
Network interface drops can be seen from N5k
show queuing interface as of 5.0(3)N1(1)
Best to attach to FEX to get detailed logs
Similar to Cat 6k or Nexus 7k linecard commands
Important to check here as FEX also have crash
logs, have their own CPU, and are responsible for
communicating link state and offloading some
protocols like CDP.
N5k-1# attach fex 100
Attaching to FEX 100 ...
To exit type 'exit', to abort type '$.'
fex-100#
BRKCRS-3145
Cisco Public
114
FEX Drops
The scenario we are looking for is big pipe to little
pipe or many to one.
Know the flow of traffic! If you know the pattern,
finding where it is likely to stress the network will be
easier.
10G to 1G is especially difficult to buffer, so you
may find the FEX is the last stop for the 10G traffic
to buffer for your 1G hosts like to drop here and not
elsewhere in your 10G network.
Fex queue-limit and buffer-threshold can be
adjusted globally, per fex-type, or per fex
BRKCRS-3145
Cisco Public
115
FEX Drops
2148
fex-100# dbgexec rw
rw> show ints <0-6>
ASIC: 0:
+-------+--------------------------+--------------+-----------+-----------+-----------+
| ASIC
| Port
Count1
Thresh1
Count2
|
|
Thresh2
|
|
+-------+--------------------------+--------------+-----------+-----------+-----------+
| 0-NI1 | not_synced_lane_3
1 |
0 |
0 |
1 |
| 0-NI1 | not_synced_lane_2
1 |
0 |
0 |
1 |
| 0-NI1 | not_synced_lane_0
1 |
0 |
0 |
1 |
| 0-NI1 | synced_lane_3
1 |
0 |
0 |
1 |
| 0-NI1 | synced_lane_2
1 |
0 |
0 |
1 |
| 0-NI1 | synced_lane_1
1 |
0 |
0 |
1 |
| 0-NI1 | synced_lane_0
1 |
0 |
0 |
1 |
| 0-NI1 | loc_fault
1 |
0 |
0 |
1 |
| 0-NI1 | not_aligned
1 |
0 |
0 |
1 |
| 0-NI1 | aligned
1 |
0 |
0 |
1 |
+-------+--------------------------+--------------+-----------+-----------+-----------+
Cisco Public
116
FEX Drops
2148
rw> drops <0-6> hi<0-8>
Dropped packet counters for 0-HI0:
red_hix_cnt_rx_allow_vntag_drop
: 0
red_hix_cnt_rx_echannel_drop
: 0
red_hix_cnt_rx_fwd_drop
: 0
red_hix_cnt_rx_mc_drop
: 0
red_hix_cnt_rx_runt_pkt_drop
: 0
red_hix_cnt_rx_src_vif_out_of_range_drop: 0
red_hix_cnt_tx_lb_drop
: 11892
Class2: 0
Class3: 0
Class2: 0
Class3: 0
Class2: 0
Class3: 0
Class2: 0
Class3: 0
Class2: 0
Class3: 0
0-SS0 ECC1: 0
ECC2: 0
0-SS0 wo_cr: 0
no cells: 0
BRKCRS-3145
mtu_vio: 0
Cisco Public
117
FEX Drops
2248
...
Rx Discard (WR_DISC):
Rx Error (WR_RCV_ERR):
...
BRKCRS-3145
Cisco Public
118
FEX Drops
2248
satctrl/qosctrl> show asic 0 0
SS Statistics:
SS
No Credit*
No Cells
MTU Error
OQ Discard
Free Cells
---+-----------+-----------+-----------+-----------+---------0
10213
10213
...
Dropped packets per CoS due to OQ head-drop, OQ is per 8 port group:
OQ
CoS 0
CoS 1
CoS 2
CoS 3
CoS 4
CoS 5
CoS 6
CoS 7
----+----------+----------+----------+----------+----------+----------+----------+----------NR0
NR1
NR2
NR3
NR4
NR5
----+----------+----------+----------+----------+----------+----------+----------+----------HR0
HR1
HR2
HR3
HR4
HR5
BRKCRS-3145
Cisco Public
119
FEX Drops
2248
fex130# dbgexec prt
prt> drops
PRT_SS_CNT_TAIL_DROP8
: 2 SS0
Diff
Current
Diff
| RX
Current
+----------------------+----------------------+-----------------+----------------------+---------------------+-----------------+
| TX_PKT_LT64
0|
0|
| TX_PKT_64
8|
0|
0|
0| RX_PKT_LT64
5|
1| RX_PKT_64
| TX_PKT_65
4073560|
|
521532|
2062219|
264039| RX_PKT_65
| TX_PKT_128
2060397|
|
263419|
2149866|
274780| RX_PKT_128
1920669|
245601| RX_PKT_256
| TX_PKT_256
...
rmon counters are similar to the counters detailed on the N5k ports,
helpful for error tracking and finding packets of a certain size
updates immediately show counters on n5k waits for the statsclient
BRKCRS-3145
Cisco Public
120
Nexus 5000
Nexus 2000
Management
BRKCRS-3145
Cisco Public
121
FEX Logs
attach fex <n>
dbgexec rw/prt (rw=2148, prt=2248)
BRKCRS-3145
Cisco Public
122
Printout note
Final presentation may not end here, look for updated content
potentially at the live presentation.
BRKCRS-3145
Cisco Public
123
BRKCRS-3145
Cisco Public
124
BRKCRS-3145
Cisco Public
125
BRKCRS-3145
Cisco Public
126
Thank you.
BRKCRS-3145
Cisco Public
127