You are on page 1of 127

Troubleshooting the

Cisco Nexus 5000 / 2000


Series Switches
BRKCRS-3145

Objectives
Be able to quickly isolate problematic nodes in the
datacenter
Become familiar with troubleshooting in NX-OS
Understand Nexus 5000 and Nexus 2000 platform
details
Gain comfort using Nexus 5000 and Nexus 2000
day to day

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Troubleshooting Nexus 5000 / 2000


Problem Isolation
Network Diagrams
Types of logging
Outputs

When to call TAC

Platform Overview and troubleshooting


Redundancy operation and troubleshooting

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Problem Isolation
A problem well stated is a problem half solved

Source: Charles F. Kettering, Engineer and Inventor


4

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Troubleshooting Tool #1
A current, accurate diagram
Physical ports

Helpful to use standard


formats
.jpg, .bmp, .pdf

N7k-1

N7k-2

e3/1 e4/1

Logical ports
Spanning-tree root and
blocked ports

RSTP Root

vPC peer-link
e1/2, 2/2
Po100
Domain 100

vPC peer-keep
e1/1 - e1/1

e3/1 e4/1

e3/2 e4/2

e3/2 e4/2

vPC
po1
e1/30 e1/31

N5k-1
vPC peer-link
e1/1, 1/2
Po101
Domain 101

vPC
Po2

e1/30

N5k-2

e1/31e1/30 e1/31

N5k-3
vPC peer-link
e1/1, 1/2
Po102
Domain 102

e1/30 e1/31

N5k-4

N5k-5
e1/10 - e1/10
e1/12 - e1/12
STP BLK

If you cannot describe how your network should be


operating, time may be wasted

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Grab a show tech-support


Or not
Sometimes too general
Large file, time consuming
If time permits, use targeted outputs or a specific
show tech
If there is no time, use tac-pac and copy off
Much quicker than transmitting to terminal

Zips entire output to file in volatile:


Copy file off of switch for analysis
N5k-1# tac-pac
N5k-1# dir volatile:
180242
Jan 28 4:37:26 2011
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

show_tech_out.gz
6

Which show tech?


As of 5.0(3), there are 68
N5k-1# show tech-support ?
aaa
Display aaa information
aclmgr
ACL commands
adjmgr
Display Adjmgr information
arp
Display ARP information
ascii-cfg
Show ascii-cfg information for technical support personnel
assoc_mgr
Gather detailed information for assoc_mgr troubleshooting
bcm-usd
Gather detailed information for BCM USD troubleshooting
bootvar
Gather detailed information for bootvar troubleshooting
brief
Display the switch summary
btcm
Gather detailed information for BTCM component
callhome
Callhome troubleshooting information
cdp
Gather information for CDP trouble shooting
...
session-mgr
Gather information for troubleshooting session manager
snmp
Gather info related to snmp
sockets
Display sockets status and configuration
spm
Service Policy Manager
stp
Gather detailed information for STP troubleshooting
sysmgr
Gather detailed information for sysmgr troubleshooting
time-optimized Gather tech-support faster, requires more memory & disk space
track
Show track tech-support information
vdc
Gather detailed information for VDC troubleshooting
vpc
Gather detailed information for VPC troubleshooting
vtp
Gather detailed information for vtp troubleshooting
xml
Gather information for xml trouble shooting

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Log your output


Redirect and Append
N5k-1# show clock > bootflash:debug-file.txt
N5k-1# show mac address-table >> bootflash:debug-file.txt
N5k-1# show running-config | count >> bootflash:debug-file.txt
N5k-1# show file bootflash:debug-file.txt
Mon Apr 4 02:39:41 UTC 2011
<==== output from show clock
Legend:
<==== output from show mac address-table
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O Overlay MAC
age - seconds since last seen,+ - primary entry using vPC PeerLink
VLAN
MAC Address
Type
age
Secure NTFY
Ports
---------+-----------------+--------+---------+------+---+----------+ 99
0021.5ad8.c424
dynamic
0
F
F Po500
* 1
0021.5ad8.c424
dynamic
250
F
F Eth101/1/2
845
<==== output from show running-config | count

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Logging
Often overlooked, but very important
show logging logfile
Basis for tracing events chronologically
Try using start-time or last
N5k-1# show logging logfile start-time 2011 Mar 9 20:00:00
2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/1 is
down (None)
2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/3 is
down (None)
N5k-1# show logging last ?
<1-9999> Enter number of lines to display

show accounting log


Basis for tracing configuration changes
terminal log-all to also log show commands
All commands end with (SUCCESS) or (FAILURE)
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Other System Logs


show logging nvram
Persistent logging survives reloads helpful for crash or reload issues.
esc-n5020-1# show logging nvram
2011 Jan 26 14:58:10 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 124 is
online
2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-PFM_SYSTEM_RESET: Manual
system restart from Command Line Interface
2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %KERN-0-SYSTEM_MSG: Shutdown
Ports.. - kernel
2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %KERN-0-SYSTEM_MSG: writing
reset reason 9, - kernel
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE:
FEX-101 Off-line (Serial Number JAF132XXXXX)
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 101 is
offline
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE:
FEX-124 Off-line (Serial Number JAF140XXXXX)
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 124 is
offline
2011 Jan 28 02:47:43 esc-n5020-1 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL:
In domain 500, VPC peer keep-alive receive has failed

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

10

When to call TAC


Most efficient if you have the following:
A description of the problem observed, with
evidence / clues, along with time and scope
A current network diagram
All parties involved in the problem
show tech is not necessary, but if you must make
drastic changes such as reloading or replacing
hardware, grab this first
Any targeted outputs, especially around the time of
the event in question
You think you have found a bug, but a quick search
of defects or release notes on cisco.com may be
faster
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

11

Troubleshooting Nexus 5000 / 2000


Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000

Platform Overview and troubleshooting


Redundancy operation and troubleshooting

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

12

NX-OS
Operation Tips
Support for tab auto-complete within current context, but commands will
execute at higher levels if available.
N5k-3(config-if)# switch?
switchport Configure switchport parameters <=== matching in config-if mode
N5k-3(config-if)# switchn?
switchname Configure system's host name

<=== matching in config mode

Filesystems dynamically auto-complete


N5k-3# (config)# show file bootflash:s?
bootflash:stp.log.1
N5k-3# (config)# install all system bootflash:n5<tab>
bootflash:n5000-uk9.5.0.3.N1.1.bin
bootflash:n5000-uk9.5.0.2.N2.1.bin

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

13

NX-OS
Operation Tips
CLI list and grep

N5k-3# show cli list | grep switchport


show system default switchport san
show interface switchport
show interface <if-mr> switchport
ctrl-c terminates output
N5k-3# show tech-support
---- show tech-support ---ctrl-c
N5k-3#

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

14

NX-OS
File Structure
Mounts could fill, watch /var/tmp it is cleared by reload or with TAC!!!!
A full /var/tmp can cause upgrade errors, unexpected logs
N5k-1# show system internal flash
Mount-on
/
/proc
/sys
/isan
/var/tmp
/var/sysmgr
/var/sysmgr/ftp
/var/sysmgr/ftp/cores
/callhome
/dev/shm
/volatile
/debug
/dev/mqueue
/mnt/cfg/0
/mnt/cfg/1
/var/sysmgr/startup-cfg
/dev/pts
/mnt/plog
/mnt/pss
/bootflash

BRKCRS-3145

1K-blocks
204800
0
0
1536000
131072
512000
204800
20480
32768
262144
61440
2048
0
39257
37242
102400
0
56192
39273
859848

Used
111460
0
0
453760
108
4700
48604
0
0
95936
0
4
0
4332
4332
3112
0
1784
6058
768664

2011 Cisco and/or its affiliates. All rights reserved.

Available
93340
0
0
1082240
130964
507300
156196
20480
32768
166208
61440
2044
0
32898
30987
99288
0
54408
31187
47504

Cisco Public

Use%
55
0
0
30
1
1
24
0
0
37
0
1
0
12
13
4
0
4
17
95

Filesystem
/dev/root
proc
none
none
none
none
none
none
none
none
none
none
none
/dev/sda5
/dev/sda6
none
devpts
/dev/mtdblock2
/dev/sda4
/dev/sda3

15

NX-OS
File Structure
volatile: filesystem is virtual, use as scratch if needed
Obviously volatile, will not survive a reload
log: filesystem is in root /
N5k-1# debug logfile CiscoLive_debugs
N5k-1# show debug
Output forwarded to file CiscoLive_debugs (size: 4194304 bytes)
Debug level is set to Minor(1)
N5k-1# dir log:
0
Apr 04 01:14:01 2011 CiscoLive_debugs
31
Mar 11 11:38:35 2011 dmesg
0
Mar 11 11:38:57 2011 libfipf.4365
79101
Apr 04 00:34:02 2011 messages
6670
Apr 04 00:06:01 2011 startupdebug
N5k-1# copy log:CiscoLive_debugs tftp:
Enter vrf: management
Enter hostname for the tftp server: 10.91.42.134
Trying to connect to tftp server......
Connection to Server Established.
|
TFTP put operation was successful
N5k-1# clear debug-logfile CiscoLive_debugs
-ORN5k-1# undebug all
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

16

Troubleshooting Nexus 5000 / 2000


Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000

Platform Overview and troubleshooting


Redundancy operation and troubleshooting

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

17

NX-OS
FSM
NX-OS records the finite state machine for many important processes
Using this event-history of FSM states and triggers, debugging can be done
after a problem has occurred.
Some common processes:
ethpc ethernet port client: responsible for talking to the mac and phy
ethpm ethernet port manager: responsible for translating between
configuration and ethpc. ethpc would inform ethpm that link is up, and
then ethpm will proceed to give instructions on what the configuration is
for the port
port-channel port-channeling process responsible for aggregating
physical links into logical channels
lacp 802.3ad standard for aggregating links
fwm forwarding manager; responsible for programming hardware
according to the software configuration
Important to compare timestamps and watch for inter-process
communication.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

18

NX-OS
FSM
Sometimes it is enough to look at one process
FSM, other times you are looking for related events.
Timestamps should line up when there is causality.

Example: A fex comes online after e1/3 is brought up


N5k-1# show
2005 Feb 2
in mode Fex
2005 Feb 2
Manager has
2005 Feb 2
Cold boot

BRKCRS-3145

logg
13:16:49
Fabric
13:16:47
received
13:16:47

esc-n5020-1 %ETHPORT-5-IF_UP: Interface Ethernet1/3 is up


esc-n5020-1 %SYSMGR-FEX100-5-MODULE_ONLINE: System
notification of local module becoming online.
esc-n5020-1 %SATCTRL-FEX100-2-SATCTRL: FEX-100 Module 1:

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

19

NX-OS
FSM
A given fex host interface shows port cfg message
Indicates preparation to enable the interface

*e1/3 up at 13:16:49

N5k-1# show platform software ethpc event-history interface e100/1/4


1) Event IF_PCFG_RSP, len: 8, at 243054 usecs after Wed Feb 2 13:16:54 2011
Sent port cfg message response to ethpm - Id: 0x2cc1819, Status: success

port-channel history shows an IF_CREATE event near this time


This is all related to a fex coming online, while e100/1/4 is configured
as a port-channel member and is coming up
N5k-1# show port-channel internal event-history interface e100/1/4
>>>>FSM: <Ethernet100/1/4> has 1 logged transitions<<<<<
1) FSM:<Ethernet100/1/4> Transition at 447889 usecs after Wed Feb
2011
Previous state: [PCM_ETH_PORT_ST_INIT_DOWN]
Triggered event: [PCM_PORT_EV_IF_CREATE]
Next state: [FSM_ST_NO_CHANGE]

2 13:16:54

Curr state: [PCM_ETH_PORT_ST_INIT_DOWN]


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

20

Troubleshooting Nexus 5000 / 2000


Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000

Platform Overview and troubleshooting


Redundancy operation and troubleshooting

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

21

NX-OS
MTS
NX-OS uses Message and Transaction
Service(MTS) to communicate between processes.
When Troubleshooting CPU issues, we can check
MTS for a large queue of messages.

When troubleshooting a specific process, we may


see specific MTS messages queued.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

22

NX-OS
MTS
NX-OS uses Message and Transaction
Service(MTS) to communicate between processes.
Useful to check when troubleshooting
high CPU

unresponsive CLI / timeout


control-plane disruption

When troubleshooting a process, we may look for


specific MTS messages queued.
MTS messages may be coming in too fast, or there
could be a message stuck at the top of the queue

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

23

NX-OS
MTS
persistant queue is allowed to grow old
N5k-1# show system internal mts buffers details
Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC
sup/284/pers
2387380
0x101 1231
0x101
284 86017
sup/284/pers
14398
0x101 1238
0x101
284 86017
sup/284/pers
3028
0x101 1897
0x101
284 86017
sup/284/pers
818
0x101 1328
0x101
284 86017
sup/284/pers
577
0x101 1236
0x101
284 86017
sup/284/pers
42
0x101 32562 0x101
284 86017

MsgId MsgSize
1301448368 868
1301470493 868
1301473115 868
1301473633 868
1301473693 868
1301473831 868

The first entry is dcos-xinetd (internet services) and it makes


sense to be old, since its a server that is always running (for
fabric manager)
N5k-1# sh system internal
TCPUDP process client MTS
N5k-1# sh system internal
dcos-xinetd
N5k-1# sh system internal
86017
MTS_OPC_TCP:
BRKCRS-3145

mts sup sap 284 description


queue
mts sup sap 1231 description
mts opcodes | grep 86017

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

24

NX-OS
MTS
recv queue should not grow old

SAP 0 is an invalid identifier and causes 300


messages to queue, and growing.
Observed impact is various show commands timing
out such as show log and show run
N5k-1# show system internal mts buffers details
Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC
sup/32/recv 319672424 0x101
25330 0x101
0
7662
sup/32/recv 319669986 0x101
25336 0x101
32
188
sup/32/recv 319609082 0x101
25344 0x101
0
7663
...
sup/32/recv 227324
0x101
32550 0x101
32
188
sup/32/recv 165509
0x101
32560 0x101
0
7663
sup/32/recv 101893
0x101
32565 0x101
0
7662

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

MsgId MsgSize
1221952768 192
1221953842 328
1221971222 2452
1301415915 328
1301432732 2452
1301448663 192

25

NX-OS
MTS
MTS messages have been addressed to SAP 0 due
to a bug.
Reload was needed to clear this scenario
N5k-1# sh system internal mts sup sap 0 description
Not implemented
N5k-1# sh system internal mts sup sap 32 description
Syslog Sup Node Cfg
N5k-1# show system internal sysmgr service name syslogd
Service "syslogd" ("syslogd", 75):
UUID = 0x21, PID = 3924, SAP = 32
State: SRV_STATE_HANDSHAKED (entered at time Sat May 15 05:01:20
2010). Restart count: 1
Time of last restart: Sat May 15 05:01:20 2010. The service never
crashed since the last reboot.
Tag = N/A
Plugin ID: 0

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

26

Troubleshooting Nexus 5000 / 2000


Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000

Platform Overview and troubleshooting


Redundancy operation and troubleshooting

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

27

NX-OS
Crashes
NX-OS attempts to create a core file with information helpful to aid in finding
and fixing the problem
stack trace

memory contents
Some processes in NX-OS are able to be restarted in a stateful manner.
Nexus 5000 is a single-supervisor platform; critical processes require a
system restart upon a crash.

A syslog message is sent just before crash and system restart


2010 Sep 10 16:19:27.411 N5k-1 %$ VDC-1 %$ %SYSMGR-2SERVICE_CRASHED: Service "fwm" (PID 2723) hasn't caught signal
6 (core will be saved).

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

28

NX-OS
Crashes
show process log
View status of all processes, including if a core was created
N5k-1# show process log
Process
PID
--------------- -----eth_port_channel 2743
eth_port_channel 2761
fwm
2703
...

Normal-exit
----------N
N
N

Stack
----Y
Y
Y

Core
----N
N
N

Log-create-time
--------------Wed Mar 17 17:20:57 2010
Tue Aug 3 19:14:58 2010
Fri Oct 8 19:24:12 2010

N5k-1# show process log pid 2703


======================================================
Service: fwm
Description: Forwarding manager Daemon
Started at Thu Oct 7 14:51:51 2010 (151707 us)
Stopped at Fri Oct 8 19:24:12 2010 (203577 us)
Uptime: 1 days 4 hours 32 minutes 21 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
...
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

29

NX-OS
Crashes
When NX-OS system manager sysmanager resets the switch, a core file for
the offending process is often generated.
N5k-1# show cores
Module-num Instance-num Process-name PID Core-create-time
---------- ------------ ------------ --- ---------------1
fwm 2723
Sep 17 16:34
1

Copy off core file for TAC analysis


N5k-1# copy core://1/fwm/1/ ?
bootflash: Select destination
ftp:
Select destination
scp:
Select destination
sftp:
Select destination
tftp:
Select destination

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

filesystem
filesystem
filesystem
filesystem
filesystem

Cisco Public

30

NX-OS
Crashes
Sometimes a core file does not exist
not enough room in the file system
kernel crashes

third-party processes; ntpd, telnetd, others...


show logging onboard obfl-logs
show logging onboard exception log
show logging onboard kernel-trace

OBFL is used to capture information related to hardware, bootup,


and environmental conditions. Onboard failure logging is non-volatile.

obfl-logs per module; tracks environmental logs, bootup-records,


uptime at bootup, version at each boot, stack trace if applicable

exception log crash/exception history and details

kernel-trace display stack of last kernel exception

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

31

NX-OS
Crashes
In addition to the core file, circumstantial evidence around the time of the
crash is helpful:
Was there a configuration change?
Was there a physical topology change?
Can this be reproduced?
Was there a recent upgrade?

Are you using an uncommon configuration? less likely to have been


tested or seen by other customers
The more details pointing to a root cause, the more feasible it is to find the
problem, provide a workaround, and a fix.

Additional detail regarding NX-OS:

BRKARC-3471 Cisco NXOS Software - Architecture


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

32

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Nexus 5000

CRC errors
Ethanalyzer / CPU
Queuing and forwarding

SPAN
Spanning-tree
Nexus 2000

Redundancy operation and troubleshooting


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

33

Hardware overview
To talk about forwarding errors and troubleshooting, drops are usually part of
this discussion
We have to know a basic hardware layout in order to know where to look for
problems
The following hardware overview is a preview of
BRKARC-3452 Cisco Nexus 5000/5500 and 2000 Switch Architecture

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

34

Nexus 5000 Hardware Overview


Data Plane Elements
SFP SFP SFP SFP

Unified Port
Controller

SFP SFP SFP SFP

Unified Port
Controller

SFP SFP SFP SFP

Nexus 5000 is a distributed


forwarding architecture
Unified Port Controller (UPC)
ASIC interconnected by a
single stage Unified Crossbar
Fabric (UCF)

Unified Port
Controller

Unified Port Controllers provide


distributed packet forwarding
capabilities
All port to port traffic passes
through the UCF (Fabric)

Unified Crossbar
Fabric

Unified Port
Controller

SFP SFP SFP SFP


BRKCRS-3145

...

Four switch ports managed by


each UPC
14 UPC in Nexus 5020
Unified Port
Controller

7 UPC in Nexus 5010

SFP SFP
2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

35

Nexus 5500 Hardware Overview


Data and Control Plane Elements
Expansion Module

CPU Intel
Jasper
Forest

10 Gig

Gen 2 UPC

Gen 2 UPC

Gen 2 UPC

DRAM
DDR3
South
Bridge

Flash
12 Gig

Memory
PCIe x8

Unified Crossbar Fabric


Gen 2

NVRAM

Serial

PEX 8525
4 port PCIE
Switch

Console

PCIe x4

Gen 2 UPC

...

Gen 2 UPC

PCIE
Dual Gig
0 1

PCIE
Dual Gig
0 1

PCIE
Dual Gig
0 1

L2
L1
Mgmt 0

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

36

Nexus 5000/5500 Hardware Overview


Data Plane Elements - Unified Crossbar Fabric
Nexus 5000 (Gen-1)
58-port packet based crossbar and scheduler
Three unicast and one multicast crosspoint per egress port

Nexus 5550 (Gen-2)


100-port packet based crossbar and new schedulers
4 crosspoints per egress port dynamically configurable between multicast
and unicast traffic

Central tightly coupled scheduler


Request, propose, accept, grant, and acknowledge semantics
Packet enhanced iSLIP scheduler
Distinct unicast and multicast schedulers (see slides later for differences in
Gen-1 vs. Gen-2 multicast schedulers)
Eight classes of service within the Fabric
Multicast
Scheduler

Unicast iSLIP
Scheduler

Unified Crossbar
Fabric
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

37

Nexus 5000 Hardware Overview


Unified Port Controller
Each UPC supports four ports and contains,
Multimode Media access controllers (MAC)

Unified Port
Controller

Support 1/10 G Ethernet and 1/2/4 G


Fibre Channel
1G is available on first 8 ports of the 5010 and
first 16 ports of the 5020

(2/4/8 G Fibre Channel MAC is located


on the Expansion Module)
MMAC + Buffer +
Forwarding

Forwarding controller

MMAC + Buffer +
Forwarding

480 KB of buffering per port

MMAC + Buffer +
Forwarding

MMAC + Buffer +
Forwarding

Packet buffering and queuing

Ethernet and Fibre Channel Forwarding


and Policy
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

38

Nexus 5500 Hardware Overview


Data Plane Elements - Unified Port Controller (Gen 2)
Each UPC supports eight ports and
contains,

Unified Port
Controller 2

Multimode Media access controllers


(MAC)
Support 1/10 G Ethernet and 1/2/4/8 G
Fibre Channel
All MAC/PHY functions supported on the
UPC (5548UP and 5596UP)

MMAC + Buffer +
Forwarding

MMAC + Buffer +
Forwarding

MMAC + Buffer +
Forwarding

MMAC + Buffer +
Forwarding

MMAC + Buffer +
Forwarding

Forwarding controller

MMAC + Buffer +
Forwarding

640 KB of buffering per port

MMAC + Buffer +
Forwarding

MMAC + Buffer +
Forwarding

Packet buffering and queuing

Ethernet (Layer 2 and FabricPath) and


Fibre Channel Forwarding and Policy
(L2/L3/L4 + all FC zoning)

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

39

Nexus 5000/5500 Hardware Overview


Control Plane Elements
In-band traffic is identified by the UPC
and punted to the CPU via two
dedicated UPC interfaces, 5/0 and 5/1,
which are in turn connected to eth3
and eth4 interfaces in the CPU
complex

CPU

South
Bridge

Eth3 handles Rx and Tx of low priority


control pkts
IGMP, CDP, TCP/UDP/IP/ARP (for
management purpose only)

Eth4 handles Rx and Tx of high


priority control pkts
STP, LACP, DCBX, FC and FCoE
control frames (FC packets come to
Switch CPU as FCoE packets)
There is a built-in control-plane policer to
limit the amount of traffic punted to CPU
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

NIC

NIC
eth3

eth4

mgmt0

Unified Port
Controller

40

Nexus 5000/5500 Hardware Overview


Control Plane Elements
CPU queuing structure provides strict protection
and prioritization of inbound traffic

Each of the two in-band ports has 8 queues and


traffic is scheduled for those queues based on
control plane priority (traffic CoS value)
Prioritization of traffic between queues on each
in-band interface

CPU
Intel LV Xeon
1.66 GHz

South
Bridge

CLASS 7 is configured for strict priority scheduling


(e.g. BPDU)
CLASS 6 is configured for DRR scheduling with
50% weight
Default classes (0 to 5) are configured for DRR
scheduling with 10% weight

NIC
eth3

eth4

Additionally each of the two in-band interfaces


has a priority service order from the CPU
Eth 4 interface has high priority to service packets
(no interrupt moderation)

Cisco Public

BPDU

2011 Cisco and/or its affiliates. All rights reserved.

CFS

BRKCRS-3145

ICMP

Eth3 interface has low priority (interrupt


moderation)

41

Nexus 5000 Hardware Overview


Control Plane Elements
Monitoring of in-band traffic via NX-OS
built-in ethanalyzer (sniffer)

CPU
Intel LV Xeon
1.66 GHz

Eth3 is equivalent to inbound-lo


Eth4 is equivalent to inbound-hi

South
Bridge

N5k-2# ethanalyzer local sniff-interface ?


inbound-hi
Inbound(high priority) interface
inbound-low Inbound(low priority) interface
mgmt
Management interface

CLI view of in-band control plane data


N5k-2# sh hardware internal cpu-mac inband counters
eth3
Link encap:Ethernet HWaddr 00:0D:EC:B2:0C:83
UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST MTU:2200
RX packets:3 errors:0 dropped:0 overruns:0 frame:0
TX packets:630 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:252 (252.0 b) TX bytes:213773 (208.7 KiB)
Base address:0x6020 Memory:fa4a0000-fa4c0000
eth4

NIC
eth3

eth4

Metric:1

Unified Port
Controller

Link encap:Ethernet HWaddr 00:0D:EC:B2:0C:84


UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST MTU:2200 Metric:1
RX packets:85379 errors:0 dropped:0 overruns:0 frame:0
TX packets:92039 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:33960760 (32.3 MiB) TX bytes:25825826 (24.6 MiB)
Base address:0x6000 Memory:fa440000-fa460000

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

42

Nexus 5000 Hardware Overview


Packet Forwarding Overview
1.

Ingress MAC - MAC decoding, MACSEC processing (not


supported currently), synchronize bytes

2.

Ingress Forwarding Logic - Parse frame and perform


forwarding and filtering searches, perform learning apply
internal DCE header

3.

Ingress Buffer (VoQ) - Queue frames, request service of


fabric, dequeue frames to fabric and monitor queue usage to
trigger congestion control

4.

Cross Bar Fabric - Scheduler determines fairness of access


to fabric and determines when frame is de-queued across
the fabric

5.

Egress Buffers - Landing spot for frames in flight when egress


is paused

6.

Egress Forwarding Logic - Parse, extract fields, learning and


filtering searches, perform learning and finally convert to
desired egress format

7.

Egress MAC - MAC encoding, pack, synchronize bytes


and transmit

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

SFP SFP SFP SFP

1
2
3

5
6
7

Ingress
UPC

Unified
Crossbar
Fabric

Egress
UPC

SFP SFP SFP SFP

43

Nexus 5000 Forwarding


cut-through vs. store and forward
Store and forward switching is still utilized when the ingress
data rate is slower than the egress data rate.

Cut-through switching is utilized to achieve low latency through


the switch fabric.
Bits are serialized in from the ingress port until enough of
the packet header has been received to perform a
forwarding and policy lookup
Once a lookup decision has been made and the fabric has
granted access to the egress port bits are forwarded
through the fabric

Egress port performs any header rewrite (e.g. CoS marking)


and MAC begins serialization of bits out the egress port
A drop cannot happen on ingress due to any switching logic or
even a CRC error. Only faulty hardware or connections can
cause a drop on ingress.
Discards can occur on ingress due to queuing configuration
and traffic patterns.
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

44

Nexus 5000 Forwarding


cut-through vs. store and forward
Source Interface

Destination Interface

Switching Mode

10 GigabitEthernet

10 GigabitEthernet

Cut-Through

10 GigabitEthernet

1 GigabitEthernet

Cut-Through

1 GigabitEthernet

1 GigabitEthernet

Store-and-Forward

1 GigabitEthernet

10 GigabitEthernet

Store-and-Forward

FCoE

Fibre Channel

Cut-Through

FibreChannel

FCoE

Store-and-Forward

FibreChannel

Fibre Channel

Store-and-Forward

FCoE

FCoE

Cut-Through

Simple way to remember: 10G ingress interfaces are always cut-through


Note: 10G interfaces can be configured for Ethernet or FCoE

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

45

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Nexus 5000

CRC errors
Ethanalyzer / CPU
Queuing and forwarding

SPAN
Spanning-tree
Nexus 2000

Redundancy operation and troubleshooting


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

46

Cut-through mode and CRC errors


Received errors
Cut-through switching changes how we troubleshoot problems
in the switch.

Ethernet CRC is at the end of the frame, so even a CRC


error cannot cause a drop on a cut-through port.
We are already forwarding the frame by the time the
ingress mac can read the CRC value.

CRC Bad

corruption

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Parsing

Ethernet
Header

IPv4
Header

FCS

IP Payload

Forward

Cisco Public

47

Cut-through mode and CRC errors


Received errors
The corrupted frame must be forwarded, but is accounted for
as an output error.
N5k-1# show interface e1/1
...
TX
10157 unicast packets 105 multicast packets 52 broadcast packets
11314 output packets 5317822 bytes
0 jumbo packets
1000 output errors 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 Tx pause
0 interface resets

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

48

Animation frames for printouts

Parsing

Ethernet
Header

IPv4
Header

FCS

IP Payload

corruption
A frame arrives to be parsed but is corrupted.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

49

Animation frames for printouts

Ethernet
Parsing
Header

IPv4
Header

FCS

IP Payload

Forward
Store-and-forward only reads the destination mac address to
make forwarding decision.
Here, the decision to forward is made, while unaware of corruption
to follow

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

50

Animation frames for printouts

Parsing

FCS

CRC Bad

IP Payload

It is not until the FCS field in the Ethernet trailer that we can calculate
CRC value

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

51

Cut-through mode and CRC stomping


Originated Errors
In addition to receiving errored frames, the Nexus 5000 can
generate a bad CRC for several reasons:
MTU violation

IP length error
Ethernet length error
when ethertype < 1500 / 0x5dc it is interpreted as
length
Invalid Ethernet preamble
Received and originated errors will count as TX output
errors.

Only received errors will count as RX CRC errors.

You are more likely to see CRC errors in a network with a


cut-through switch.
The errors will pass through all cut-through switches and
finally drop at the first store-and-forward buffer.
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

52

Finding the source of CRC errors


CRC errors are introduced in 3 ways:

Bad physical connection

copper, fiber, transceiver, phy

stomping due to intentionally originated errors

Received bad CRC stomped from neighboring cut-through


switch.

Start by finding any RX CRC counters.

If none, then this switch is responsible for originating

Use interrupt counters to find the reason and port, if intentional

Log in to next switch upstream of CRC counters, check for


RX CRC there.

Use the above logic to determine if this switch is originating


any errors.

Finally, inspect optics/pluggables, fiber/cables and


troubleshoot as a Layer 1 issue. Change cable and port to
find where the problem follows.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

53

Finding the source of CRC errors


Observations, scenario #1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7
VLAN 8
BRKCRS-3145

N5k-2
e1/4

e1/3

N5k-1# show interface e1/1


RX
20949142 unicast packets 1147746 multicast packets
packets
22096894 input packets 30452432662 bytes
18967009 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

6 broadcast

54

Finding the source of CRC errors


Observations, scenario #1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-1# show interface e1/5


TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

55

Finding the source of CRC errors


Observations, scenario #1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-2# show interface e1/5


RX
1266 unicast packets 1147746 multicast packets
0 input packets 0 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer

6 broadcast packets

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

56

Finding the source of CRC errors


Observations, scenario #1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-2# show interface e1/3


TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

57

Finding the source of CRC errors


Scenario #1: Physical Issue

e1/11

Frame enters switch as


a CRC error

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

bad fiber

VLAN 7
VLAN 8
BRKCRS-3145

N5k-1# show interface e1/1


RX
20949142 unicast packets 1147746 multicast packets
packets
22096894 input packets 30452432662 bytes
18967009 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

6 broadcast

58

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

Look up internal ASIC port

7:2

e1/1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

N5k-1# show hardware internal gatos all-ports | egrep name|1/1


name
|log|gat|mac|flag|adm|opr|c:m:s:l|ipt|fab|xgat|xpt|if_index|diag
xgb1/1 |0 |7 |2 |b7 |en |up |1:2:2:f|2 |6 |7
|4 |1a000000|pass
VLAN 7
VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

59

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal
7:2

e1/1

e1/11

e1/12

Interrupt counters will


increment on receipt of
a bad CRC

N7k-1
e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

N5k-1# show hardware internal gatos asic 7 counters interrupt


Gatos 7 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+---gat_fw2_INT_ig_pkt_err_cb_bm_eof_err
|1
|0
|1
|0
gat_fw2_INT_ig_pkt_err_eth_crc_stomp
|1
|0
|1
|0
gat_fw2_INT_ig_pkt_err_e802_3_len_err
|1
|0
|1
|0
VLAN 7 gat_mm0_INT_rlp_rx_pkt_crc_err
|1
|0
|1
|0
|1
|0
|1
|0
VLAN 8 gat_mm0_INT_rlp_rx_pkt_crc_stomped
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

60

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

e1/11

e1/12

10Gb/s interfaces will cut-through


switch these bad frames and
increment an output error at
the egress port

N7k-1
e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-1# show interface e1/5


TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

61

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

Interrupt counters increment


upon transmit of errored frame
e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7
VLAN 8
BRKCRS-3145

N5k-2
e1/4

e1/3

N5k-1# show hardware internal gatos asic 7 counters interrupt


Gatos 7 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+---gat_fw1_INT_eg_pkt_err_cb_bm_eof_err
|1
|0
|0
|0
gat_fw1_INT_eg_pkt_err_eth_crc_stomp
|1
|0
|0
|0
gat_fw1_INT_eg_pkt_err_e802_3_len_err
|1
|0
|0
|0
2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

62

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

Another cut-through port


receives bad frame
e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-2# show interface e1/5


RX
1266 unicast packets 1147746 multicast packets
0 input packets 0 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer

6 broadcast packets

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

63

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

e1/11

e1/12

Interrupt counters will


increment on receipt of
a bad CRC

N7k-1
e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7
VLAN 8

N5k-2
e1/4

e1/3

N5k-2# show hardware internal gatos asic 7 counters interrupt


Gatos 7 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+---gat_fw1_INT_ig_pkt_err_cb_bm_eof_err
|1
|0
|1
|0
gat_fw1_INT_ig_pkt_err_eth_crc_stomp
|1
|0
|1
|0
gat_fw1_INT_ig_pkt_err_e802_3_len_err
|1
|0
|1
|0
gat_mm0_INT_rlp_rx_pkt_crc_err
|1
|0
|1
|0
gat_mm0_INT_rlp_rx_pkt_crc_stomped
|1
|0
|1
|0

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

64

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-2# show interface e1/3


TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

65

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

e1/3

0:2

Interrupt counters increment


upon transmit of errored frame
e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

N5k-2# show hardware internal gatos asic 0 counters interrupt


Gatos 0 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+---gat_fw2_INT_eg_pkt_err_cb_bm_eof_err
|1
|0
|0
|0
gat_fw2_INT_eg_pkt_err_eth_crc_stomp
|1
|0
|0
|0
VLAN 7 gat_fw2_INT_eg_pkt_err_e802_3_len_err
|1
|0
|0
|0
VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

66

Finding the source of CRC errors


Scenario #1: Physical Issue
Front Panel

Internal

e1/1

7:2

e1/5

7:1

e1/3

0:2

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1

N5k-2

e1/1

e1/4

e1/3

host will drop bad


frame in Rx buffer

VLAN 7
VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

67

Finding the source of CRC errors


Observations, scenario #2

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-1# show interface e1/1


RX
20995002 unicast packets 1150262 multicast packets
22145270 input packets 30519119563 bytes
1 jumbo packets 0 storm suppression packets

6 broadcast packets

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

68

Finding the source of CRC errors


Observations, scenario #2

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-1# show interface e1/7


TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

69

Finding the source of CRC errors


Observations, scenario #2

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

N7k-1# show interface e1/11


RX
4 unicast packets 0 multicast packets 0 broadcast packets
4 input packets 5672 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
1 input error 0 short frame 0 overrun
0 underrun 0
ignored
VLAN 7
VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

70

Finding the source of CRC errors


Scenario #2: MTU Exceeded
Front Panel

Internal
7:2

e1/1

e1/11

e1/12

Jumbo packets increment


whenever ethernet payload is
greater than 1500 not always
an error!

N7k-1
e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

4000B frame
transmitted

VLAN 7

N5k-1# show interface e1/1


RX
20995002 unicast packets 1150262 multicast packets
22145270 input packets 30519119563 bytes
1 jumbo packets 0 storm suppression packets

6 broadcast packets

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

71

Finding the source of CRC errors


Scenario #2: MTU Exceeded
Front Panel

Internal

Hardware counters keep track


of size ranges.

7:2

e1/1

e1/11

e1/12
N7k-1

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

4000B frame
transmitted

VLAN 7

N5k-1# show hardware internal gatos port e1/1 counters


rx
RX_PKT_SIZE_IS_1519_TO_2047
| 0
RX_PKT_SIZE_IS_2048_TO_4095
| 1
RX_PKT_SIZE_IS_4095_TO_8191
| 0
RX_PKT_SIZE_IS_8192_TO_9216
| 0
RX_PKT_SIZE_GT_9216
| 0

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

72

Finding the source of CRC errors


Scenario #2: MTU Exceeded
Front Panel

Internal

e1/1

7:2

In this case, the MTU is set to


the default of 1500 in class-default
e1/11

e1/12

So we enter an error condition.

N7k-1

class-based
MTU is 1500

e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

VLAN 7

N5k-2
e1/4

e1/3

N5k-1# show hardware internal gatos asic 7 counters interrupt


Gatos 7 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+---gat_bm_port2_INT_err_ig_mtu_vio
|1
|
|
|

VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

73

Finding the source of CRC errors


Scenario #2: MTU Exceeded

N5k-1# show policy-map type network-qos

Front Panel

Internal

e1/1

7:2

Type network-qos policy-maps


===============================
policy-map type network-qos default-nqpolicy
e1/12 class type network-qos class-fcoe
pause no-drop
mtu 2158
class type network-qos class-default
mtu 1500

e1/11
N7k-1
e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

MTU is configured per class, under network-qos.


This allows for a separate FCoE MTU and Ethernet MTU.
VLAN 7
VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

74

Finding the source of CRC errors


Scenario#2: MTU Exceeded
Front Panel

Internal

e1/1

7:2
0:1

e1/7

e1/11

Leaving the egress interface,


the CRC has been stomped and
other interrupts have fired.

e1/12
N7k-1

e1/7

Note the egress interface will


aggregate all frames from various
source interfaces. Adding up
counters can be tricky.
e1/7

e1/5

e1/5

N5k-1
e1/1

VLAN 7
VLAN 8
BRKCRS-3145

N5k-2
e1/4

e1/3

N5k-1# show hardware internal gatos asic 0 counters interrupt


Gatos 0 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+---gat_fw1_INT_eg_pkt_err_cb_bm_eof_err
|1
|0
|1
|0
gat_fw1_INT_eg_pkt_err_eth_crc_stomp
|1
|0
|1
|0
gat_fw1_INT_eg_pkt_err_ip_pyld_len_err
|1
|0
|1
|0
gat_mm1_INT_rlp_tx_pkt_crc_err
|1
|0
|1
|0
2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

75

Finding the source of CRC errors


Scenario #2: MTU Exceeded
Front Panel

Internal

e1/1

7:2

e1/7

0:1

e1/11

e1/12

The store-and-forward card on the


Nexus 7000 parses the entire frame
and finds a bad CRC value. A drop
occurs on N7k1 the frame never
makes it to N5k2.

N7k-1
e1/7

e1/7
e1/5

e1/5

N5k-1
e1/1

N5k-2
e1/4

e1/3

N7k-1# show interface e1/11


RX
4 unicast packets 0 multicast packets 0 broadcast packets
4 input packets 5672 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
1 input error 0 short frame 0 overrun
0 underrun 0
ignored
VLAN 7
VLAN 8
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

76

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Crashes

Nexus 5000
CRC errors
Ethanalyzer / CPU

Queuing and forwarding


Spanning-tree
Nexus 2000

Redundancy operation and troubleshooting


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

77

NX-OS
High CPU
Hardware accelerated switches do not rely on the CPU for frame forwarding
and processing.
*Some L3 paths do require CPU path if hw entries are missing punt
CPU is critical for control-plane activities:
LACP without keeping up with LACPDUs, 802.3ad portchannels would
go down
STP and STP Bridge Assurance A downstream switch missing BPDUs
will go forwarding on a blocked port. If the CPU cannot keep up with
sending BPDUs, loops can form. Bridge Assurance helps in some ways,
instead of going forwarding, a BA-enabled switch will disable the
interface.
vPC programming mac addresses learned on vPC interfaces must be
installed on both switches in order to prevent flooding as well as deliver
frames to their destination
Redundancy in the event of a switch outage, the CPU needs to
reprogram state information for all processes, configure mac addresses
on interfaces in their respective VLANs.
configuration and management An unresponsive switch is not useful as
a troubleshooting tool, and you are blind without a reliable interface with
the network
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

78

NX-OS
High CPU
Hopefully you have a baseline to compare the current CPU trends with a
known nominal state
Always gather 3 commands repeating frequently

show process cpu sort | exclude 0.0


show system resources
show process cpu history
N5k-1# show process cpu sort

| exclude 0.0

PID

Runtime(ms)

Invoked

uSecs

1Sec

Process

-----

-----------

--------

-----

------

-----------

4120

1137

10931494

17.5%

4204

1477

84831831

1.9%

pfma
gatosusd

N5k-1# show system resources


Load average:

1 minute: 0.63

Processes

281 total, 1 running

CPU states

1.0% user,

Memory usage:

BRKCRS-3145

5 minutes: 1.35

8.9% kernel,

2073408K total,

90.1% idle

1412108K used,

2011 Cisco and/or its affiliates. All rights reserved.

15 minutes: 1.41

661300K free

Cisco Public

79

NX-OS
High CPU
Note the difference between *, maximum CPU and #, average CPU
This is a completely normal looking graph, try to focus on extended high
average CPU periods
N5k-1# show process cpu history
1

11

789509607796857706878950694778698849688895079850886958858500
753105000482598603786430941227125016911055026100692801248500
100

** *

90

** ** *

80 *** ** *

* *

* * ** * *

* * *** **** * *

* **

* * *

**

* * *

*** *

* **

* *** * **** * ** *** * ** * **

70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **
60 *** ****************** *** ******* *********** ***** ** ****
50 ************************** ******* *************************
40 ************************************************************

30 ***********************************************************#
20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0

CPU% per minute (last 60 minutes)


* = maximum CPU%
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

# = average CPU%

Cisco Public

80

NX-OS
Ethanalyzer
Displaying and capturing control-plane frames with built-in Ethanalyzer utility
based on wireshark project, NX-OS command frontend
Can display like tshark, or capture to .pcap file to analyze elsewhere
Can be used on mgmt0 as well as eth3 or eth4, the low and high priority
CPU queues
CDP

ICMP

ARP

CFS

eth3

low

NIC
South Bridge

eth4

UPC
LACPDU BPDU

eth0

DCBX

CPU

NIC
MGMT0

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

81

NX-OS
Ethanalyzer example
capture mgmt0 traffic and save to a file on bootflash
view capture files
copy off for further analysis
N5k-1# ethanalyzer local interface mgmt write bootflash:managementCAP
Program exited with status 0.
N5k-1# dir bootflash: | inc management
1224

Apr 04 16:56:33 2011

managementCAP

N5k-1#ethanalyzer local read bootflash:managementCAP


2011-04-04 16:56:33.763150 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=68
2011-04-04 16:56:33.763527 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.763968 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.764391 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.764811 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.765230 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.765649 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.765928 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=68
Win=65535 Len=0 TSV=597611264 TSER=19040186
2011-04-04 16:56:33.765930 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=120
Win=65535 Len=0 TSV=597611264 TSER=19040186
2011-04-04 16:56:33.765932 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=172
Win=65535 Len=0 TSV=597611264 TSER=19040186

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

82

NX-OS
Ethanalyzer example
capture high priority traffic with capture-filter and display to terminal
N5k-1# ethanalyzer local interface inbound-hi capture-filter "not ip"
Capturing on eth4
wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0
2005-02-11 20:36:50.251412 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d
2005-02-11 20:36:50.252075 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099
2005-02-11 20:36:50.252204 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a
2005-02-11 20:36:50.252317 00:0d:ec:d6:02:e9 -> 01:80:c2:00:00:00 STP Conf. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a2
2005-02-11 20:36:50.252426 00:0d:ec:d6:02:e8 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a1
2005-02-11 20:36:50.391691 00:0d:ec:d3:b5:f4 -> 01:80:c2:00:00:0e LLC U, func=UI; SNAP, OUI
0x00000C (Cisco), PID 0x0134
2005-02-11 20:36:50.803069 00:12:43:01:b0:98 -> 01:80:c2:00:00:00 STP Conf. Root =
8291/00:d0:03:62:4c:00 Cost = 0 Port = 0x8081
2005-02-11 20:36:52.251349 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d
2005-02-11 20:36:52.251366 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099
2005-02-11 20:36:52.251373 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root =
8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

83

NX-OS
Ethanalyzer and CPU
Using to aid in identifying external causes of high CPU utilization
N5k-1# show system resources
Load average:

1 minute: 0.95

Processes

281 total, 4 running

CPU states

26.7% user,

Memory usage:

5 minutes: 1.54
26.7% kernel,

2073408K total,

15 minutes: 1.46

46.5% idle

1412172K used,

661236K free

N5k-1# show process cpu sort | exclude 0.0


PID

Runtime(ms)

Invoked

uSecs

1Sec

Process

-----

-----------

--------

-----

------

-----------

4230

398

5011881

22.0%

snmpd

4204

1467

84869127

20.2%

gatosusd

4226

433

5601856

5.5%

statsclient

4264

1380

391510

3.7%

ethpm

4302

254

103

2468

1.8%

netstack

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

84

NX-OS
Ethanalyzer and CPU
Baseline per second
esc-n5020-1# show process cpu history

211111111131111111111121111111131111111114111111831112111111
002244240786947901001225201001390000110010000902910013010023
100
90

80

70

60

50

40

30

##

20 #

#### ##

##

##

10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0

CPU% per second (last 60 seconds)


# = average CPU%

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

85

NX-OS
Ethanalyzer and CPU
Observed spike in CPU (per second)
N5k-1# show process cpu history
1

754669098990899966777977656766876775178734455655456466545645
006186077990796258300801881187120477641015900150830621684070
100

### ### ##

90

###########

80

###########

70 #

60 #

#####################

##### ##

###

################################# ###

##

# ###

50 #################################### ### ###################


40 #################################### ### ###################
30 #################################### #######################
20 ############################################################
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0

CPU% per second (last 60 seconds)


# = average CPU%
<continued>
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

86

NX-OS
Ethanalyzer and CPU
Baseline per minute
N5k-1# show process cpu history

11

789509607796857706878950694778698849688895079850886958858500
753105000482598603786430941227125016911055026100692801248500
100

** *

90

** ** *

80 *** ** *

* *

* * ** * *

* * *** **** * *

* **

* * *

**

* * *

*** *

* **

* *** * **** * ** *** * ** * **

70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **
60 *** ****************** *** ******* *********** ***** ** ****
50 ************************** ******* *************************
40 ************************************************************
30 ***********************************************************#
20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0

CPU% per minute (last 60 minutes)


* = maximum CPU%

BRKCRS-3145

# = average CPU%

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

87

NX-OS
Ethanalyzer and CPU
We also notice a spike in average CPU over the past 5 minutes

899074676686870687895096077968577068789506947786988496888950
189068779462040167531050004825986037864309412271250169110550
100

***

90

***

80 *****

*
*
*

** *

* * ** ** *

* * * **** ** *

* *

* * ** * *

* **

* * *** **** * *

* *

* *

* *** * **** *

70 ***** *** * *** **** ** **** * *** **** *** *** *** ****** *
60 **#** ************** ****************** *** ******* ********
50 *##**************************************** ******* ********
40 ###*#*******************************************************
30 ######******************************************************
20 #######******#****##**#*******#***********#*#*#**#**##*###*#
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0

CPU% per minute (last 60 minutes)


* = maximum CPU%
BRKCRS-3145

# = average CPU%

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

88

NX-OS
Ethanalyzer and CPU
Capturing on mgmt, we see there is an snmpwalk occuring
This should be a temporary condition and should not affect switching
performance, but perhaps you can feel latency on the terminal
Could affect other control-plane transactions like configuration backups,
collection scripts, etc.
Now you can check with your network management team to work out when
this is appropriate or if this is a mistake. A full walk is not very efficient to run
reguarly.
N5k-1# ethanalyzer local interface mgmt capture-filter "not host 10.116.114.157"
Capturing on eth0
wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0
2005-02-11 21:25:48.452632 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.455871 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.458120 172.18.118.162 -> 172.18.118.34 SNMP get-response

2005-02-11 21:25:48.459968 172.18.118.34 -> 172.18.118.162 SNMP get-next-request


2005-02-11 21:25:48.462428 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.464066 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.466903 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.468165 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.471662 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.472263 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

89

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Crashes

Nexus 5000
CRC errors
Ethanalyzer / CPU

Queuing and forwarding


Spanning-tree
Nexus 2000

Redundancy operation and troubleshooting


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

90

Nexus 5000/5500 Queuing


Nexus 5000/5500 utilize ingress queuing
Ingress queuing is helpful for data flows where many ports
talk to few, the load is spread across the sources
Simple flowcontrol mechanism can be implemented
end-to-end flowcontrol is necessary for FCoE
Ingress queuing is implemented by Virtual Output Queuing
(VOQ)
VOQ prevents head of line blocking

One egress interface can be congested, but ingress


buff still accepts frame into other queues
8 class-based unicast VOQ per egress interface on every
ingress interface

8 class-based multicast VOQ per ingress interface


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

91

Nexus 5000/5500 Queuing

Ingress queuing implication on troubleshooting:


Drops occur at INGRESS!
You must think about where the flow originates on the switch to
determine where you would like to look for drops.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

92

Nexus 5000/5500 Queuing


N5k-1# show queuing interface e1/5
Ethernet1/5 queuing information:
TX Queuing
qos-group

sched-type

oper-bandwidth

WRR

50

WRR

50

RX Queuing
qos-group 0
q-size: 243200, HW MTU: 1600 (1500 configured)
drop-type: drop, xon: 0, xoff: 1520

Statistics:
Pkts received over the port

: 100882627

Ucast pkts sent to the cross-bar

: 100877529

Mcast pkts sent to the cross-bar

: 0

Ucast pkts received from the cross-bar

: 786990

Pkts sent to the port

: 692821

Pkts discarded on ingress

: 5098

Per-priority-pause status

: Rx (Inactive), Tx (Inactive)

Ingress discards are present when buffering is not sufficient


for the traffic flow.

For example 2 interfaces transmitting toward 1 interface in


sustained oversubscription.
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

93

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

e1/5
N5k-2

Trunk

e1/1

e1/3

Server A

Server B

Server A is sending some traffic toward Server B


Both servers have had static ARP entries applied for
troubleshooting
Server B does not see traffic from Server A when sniffing
locally
They are both configured to be in the same VLAN
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

94

Nexus 5000/5500 Queuing


Scenario
Start at the ingress interface on server A
e1/5
N5k-1

e1/5

gatos

Nexus 5500
carmel

e1/1

7:2

e1/5

7:1

e1/3

Server A

Nexus 5000

Internal

N5k-2

Trunk

e1/1

Front Panel

Server B

N5k-1# show hardware internal gatos port e1/1 | grep gatos i


gatos instance
: 7
gatos iport
: 2
----------------------------------------------------------------N55k-1# show hardware internal carmel port e1/1 | grep "carmel i"
carmel instance
: 0
carmel iport
: 1

For this example, we will use Nexus 5000 outputs, but you can
substitute gatos for carmel, as they are laid out in a similar
architecture.
The actual counters and errors may vary, the methodology does not
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

95

Nexus 5000/5500 Queuing


Scenario
Start at the ingress interface on server A
e1/5
N5k-1

e1/5

Internal

e1/1

7:2

e1/5

7:1

N5k-2

Trunk

e1/1

Front Panel

e1/3

Server A
N5k-1#
Eth1/1
Eth1/1
Eth1/1
Eth1/1

Server B
show platform fwm info pif e1/1 | grep stats
pd: tx stats: bytes 147694477 frames 0 discard 0 drop 0
pd: rx stats: bytes 26022500 frames 0 discard 0 drop 0
pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0

These outputs are clean

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

96

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

e1/5

Internal

e1/1

7:2

e1/5

7:1

N5k-2

Trunk

e1/1

Front Panel

e1/3

Server A

Server B

N5k-1# show platform fwm info asic-errors 7


Printing non zero Gatos error registers:

N5k-1# show hardware internal gatos asic 7 counters interrupt


Gatos 7 interrupt statistics:
Interrupt name
|Count
|ThresRch|ThresCnt|Ivls

These outputs are also clean


Move on to the egress interface e1/5

In this case, e1/5 is on the same ASIC, so we have already


gathered the output needed
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

97

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

e1/5

Internal

e1/1

7:2

e1/5

7:1

N5k-2

Trunk

e1/1

Front Panel

e1/3

Server A
N5k-1#
Eth1/5
Eth1/5
Eth1/5
Eth1/5

Server B
show platform fwm info pif e1/5 | grep stats
pd: tx stats: bytes 476497477 frames 0 discard 0 drop 0
pd: rx stats: bytes 232322392 frames 0 discard 0 drop 0
pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0

These outputs are clean

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

98

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

e1/5

Front Panel

Internal

e1/1

7:2

e1/5

7:1

N5k-2

Trunk

e1/1

e1/3

Server A
N5k-1#
Eth1/5
Eth1/5
Eth1/5
Eth1/5

Server B
show platform fwm info pif e1/5 | grep stats
pd: tx stats: bytes 332298390 frames 0 discard 0 drop 0
pd: rx stats: bytes 176797274 frames 0 discard 0 drop 208
pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0

208 drops seen received on port e1/5


Next we try to find the reason for these drops

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

99

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

Internal

e1/1

7:2

e1/5

7:1

e1/5

N5k-2

Trunk

e1/1

Front Panel

e1/3

Server A

Server B
N5k-1# show platform fwm info asic-errors 7
Printing non zero Gatos error registers:
DROP_SRC_VLAN_MBR: res0 = 624 res1 = 0

DROP_SRC_VLAN_MBR is 624
This counter is 3x the number of frame drops - hardware
caveat

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

100

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

e1/5

Internal

e1/1

7:2

e1/5

7:1

N5k-2

Trunk

e1/1

Front Panel

e1/3

Server A

Server B
N5k-1# show hardware internal gatos asic 7 counters interrupt
...
gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74
|
...

Interrupt counters will agree that a given error has fired from the
hardware, but the number is HEX and we also do not record
every interrupt due to the rate at which interrupts can hit CPU.
Generally this number will be somewhat less than the fwm pif
drop number.
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

101

Nexus 5000/5500 Queuing


Scenario

e1/5
N5k-1

e1/5

Internal

e1/1

7:2

e1/5

7:1

N5k-2

Trunk

e1/1

Front Panel

e1/3

Server A

Server B
N5k-1# show hardware internal gatos asic 7 counters interrupt
...
gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74
|
...

Interrupt counters will agree that a given error has fired from the
hardware
number is hex and
we do not record every interrupt due to the rate at which
interrupts can hit CPU. Generally this number will be somewhat
less than the show platform fwm info pif number
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

102

Nexus 5000/5500 Queuing


Scenario
e1/5
N5k-1

Front Panel

Internal

e1/1

7:2

e1/5

7:1

e1/5
N5k-2

Trunk

e1/1

e1/3

Server A
N5k-1# interface Ethernet1/5
switchport mode trunk
switchport trunk allowed vlan 100-103

Server B
N5k-1# interface Ethernet1/5
switchport mode trunk
switchport trunk allowed vlan 100-102

From the outputs gathered, we can say either STP is


blocking or the VLAN is not allowed
The configs confirm VLAN is not allowed
Use this same methodology to find counters incrementing
with your dropped traffic. Where the numbers increment,
you can find a reason
Various scenarios cause drops, register list is not available
publically TAC case should be opened for scenarios with
conflicting/confusing output.
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

103

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Crashes

Nexus 5000
CRC errors
Ethanalyzer / CPU

Queuing and forwarding


Spanning-tree
Nexus 2000

Redundancy operation and troubleshooting


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

104

Spanning-tree

NX-OS keeps a long history of STP states

Usually you can trace back the change that caused an


outage, as long as it has not wrapped in the logs.

STP logs shouldnt wrap normally without constant topology


changes.

Also a good idea to log stp at level 6:


N5k-2(config)# logging level spanning-tree 6
N5k-2# 2011 Jan 21 01:58:23 N5k-2 %STP-6PORT_ROLE: Port port-channel14 instance VLAN007
role changed to designated

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

105

Spanning-tree

Checking all trees

N5k-1# show spanning-tree internal event-history all


-------------------- All the active STPs ----------VDC01 VLAN0001
0) Transition at 848207 usecs after Thu Jan 13 05:05:54 2005
Root: 0000.0000.0000.0000 Cost: 0 Age:

0 Root Port: none Port: none [STP_TREE_EV_UP]

1) Transition at 367168 usecs after Thu Jan 13 05:05:57 2005


Root: 8001.000d.ecd6.02fc Cost: 0 Age:
[STP_TREE_EV_UPDATE_TOPO_RCVD_SUP_BPDU]

0 Root Port: none Port: Ethernet1/15

2) Transition at 373395 usecs after Thu Jan 13 05:05:57 2005


Root: 2063.00d0.0362.4c00 Cost: 2 Age:
[STP_TREE_EV_MULTI_FLUSH_LOCAL]

1 Root Port: Ethernet1/15 Port: none

3) Transition at 434563 usecs after Thu Jan 13 05:06:00 2005


Root: 2063.00d0.0362.4c00 Cost: 2 Age:
[STP_TREE_EV_MULTI_FLUSH_RCVD]

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

1 Root Port: Ethernet1/15 Port: Ethernet1/15

Cisco Public

106

Spanning-tree

... or just the tree you are interested in

N5k-1# show spanning-tree internal event-history tree 1 brief

2005:01:13 05h:05m:54s:848207us T_EV_UP


none P none]

VLAN0001 [0000.0000.0000.0000 C 0 A

0 R

2005:01:13 05h:05m:57s:367168us T_UT_SBPDU


none P Eth1/15]

VLAN0001 [8001.000d.ecd6.02fc C 0 A

0 R

2005:01:13 05h:05m:57s:373395us T_EV_M_FLUSH_L


Eth1/15 P none]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

2005:01:13 05h:06m:00s:434563us T_EV_M_FLUSH_R


Eth1/15 P Eth1/15]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

2005:01:13 05h:06m:01s:407259us T_EV_M_FLUSH_R


Eth1/15 P Eth1/15]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

2005:01:13 05h:06m:02s:947220us T_EV_M_FLUSH_R


Eth1/15 P Eth1/15]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

2005:01:13 05h:06m:04s:947216us T_EV_M_FLUSH_R


Eth1/15 P Eth1/15]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

2005:01:13 05h:06m:06s:947457us T_EV_M_FLUSH_R


Eth1/15 P Eth1/15]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

2005:01:13 05h:06m:08s:837586us T_EV_M_FLUSH_R


Eth1/15 P Eth1/15]

VLAN0001 [2063.00d0.0362.4c00 C 2 A

1 R

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

107

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Crashes

Nexus 5000
Nexus 2000
Management

Queuing and forwarding


Logs

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

108

FEX Management
FEX fabric interfaces run SDP satellite discovery
protocol
You can view the status of a FEX and see some
logs from the N5k:
N5k-1# show fex 100
FEX: 100 Description: FEX0100

state: Online

FEX version: 5.0(3)N1(1b) [Switch version: 5.0(3)N1(1b)]


Extender Model: N2K-C2148T-1GE,

Extender Serial: JAF1326BBRC

Part No: 73-12009-05


pinning-mode: static

Max-links: 1

Fabric port for control traffic: Eth1/3


Fabric interface state:
Eth1/3 - Interface Up. State: Active

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

109

FEX Management
N5k-1# show fex 100 detail
FEX: 100 Description: FEX0100

state: Online

FEX version: 5.0(3)N1(1b) [Switch version: 5.0(3)N1(1b)]


FEX Interim version: 5.0(3)N1(1b)
Switch Interim version: 5.0(3)N1(1b)
Extender Model: N2K-C2148T-1GE,

Extender Serial: JAF1326BBRC

Part No: 73-12009-05


Card Id: 70, Mac Addr: 00:0d:ec:d3:b5:c2, Num Macs: 64
Module Sw Gen: 21

[Switch Sw Gen: 21]

post level: complete


...
Logs:
02/02/2005 13:09:06.946120: Module register received
02/02/2005 13:09:06.947614: Image Version Mismatch
02/02/2005 13:09:06.947960: Registration response sent
02/02/2005 13:09:06.948392: Requesting satellite to download image

02/02/2005 13:14:54.149480: Image preload successful.


02/02/2005 13:14:55.375447: Deleting route to FEX
02/02/2005 13:14:55.384270: Module disconnected
02/02/2005 13:14:55.386372: Module Offline
02/02/2005 13:16:52.847574: Module register received
02/02/2005 13:16:52.849146: Registration response sent

02/02/2005 13:16:53.419079: Module Online Sequence


02/02/2005 13:17:09.507541: Module Online
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

110

FEX Management
N5k-1# show system internal fex log fport e1/3
Satmgr debug messages for If 0x1a002000:
[19952]02/02/2005 13:08:32.191646: if [0x1a002000]:Phy cleanup rcvd
[19956]02/02/2005 13:08:32.192257: fport [0x1a002000]:Log - Interface Down
[19957]02/02/2005 13:08:32.192266: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr
state: Discovered
[19958]02/02/2005 13:08:32.192654: fport [0x1a002000]:Log - State changed to: Created
[19962]02/02/2005 13:08:32.192853: fport [0x1a002000]:satmgr_fport_fsm: new state: Created
[19967]02/02/2005 13:08:32.193991: fport [0x1a002000]:Log - fport phy cleanup retry end: sending out
resp
[19970]02/02/2005 13:08:32.206315: if [0x1a002000]:Pre Cfg rcvd

[19971]02/02/2005 13:08:32.206606: fport [0x1a002000]:Log - pre config: is not a port-channel member


[19977]02/02/2005 13:08:33.727893: fport [0x1a002000]:Log - Interface Up
[19978]02/02/2005 13:08:33.727904: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr
state: Created
[19982]02/02/2005 13:08:33.729944: fport [0x1a002000]:Log - Port Bringup rcvd
[19986]02/02/2005 13:08:33.731201: fport [0x1a002000]:Log - Suspending Fabric port. reason: Fex not
configured
[19987]02/02/2005 13:08:33.731216: fport [0x1a002000]:Log - fport bringup retry end: sending out
resp
[19997]02/02/2005 13:08:34.120031: fport [0x1a002000]:Log - Fcot message sent to Ethpm
[19998]02/02/2005 13:08:34.120092: fport [0x1a002000]:Log - Satellite discovered msg sent
[19999]02/02/2005 13:08:34.120459: fport [0x1a002000]:Log - State changed to: Discovered

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

111

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Crashes

Nexus 5000
Nexus 2000
Management

Queuing and forwarding


Logs

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

112

FEX Drops
Network interface drops can be seen from N5k
show queuing interface as of 5.0(3)N1(1)
Best to attach to FEX to get detailed logs
Similar to Cat 6k or Nexus 7k linecard commands
Important to check here as FEX also have crash
logs, have their own CPU, and are responsible for
communicating link state and offloading some
protocols like CDP.
N5k-1# attach fex 100
Attaching to FEX 100 ...
To exit type 'exit', to abort type '$.'
fex-100#

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

113

FEX Drops
Network interface drops can be seen from N5k
show queuing interface as of 5.0(3)N1(1)
Best to attach to FEX to get detailed logs
Similar to Cat 6k or Nexus 7k linecard commands
Important to check here as FEX also have crash
logs, have their own CPU, and are responsible for
communicating link state and offloading some
protocols like CDP.
N5k-1# attach fex 100
Attaching to FEX 100 ...
To exit type 'exit', to abort type '$.'
fex-100#

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

114

FEX Drops
The scenario we are looking for is big pipe to little
pipe or many to one.
Know the flow of traffic! If you know the pattern,
finding where it is likely to stress the network will be
easier.
10G to 1G is especially difficult to buffer, so you
may find the FEX is the last stop for the 10G traffic
to buffer for your 1G hosts like to drop here and not
elsewhere in your 10G network.
Fex queue-limit and buffer-threshold can be
adjusted globally, per fex-type, or per fex
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

115

FEX Drops
2148
fex-100# dbgexec rw
rw> show ints <0-6>
ASIC: 0:
+-------+--------------------------+--------------+-----------+-----------+-----------+
| ASIC

| Interrupt Bit Field

| Port

Count1

Thresh1

Count2

|
|

Thresh2

|
|

+-------+--------------------------+--------------+-----------+-----------+-----------+
| 0-NI1 | not_synced_lane_3

1 |

0 |

0 |

1 |

| 0-NI1 | not_synced_lane_2

1 |

0 |

0 |

1 |

| 0-NI1 | not_synced_lane_0

1 |

0 |

0 |

1 |

| 0-NI1 | synced_lane_3

1 |

0 |

0 |

1 |

| 0-NI1 | synced_lane_2

1 |

0 |

0 |

1 |

| 0-NI1 | synced_lane_1

1 |

0 |

0 |

1 |

| 0-NI1 | synced_lane_0

1 |

0 |

0 |

1 |

| 0-NI1 | loc_fault

1 |

0 |

0 |

1 |

| 0-NI1 | not_aligned

1 |

0 |

0 |

1 |

| 0-NI1 | aligned

1 |

0 |

0 |

1 |

+-------+--------------------------+--------------+-----------+-----------+-----------+

this output is clean, no wo_cr counters. *shows non-zero counters.

wo_cr indicates the buffer is without credit


BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

116

FEX Drops
2148
rw> drops <0-6> hi<0-8>
Dropped packet counters for 0-HI0:
red_hix_cnt_rx_allow_vntag_drop

: 0

red_hix_cnt_rx_echannel_drop

: 0

red_hix_cnt_rx_fwd_drop

: 0

red_hix_cnt_rx_mc_drop

: 0

red_hix_cnt_rx_runt_pkt_drop

: 0

red_hix_cnt_rx_src_vif_out_of_range_drop: 0
red_hix_cnt_tx_lb_drop

: 11892

0-SS0 DDROP counters:


OQ0: Class0: 0 Class1: 0

Class2: 0

Class3: 0

OQ1: Class0: 0 Class1: 0

Class2: 0

Class3: 0

OQ2: Class0: 0 Class1: 0

Class2: 0

Class3: 0

OQ3: Class0: 0 Class1: 0

Class2: 0

Class3: 0

OQ4: Class0: 0 Class1: 0

Class2: 0

Class3: 0

0-SS0 ECC1: 0

ECC2: 0

0-SS0 wo_cr: 0

no cells: 0

BRKCRS-3145

mtu_vio: 0

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

117

FEX Drops
2248

this output is clean, wr_disc or wr_rcv_err.


N5k-1# attach fex 130
fex-130# dbgexec satctrl
satctrl/qosctrl> show port 0 0 2 <0-3> *uplink interfaces queue on ingress

...
Rx Discard (WR_DISC):

Rx Multicast Discard (WR_DISC_MC):

Rx Error (WR_RCV_ERR):

...

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

118

FEX Drops
2248
satctrl/qosctrl> show asic 0 0
SS Statistics:
SS

No Credit*

No Cells

MTU Error

OQ Discard

Free Cells

---+-----------+-----------+-----------+-----------+---------0

10213

10213

...
Dropped packets per CoS due to OQ head-drop, OQ is per 8 port group:
OQ

CoS 0

CoS 1

CoS 2

CoS 3

CoS 4

CoS 5

CoS 6

CoS 7

----+----------+----------+----------+----------+----------+----------+----------+----------NR0

NR1

NR2

NR3

NR4

NR5

----+----------+----------+----------+----------+----------+----------+----------+----------HR0

HR1

HR2

HR3

HR4

HR5

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

119

FEX Drops
2248
fex130# dbgexec prt
prt> drops
PRT_SS_CNT_TAIL_DROP8

: 2 SS0

prt> show rmon 0 ni<0-3>


+----------------------+----------------------+-----------------+----------------------+---------------------+-----------------+
| TX
|

Diff

Current

Diff

| RX

Current

+----------------------+----------------------+-----------------+----------------------+---------------------+-----------------+
| TX_PKT_LT64
0|

0|

| TX_PKT_64
8|

0|

0|

0| RX_PKT_LT64

5|

1| RX_PKT_64

| TX_PKT_65
4073560|

|
521532|

2062219|

264039| RX_PKT_65

| TX_PKT_128
2060397|

|
263419|

2149866|

274780| RX_PKT_128

1920669|

245601| RX_PKT_256

| TX_PKT_256

...

rmon counters are similar to the counters detailed on the N5k ports,
helpful for error tracking and finding packets of a certain size
updates immediately show counters on n5k waits for the statsclient
BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

120

Troubleshooting Nexus 5000 / 2000


Problem Isolation

Platform Overview and troubleshooting


NX-OS Operation
Crashes

Nexus 5000
Nexus 2000
Management

Queuing and forwarding


Logs

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

121

FEX Logs
attach fex <n>
dbgexec rw/prt (rw=2148, prt=2248)

Show ctx driver information


Show oper link states for L1 status
Show elog event log chronicling hardware and software interaction, helpful for L1 issues
Show ints interrupt counters
Show bootlog bootup messages
Show log any other logs

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

122

Printout note

Final presentation may not end here, look for updated content
potentially at the live presentation.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

123

Complete Your Online


Session Evaluation
Receive 25 Cisco Preferred Access points for each session
evaluation you complete.
Give us your feedback and you could win fabulous prizes. Points are
calculated on a daily basis. Winners will be notified by email after
July 22nd.
Complete your session evaluation online now (open a browser
through our wireless network to access our portal) or visit one of the
Internet stations throughout the Convention Center.
Dont forget to activate your Cisco Live and Networkers Virtual
account for access to all session materials, communities, and ondemand and live activities throughout the year. Activate your account
at any internet station or visit www.ciscolivevirtual.com.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

124

Visit the Cisco Store for


Related Titles
http://theciscostores.com

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

125

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

126

Thank you.

BRKCRS-3145

2011 Cisco and/or its affiliates. All rights reserved.

Cisco Public

127

You might also like