You are on page 1of 6

7/19/2017 SAN Performance Metrics | THE SAN GUY

THE SAN GUY

EMC GENERAL

SAN Performance Metrics

SEPTEMBER 25, 2012 | THESANGUY | 4 COMMENTS

I often get requests from application owners to review performance stats. I thought Id give a quick
overview of some of the things I look at, what the myriad of performance metrics in Navisphere Analyzer
and ECC Performance Manager mean, and how you might use some of them to investigate a performance
problem. Performance analysis is very much an art (not a science) and its sometimes dicult to pinpoint
exact causes based on the mix of applications and workload on the array. Taking all of the metrics into
account with a holistic view is needed to be successful. Performing data collection of application
workloads over time is recommended because application workload characteristics will likely vary over
time. If you have a major problem, I would always recommend opening an SR with EMC.

This post is just an overview of SAN performance metrics and isnt meant to dive in to every possible
scenario from every angle. EMC already has excellent guides for performance best practices that you can
read here:

h p://www.emc.com/collateral/hardware/white-papers/h5773-clariion-best-practices-performance-
availability-wp.pdf (h p://www.emc.com/collateral/hardware/white-papers/h5773-clariion-best-
practices-performance-availability-wp.pdf) (Older version fpr clariion)
h ps://community.emc.com/message/796647 (Newer
(h ps://community.emc.com/message/796647 (Newer) version for VNX, see 2nd post in the topic)

Because we have EMCs Performance Manager tool installed in our environment, I always go to that tool
rst rather than Navisphere Analyzer. Both use the same metrics, so the following information will be
useful regardless of which method you use.

The rst thing I do is look at the Storage Processors. This will give you a good indication of the overall
health of the array before you dive into the specic LUN (or LUNs) used by the application.

SP Cache Dirty Pages (%). These are pages in write cache that have received new data from hosts but
have not yet been ushed to disk. You should have a high percentage of dirty pages as it increases the
chance of a read coming from cache or additional writes to the same block of data being absorbed by
the cache. If an IO is served from cache the performance is be er than if the data had to be retrieved
from disk. Thats why the default watermarks
https://thesanguy.com/2012/09/25/san-performance-metrics/ are usually around 60/80% or 70/90%. You dont want 1/6
7/19/2017 SAN Performance Metrics | THE SAN GUY

from disk. Thats why the default watermarks are usually around 60/80% or 70/90%. You dont want
dirty pages to reach 100%, they should uctuate between the high and low watermarks (which means
the Cache is healthy). Periodic spikes or drops outside the watermarks are ok, but consistently hi ing
100% indicates that the write cache is overstressed.
SP Utilization (%). Check and see if either SP is running higher than about 75%. If either is running
that high application response time will be increased. Also, both will need to be under 50% for non-
disruptive upgrades. We had to do a large scale migration of data from one SAN to another at one point
in order to get a NDU accomplished. Youll also want to check for proper balance. If one is much
higher than the other, you should consider migrating LUNs from one SP owner to another. I check SP
balance on all of our arrays on a daily basis.
SP Response time (ms). Make sure again that both SPs are even and that response time is acceptable. I
like to see response times under 10ms. If you see that one SP has high utilization and response time but
the other SP doesnt, look for LUNs owned by the busier SP that are using more array resources.
Looking at total IO on a per LUN basis can help conrm If both SPs have relatively similar throughput
but one SP has much higher bandwidth. That could mean that there is some large block IO occurring.
SP Port Queue Full Count. This represents the number of times that a front end port issued a QFULL
response back to the hosts. If you are seeing QFULLs it could mean that the Queue Depth on the HBA
is too large for the LUNs being accessed. A Clariion/VNX front end port has a queue depth of 1600
which is the maximum number of simultaneous IOs that port can process. Each LUN on the array has
a maximum queue depth that is calculated using a formula based on the number of data disks in the
RAID Group. For example, a port with 512 queues and a typical LUN queue depth of 32 can support up
to: 512 / 32 = 16 LUNs on 1 Initiator (HBA) or 16 Initiators (HBAs) with 1 LUN each or any combination
not to exceed this number. Congurations that exceed this number are in danger of returning QFULL
conditions. A QFULL condition signals that the target/storage port is unable to process more IO
requests and thus the initiator will need to thro le IO to the storage port. As a result of this, application
response times will increase and IO activity will decrease.

The next thing I do is look at the specic LUNs that the application owner is asking about. The list below
includes the basic performance metrics that I most often look at when investigating a performance
problem.

Utilization (%) represents the fraction of an observation period during which a LUN has any
outstanding requests. When the LUN becomes the bo leneck, the utilization will be at or close to 100%.
However, since I/Os can get serviced by multiple disks an increase in workload might still result in a
higher throughput. Utilization by itself is not a very good indicator of the overall performance of the
LUN, it needs to be factored in with several other things. For example, If you are writing to a LUN
(100% Writes) and the location of the data is in a small physical space on the LUN, it may be possible to
get to 100% with write cache re-hits. This means that all writes are being serviced by the write cache
and since you are writing data to the same locations over and over you do not ush any of the data to
the disks. This can cause your LUN Utilization to be 100% but there will actually be no IO to the disks.
Utilization is very aected by caching, both read and write. The LUN can be very busy but may not
have a problem. Use Utilization to assist in identing busy LUNs then look at queuing and response
times to see if there really is an issue.
Queue Length is the average number of requests within a polling interval that are outstanding to this
LUN. A queue length of zero indicates an idle LUN. If three requests arrive at an idle LUN at the same
time, only one of them can be served immediately; the other two must wait in the queue. That scenario
would result in a queue length of 3. My general guideline for bad performance on a LUN is a queue
length greater than 2 for a single disk drive.
Average Busy Queue Length is the average number of outstanding requests when the LUN was busy.
This does not include any idle time. This value should not exceed 2 times the number of spindles on a
LUN. For example, if a LUN has 25 spindles,
https://thesanguy.com/2012/09/25/san-performance-metrics/ a value of 50 is acceptable. Since this queue length is 2/6
7/19/2017 SAN Performance Metrics | THE SAN GUY

LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. Since this queue length is
counted only when the LUN is not idle, the value indicates the frequency variation (burst frequency) of
incoming requests. The higher the value, the bigger the burst and the longer the average response time
at this component. In contrast to this metric, the average queue length does also include idle periods
when no requests are pending. If you have 50% of the time just one outstanding request, and the other
50% the LUN is idle, the average busy queue length will be 1. The average queue length however, will
be .
Response Time (ms) is the average time, in milliseconds, that a request to this LUN is outstanding,
including its waiting time. The higher the queue length for a LUN, the more requests are waiting in its
queue, thus increasing the average response time of a single request. For a given workload, queue
length and response time are directly proportional. Keep in mind that cache re-hits bring down the
average response time (and service times), whether they are reads or writes. LUN Response time is a
good starting point for troubleshooting. It gives a good indicator of what the host system is
experiencing. Usually if your LUN response time (Response time = queue length * service time) is good
then the host performance is good. High response times dont always mean that the CLARiiON is busy,
it can also indicate that youre having issues with your host or Fabric. We use the Brocade Health
report on a regular basis to identify hosts that have an excessive amount of trac, as well as running
the EMC HEAT report on hosts that have reported issues (which can identify incorrect HBA Drivers,
Bad HBA, etc).These are my general guidelines for response time:
Less than 10 ms: very good
Between 10 20 ms: okay
Between 20 50 ms: slow, needs a ention
Greater than 50 ms: I/O bo leneck
Service Time (ms) represents the Time, in milliseconds, a request spent being serviced by a component.
It does not include time waiting in a queue. Service time is mainly a characteristic of the system
component. However, larger I/Os take longer and therefore usually result in lower throughput (IO/s)
but be er bandwidth (Mbytes/s). In general, Service time is simply the time it takes to actually send the
I/O request to the storage and get an answer back. In general, I like to see service times below 20ms.
Total Throughput (IO/sec) is the average number of host requests that is passed through the LUN per
second. This includes both read and write requests. Smaller requests usually result in a higher total
throughput than larger requests. Examining total throughput (along with %Utilization) is a good way
to identify the busiest LUNs on the array. In general, here are the IOPs limits by drive type:

RPM Drive Type IOPs


7,200 SATA,NL-SAS ~80
10,000 SATA,NL-SAS ~130
10,000 FC,SAS ~140
15,000 FC,SAS ~180
N/A EFD ~1500 (Read/Write, 60/40)
N/A EFD ~6000 (Read)
N/A EFD ~3000 (Write)

Write Throughput (I/O/sec) The average number of host write requests that is passed through the LUN
per second. Smaller requests usually result in a higher write throughput than larger requests. When
troubleshooting specic LUNs, check the write IO size and see if the size is what you would expect for
the application you are investigating. Extremely large IO sizes coupled with high IOPS may cause write
cache contention.

Read Throughput (I/O/sec) The average


https://thesanguy.com/2012/09/25/san-performance-metrics/ number of host read requests that is passed through the LUN3/6
7/19/2017 SAN Performance Metrics | THE SAN GUY

Read Throughput (I/O/sec) The average number of host read requests that is passed through the LUN
per second. Smaller requests usually result in a higher read throughput than larger requests.
Total Bandwidth (MB/s) The average amount of host data in Mbytes that is passed through the LUN
per second. This includes both read and write requests. Larger requests usually result in a higher total
bandwidth than smaller requests.
Read Bandwidth (MB/s) The average amount of host read data in Mbytes that is passed through the
LUN per second. Larger requests usually result in a higher bandwidth than smaller requests.
Write Bandwidth (MB/s) The average amount of host write data in Mbytes that is passed through the
LUN per second. Larger requests usually result in a higher bandwidth than smaller requests. Keep in
mind that writes consume many more array resources than reads.
Read Size (KB) The average read request size in Kbytes seen by the LUN. This number indicates
whether the overall read workload is oriented more toward throughput (I/Os per second) or bandwidth
(Mbytes/second). For a ner distinction of I/O sizes, use an IO Size Distribution chart for this LUN.
Write Size (KB) The average write request size in Kbytes seen by the LUN. This number indicates
whether the overall write workload is oriented more toward throughput (I/Os per second) or
bandwidth (Mbytes/second). For a ner distinction of I/O sizes, use an IO Size Distribution chart for the
LUNs.

Below is an explanation of additional performance metrics that I dont use as frequently, but Im including
them for completeness.

Forced Flushes/s Number of times per second the cache had to ush pages to disk to free up space for
incoming write requests. Forced ushes are a measure of how often write requests will have to wait for
disk I/O rather than be satised by an empty slot in the write cache. In most well performing systems
this should be zero most of the time.
Full Stripe Writes/s Average number of write requests per second that spanned a whole stripe (all
disks in a LUN). This metric is applicable only to LUNs that are part of a RAID5 or RAID3 group.
Used Prefetches (%) The percentage of prefetched data in the read cache that was read during the last
polling interval.
Disk Crossing (%) Percentage of host requests that require I/O to at least two disks compared to the
total number of host requests. A single disk crossing can involve more than two disk drives.
Disk Crossings/s Number of times per second that a request requires access to at least two disk drives.
A single disk crossing can involve more than two disks.
Read Cache Hits/s Average number of read requests per second that were satised by either read or
write cache without requiring any disk access. A read cache hit occurs when recently accessed data is
re-referenced while it is still in the cache.
Read Cache Misses/s Average number of read requests per second that did require one or more disk
accesses.
Reads From Write Cache/s Average number of read requests per second that were satised by write
cache only. Reads from write cache occur when recently wri en data is read again while it is still in the
write cache. This is a subset of read cache hits which includes requests satised by either the write or
the read cache.
Reads From Read Cache/s Average number of read requests per second that were satised by the read
cache only. Reads from read cache occur when data that has been recently read or prefetched is re-read
while it is still in the read cache. This is a subset of read cache hits which includes requests satised by
either the write or the read cache.
Read Cache Hit Ratio The fraction of read requests served from both read and write caches vs. the total
number of read requests. A higher ratio indicates be er read performance.
Write Cache Hits/s Average number of write requests per second that were satised by the write cache
without requiring any disk access. Write requests that are not write cache hits are referred to as write
cache misses.
https://thesanguy.com/2012/09/25/san-performance-metrics/ 4/6
7/19/2017 SAN Performance Metrics | THE SAN GUY

cache misses.
Write Cache Misses/s Average number of write requests per second that did require one or multiple
disk accesses. Write requests that cause forced ushes or that bypass the write cache due to their size
are examples of write cache misses.
Write Cache Rehits/s Average number of write requests per second that were satised by the write
cache since they had been referenced before and not yet ushed to the disks. Write cache rehits occur
when recently accessed data is referenced again while it is still in the write cache. This is a subset of
Write Cache Hits.
Write Cache Hit Ratio The ratio of write requests that the write cache satised without requiring any
disk access vs. the total number of write requests to this LUN. A higher ratio indicates be er write
performance.
Write Cache Rehit Ratio The ratio of write requests that the write cache satised since they have been
referenced before and not yet ushed to the disks vs. the total number of write requests to this LUN.
This is a measure of how often the write cache succeeded in eliminating a write operation to disk.
While improving the rehit ratio is useful it is more benecial to reduce the number of forced ushes.

Advertisements

CLARIION DETAIL EXAMINING EXPLAINED EXPLANATION LIST METRICS


PERFORMANCE REVIEW STORAGE VNX

4 thoughts on SAN Performance Metrics


https://thesanguy.com/2012/09/25/san-performance-metrics/ 5/6
7/19/2017 SAN Performance Metrics | THE SAN GUY

1. Jason DeFord says:


OCTOBER 8, 2012 AT 12:44 PM
The last EMC VNX Best Practices for Performance and Availability was for Block O.E. 31.5. (Youre
linked to an obsolete version.) However, it looks like EMC has stopped publishing the very informative
best practices for the Block O.E. 32.0 version.

1. emcsan says:
OCTOBER 9, 2012 AT 9:44 AM
Thanks, Ill update the link. I found a copy of the 31.5 version on scribd here:
h p://www.scribd.com/doc/91233385/h8268-VNX-Block-Best-Practices.

2. Pingback: Gathering performance data on a virtual windows server | The SAN Guy
3. wade says:
SEPTEMBER 5, 2015 AT 8:56 AM
Recently deployed a VNX 5600 and begin to migrate workload from an older VNX 1 array reviewing
the performance metrics for the array using VNX monitoring and reporting all signs indicate the array
is operating well within its capabilities, stoage pools well below total IOPS desing for, SPs operating
well below 50% utilizations however we are seeing very high Processor Dirty Pages Utilization
warnings..

example: Alert on processor Dirty pages Utilization (SPB has exceeded threshold, current value is
4000.3%). Seeing the alerts all day however this does not seem to be translating into any performance
issues with the workload running on the VNX. the percentages seem unreal and if true we would
suspect that performance would suer..

https://thesanguy.com/2012/09/25/san-performance-metrics/ 6/6

You might also like