Vertical Performance Tuning-MOSC2016

Malaysia Open Source Conference 2016
25 - 27 . MAY . UKM . BANGI .
VERTICAL SCALING &
PERFORMANCE TUNING
Adzmely Mansor
ABOUT ME
CONSULTANT/FOUNDER OF
NEXOPRIMA SDN. BHD.
Adzmely Mansor
1. INTRO TO SCALING
2. THE BASICS
CPU
MEMORY
NETWORK
3. EXPERIENCES
SHARING
AGENDA
INTRO TO
SCALING
INTRO TO SCALING
SCALABILITY?
WIKIPEDIA
handle a growing amount of work

in capable manner
accommodate max growth
PERFORMANCE TUNING?
WIKIPEDIA
tune a system to handle a higher

load
INTRO TO SCALING
VERTICALLY?
WIKIPEDIA
add more resources in single node
PERFORMANCE TUNING - VERTICALLY?

???????
INTRO TO SCALING
fully optimised all available

resources for maximum
possible load
Performance Tuning - Vertically
INTRO TO SCALING
HORIZONTALLY?
when vertical / performance tuning
already maximise
why?
INTRO TO SCALING
How to determine that your

servers actually required more:
CPU / Processing Power RAM etc Vertical Scaling
INTRO TO SCALING
You Need to Know & Understand The Basics

how various components works
preferences in processes
how IO interupts are handled
how memory management works
how network layer implemented
meaning of the information given
basics tools
INTRO TO SCALING
In general - Four subsystems that need to be

monitored
CPU
Memory
IO
Network
THE
BASICS
CPU - four subsystem to be

monitored:
- run queue
- context switch
- cpu utilisation
- load average
the basics
THE BASICS
CPU
CPU utilisation depend on accessed resources
Linux Kernel has a scheduler, and scheduler give priorities
to the different resources:
scheduling two kind of resources:
interrupts
threads
THE BASICS
CPU - INTERRUPTS REQUEST (IRQ) HANDLING

IRQ is a signal for an immediate attention sent from
hardware to processor
each device is assigned one or more IRQ numbers
allowing to send unique interrupts
a processor that receives an interrupts request will
immediately pause execution of the current
application thread in order to address the request
THE BASICS
CPU - SCHEDULER
smallest unit of process execution called thread
the system scheduler:
determines which processor run a thread
and for how long the threads run
however the scheduler have priorities
THE BASICS
CPU - SCHEDULER PRIORITIES

Scheduler Priorities:
Hardware interrupts (highest priority)
by hardware on the system to process data
eg:
by disk when completed IO transaction
by NIC when packet has been received
THE BASICS
CPU - SCHEDULER PRIORITIES

Scheduler Priorities:
Soft interrupts (softirq) - related to maintenance of the kernel
itself
Real Time Thread - parallel processing / real time
programming
Kernel Threads - all kernel processing
User Threads - a.k.a. user space.
All applications run in the user space / lowest priority of all
THE BASICS
CPU - CORES
Linux consider / view each core on n-way Hyper Threaded
processor as an:
INDEPENDENT Processor
eg: Dual Core Processor = two independent processor
THE BASICS
CPU - CONTEXT SWITCHES

each threads alloted a time quantum to spend on the
processor
passed alloted time / pre-empted by something higher
priority, the thread:
place back to queue
higher priority / next in queue thread is placed on the
processor:
switched of thread = Context Switch
THE BASICS
CPU - THE RUN QUEUE

each CPU maintain a RUN QUEUE of threads
process threads are either:
runnable (in run queue)
sleep state (blocked and waiting for IO - not in run queue)
CPU heavily utilised:
longer run queue
the longer it take for process threads to execute
THE BASICS
CPU - THE RUN QUEUE

Load
describe the state of the Run Queue
System Load
equal to amount of process threads currently executing
+ amoung of threads in the CPU Run Queue
top command report load averages over the course
of 1, 5 and 15 minutes
THE BASICS
CPU - UTILISATION
defined as the percentage of usage of a CPU
mostly CPU utilisation falls under following categories:
User Time: percentage of time CPU spends executing threads in the
User Space
System Time: percentage of time CPU spends in executing kernel
threads and interrupts
Wait IO: the persentage of time a CPU spends idle because all process
threads are blocked waiting for IO requests to complete
Idle: the percentage of time a processor spends in completely idle
state
THE BASICS
CPU - TIME SLICING

a numeric value represent how long a task can run until it
pre-empted
scheduler policy dictate the default timeslice
too long time slice = poor interactive performance
time slice too short = significant amount of processor
time been wasted because of overhead of switching
process from one process to another (context
switching)
CPU - Performance
Monitoring - a matter of
interpreting performance of:
- run queue
- utilisation
- context switching
the basics
THE BASICS
CPU - PERFORMANCE MONITORING

General Expectations:
Run Queues: a run queue should have no more than
1 - 3 threads queued per CPU
eg: a dual processor should not have more than 2
threads in queue (ideally) or 6 to the max
THE BASICS

CPU Utilisation: if a CPU is fully utilised, ideally the
following balance of utilisation should be achieved:
65% - 70%: User Time
30% - 35%: System Time
0% - 5%: Idle Time
THE BASICS

Context Switches:
high amount of context switches is acceptable if:
CPU utilisation stays within previously mentioned
balance
/proc/$pid/status | grep ctxt

/usr/bin/time -v ls | grep context
vmstat
THE BASICS
CPU - PERFORMANCE MONITORING TOOLS

must be low overhead tool
still practical having it running under heavily loaded
system
able to monitor the health of the system at glance
THE BASICS
top
THE BASICS
vmstat
THE BASICS
mpstat
Memory
- Physical Memory
- Virtual Memory
the basics
THE BASICS
VIRTUAL MEMORY
Virtual Memory = SWAP space on disk + RAM/physical
memory
virtual memory divided into pages
on x86 architecture VM pages = 4kb
when writing from memory to disk, it write memory in
Pages
THE BASICS
VIRTUAL MEMORY
when application starts, it request Virtual Memory Size (VSZ)
the kernel either grants or denies VSZ request
as application use the requested memory, that memory
mapped into physical memory (RSS)
RSS (resident memory size) is amount of physical memory
a task is using
most case application use less RSS than what it requested
(VSZ)
THE BASICS
VIRTUAL MEMORY
THE BASICS
VIRTUAL MEMORY
THE BASICS
VIRTUAL MEMORY - PAGES

Virtual Memory divided into Pages
on x86 architecture VM Pages = 4kb
when writing from memory to disk, it write memory in
Pages
when Pages in memory are modified by running process,
they become dirty
when reach defined percentage (vm.dirty_ratio) it will be
written to disk
THE BASICS
VIRTUAL MEMORY - DIRTY PAGES

vm.dirty_ratio (kernel param)
defined the maximum amount of memory for a process that can be filled
with dirty pages before they get flushed to disk
flushing will stop all IOs
higher value = remain longer in memory
better performance but high risk
lower value = get flushed to disk more often
slower performance but less risk
default value = 20
THE BASICS

vm.dirty_background_ratio (kernel param)
defined the maximum amount of memory that can be
filled with dirty pages before they get flushed to disk by
the kernel flusher threads
pdflush/flush/kdmflush
default value = 10
eg: RAM 64G, only 6.4G data can be sitting in RAM
before kernel flusher daemon kicks in.
THE BASICS

vm.dirty_bytes & vm.dirty_background_bytes
same as dirty_ratios and dirty_background_ratios but in
bytes
setting ratios value/s, bytes param/s will become 0
and vice-verse
THE BASICS

vm.dirty_expire_sentisecs
how long something can be cache before it needs to be
written
default value = 3000 sentisecs = 30 seconds
if dirty pages older than default value, it will be
written asynchronously to disk
safe guard against data lost
THE BASICS

vm.dirty_writeback_sentisecs
how often pdflush/flush/kdmflush process wake up and
check if works need to be done
default value = 500 sentisecs = 5 seconds
THE BASICS
VIRTUAL MEMORY - SWAPPINESS

vm.swappiness
how aggressive linux should be when swapping active pages in memory to
disk
default value = 60
when reaching 40% of memory used, it will start to consider for swapping
lower value meaning closer to memory max size - discourage linux from
swapping
60 value recommended for most desktop use
10 recommended to improve performance in general
THE BASICS
VIRTUAL MEMORY - CACHES

/proc/sys/vm/drop_caches
0 - do nothing state
1 - free page cache
2 - free reclaimable slab objects
slab allocation - used to retain allocated memory that contains a
data object of certain types for reuse of another allocation of the
same request (deallocation/allocation)
3 - free both types
this is not a destructive operation, it will not destruct any dirty objects
THE BASICS
MEMORY - PERFORMANCE MONITORING
vmstat
Network
- NIC Ring Buffer
- Hard IRQ
- Soft IRQ
- Networking Tools
the basics
THE BASICS
App N
App 1
NETWORK - NIC RING BUFFER
Packet to
Forward
IP STACK
SKB
SKB
Queuing
Disciplines
SKB
NIC
Driver Queue
a.k.a
Ring Buffer
THE BASICS

Ring Buffer
implemented as First in First Out (FIFO) ring buffer
does not contain the packet data
descriptor that point to the other data structures called
Socket Kernel Buffers (SKB)
input source for Ring Buffer is IP Stack
dequeued by Hardware Driver and sent to NIC via data bus
THE BASICS

ethtool - command to show/
change values of the ring buffer
-g : display
-G : change
with introduction of Byte Queue
Limit (BQL) there is no longer any
need to modify the driver queue
size - a self tuning algorithm
THE BASICS
NETWORK - QUEUING DISCIPLINES (QDISC)

QDISC
Linux abstraction layer for traffic queues
carry out complex queue management behaviours
traffic classification
prioritisation
rate shaping
configured through traffic control - tc command
THE BASICS
NETWORK - HARD IRQ

Hardware IRQ
also known as top half interrupts
when NIC receives incoming data:
it copies the data to RX ring buffers
the NIC notifies the kernel of this incoming data by
raising a hardware interrupt
cat /proc/interrupts | grep em1
THE BASICS
NETWORK - SOFT IRQ

Software IRQ
also known as bottom half interrupts
purpose is to drain the NIC receive ring buffers
can be seen in process monitoring tools such as ps and
top
ksoftirqd/cpu-num
THE BASICS
NETWORK - MONITORING TOOLS

ss/netstat - dump/shows socket statistics
dropwatch - monitors packet freed from memory by the
kernel
ip - for managing IP and monitoring routes, devices, policy
routing and tunnels
ethtool - for displaying and changing NIC settings
THE BASICS
NETWORK - SOME TUNING PARAMETERS

SoftIRQ Misses
net.core.netdev_budget
if softirq dont run long enough, the rate of incoming data
could exceed the kernels capability to drain the buffer
fast enough
default value = 300
this will allow softirq to only process/drain out 300
messages from the NIC before been booted off the CPU
THE BASICS

SoftIRQ Misses
when to increase net.core.netdev_budget value?
monitor /proc/net/softnet_stat third column
if third column keep on increasing - doubled the
value
THE BASICS

increase max open files descriptors - default 1024
/etc/security/limits.conf
* soft nofile 100000
* hard nofile 100000
ulimit -n : to view current file desc limit
THE BASICS

decrease the time socket stays in TIME_WAIT state
by lowering tcp_fin_timeout
default 60
lowering too low can run into socket close errors in
network with lots of jitter
set tcp_tw_reuse = 1 : this tell kernel it can reuse socket in
the TIME_WAIT state
THE BASICS
NETWORK - TCP TIMEWAIT HACK
by EA Faisal - NexoPrima
patch kernel to introduce tcp_timewait_len kernel

parameter
new entry in /proc FS:
/proc/sys/net/ipv4/tcp_timewait_len
able to use sysctl for configuration:
net.ipv4.tcp_timewait_len
https://github.com/efaisal/linuxtcptw
THE BASICS

increase the port range for ephemeral outgoing ports
cat /proc/sys/net/ipv4/ip_local_port_range
default minimum port 32768
change to 10000
default maximum port 61000
change to 65000
THE BASICS
NETWORK - TCP CONGESTION WINDOW

throughput of a communication is
limited by two windows:
congestion windows - maximum
segment size (MSS) allowed on
that connection
BrowserClient
SYN
SY
CK
N, A
RWI
N 65
k
Server/Web
SS
*M
IN 3
RW
ACK
GET
maintained by the sender

receive window - maximum
amount of data before
acknowledge sender
maintained by the receiver
MSS = MTU - (TCP Header) ~= 1460
ACK
ke
pac
a
t
3 da
ACK
s
ts
CWIN = 3
THE BASICS

TCP uses a mechanism called slow start to increase the congestion
window after a connection is initialised and after timeout
It will start a window of two times maximum segment size (MSS) depend on CWIN value
starting from kernel 2.6.39 CWIN default value increased to 10
for every packet acknowledged, congestion window increase by 1 MSS
when exceeds sshtresh threshold it will enter congestion avoidance
mode
sshtresh value automatically update at every end of slow start
THE BASICS

a server does not
necessarily adhere to the
clients RWIN (receiver
advertised window size)
if CWIN size is a lot
smaller/lower than
receiver RWIN - the initial
transfer might not be
optimal
BrowserClient
SYN
SY
CK
N, A
RWI
N 65
k
Server/Web
SS
*M
IN 3
RW
ACK
GET
ACK
CWIN = 5
kets
c
a
ta p
5 da
THE BASICS

net.ipv4.tcp_slow_start_after_idle
tells either to start the default congestion window size for
existing TCP connections that have been idle for too long
on persistent connections you will likely end up in this
state
default value = 1
recommended value for performance - change to 0
THE BASICS

changing CWIN and RWIN values in Linux
ip route show
ip route change default via 192.168.1.1 dev em1 proto
static initcwnd 10
ip route change default via 192.168.1.1 dev em1 proto
static initrwnd 10
THE BASICS

tuned : adaptive system tuning daemon
apply tuning settings which enable most desirable
performance
EXPERIENCES
SHARING
EXPERIENCES SHARING
LOAD STRESS TEST INFRA - ONLINE UNIV APPLICATION
EXPERIENCES SHARING
LOAD STRESS TEST - ONLINE UNIV APPLICATION

untuned
tuned
INTERNAL
EXPERIENCES SHARING
DB CONNECTION LATENCY
Q&A
THANK YOU
adzmely@nexoprima.com
http://blog.nexoprima.com
ANNEX
REFERENCES:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
Performance_Tuning_Guide
https://wiki.mikejung.biz/Ubuntu_Performance_Tuning
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vmdirty_ratio/
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
https://access.redhat.com/sites/default/files/attachments/
20150325_network_performance_tuning.pdf
http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/
https://www.wikipedia.org/ : for various topics related
others that I might forgotten

Vertical Performance Tuning-MOSC2016

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vertical Performance Tuning-MOSC2016

Uploaded by

Copyright:

Available Formats

Malaysia Open Source Conference 2016

25 - 27 . MAY . UKM . BANGI .

VERTICAL SCALING &

handle a growing amount of work

tune a system to handle a higher

add more resources in single node

PERFORMANCE TUNING - VERTICALLY?

fully optimised all available

How to determine that your

You Need to Know & Understand The Basics

In general - Four subsystems that need to be

CPU - four subsystem to be

CPU - INTERRUPTS REQUEST (IRQ) HANDLING

CPU - SCHEDULER PRIORITIES

CPU - SCHEDULER PRIORITIES

CPU - CONTEXT SWITCHES

CPU - THE RUN QUEUE

CPU - THE RUN QUEUE

CPU - TIME SLICING

CPU - PERFORMANCE MONITORING

CPU - PERFORMANCE MONITORING

CPU - PERFORMANCE MONITORING

/proc/$pid/status | grep ctxt

CPU - PERFORMANCE MONITORING TOOLS

CPU - PERFORMANCE MONITORING

CPU - PERFORMANCE MONITORING

CPU - PERFORMANCE MONITORING

VIRTUAL MEMORY - PAGES

VIRTUAL MEMORY - DIRTY PAGES

VIRTUAL MEMORY - DIRTY PAGES

VIRTUAL MEMORY - DIRTY PAGES

VIRTUAL MEMORY - DIRTY PAGES

VIRTUAL MEMORY - DIRTY PAGES

VIRTUAL MEMORY - SWAPPINESS

VIRTUAL MEMORY - CACHES

MEMORY - PERFORMANCE MONITORING

NETWORK - NIC RING BUFFER

NETWORK - NIC RING BUFFER

NETWORK - NIC RING BUFFER

NETWORK - QUEUING DISCIPLINES (QDISC)

NETWORK - HARD IRQ

NETWORK - SOFT IRQ

NETWORK - MONITORING TOOLS

NETWORK - SOME TUNING PARAMETERS

NETWORK - SOME TUNING PARAMETERS

NETWORK - SOME TUNING PARAMETERS

NETWORK - SOME TUNING PARAMETERS

NETWORK - TCP TIMEWAIT HACK

patch kernel to introduce tcp_timewait_len kernel

NETWORK - SOME TUNING PARAMETERS

NETWORK - TCP CONGESTION WINDOW

maintained by the sender

NETWORK - TCP CONGESTION WINDOW

NETWORK - TCP CONGESTION WINDOW

NETWORK - TCP CONGESTION WINDOW

NETWORK - TCP CONGESTION WINDOW

NETWORK - SOME TUNING PARAMETERS

LOAD STRESS TEST INFRA - ONLINE UNIV APPLICATION

LOAD STRESS TEST - ONLINE UNIV APPLICATION

You might also like