You are on page 1of 73

Malaysia Open Source Conference 2016

25 - 27 . MAY . UKM . BANGI .

VERTICAL SCALING &

PERFORMANCE TUNING
Adzmely Mansor

ABOUT ME

CONSULTANT/FOUNDER OF
NEXOPRIMA SDN. BHD.
Adzmely Mansor

1. INTRO TO SCALING
2. THE BASICS
CPU
MEMORY
NETWORK
3. EXPERIENCES
SHARING

AGENDA

INTRO TO
SCALING

INTRO TO SCALING

SCALABILITY?

WIKIPEDIA

handle a growing amount of work


in capable manner
accommodate max growth

PERFORMANCE TUNING?

WIKIPEDIA

tune a system to handle a higher


load

INTRO TO SCALING

VERTICALLY?

WIKIPEDIA

add more resources in single node

PERFORMANCE TUNING - VERTICALLY?


???????

INTRO TO SCALING

fully optimised all available


resources for maximum
possible load
Performance Tuning - Vertically

INTRO TO SCALING

HORIZONTALLY?
when vertical / performance tuning
already maximise
why?

INTRO TO SCALING

How to determine that your


servers actually required more:
CPU / Processing Power RAM etc Vertical Scaling

INTRO TO SCALING

You Need to Know & Understand The Basics


how various components works
preferences in processes
how IO interupts are handled
how memory management works
how network layer implemented
meaning of the information given
basics tools

INTRO TO SCALING

In general - Four subsystems that need to be


monitored
CPU
Memory
IO
Network

THE
BASICS

CPU - four subsystem to be


monitored:
- run queue
- context switch
- cpu utilisation
- load average
the basics

THE BASICS

CPU
CPU utilisation depend on accessed resources
Linux Kernel has a scheduler, and scheduler give priorities
to the different resources:
scheduling two kind of resources:
interrupts
threads

THE BASICS

CPU - INTERRUPTS REQUEST (IRQ) HANDLING


IRQ is a signal for an immediate attention sent from
hardware to processor
each device is assigned one or more IRQ numbers
allowing to send unique interrupts
a processor that receives an interrupts request will
immediately pause execution of the current
application thread in order to address the request

THE BASICS

CPU - SCHEDULER
smallest unit of process execution called thread
the system scheduler:
determines which processor run a thread
and for how long the threads run
however the scheduler have priorities

THE BASICS

CPU - SCHEDULER PRIORITIES


Scheduler Priorities:
Hardware interrupts (highest priority)
by hardware on the system to process data
eg:
by disk when completed IO transaction
by NIC when packet has been received

THE BASICS

CPU - SCHEDULER PRIORITIES


Scheduler Priorities:
Soft interrupts (softirq) - related to maintenance of the kernel
itself
Real Time Thread - parallel processing / real time
programming
Kernel Threads - all kernel processing
User Threads - a.k.a. user space.
All applications run in the user space / lowest priority of all

THE BASICS

CPU - CORES
Linux consider / view each core on n-way Hyper Threaded
processor as an:
INDEPENDENT Processor
eg: Dual Core Processor = two independent processor

THE BASICS

CPU - CONTEXT SWITCHES


each threads alloted a time quantum to spend on the
processor
passed alloted time / pre-empted by something higher
priority, the thread:
place back to queue
higher priority / next in queue thread is placed on the
processor:
switched of thread = Context Switch

THE BASICS

CPU - THE RUN QUEUE


each CPU maintain a RUN QUEUE of threads
process threads are either:
runnable (in run queue)
sleep state (blocked and waiting for IO - not in run queue)
CPU heavily utilised:
longer run queue
the longer it take for process threads to execute

THE BASICS

CPU - THE RUN QUEUE


Load
describe the state of the Run Queue
System Load
equal to amount of process threads currently executing
+ amoung of threads in the CPU Run Queue
top command report load averages over the course
of 1, 5 and 15 minutes

THE BASICS

CPU - UTILISATION
defined as the percentage of usage of a CPU
mostly CPU utilisation falls under following categories:
User Time: percentage of time CPU spends executing threads in the
User Space
System Time: percentage of time CPU spends in executing kernel
threads and interrupts
Wait IO: the persentage of time a CPU spends idle because all process
threads are blocked waiting for IO requests to complete
Idle: the percentage of time a processor spends in completely idle
state

THE BASICS

CPU - TIME SLICING


a numeric value represent how long a task can run until it
pre-empted
scheduler policy dictate the default timeslice
too long time slice = poor interactive performance
time slice too short = significant amount of processor
time been wasted because of overhead of switching
process from one process to another (context
switching)

CPU - Performance
Monitoring - a matter of
interpreting performance of:

- run queue
- utilisation
- context switching
the basics

THE BASICS

CPU - PERFORMANCE MONITORING


General Expectations:
Run Queues: a run queue should have no more than
1 - 3 threads queued per CPU
eg: a dual processor should not have more than 2
threads in queue (ideally) or 6 to the max

THE BASICS

CPU - PERFORMANCE MONITORING


General Expectations:
CPU Utilisation: if a CPU is fully utilised, ideally the
following balance of utilisation should be achieved:
65% - 70%: User Time
30% - 35%: System Time
0% - 5%: Idle Time

THE BASICS

CPU - PERFORMANCE MONITORING


General Expectations:
Context Switches:
high amount of context switches is acceptable if:
CPU utilisation stays within previously mentioned
balance

/proc/$pid/status | grep ctxt


/usr/bin/time -v ls | grep context
vmstat

THE BASICS

CPU - PERFORMANCE MONITORING TOOLS


must be low overhead tool
still practical having it running under heavily loaded
system
able to monitor the health of the system at glance

THE BASICS

CPU - PERFORMANCE MONITORING

top

THE BASICS

CPU - PERFORMANCE MONITORING

vmstat

THE BASICS

CPU - PERFORMANCE MONITORING

mpstat

Memory
- Physical Memory
- Virtual Memory

the basics

THE BASICS

VIRTUAL MEMORY
Virtual Memory = SWAP space on disk + RAM/physical
memory
virtual memory divided into pages
on x86 architecture VM pages = 4kb
when writing from memory to disk, it write memory in
Pages

THE BASICS

VIRTUAL MEMORY
when application starts, it request Virtual Memory Size (VSZ)
the kernel either grants or denies VSZ request
as application use the requested memory, that memory
mapped into physical memory (RSS)
RSS (resident memory size) is amount of physical memory
a task is using
most case application use less RSS than what it requested
(VSZ)

THE BASICS

VIRTUAL MEMORY

THE BASICS

VIRTUAL MEMORY

THE BASICS

VIRTUAL MEMORY - PAGES


Virtual Memory divided into Pages
on x86 architecture VM Pages = 4kb
when writing from memory to disk, it write memory in
Pages
when Pages in memory are modified by running process,
they become dirty
when reach defined percentage (vm.dirty_ratio) it will be
written to disk

THE BASICS

VIRTUAL MEMORY - DIRTY PAGES


vm.dirty_ratio (kernel param)
defined the maximum amount of memory for a process that can be filled
with dirty pages before they get flushed to disk
flushing will stop all IOs
higher value = remain longer in memory
better performance but high risk
lower value = get flushed to disk more often
slower performance but less risk
default value = 20

THE BASICS

VIRTUAL MEMORY - DIRTY PAGES


vm.dirty_background_ratio (kernel param)
defined the maximum amount of memory that can be
filled with dirty pages before they get flushed to disk by
the kernel flusher threads
pdflush/flush/kdmflush
default value = 10
eg: RAM 64G, only 6.4G data can be sitting in RAM
before kernel flusher daemon kicks in.

THE BASICS

VIRTUAL MEMORY - DIRTY PAGES


vm.dirty_bytes & vm.dirty_background_bytes
same as dirty_ratios and dirty_background_ratios but in
bytes
setting ratios value/s, bytes param/s will become 0
and vice-verse

THE BASICS

VIRTUAL MEMORY - DIRTY PAGES


vm.dirty_expire_sentisecs
how long something can be cache before it needs to be
written
default value = 3000 sentisecs = 30 seconds
if dirty pages older than default value, it will be
written asynchronously to disk
safe guard against data lost

THE BASICS

VIRTUAL MEMORY - DIRTY PAGES


vm.dirty_writeback_sentisecs
how often pdflush/flush/kdmflush process wake up and
check if works need to be done
default value = 500 sentisecs = 5 seconds

THE BASICS

VIRTUAL MEMORY - SWAPPINESS


vm.swappiness
how aggressive linux should be when swapping active pages in memory to
disk
default value = 60
when reaching 40% of memory used, it will start to consider for swapping
lower value meaning closer to memory max size - discourage linux from
swapping
60 value recommended for most desktop use
10 recommended to improve performance in general

THE BASICS

VIRTUAL MEMORY - CACHES


/proc/sys/vm/drop_caches
0 - do nothing state
1 - free page cache
2 - free reclaimable slab objects
slab allocation - used to retain allocated memory that contains a
data object of certain types for reuse of another allocation of the
same request (deallocation/allocation)
3 - free both types
this is not a destructive operation, it will not destruct any dirty objects

THE BASICS

MEMORY - PERFORMANCE MONITORING

vmstat

Network
- NIC Ring Buffer
- Hard IRQ
- Soft IRQ
- Networking Tools

the basics

THE BASICS

App N

App 1

NETWORK - NIC RING BUFFER

Packet to
Forward

IP STACK

SKB

SKB

Queuing
Disciplines

SKB

NIC
Driver Queue
a.k.a
Ring Buffer

THE BASICS

NETWORK - NIC RING BUFFER


Ring Buffer
implemented as First in First Out (FIFO) ring buffer
does not contain the packet data
descriptor that point to the other data structures called
Socket Kernel Buffers (SKB)
input source for Ring Buffer is IP Stack
dequeued by Hardware Driver and sent to NIC via data bus

THE BASICS

NETWORK - NIC RING BUFFER


ethtool - command to show/
change values of the ring buffer
-g : display
-G : change
with introduction of Byte Queue
Limit (BQL) there is no longer any
need to modify the driver queue
size - a self tuning algorithm

THE BASICS

NETWORK - QUEUING DISCIPLINES (QDISC)


QDISC
Linux abstraction layer for traffic queues
carry out complex queue management behaviours
traffic classification
prioritisation
rate shaping
configured through traffic control - tc command

THE BASICS

NETWORK - HARD IRQ


Hardware IRQ
also known as top half interrupts
when NIC receives incoming data:
it copies the data to RX ring buffers
the NIC notifies the kernel of this incoming data by
raising a hardware interrupt
cat /proc/interrupts | grep em1

THE BASICS

NETWORK - SOFT IRQ


Software IRQ
also known as bottom half interrupts
purpose is to drain the NIC receive ring buffers
can be seen in process monitoring tools such as ps and
top
ksoftirqd/cpu-num

THE BASICS

NETWORK - MONITORING TOOLS


ss/netstat - dump/shows socket statistics
dropwatch - monitors packet freed from memory by the
kernel
ip - for managing IP and monitoring routes, devices, policy
routing and tunnels
ethtool - for displaying and changing NIC settings

THE BASICS

NETWORK - SOME TUNING PARAMETERS


SoftIRQ Misses
net.core.netdev_budget
if softirq dont run long enough, the rate of incoming data
could exceed the kernels capability to drain the buffer
fast enough
default value = 300
this will allow softirq to only process/drain out 300
messages from the NIC before been booted off the CPU

THE BASICS

NETWORK - SOME TUNING PARAMETERS


SoftIRQ Misses
when to increase net.core.netdev_budget value?
monitor /proc/net/softnet_stat third column
if third column keep on increasing - doubled the
value

THE BASICS

NETWORK - SOME TUNING PARAMETERS


increase max open files descriptors - default 1024
/etc/security/limits.conf
* soft nofile 100000
* hard nofile 100000
ulimit -n : to view current file desc limit

THE BASICS

NETWORK - SOME TUNING PARAMETERS


decrease the time socket stays in TIME_WAIT state
by lowering tcp_fin_timeout
default 60
lowering too low can run into socket close errors in
network with lots of jitter
set tcp_tw_reuse = 1 : this tell kernel it can reuse socket in
the TIME_WAIT state

THE BASICS

NETWORK - TCP TIMEWAIT HACK

by EA Faisal - NexoPrima

patch kernel to introduce tcp_timewait_len kernel


parameter
new entry in /proc FS:
/proc/sys/net/ipv4/tcp_timewait_len
able to use sysctl for configuration:
net.ipv4.tcp_timewait_len
https://github.com/efaisal/linuxtcptw

THE BASICS

NETWORK - SOME TUNING PARAMETERS


increase the port range for ephemeral outgoing ports
cat /proc/sys/net/ipv4/ip_local_port_range
default minimum port 32768
change to 10000
default maximum port 61000
change to 65000

THE BASICS

NETWORK - TCP CONGESTION WINDOW


throughput of a communication is
limited by two windows:
congestion windows - maximum
segment size (MSS) allowed on
that connection

BrowserClient

SYN

SY

CK
N, A

RWI
N 65
k

Server/Web

SS

*M
IN 3

RW

ACK
GET

maintained by the sender


receive window - maximum
amount of data before
acknowledge sender
maintained by the receiver
MSS = MTU - (TCP Header) ~= 1460

ACK
ke
pac
a
t
3 da

ACK
s

ts

CWIN = 3

THE BASICS

NETWORK - TCP CONGESTION WINDOW


TCP uses a mechanism called slow start to increase the congestion
window after a connection is initialised and after timeout
It will start a window of two times maximum segment size (MSS) depend on CWIN value
starting from kernel 2.6.39 CWIN default value increased to 10
for every packet acknowledged, congestion window increase by 1 MSS
when exceeds sshtresh threshold it will enter congestion avoidance
mode
sshtresh value automatically update at every end of slow start

THE BASICS

NETWORK - TCP CONGESTION WINDOW


a server does not
necessarily adhere to the
clients RWIN (receiver
advertised window size)
if CWIN size is a lot
smaller/lower than
receiver RWIN - the initial
transfer might not be
optimal

BrowserClient

SYN

SY

CK
N, A

RWI
N 65
k

Server/Web

SS

*M
IN 3

RW

ACK
GET

ACK

CWIN = 5
kets
c
a
ta p
5 da

THE BASICS

NETWORK - TCP CONGESTION WINDOW


net.ipv4.tcp_slow_start_after_idle
tells either to start the default congestion window size for
existing TCP connections that have been idle for too long
on persistent connections you will likely end up in this
state
default value = 1
recommended value for performance - change to 0

THE BASICS

NETWORK - TCP CONGESTION WINDOW


changing CWIN and RWIN values in Linux
ip route show
ip route change default via 192.168.1.1 dev em1 proto
static initcwnd 10
ip route change default via 192.168.1.1 dev em1 proto
static initrwnd 10

THE BASICS

NETWORK - SOME TUNING PARAMETERS


tuned : adaptive system tuning daemon
apply tuning settings which enable most desirable
performance

EXPERIENCES
SHARING

EXPERIENCES SHARING

LOAD STRESS TEST INFRA - ONLINE UNIV APPLICATION

EXPERIENCES SHARING

LOAD STRESS TEST - ONLINE UNIV APPLICATION


untuned

tuned

INTERNAL

EXPERIENCES SHARING

DB CONNECTION LATENCY

Q&A
THANK YOU

adzmely@nexoprima.com
http://blog.nexoprima.com

ANNEX

REFERENCES:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
Performance_Tuning_Guide
https://wiki.mikejung.biz/Ubuntu_Performance_Tuning
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vmdirty_ratio/
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
https://access.redhat.com/sites/default/files/attachments/
20150325_network_performance_tuning.pdf
http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/
https://www.wikipedia.org/ : for various topics related
others that I might forgotten

You might also like