Professional Documents
Culture Documents
Program Manager
Key Takeaways
SQL Server can meet the needs of many of the most challenging OLTP scenarios
in the world.
There are a number of new challenges when designing for high end OLTP
systems.
Hardware
Setup
Database
files
Database Files
# should be at least 25% of CPU cores
This alleviates PFS contention PAGELATCH_UP
There is no signficant point of diminishing returns up to 100% of CPU cores
But manageability, is an issue...
Though Windows 2008R2 is much easier
TempDb
PFS contention is a larger problem here as its an instance wide resource
Deallocations and Allocations , RCSI version store, triggers, temp tables
# files shoudl be exactly 100% of CPU Threads
Presize at 2 x Physical Memory
Key Takeaway: Script it! At this scale, manual work WILL drive you
insane
SQL Server generally follows suit but for now, 256 Cores
is limit on R2
Example x64 machines: HP DL980 (64 Cores, 128 in
HyperThread). IBM 3950 (up to 256 Cores)
And largest IA-64 is 256 Hyperthread (at 128 Cores)
Hardware
NUMA 6
Kernel
Group 0
Kernel
Group 1
NUMA 0
NUMA 2
NUMA 4
NUMA 6
NUMA 1
NUMA 3
NUMA 5
NUMA 7
NUMA 8
NUMA 10
NUMA 12
NUMA 14
NUMA 9
NUMA 11
NUMA 13
NUMA 15
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
NUMA 7
Kernel
Group 2
NUMA 16
NUMA 17
NUMA 19
NUMA 21
NUMA 23
Kernel
Group 3
NUMA 24
NUMA 26
NUMA 28
NUMA 30
NUMA 25
NUMA 27
NUMA 29
NUMA 31
NUMA 18
NUMA 20
NUMA 22
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
total CPU utilization 15,000 planned for March/April 2011 with ultimate goal
of 25,000+
Workload Characteristics:
6,000-7,000 batches/sec with a Read/Write ratio of about 80/20
Highly normalized schema, lots of relatively complex queries (heavy on loop
joins), heavy use of temporary objects (table valued functions), use of BLOBs,
transactional and storage based replication
cluster instance)
scale-out)
Observation
x64 Servers provide >2x per core processing over previous IA64 CPUs
Challenge
Consideration/Workaround
Network
Concurrency
Transaction Log
Monitoring
Architecture/Hardware
NUMA latencies
Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect.
2
427556864
74398826496
0
0
128 22970000047327200649
2
427950080
74826383360
0
0
128 22970000047327200649
I N T E R N A L O N LY
Spinlocks
Lightweight synchronization primitives used to protect access to data
structures
Used to protect structures in SQL such as lock hash tables (LOCK_HASH),
security caches (SOS_CACHESTORE) and more
Used when it is expected that resources will be held for a very short duration
Why not yield?
It would be more expensive to yield and context switch than spin to acquire the
resource
Threads accessing the same
hash bucket of the table
are synchronized
LO
H
_
CK
H
S
A
Resourc
e
Hash
of lock
maint
Lock Manager
a
table ined in
ha sh
Lock Hash
Table
Thread attempts to
obtain lock (row,
page, database,
etc..
Spinlocks Diagnosis
1
1
2
2
3
3
5
5
4
4
1
1
2Confirmed
2
Name
SOS_CACHESTORE
SOS_SUSPEND_QUEUE
LOCK_HASH
MUTEX
SOS_SCHEDULER
Collisions
14,752,117
69,267,367
5,765,761
2,802,773
1,207,007
Spins
942,869,471,526
473,760,338,765
260,885,816,584
9,767,503,682
3,692,845,572
Spins_Per_Collision
63,914
6,840
45,247
3,485
3,060
Backoffs
67,900,620
2,167,281
3,739,208
350,997
109,746
--create the even session that will capture the callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action (package0.callstack)
where
type = 144
--SOS_CACHESTORE
)
add target package0.asynchronous_bucketizer (
set filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)
--Ensure the session was created
select * from sys.dm_xe_sessions
where name = 'spin_lock_backoff'
--Run this section to measure the contention
alter event session spin_lock_backoff on server state=start
--wait to measure the number of backoffs over a 1 minute period
waitfor delay '00:01:00'
A complete walkthrough
of the technique can be
found here:
http://sqlcat.com/msdnmirror/archive
/2010/05/11/resolving-dtc-relatedwaits-and-tuning-scalability-ofdtc.aspx
Huge
increase in number of spins & backoffs associated with SOS_CACHESTORE
2
Approach: Use extended events to profile the code path with the spinlock contention (i.e.
where there is a high number of backoffs)
Root cause: Regeneration of security tokens exposes contention in code paths for
access permission checks
Workaround/Problem Isolation: Run with sysadmin rights
Long Term Changes Required: SQL Server fix
Sustain expected peak load of ~230 business transactions (checks) per second
Workload Characteristics:
Heavy insert into a few tables, periodic range scans of newly added data
Hardware/Deployment Configuration:
Custom test harness, 12 Load Generators, 5 Application servers
Database servers: HP DL 785
availability.
Strict uptime requirements.
SQL Server Failover Clustering for local (within datacenter) availability
Storage based replication (EMC SRDF) for disaster recovery
Quick recovery time for failover is a priority.
Observation
Initial tests showed low overall system utilization
Long duration for insert statements
High waits on buffer pages (PAGELATCH_EX/PAGELATCH_SH)
Network bottlenecks once the latch waits were resolved
Recovery times (failure to DB online) after failover under full load were between 45
Network switch
switch
Network
Transaction DB
Server
1 x DL785
8P (quad core),
2.3GHz
256 GB RAM
12 x Load drivers:
2 proc (quad core),
x64
32+ GB memory
DL785
DL585
SAN switch
Switch
Switch Brocade 4900
(32-ports active)
5 x App
servers:
Switch
Reporting DB
Server
1 x DL585
4P (dual core), 2.6
GHz
32 GB RAM
5 x BL460
2 proc (quad
core), 32bit
32 GB memory
SAN
CX-960
(240 drives,
15K, 300GB)
Concurrency
Transaction Log
No log bottlenecks were observed. When cache on the array behaves well log response
times are very low.
Monitoring
Architecture/Hardware
Hot Latches!
We observed very high waits for
PAGELATCH_EX
High = more than 1ms, we observed
greater than 20 ms
Be careful drawing conclusions just on
averages
Page (8K)
EX_LATCH
ROW
ROW
ROW
298
IX Page
INSERT VALUES
(298, xxxx)
EX_LATCH wait
EX_LATCH
ROW
299
IX Page
INSERT VALUES
(299, xxxx )
% Wait Time
PAGELATCH_EX
86.4%
PAGELATCH_SH
8.2%
LATCH_SH
1.5%
LATCH_EX
1.0%
LOGMGR_QUEUE
0.9%
CHECKPOINT_QUEUE
0.8%
ASYNC_NETWORK_IO
0.8%
WRITELOG
0.4%
latch_class
wait_time_ms
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
156,818
LOG_MANAGER
103,316
I N T E R N A L O N LY
from sys.dm_os_wait_stats
where wait_time_ms > 0
and wait_type like '%PAGELATCH%'
order by wait_time_ms desc
BBtree
tree
Pag
Pag
e
e
BBtree
tree
Pag
Pag
e
e
Our scenario
Two tables were insert heavy, by far
receiving the highest number of
inserts
INSERT mainly however there is
background process reading off
ranges of the newly added data
And dont forget
We have to obtain latches on the
non-leaf Btree pages as well.
Page Latch waits vs. Tree Page Latch
waits (sys.dm_db_index_operational
stats)
Leaf
Page
s
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
e
e
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
BBtree
tree
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
e
e
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Tree
Pages
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
2001
- 3000
3001
- 4000
INSERT
1001
- 2000
INSERT
0
-1000
INSERT
Before
After
INSERT
2
2
3
3
4
4
2
2
-- Add the computed column to the existing table (this is an OFFLINE operation of done the simply way)
- Consider using bulk loading techniques to speed it up.
ALTER TABLE [dbo].[Transaction]
ADD [HashValue] AS (CONVERT([tinyint], abs(binary_checksum([uidMessageID ])%(16)),(0)))
PERSISTED NOT NULL
3
3
Note: Requires
application
changes
CREATE
UNIQUE CLUSTERED
INDEX
[IX_Transaction_ID] ON [dbo].[Transaction([Transaction_ID ],
[HashValue])
ON ps_hash16(HashValue)
Ensure
Select/Update/Delete
have appropriate partition elimination
2
2
3
3
1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly
enabling TCP Chimney Offload CPU time on CPU0 was reduced. The
base CPU for RSS successfully moved from CPU0 to another CPU.
ToCom+
DTC
or
not
to
DTC:
POS
System
transactional applications are still prevalent today
This results in all database calls enlisting in a DTC transaction
45% performance overhead
Scenario in the lab involved two Resource Managers MSMQ and SQL:
wait_type
total_wait_time_ms
total_waiting_tasks_count
average_wait_ms
DTC_STATE
5,477,997,934
4,523,019
1,211
PREEMPTIVE_TRANSIMPORT
2,852,073,282
3,672,147
776
PREEMPTIVE_DTC_ENLIST
2,718,413,458
3,670,307
740
Tuning approaches
1.
2.
Learn about SQL Server capabilities and challenges experienced by some of our
extreme OLTP customer scenarios.
Insight into diagnosing and architecting around issues with Tier-1, mission
critical workloads.
Key Takeaways
SQL Server can meet the needs of many of the most challenging OLTP scenarios
in the world.
There are a number of new challenges when designing for high end OLTP
systems.
Q &A
46
2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
47
Agenda
Customer Requirements
Hardware setup
Transaction log essentials
Top statistics
Category
Largest single database
Largest table
Metric
80 TB
20 TB
2.5 PB
60,000
18 GB/sec
(26GB/sec)
1 sec latency
20 minutes
12 TB
49
SQL Server generally follows suit but for now, 256 Cores is
limit on R2
Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up
to 256 Cores)
And largest IA-64 is 256 Hyperthread (at 128 Cores)
50
Hardware
NUMA 6
Kernel
Group 0
Kernel
Group 1
NUMA 0
NUMA 2
NUMA 4
NUMA 6
NUMA 1
NUMA 3
NUMA 5
NUMA 7
NUMA 8
NUMA 10
NUMA 12
NUMA 14
NUMA 9
NUMA 11
NUMA 13
NUMA 15
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
NUMA 7
Kernel
Group 2
NUMA 16
NUMA 17
NUMA 19
NUMA 21
NUMA 23
Kernel
Group 3
NUMA 24
NUMA 26
NUMA 28
NUMA 30
NUMA 25
NUMA 27
NUMA 29
NUMA 31
NUMA 18
NUMA 20
NUMA 22
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
51
Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect.
52
53
Customer Scenarios
Core Banking
Healthcare System
POS
Workload
Credit Card
transactions from
ATM and Branches
Scale
Requirements
10.000 Business
Transactions / sec
Technology
Server
HP Superdome
HP DL785G6
DL785
54
the CPU
These must be handled by CPU Cores
Must distribute packets to cores for processing
55
56
indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx
57
2
2
3
3
1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP
Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS
successfully moved from CPU0 to another CPU.
58
Use affinity mask to get rid of SQL Server for cores running NIC traffic
Well tuned, pure play OLTP
No need to consider parallel plans
Sp_configure max degree of parallelism, 1
60
61
63
64
65
Account
INSERT .. VALUES (@amount)
INSERT .. VALUES (-1 * @amount)
ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID
10**3 rows
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount
Account_ID
LastUpdateDate
Balance
10**5 rows
10**10 rows
66
Summary of Concerns
ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID
Account
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount
Account_ID
LastUpdateDate
Balance
Generating a Unique ID
Why wont this work?
CREATEPROCEDUREGetID
@IDINTOUTPUT
@ATM_IDINT
AS
DECLARE@LastTransaction_IDINT
SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=@ATM_ID
SET@ID=@LastTransaction_ID+1
UPDATEATM
SET@LastTransaction_ID
WHEREATM_ID=@ATM_ID
68
Concurrency is Fun
ATM
ID_ATM = 13
LastTransaction_ID = 42
SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=13
(@LastTransaction_ID=42)
SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=13
(@LastTransaction_ID=42)
SET@ID=@LastTransaction_ID+1
UPDATEATM
SET@LastTransaction_ID=@ID
WHEREATM_ID=13
SET@ID=@LastTransaction_ID+1
UPDATEATM
SET@LastTransaction_ID=@ID
WHEREATM_ID=13
69
70
for LCK_M_U
Diagnosed in sys.dm_os_wait_stats
Drilling down to individual locks using sys.dm_tran_locks
Inventive readers may wish to use Xevents
Event objects: sqlserver.lock_acquired and sqlos.wait_info
Bucketize them
71
Spinning around
1.00E+14
20000
18000
1.00E+12
16000
1.00E+10
14000
12000
1.00E+08
Spins
10000
1.00E+06
Throughput
8000
lg(Spins)
Throughput
6000
1.00E+04
4000
1.00E+02
2000
1.00E+00
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0
100000
Requests
More Threads
LO
H
_
CK
H
S
A
Lock Manager
ROW
Thread
LCK_U
73
Locking at Scale
Ratio between ATM machines and transactions generated
too low.
Can only sustain a limited number of locks/unlocks per second
Depends a LOT on NUMA hardware, memory speeds and CPU caches
Each ATM was generating 200 transactions / sec in test harness
74
row
Hot Latches!
consistency)
Latches are internal SQL Engine
Page (8K)
ROW
ROW
PAGELATCH_EX
LCK_U
ROW
ROW
LCK_U
(memory consitency)
77
Row Padding
Page (8K)
ROW
CHAR(5000)
PAGELATCH_EX
LCK_U
1 LCK = 1 PAGELATCH
ALTERTABLEATM
ADDCOLUMNPaddingCHAR(5000)NOTNULL
DEFAULT(X)
78
INSERT throughput
Transaction table is by far the most active table
Fortunately, only INSERT
No need to lock rows
But several rows must still fit a single page
79
80
wait_type
% Wait Time
PAGELATCH_SH
86.4%
PAGELATCH_EX
8.2%
LATCH_SH
1.5%
LATCH_EX
1.0%
LOGMGR_QUEUE
0.9%
CHECKPOINT_QUEUE
0.8%
ASYNC_NETWORK_IO
0.8%
WRITELOG
0.4%
latch_class
wait_time_ms
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
156,818
LOG_MANAGER
103,316
81
the B-tree
3,11,19
4,12,20
5,13,21
6,14,22
7,15,23
0
-1000
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
2,10,18
INSERT
1,9,17
0
1
2
3
4
5
6
7
INSERT
0,8,16
hash
INSERT
ID
82
hash
0
1
2
3
4
5
6
253
254
255
83
2
2
84
3
3
4
4
2
2
(ACCESS_METHODS
HBOT_VIRTUAL_ROOT)
SH
SHH
SEX
PAGELATCH
SSH
EHX
SH
EX
PAGELATCH
PAGELATCH
EX
EX
EX
Prev
Next
X
LATCH
LCK
87
can work on it
On NUMA systems, going to a foreign memory node takes
89
The first NUMA node to request a page will own that page
Ownership continues until page is evicted from buffer pool
Every other NUMA node that need that page will have to do foreign memory access
UPDATEATM
SETLastTransaction_ID
NUMA 1
UPDATEATM
SETLastTransaction_ID
NUMA 2
2
NUMA 3
UPDATEATM
SETLastTransaction_ID
App Servers
UPDATEATM
SETLastTransaction_ID
91
UPDATEATM
SETLastTransaction_ID
NUMA 1
UPDATEATM
SETLastTransaction_ID
NUMA 2
2
NUMA 3
UPDATEATM
SETLastTransaction_ID
UPDATEATM
SETLastTransaction_ID
Port: 8000
Port: 8001
Port: 8002
Port: 8003
93
Workload Characteristics:
Multiple systems comprise the gaming experience including payment, casino
Hardware/Deployment Configuration:
Scale-up the payment system. HP Superdome (32-socket, 2 core; 256GB).
Investigating x64
Co-operative Scale-out for actual gaming activity
sqlcat.com/whitepapers/archive/2010/06/07/proven-sql-server-architectures-for-high-av
ailability-and-disaster-recovery.aspx
http://
sqlcat.com/whitepapers/archive/2010/11/03/failure-is-not-an-option-zero-data-loss-and
-high-availability.aspx
Observation
Large scale of users, with low latency requirements
Hot spots on heavily hit tables - page latching
Scale-out helped increase transaction volume (#/sec)
VS
Games
2+
1x2
Games
12+
CMS
15+
News
Letter 2+
Other
30+
BGI
CSM
2+
Other
40+
Payment
20+
Repl
ASP.NET
Sessions
8+
SMS
4+
Other
20+
DWH
Stage
50+
Internal
Office,
Sharepoint
(300+)
DWH
60+
Moni-toring
10+
OLAP
10+
Administration
20+
Challenge
Consideration/Workaround
Network
CPU bottlenecks for network processing were observed and resolved via network tuning (RSS)
Dedicated networks for backup, replication etc
8 network cards for clients
Concurrency
Transaction Log
Monitoring
Security monitoring (PCI and intrusion detection) between 10%-25% impact/overhead when
monitoring
Architecture/Hardware
Workload Characteristics:
Send large batch containing multiple business transactions, parse and end up
Hardware/Deployment Configuration:
Load distributed based on alphabetical split.
Co-operative Scale-out. Commodity based hardware (2-socket, quad-core pre-
Observation
Extreme low latency and high throughput requirements with machine born data
Resolution: Batch data into single large parameter (varchar (8000)) to avoid network
roundtrips
Transaction Log
Monitoring
Logwaits:
Resolution: Batching business transactions within a single COMMIT to avoid
WRITELOG waits
Test of SSDs for log helped with latency.