You are on page 1of 30

SAP HANA High Availability

Business Cont inuit y r equir es t hat t he operat ion of business crit ical syst em s r em ain highly
av ailable at all t im es, ev en in t he presence of failures. This paper discusses t he funct ionalit y of
SAP HANA in support of High Availabilit y and Disast er Recov er y.

Up d a t e d f o r H A N A 2 .0 SP0 0
HANA 2.0 SPS00: new operat ion m ode logreplay_readaccess for syst em replicat ion w it h Act ive/ Act iv e ( read enabled) feat ure
__________________________
chaim .bendelac@sap.com & m echt hild.bore- wuest hof@sap.com
SAP HANA Dev elopm ent Team
Table of Contents

Contents

Table of Contents ............................................................................................................................. 2

Legal Disclaimer ............................................................................................................................... 3

1 Introduction ............................................................................................................................... 4

SAP HANA ...................................................................................................................................... 4

About this Document ...................................................................................................................... 4

2 What is High Availability? ........................................................................................................ 4

Recovery - Key Performance Indicators ......................................................................................... 4

3 Eliminating Single Points of Failure........................................................................................ 5

Hardware Redundancy ................................................................................................................... 5

Network Redundancy ..................................................................................................................... 5

Data Center Redundancy ............................................................................................................... 5

4 SAP HANA High Availability Support ..................................................................................... 5

Backups .......................................................................................................................................... 6

Storage Replication ........................................................................................................................ 7

System Replication ......................................................................................................................... 8

Service Auto-Restart..................................................................................................................... 11

Host Auto-Failover ........................................................................................................................ 11

5 Design for High Availability ................................................................................................... 12

Planning for Failure....................................................................................................................... 13

6 In Summary.............................................................................................................................. 15

Glossary .......................................................................................................................................... 15

Industry Terms .............................................................................................................................. 15

SAP HANA Terms ........................................................................................................................ 15


Legal Disclaimer
THIS DOCUMENT IS PROVIDED FOR INFORMATION PURPOSES ONLY AND DOES NOT MODIFY THE TERMS OF ANY AGREEMENT.
THE CONENT OF THIS DOCUMENT IS SUBJECT TO CHANGE AND NO THIRD PARTY MAY LAY LEGAL CLAIM TO THE CONTENT OF
THIS DOCUMENT. IT IS CLASSIFIED AS CUSTOMER AND MAY ONLY BE SHARED WITH A THIRD PARTY IN VIEW OF AN ALREADY
EXISTING OR FUTURE BUSINESS CONNECTION WITH SAP. IF THERE IS NO SUCH BUSINESS CONNECTION IN PLACE OR
INTENDED AND YOU HAVE RECEIVED THIS DOCUMENT, WE STRONGLY REQUEST THAT YOU KEEP THE CONTENTS
CONFIDENTIAL AND DELETE AND DESTROY ANY ELECTRONIC OR PAPER COPIES OF THIS DOCUMENT. THIS DOCUMENT SHALL
NOT BE FORWARDED TO ANY OTHER PARTY THAN THE ORIGINALLY PROJECTED ADDRESSEE.

This document outlines our general product direction and should not be relied on in making a purchase decision. This document is not subject
to your license agreement or any other agreement with SAP. SAP has no obligation to pursue any course of business outlined in this
presentation or to develop or release any functionality mentioned in this document. This document and SAP's strategy and possible future
developments are subject to change and may be changed by SAP at any time for any reason without notice. This document is provided
without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a
particular purpose, or non-infringement. SAP assumes no responsibility for errors or omissions in this document and shall have no liability for
damages of any kind that may result from the use of these materials, except if such damages were caused by SAP intentionally or grossly
negligent.

Copyright 2017 SAP SE. All rights reserved.


1 Introduction
SAP HANA
SAP HANA is an innovat ive in- m em ory dat abase and dat a m anagem ent plat for m , specifically developed to
t ake full advant age of t he capabilit ies pr ovided by m oder n har dwar e t o incr ease application perfor m ance. By
keeping all r elevant dat a in m ain m em or y, dat a processing oper at ions ar e significant ly acceler at ed.

Design for scalabilit y is a cor e SAP HANA pr inciple. SAP HANA can be dist r ibut ed acr oss m any m ult iple hosts
t o achieve scalabilit y in t er m s of bot h dat a volum e and user concur r ency. Unlike clust er s, dist r ibut ed HANA
system s also dist r ibut e t he dat a efficient ly, achieving high scaling wit hout I / O locks.

The key per for m ance indicat ors of SAP HANA appeal t o m any of our custom er s, and t housands of deploym ents
ar e in progr ess. SAP HANA has becom e t he fast est growing product in SAPs 40+ year hist or y.

About this Document


Loss of business cr it ical syst em r esour ces and services, like SAP HANA, t r anslat e dir ect ly int o lost r evenue.
The goal t herefore is Business Cont inuit y, using syst em s designed for cont inuous oper at ion even in t he
pr esence of inevit able failur es. Mission cr it ical syst em s r equir e High Availabilit y; t his is no longer opt ional.

SAP HANA is fully designed for High Availabilit y, suppor t ing a broad r ange of r ecovery scenar ios from var ious
faults, from sim ple soft war e er ror s, t o disast er s t hat decom m ission an ent ire sit e.

This paper descr ibes SAP HANAs High Availabilit y suppor t for Fault and Disast er Recovery . A com pr ehensive
High Availabilit y solut ion offers m or e design choices and evident ly requir es t he discussion of m or e det ails t han
can be cover ed in a shor t paper , and m ay t her efore r equire addit ional consult at ions.

2 What is High Availability?


A v a i l a b i l i t y , t he m easur e of a syst em 's oper at ional cont inuit y, is expr essed as a per cent age of t im e, inver sely
propor t ional t o downt im e. For exam ple, if a given sy st em is designed t o be available for 99 .9 % of t he t im e
( som et im es called " t hr ee nines" ) ; it s downt im e per year m ust be less t han 0.1 % , or 9 hour s.

D o w n t i m e is t he consequence of out ages, which m ay be int ent ional ( e.g. for syst em upgr ades) or caused by
unplanned fault s. A f a u l t can be due t o equipm ent m alfunct ion, soft war e or net wor k failur es, or due to a
m aj or d i sa st e r such as a fire, a r egional power loss or a const r uct ion accident , which m ay decom m ission t he
ent ire dat a- cent er .

H i g h A v a i l a b i l i t y is a set of t echniques, engineer ing pr act ices and design pr inciples for Business Cont inuit y.
This is achieved by elim inat ing single point s of failur e ( fault t oler ance) , and pr oviding t he abilit y t o rapidly
r esum e oper at ions aft er a syst em out age wit h m inim al business loss ( fault r esilience) .

Fa u l t Re co v e r y is t he pr ocess of r ecover ing and r esum ing oper at ions aft er an out age due t o a fault . Di sa st e r
Re co v e r y is t he process of recover ing oper at ions aft er an out age due t o a prolonged dat acent er or sit e failur e.
Pr epar ing for disast er s m ay requir e backing up dat a acr oss longer dist ances, and m ay t hus be m or e com plex
and cost ly.

Recover y - Key Per formance Indicator s


Custom er s com m only use t wo key m easur es t o specify the recover y par am et er s of a syst em following an
out age: The Recovery Per iod Obj ect ive ( RPO) and t he Recovery Tim e Obj ect ive ( RTO) . The RPO and RTO of
a syst em ar e illust r at ed below:

RPO a n d RTO

The RPO is t he m axim al per m issible period of t im e dur ing which oper at ional dat a m ay be lost wit hout
abilit y t o r ecover ( t im e bet ween t he last backup and t he crash)
The RTO is t he m axim al per m issible t im e it t akes t o r ecover t he syst em , so t hat it s operat ions can
r esum e.

3 Eliminating Single Points of Failure


The key to achieving fault t oler ance is t o elim inat e single point s of failur e by int roducing r edundancy . SAP
HANA Appliance vendor s deliver sever al levels of r edundancy to avoid out age due t o com ponent failur e, which
ar e br iefly discussed her e. Gener ally speaking, t hese t echniques ar e " t r anspar ent " to SAP HANAs oper at ion,
but t hey for m a cr ucial line of defense against avoidable syst em out age, and t herefore gr eatly cont r ibut e t o
Business Cont inuit y 1 .

Hardware Redundancy
SAP HANA appliance har dware vendor s design m ult iple layer s of r edundant har dwar e com ponent s and sub-
system s. These include redundant and hot - swappable power supply unit s ( PSUs) , fans, net work int er face
car ds and ent er pr ise- gr ade err or - cor rect ing pr ot ect ed m em or ies. These subsyst em s ar e designed su ch t hat
t he redundant com ponent can sust ain t he oper at ion of t he syst em if t he ot her com ponent fails 2 .

Par ticular ly crit ical is t he st or age syst em . Ent erpr ise- gr ade st or age syst em s com bine m ult iple physical dr ives
int o logical unit s, wit h built - in st andard ( RAI D) t echniques for r edundancy and err or r ecovery. These include
m ir ror ing, t he wr it ing of t he sam e dat a t o t wo different drives in par allel, and parit y, ext r a bit s wr it t en t o
allow t he det ect ion and aut om at ic cor r ect ion of error s3 .

Networ k Redundancy
Redundant net wor ks, net work equipm ent and net wor k connect ivit y is r equir ed t o avoid net wor k failur es fr om
affect ing syst em availabilit y. This is t ypically accom plished by deploying a com plet ely r edundant swit ch
t opology, using t he Spanning Tr ee Prot ocol t o avoid loops. Rout er s can be configured wit h t he Hot St andby
Router Prot ocol ( HSRP) for aut om at ic failover . BGP is com m only used t o m anage dual WAN connect ions.

Data Center Redundancy


Dat a cent er s t hat host SAP HANA solut ions are equipped wit h Unint er r upt ed Power Supply ( UPS) and back up
power gener at or s, r edundant cooling syst em s and m ult i- sour ced pr ovider s of net work connect ivit y and
elect r icit y, achieving oper at ional availabilit y in t he pr esence of individual failur es, and significant ly reducing
t he probabilit y of a business- im pact ing out age.

Som e ent er pr ises oper at e fully duplicat ed dat a cent er s, providing a high level of disaster t oler ance.

4 SAP HANA High Availability Suppor t


As an in- m em or y dat abase, SAP HANA m ust not only concer n it self wit h m aint aining t he reliabilit y of it s dat a
in t he event of failur es, but also wit h r esum ing oper at ions wit h m ost of t hat dat a loaded back in m em or y as
quickly as possible.

The following figur e shows t he phases of High Availabilit y. The fir st phase is r eadiness, being pr epared for t he
inevit able fault . Dur ing t his t im e, dat a is backed up and standby syst em s ar e r eady to t ake over . A fault m ust
be det ect ed, eit her aut om at ically or adm inist r at ively ( t o avoid false posit ives) , and a r ecover y process is put
in act ion. Finally, t he fault m ust be repair ed, and t he syst em m ay need t o be r ever t ed t o t he or iginal
configur at ion ( failed back) , t o be r eady again for t he next fault .

1 The SAP HANA soft war e it self is a single point of failur e, as it can cease t o oper at e due t o soft war e er r or s or ext r em e out -

of- m em or y sit uat ions. Fau lt Recov er y suppor t is discussed in t h e nex t sect ion .
2 An exam ple of high availabilit y har dwar e design can be found her e: http://www.redbooks.ibm.com/redpapers/pdfs/redp4864.pdf

3 Read fur t her : http://download.intel.com/support/motherboards/server/sb/enterprise_class_versus_desktop_class_hard_drives_.pdf


Different RPO/ RTO values can be associat ed wit h different kinds of fault s. Business crit ical syst em s are
expect ed t o oper at e wit h an RPO of zer o dat a loss in t he case of local fault s, and oft en even in t he case of a
disast er . But t he challenges of disast er r ecovery ar e different from locally r ecover able fault s; t o achieve zero
RPO and low RTO, dat a m ust be r eplicated synchronously over longer dist ances, which im pact s regular syst em
perfor m ance and m ay r equire m or e expensive st andby and failover solut ions.

All of t his leads t o tr adeoff decisions around t he at t r ibut es of fault r ecovery funct ionalit y, cost and com plexit y .
SAP accor dingly offer s com plem ent ar y design opt ions, including t hree levels of Disaster Recovery suppor t and
t wo aut om at ic Fault Recover y suppor t feat ures, sum m ar ized in t he following t able and fur t her discussed in
t he sect ions below.

Co st RPO RTO
DI SASTER 1. Ba ck u p s $ >0 high
RECOVERY 2. St o r a g e Re p l i ca t i o n $$ ~0 m ed
SUPPORT 3. Sy s t e m Re p l i ca t i o n $$$ 0 low
4. Sy s t e m Re p l i ca t i o n A ct i v e / A ct i v e ( r e a d e n a b l e d ) $ 0 low
5. Sy s t e m Re p l i ca t i o n w / o d at a p r e l o a d $ 0 m ed
FAULT RECOVERY 1. Se r v i ce A u t o - Re st a r t 0 0 m ed
SUPPORT 2. H o st A u t o - Fa i l o v e r $$ 0 m ed

Backups
SAP HANA uses in- m em or y t echnology , but of course, it fully per sist s any t r ansact ion t hat changes t he dat a,
such as row inser t ions, delet ions and updat es, so it can resum e from a pow er - out age wit hout loss of dat a.
SAP HANA per sist s tw o t ypes of dat a t o st or age: t r ansact ion r edo logs, and dat a changes in t he form of
sav epoint s.

A tr ansact ion r edo log is used t o r ecord a change. To m ake a t r ansact ion dur able, it is not requir ed t o per sist
t he com plet e dat a w hen t he t ransact ion is com m it t ed; inst ead it is sufficient t o per sist t he r edo log. Upon an
out age, t he m ost r ecent consist ent st at e of t he dat abase can be r est ored by r eplay ing t he changes r ecorded
in t he log, r edoing com plet ed t r ansact ions and rolling back incom plet e ones.

A savepoint is a per iodic point in t im e, when all t he changed dat a is w r it t en t o st or age, in t he for m of pages.
One goal of perfor m ing sav epoint s is t o speed up r est art : w hen st ar t ing up t he sy st em , logs need not be
processed from t he beginning, but only fr om t he last savepoint posit ion. Sav epoint s ar e coor dinat ed acr oss
all processes ( called SAP HANA ser v ices) and inst ances of t he dat abase t o ensure tr ansact ion consist ency . By
default , sav epoint s are per form ed every fiv e m inut es, but this can be configur ed.

Sav epoint s nor m ally ov er wr it e older sav epoint s, but it is possible t o fr eeze a sav epoint for fut ur e use; t his is
called a snapshot . Snapshot s can be r eplicated in t he for m of f u l l d a t a b a ck u p s , w hich can be used t o r est or e
a dat abase to a specific point in t im e. This can be useful in t he ev ent of dat a corr upt ion, for inst ance. I n
addit ion t o dat a back ups and snapshot s, sm aller per iodic l o g b a ck u p s ensur e t he abilit y t o r ecov er from fat al
st or age fault s w it h m inim al loss of dat a. While full dat a back ups cont ain all cur rent dat a also d e l t a b a ck u p s
can be cr eated ( since HANA 1.0 SPS11 ) cont aining all dat a that was changed since t he last dat a back up. Two
t y pes of delta back ups ar e t o be dist inguished: i n cr e m e n t a l b a ck u p s cont ain all changed dat a since t he last
full or delt a back up, and d i f f e r e n t i a l b a ck u p s cont ain all changed dat a since t he last full back up.
Lo ca l Pe r s i st e n ce an d B a ck u p s

The abov e figur e shows t he savepoint s, sav ed t o local st orage, and t he addit ional back ups, saved t o back up
st or age. Local r ecov ery from t he cr ash uses t he lat est savepoint , and t hen replays t he last logs, t o r ecov er
t he dat abase w it hout any data loss. I f t he local stor age was corr upt ed by t he cr ash, it is st ill possible t o
r ecov er t he dat abase fr om t he dat a back up ( or last snapshot) , and log back ups, possibly wit h som e dat a loss.

Regular ly shipping back ups t o a rem ot e locat ion ov er a net w or k or v ia cour ier s can be a sim ple and relat iv ely
inexpensive w ay to prepar e for a disast er . Depending on the fr equency and shipping m et hod, t his approach
m ay hav e an RPO of hour s to day s.

B a ck u p Sh i p p i n g

Stor age Replication


One dr aw back of back ups is the pot ent ial loss of dat a from t he t im e of t he last back up t o t he t im e of t he
failur e. A prefer r ed solut ion ther efor e, is t o prov ide cont inuous r eplicat ion of all per sist ed dat a. Sev er al SAP
HANA har dw ar e par t ner s offer a st or age- lev el r eplicat ion solut ion, w hich deliv er s a back up of t he v olum es or
file- sy st em t o a rem ot e, net w or k ed st or age syst em . I n som e of t hese vendor - specific solut ions, w hich are
cer t ified by SAP, t he SAP HANA t r ansact ion only com plet es w hen t he locally per sist ed t r ansact ion log has been
r eplicat ed rem ot ely . This is called sy nchr onous st or age replicat ion. Sy nchr onous stor age replicat ion can be
used only w her e t he dist ance bet w een t he pr im ar y and backup sit e is up t o 100 k ilom et er s ( one or few hops,
w it h no m ore t han ~ 5 sec lat ency per k ilom et er ) , allow ing for sub- m illisecond round- t r ip lat encies.

St o r a g e Re p li ca t i o n

Due t o it s cont inuous nat ur e, st or age r eplicat ion ( som et im es also called r em ot e st or age m irr or ing) offer s a
m ore at t r act iv e RPO t han back ups, but t his solut ion of course r equir es a r eliable, high bandw idt h and low
lat ency connect ion bet ween the pr im ary sit e and t he secondary sit e.
I n t he ev ent of a disast rous failur e t hat j ust ifies full sy st em failover , an adm inist rat or at t aches a st andby
sy stem t o t he r eplicat ed st or age, and t hen r est ar t s t he SAP HANA sy st em . The adm inist r at or m ust t ak e car e
t hat t he failed pr im ar y sy st em can no longer wr it e t o t he replicat ed st or age ( an act ion called fencing) , or else
t here is a r isk of dat a cor r upt ion, w it h t wo sy st em s w r it ing to t he sam e st or age.

System Replication
Sy s t e m Re p l i ca t i o n is an alt er nat iv e HA solut ion for SAP HANA, pr ov iding an ex t r em ely shor t RTO, and
com pat ible w it h all SAP HANA har dwar e par t ner solut ions. Sy st em r eplicat ion em ploy s an " N+ N" appr oach,
w it h a secondar y st andby SAP HANA sy st em w it h t he sam e num ber of act ive nodes as t he act iv e, pr im ary
sy stem . Each serv ice and inst ance of t he pr im ar y SAP HANA sy st em com m unicat es pair wise w it h a count er par t
in t he secondary syst em 4 .

Sy s t e m Re p l i ca t i o n

The secondary sy st em can be locat ed near t he pr im ar y sy st em t o ser ve as a r apid failov er solut ion for planned
downt im e, or t o handle st or age cor r upt ion or ot her local fault s. Alt er nat iv ely , or addit ionally ( m ult i- t ier ed or
cascaded) , a secondar y sy st em can be inst alled in a r em ot e sit e for disast er r ecov er y . Lik e St or age Replicat ion,
t his Disast er Recovery opt ion r equires a r eliable link bet ween t he pr im ary and secondary sit es.

The inst ances in t he secondary syst em oper at e in liv e replicat ion m ode. I n t his m ode, all secondary sy st em
services const ant ly com m unicat e wit h t heir pr im ary count er par t s, r eplicat e and per sist dat a and logs, and
t y pically load dat a t o m em ory. The log and dat a can be com pr essed before shipping. The secondar y sy st em
can accept quer ies w hen syst em r eplicat ion w as set up as A ct i v e / A ct i v e ( r e a d e n a b l e d ) configur at ion;
ot her w ise it does not accept request s or quer ies. Wit h t he Act iv e/ Act ive set up t he secondary syst em can be
used to handle repor t ing w or kload w it hout disr upt ing t he pr im ar y sy st em .

I n an alt er nat iv e configur at ion, called sy st em replication w i t h o u t d a t a - p r e l o a d , t he secondar y sy st em does


not pr e- load dat a, and hence consum es very lit t le m em ory. This allows t he host s of t he secondar y sy st em t o
serve dual pur poses, for inst ance for developm ent or t est / QA w it h separ at e stor age. Befor e t ak eov er , t hese
act ivit ies m ust of cour se be t ur ned off. The t r adeoff is a longer RTO in case of failov er .

Her e is how sy st em r eplicat ion work s. When t he secondar y syst em is br ought up t o st ar t r unning in liv e
r eplicat ion m ode, each serv ice com ponent est ablishes a connect ion w it h it s pr im ar y sy st em count er par t , and
r equest s a snapshot of t he dat a. From t hen on, all logged changes in t he pr im ary syst em are r eplicated.
Whenev er logs are per sist ed in t he pr im ary sy st em ( i.e. wr it t en t o t he log volum es of each serv ice) , t hey are
also sent t o t he secondary syst em . A t r ansact ion in t he pr im ar y sy st em is not com m it t ed unt il t he r edo logs
ar e replicat ed, as det er m ined by a log r eplicat ion opt ion:

Sy n ch r o n o u s : The pr im ary syst em wait s w it h com m it t ing t he t r ansact ion unt il it r eceives a reply
t hat t he log is persisted in t he secondar y sy st em . This m ode guar ant ees im m ediat e consist ency
bet w een bot h sy st em s, at a cost of delay ing t he tr ansact ion by t he t im e for dat a tr ansm ission and
persist ing in t he secondar y syst em .

The quest ion of w hat t o do if replicat ion fails ( for inst ance due t o a net work fault ) is governed by t he
full sy nc configur at ion opt ion. I t can be set to com m it t he t r ansact ion, or to fail t he com m it on t he
pr im ary sy st em , unt il r eplicat ion is r est ored.

4 Fr om HANA 1.0 SPS09 SAP HANA suppor t s m u lt i- t enant dat abase cont ainer s. Sy st em r eplicat ion can be on ly set up for

t he syst em as a whole, not per indiv idual t enant .


Sy n ch r o n o u s i n - m e m o r y : The pr im ary sy st em com m it s t he tr ansact ion aft er it r eceiv es a r eply
t hat t he log w as received by t he secondary sy st em , but before it w as persist ed. The t r ansact ion delay
in t he prim ar y sy st em is shor ter , because it only includes t he dat a t r ansm ission t im e.

A s y n ch r o n o u s : t he pr im ary sy st em com m it s t he t r ansact ion aft er sending t he log w it hout w ait ing
for a r esponse. This elim inat es t he sy nchr onizat ion lat ency , at t he r isk of m inor t heor et ical dat a- loss
dur ing failur e. This m ode is m ost useful w hen the secondar y site is hundr eds of k ilom et er s aw ay
from t he pr im ary sit e, or when r educing lat ency is cr it ical.

I f t he connect ion t o t he secondary sy st em is lost , or t he secondary sy st em cr ashes, t he prim ar y sy st em ( aft er


a brief, configur able, t im eout ) w ill resum e oper at ions wit hout t he back up pr ot ect ion 5 .

Handling of t he r eceived logs on t he secondary sit e is done in differ ent w ay s, depending on t he configured
sy stem replication operat ion m ode:

D e l t a d a t a s h i p p i n g : I n t his oper at ion m ode t he secondary sy st em per sists, but does not
im m ediat ely r eplay t he receiv ed logs. To av oid an ev er - grow ing list of logs, incr em ent al dat a
snapshot s ar e t r ansm it t ed asy nchronously from t im e t o t im e from t he pr im ary t o t he secondary
sy st em . I f t he secondary sy stem has t o t ak e ov er, only t hat par t of t he log needs t o be r eplay ed t hat
r epr esent s changes t hat wer e m ade aft er t he m ost recent dat a snapshot . I n addit ion t o snapshot s,
t he pr im ar y sy st em also t r ansfer s st at us infor m at ion regar ding w hich colum n t able colum ns are
cur r ent ly loaded int o m em or y . The secondary sy st em corr espondingly pr eloads t hese colum ns.

Lo g r e p l a y ( as of HANA 1.0 SPS11) : Wit h t his oper at ion m ode configured, t he r eceived log ent r ies
ar e r eplay ed im m ediat ely in t he secondar y sy st em . The t ak eov er t im e is r educed because t he log
does not hav e t o be replayed any m or e. Addit ionally , t her e is m uch less t r affic on t he net w ork bet w een
t he pr im ar y and t he secondary sit e, because no delt a dat a shipping needs t o t ak e place.

Lo g r e p l a y r e a d a cce s s ( as of HANA 2.0 SPS00) : I n t his oper at ion m ode t he r eceived log entr ies
ar e also r eplay ed im m ediat ely in t he secondary sy st em . Addit ionally , t he r eplicat ed data ar e r ead
accessible w it h a sm all delay com par ed t o t he pr im ar y s dat a. Read access is possible v ia direct
connect ion to t he secondar y or by pr ov iding hint ed SQL st atem ent s on t he pr im ar y , w hich ar e r out ed
t o t he secondary for ex ecut ion. The t ak eov er t im e is r educed fur t her not only because t he log does
not hav e t o be replay ed any m or e, but also because t his sy stem is ev en m or e pr epar ed for product iv e
oper at ion.

I n t he event of a failur e t hat j ust ifies full sy st em t akeov er , an adm inist r at or inst r uct s t he secondary sy st em
t o sw it ch from liv e r eplicat ion m ode to full oper at ion. The secondary sy st em , w hich already preloaded t he
sam e colum n dat a as t he pr im ar y sy st em , and possibly is already read enabled, becom es the pr im ar y syst em
by replay ing t he last t r ansact ion logs, and t hen st ar t s t o accept quer ies.

When t he or iginal sy st em can be rest or ed t o ser v ice, it can be configured as t he new secondar y sy st em , or ,
r ever t ed to t he or iginal configur at ion by " falling back" .

HANA 1.0 SPS09 int r oduced a w ay t o hook event s and act ions inside SAP HANA scale- out ( such as Host Aut o-
Failov er ) and sy st em r eplicat ion. An adm inist r at or can add requir ed act ions to a Py t hon scr ipt , t o be ex ecut ed
before or aft er event s ( lik e st ar t up, shut dow n, failov er , t ak eover , ...) .

These so- called " HA/ DR prov ider" hooks can be used t o addr ess issues t hat r equir e int egrat ion at t ent ion such
as how to handle connect ions from dat abase client s t hat w er e configur ed t o r each t he pr im ar y sy st em , and
need t o be " div er t ed" to t he secondary sy st em aft er a takeover . For ex am ple, a hook could be wr it t en t o
r em ap v ir t ual I P addr esses aft er a t ak eov er in SAP HANA syst em r eplication.

I P r e d i r e ct i o n is t he m et hod of choice for end- t o- end client r econnect ion suppor t , as it unifor m ly and sim ply
handles t he end- t o- end r ecover y of bot h SQL and HTTP client s, w it h v ery shor t recov er y t im es, and w it hout
special client - side configur at ion. The pr inciple of I P r edir ect ion ( also k now n as VI P6 ) is to define an addit ional
" logical" host nam e ( hana1, in t he pict ure below ) w it h it s separ at e logical I P address ( for ex am ple,
10.68.104.51) , and t hen m ap t his init ially t o t he MAC addr ess of t he or iginal host in t he pr im ar y sy st em ( by
binding it to one of t he host 's int er faces) . As par t of t he t ak eov er pr ocedure, a scr ipt is ex ecut ed w hich r e-
m aps t he unchanged logical I P addr ess t o t he cor r esponding t akeov er host in t he secondar y sy st em . This
m ust be done pair - w ise, for each host in t he pr im ary sy st em . The r em apping affect s t he L2 sw it ching, as can
be seen in st ep 4 of t he follow ing diagr am :

5 As a r esult , t h e pr im ar y sy st em and secondar y syst em m ight get ou t of sync. Such a sit uat ion is det ect ed by t he
secondar y syst em when it r esu m es, r eest ablishes t he connect ion, an d r eceives t he next set of log ent r ies. I n su ch case, t he
secondar y syst em r equest s a dat a backup delt a based on wh ich t he log r eplicat ion can be r est ar t ed.
6 E.g. see her e: http://scale-out-blog.blogspot.com/2011/01/virtual-ip-addresses-and-their.html
I P r edir ect ion can be im plem ent ed using a num ber of t echniques, for inst ance w it h t he use of Linux com m ands
w hich affect t he net work ARP t ables, by configur ing L2 net w ork sw it ches dir ect ly , or by using clust er
m anagem ent soft war e. Follow ing t he I P r edir ect ion configur at ion, t he ARP caches should be flushed, t o pr ov ide
an alm ost inst ant aneous r ecov ery exper ience t o client s.

I P redir ect ion r equir es t hat bot h t he pr im ar y and failover host ( s) ar e on t he sam e L2 net wor k . This depends
on the cust om er net w or k design, but net wor k s are incr easingly designed w it h L2 - ov er - L3 ( such as Et her net
ov er MPLS) , m ak ing t his opt ion a v iable solut ion in m any cases. I f t he st andby sy st em is in a com plet ely
separ at e L3 net work , t hen D N S r e d i r e ct i o n is t he preferred alt er nat iv e solut ion.

D N S is a binding fr om a logical dom ain nam e t o an I P addr ess. Client s cont act a DNS ser v er to obt ain t he I P
address of t he HANA host ( st ep 1 below ) t hey w ish t o reach. Many DNS product s suppor t failov er configu r at ion
by using shor t ( few m inut es or less) TTL response fields, and can be set up w it h w at chdog funct ionalit y and
aut om at ically t r iggered sw it chov er .

As par t of t he fail- ov er procedur e, a scr ipt is ex ecut ed t hat changes t he DNS nam e- t o- I P m apping fr om t he
pr im ary host t o t he corr esponding host in t he secondary syst em ( pair - w ise for all host s in t he syst em ) . From
t hat point in t im e, client s ar e r edir ect ed t o t he failover hosts, as in st ep 2 of t he following diagr am :

DNS and I P r edir ect ion share t he advant age t hat t here are no client - specific configur at ions or r equirem ent s.
Fur ther , it suppor t s DR configur at ions w her e t he pr im ar y and st andby sy st em s m ay be in t w o com plet ely
differ ent net work dom ains ( separ at ed by rout er s) . One dr aw back of t his solut ion is that m odify ing DNS
m appings r equir es a vendor - propr iet ar y solut ion. Fur t her , due to DNS caching in nodes ( bot h client s and
int erm ediat e net w or k equipm ent ) , it m ay t ake a w hile ( up t o hour s) unt il t he DNS changes ar e propagat ed,
causing client s t o ex per ience dow nt im e despit e t he recov ery of t he sy st em .
A special handling of v ir t ual I P addr esses is r equir ed in an A ct i v e / A ct i v e ( r e a d e n a b l e d ) sy st em r eplication
configur at ion, w her e a separ at e v ir t ual I P addr ess is needed for t he r ead access connect ions t o t he secondary
sy stem . I n t he t ak eover - case t he v ir t ual I P address for prim ar y access is rebound t o t he secondar y sy st em ,
w hile t he secondar y s v ir t ual I P addr ess st ay s act ive. Tw o v ir t ual I P addr esses are av ailable for syst em access
t o t he t hen act iv e syst em aft er t akeov er .

Service Auto-Restart
I n t he ev ent of a soft w are failur e ( or an int ent ional int er v ent ion by an adm inist r at or ) , t hat disables one of t he
configur ed SAP HANA ser v ices ( I ndex Serv er , Nam e Ser ver , et c.) , t he serv ice w ill be r est ar t ed by t he SAP
HANA Se r v i ce A u t o - Re st a r t w at chdog funct ion, w hich aut om at ically det ect s t he failur e and r est ar t s t he
st opped ser v ice pr ocess. Upon r est ar t , t he ser v ice loads dat a int o m em ory and r esum es it s funct ion. While all
dat a r em ains safe ( RPO= 0) , the ser v ice r ecovery t akes som e t im e.

Host Auto-Failover
H o st A u t o - Fa i l o v e r is a local " N+ m " ( m is oft en 1 ) Fault Recov ery solut ion t hat can be used as a
supplem ent al or alt er nat iv e m easur e t o t he sy st em r eplicat ion solut ion descr ibed ear lier . One ( or m ore)
st andby host s ar e added t o an SAP HANA sy st em , and configur ed t o work in st andby m ode. As long as t hey
ar e in st andby m ode t he dat abases on t hese host s do not cont ain any dat a and do not accept r equest s or
queries.

H o s t A u t o - Fa i l o v e r , b e f o r e f a i l u r e

When an act iv e ( work er ) host fails, a st andby host aut om at ically t ak es it s place. Since t he st andby host m ay
t ak e over oper at ion from any of t he pr im ar y host s, it needs access t o all t he dat abase v olum es. This can be
accom plished by a shar ed net w ork ed st or age ser ver , by using a dist r ibut ed file sy st em , or w it h v endor - specific
solut ions t hat use an SAP HANA progr am m at ic int erface ( the so- called St or age Connect or API ) t o dy nam ically
det ach and at t ach ( m ount ) net w ork ed stor age ( e.g. using block st or age v ia Fiber Channel) upon failover .

H o s t A u t o - Fa i l o v e r , a f t e r r e co v e r y

A t opic t hat r equir es som e at t ent ion is how to recov er connect ions from SAP HANA client s t hat w ere configur ed
t o reach t he or iginal host , and need t o be "div er t ed" to t he st andby host aft er host aut o - failov er .

One appr oach is a net work - based ( I P or DNS) approach, exact ly as discussed earlier . Alt er nat iv ely , SQL/ MDX
dat abase client s can be configur ed w it h t he connect ion inform at ion of m ult iple host s, opt ionally including t he
st andby host ( a m ult i- host list is pr ov ided in t he connect ion st r ing) . The client connect ion code ( ODBC/ JDBC)
uses a " round- robin" appr oach t o reconnect and ensur es t hat t hese client s can r each t he SAP HANA dat abase,
ev en aft er failover . To support HTTP ( w eb) client s, w hich use t he SAP HANA XS applicat ion ser v ices 7 , it is

7 I f t he XS ser v ices ar e not used by any applicat ion, it can be disabled, and no HLB is r equ ir ed. Cur r ent ly on ly one XS

ser ver can r un on a dist r ibut ed sy st em . I n t he fut ur e, m u lt iple load - shar in g XS ser ver s w ill be in st allable on a syst em ,
m ak in g t he u se of an HLB even m or e valuable.
r ecom m ended to inst all an ex t er nal, it self fault pr ot ect ed, HTTP load balancer ( HLB) , such as SAP's Web
Dispat cher , or a sim ilar pr oduct fr om anot her vendor . The HLBs ar e configur ed t o m onit or t he w eb- ser v er s on
all t he host s on bot h t he pr im ar y and secondar y sit es.

The HLB ( w hich serv es as a rev er se web- pr oxy ) r edir ect s the HTTP client s t o t he cor r ect ser ver , upon HANA
inst ance failur e. HTTP client s ar e configured t o use t he I P address of t he HLB it self ( obt ained v ia DNS) , and
r em ain unaw ar e of any HANA failov er act iv it y .

One dangerous scenar io t hat m ay occur w it h Host Aut o- Failov er is r efer red t o as split - br ain. A split - br ain could
accident ally happen if, for inst ance, host2 did not really fail, but only lost all it s net work connect ions ( causing
t he st andby host to decide t o t ak e ov er ) . I n t his case bot h non- com m unicat ing sy st em s assum e t he host 2
r ole, and m ay bot h w r it e t o t he sam e stor age, causing dat a corr upt ion. Pr ev ent ing such dat a corr upt ion due
t o split - br ain situat ions ( fencing) m ust be im plem ent ed. The above- m ent ioned st or age API suppor t s fencing.

Once r epair ed, t he failed host can be r ej oined to t he syst em as t he new st andby host , t o reest ablish t he failur e
r ecov ery capabilit y .

5 Design for High Availability


The follow ing t able sum m ar izes t he m ain adv ant ages and lim it at ions of t he SAP HANA High Av ailabilit y suppor t
opt ions.

Ad v an t ages Li m i t a t i o n s
Ba ck u p s Allows Disast er Recover y RPO of m inut es t o hour s, depending on fr equency of
Lowest cost , sim plest backu p and sh ipping m et hod ( syn chr onous sh ipping,
Suppor t s point - in- t im e r ecover y usin g 3 rd par t y t ools, is r ecom m en ded)
Can also be used t o " clone" or copy syst em s I n case of disast er , need t o acqu ir e and configur e
secondar y syst em ( hour s- days)
Cold st ar t longer RTO ( ~ h our )
Ext r a t im e ( u p t o hour s) t o load colu m n dat a an d r et ur n
t o fu ll per for m an ce
St o r a g e Allows Disast er Recover y I n case of disast er , need t o possibly fr ee up, boot up
Re p l i ca t i o n RPO= 0 w it h synchr onous r eplicat ion ; and r e- configur e secondar y syst em ( hour s)
RPO of a few seconds ot her w ise Cold st ar t longer RTO ( ~ h our )
Secondar y syst em can be u sed for ot her Not yet offer ed by all SAP HANA h ar dwar e par t ner s
pur poses, unt il needed Ext r a t im e ( u p t o hour s) t o r et ur n t o fu ll per for m an ce
Requ ir es net wor ked st or age sy st em s and efficient int er -
sit e link
Synchr onou s r eplicat ion on ly suppor t s dist ances of u p t o
100 k m
Doesn't pr ot ect again st st or age cor r upt ion
Mor e bandw idt h wast efu l t han Syst em Replicat ion
Sy st e m Allows Disast er Recover y, an d can be used as Requ ir es dedicat ed live st andby sy st em an d efficient
Re p l i ca t i o n m ain HA failover for near - zer o downt im e int er - sit e link
m aint enance or failur es Requ ir es a solut ion for client conn ect ion r ecover y upon
Act ive/ Act ive ( r ead enabled) con figur at ions t he failover ( e.g. DNS or Virt ual I P ad dr ess based)
secondar y is usable for r epor t ing wor k load
RPO= 0 ( syn chr onou s)
RTO of on ly a m inut e ( cont inu ous log r eplay)
Full per for m ance r ight aft er t akeover
Com pat ible w it h all par t ner solut ions
Suppor t s single- host syst em s w it h local
st or age, no need for ext er nal net w or k st or age
appliances
Wit h no dat a- pr eload configur at ion, secondar y
syst em ( s) can be used for n oncr it ical dual
pur poses
H o st Can be used t o com plem ent Syst em Requ ir es access t o dat abase st orage by t he st andby
Au t o- Replicat ion or by it self host ( shar ed net wor k st or age or ot her par t ner - specific
Fa i l o v e r Aut om at ic det ect ion and failover solut ion )

I n addit ion t o t he aforem ent ioned SAP HANA High Av ailabilit y opt ions, one ot her approach deserv es t o be
m ent ioned, for analy t ic " dat a m ar t " applicat ions w her e t he dat a in SAP HANA is t he result of using SAP
Landscape Tr ansform at ion ( SLT) r eplicat ion from anot her dat a source. I n such a situat ion, High Av ailabilit y
t hr ough r edundancy can be achiev ed by set t ing up concurrent SLT r eplication st r eam s from t he com m on dat a-
source t o t w o separ at e SAP HANA sy st em s. Bot h sy st em s can act iv ely operat e independent ly ; in t he case of
a failur e or disast er , t he ot her syst em r em ains av ailable.

Planning for Failure


Failur es ar e inev it able. Planning a com pr ehensive High Av ailabilit y solut ion for SAP HANA requires an
ev aluat ion of t he im pact of pot ent ial failur es, t he com pany's t oler ance and r equir em ent s for differ ent RPO and
RTO v alues in t he presence of com m on tr ansient local failur es v s. ex t rem ely r ar e disast er s, and an
under st anding of t he benefit s and cost of t he different alt ernat iv es offered.

To recap, here is a br ief sum m ar y of m ain fault s and how SAP HANA addr esses t hem :

Fa u l t So l u t i o n
Ser vice down ( soft war e fau lt ) Ser vice Aut o- Rest ar t . Sy st em Replicat ion can also be used t o fail over .
Power out age Per sist ence of sav epoint s and t r an sact ion logs guar ant ees r ecover y w it hout dat a loss.
Host cr ash ( har dwar e fau lt ) Host Aut o- Failover . Alt er nat ively, Syst em Replicat ion can be u sed t o fail over .
St or age or Dat a Cor r upt ion Backups and snapshot s allow poin t - in - t im e r ecover y, applicable t o all solut ions.
Dat a cent er out ( disast er ) Syst em Replicat ion suppor t s r apid r esum pt ion of oper at ion. Alt er nat ively , St or age
Replicat ion or Backups can be u sed t o br ing up t he syst em in an alt er nat e dat acent er

Besides t he high- level consider at ion of RPO/ RTO in t he differ ent scenar ios, ot her aspect s w ill need t o be
ev aluat ed as well: t he size of t he sy st em and dat abase, t he fr equency and size of t he logs and dat a files t hat
need t o be replicat ed, t he bandw idt h av ailabilit y , r eliabilit y and lat ency of t he link s bet ween t he sy st em s, t he
nat ur e of t he landscape m anagem ent and av ailabilit y solut ions used for ot her non- SAP HANA sy st em s, and
ot her consider at ions.

Sm all RTO r equir em ent s lead t o t he pr efer r ed sy st em replicat ion solut ion, w hich can also be used for r apid
failov er in case of planned and unplanned out ages. Tr adeoffs m ay lead to ot her alt er nat ives. The follow ing
decision t r ee sum m ar izes t he m ain design choices:

H i g h A v a i l a b i l i t y De ci s i o n Tr e e
Realist ically , t he abov e decision process w ill be fur t her influenced by consider at ions lik e t im elines, cost s,
budget s and cust om er par adigm - prefer ences, w hich are out side t he scope of t his shor t paper .
6 In Summar y
SAP HANA suppor t s a com prehensiv e r ange of High Av ailabilit y opt ions, designed to sat isfy tr adeoffs bet w een
dem anding High Av ailabilit y and Disast er Recovery requirem ent s, w hile also consider ing cost and com plex it y .

I n par t icular , t he SAP HANA Sy st em Replicat ion solut ion support s an RPO of zero seconds, and an RTO
m easur ed in m inut es, and is SAP's recom m ended configurat ion for addr essing SAP HANA out age r educt ion
due t o planned m aint enance, fault s and disast er s.

SAP HANA High Av ailabilit y docum ent at ion: SAP Not e 2407186

Mor e infor m at ion about SAP HANA can be found on ht t p: / / help.sap.com / .

Glossar y
Industr y Ter ms
Te r m D e scr i p t i o n
Fault A failur e of a syst em or one of it s com ponent s / sub - syst em s ( har dw ar e, net wor k, soft war e)
Disast er Maj or fau lt : t he failur e of an ent ir e dat a cent er / sit e
Out age A sy st em 's inabilit y t o oper at e ( due t o failur e or planned downt im e)
Availabilit y The m easur e of a syst em 's oper at ional cont inuit y, expr essed as a per cent age of t im e
Downt im e I nver se of availabilit y : t he dur at ion of t im e t hat a syst em is not oper at ional
High Availabilit y ( HA) A fr am ewor k of design pr in ciples, t echniques and best pr act ices t o r educe downt im e
Fault Recover y ( FR) Recover y of syst em oper at ions aft er out age due t o a local fau lt
Disast er Recover y ( DR) Recover y of syst em oper at ions aft er out age due t o a disast er
Failover / Takeover Sw it ch ing t o a backu p ( st andby ) syst em / host , u pon failur e of t h e pr im ar y syst em / host
Failback Pr ocess of r est or ing a syst em t o it s or iginal st at e
Recover y Point t he m ax im al per m issible per iod of t im e dur ing wh ich oper at ional dat a m ay be lost w it hout
Obj ect ive ( RPO) abilit y t o r ecover ( t im e bet ween t h e last backup an d t h e cr ash)
Recover y Tim e The m ax im al per m issible t im e it t ak es t o r ecover t he sy st em , so t hat it s oper at ion s can r esu m e
Obj ect ive ( RTO)

SAP HANA Ter ms

Te r m D e scr i p t i o n
SAP HANA Syst em A SAP HANA syst em is ident ified by a syst em id ( SI D) . I t is per ceiv ed as one unit fr om t h e
per spect ive of t he adm in ist r at or , who can in st all, updat e, st ar t u p, shut down, or backup t he
syst em as a whole. A dist r ibut ed SAP HANA sy st em is a syst em wh ich is in st alled on m or e t h an
one host . The collect ion of elem ent s of t he sy st em on each h ost ar e r efer r ed t o as an inst ance.
SAP HANA Ser v ice A SAP HANA ser v ice is an independent funct ional com ponent of a SAP HANA Syst em , such as
t he I ndex Ser ver , t he Nam e Ser ver , et c. They appear as separ at e pr ocesses fr om an Oper at in g
Syst em per spect ive.

D i sa st e r Re co v e r y Backup Per iodic sav in g of dat abase copies in safe place


Su p p o r t St or age Replicat ion Cont inuou s r eplicat ion ( m ir r or ing) bet ween pr im ar y st or age and back up
st or age over a net w or k ( m ay be synchr onou s)
Syst em Replicat ion Cont inuou s ( synchr onou s an d asy nchr onous) updat e of secon dar y syst em
by pr im ar y sy st em , includin g in - m em or y t able loadin g and cont inuou s log
r eplay on t he secondar y sy st em ( if con figur ed)
Fa u l t Re co v e r y Ser vice Aut o- Rest ar t Aut om at ic r est ar t of st opped ser v ices on host ( wat chdog)
Su p p o r t Host Aut o- Failover Aut om at ic failover fr om cr ashed h ost t o st andby h ost in t he sam e sy st em

You might also like