SAP HANA High Availability

Business Cont inuit y r equir es t hat t he operat ion of business crit ical syst em s r em ain highly
av ailable at all t im es, ev en in t he presence of failures. This paper discusses t he funct ionalit y of
SAP HANA in support of High Availabilit y and Disast er Recov er y.

Up d a t e d f o r H A N A 2 .0 SP0 0
HANA 2.0 SPS00: new operat ion m ode logreplay_readaccess for syst em replicat ion w it h Act ive/ Act iv e ( read enabled) feat ure
chaim & m echt hild.bore- wuest
SAP HANA Dev elopm ent Team
1 Introduction
SAP HANA is an innovat ive in- m em ory dat abase and dat a m anagem ent plat for m , specifically developed to
t ake full advant age of t he capabilit ies pr ovided by m oder n har dwar e t o incr ease application perfor m ance. By
keeping all r elevant dat a in m ain m em or y, dat a processing oper at ions ar e significant ly acceler at ed.

Design for scalabilit y is a cor e SAP HANA pr inciple. SAP HANA can be dist r ibut ed acr oss m any m ult iple hosts
t o achieve scalabilit y in t er m s of bot h dat a volum e and user concur r ency. Unlike clust er s, dist r ibut ed HANA
system s also dist r ibut e t he dat a efficient ly, achieving high scaling wit hout I / O locks.

The key per for m ance indicat ors of SAP HANA appeal t o m any of our custom er s, and t housands of deploym ents
ar e in progr ess. SAP HANA has becom e t he fast est growing product in SAPs 40+ year hist or y.

About this Document

Loss of business cr it ical syst em r esour ces and services, like SAP HANA, t r anslat e dir ect ly int o lost r evenue.
The goal t herefore is Business Cont inuit y, using syst em s designed for cont inuous oper at ion even in t he
pr esence of inevit able failur es. Mission cr it ical syst em s r equir e High Availabilit y; t his is no longer opt ional.

SAP HANA is fully designed for High Availabilit y, suppor t ing a broad r ange of r ecovery scenar ios from var ious
faults, from sim ple soft war e er ror s, t o disast er s t hat decom m ission an ent ire sit e.

This paper descr ibes SAP HANAs High Availabilit y suppor t for Fault and Disast er Recovery . A com pr ehensive
High Availabilit y solut ion offers m or e design choices and evident ly requir es t he discussion of m or e det ails t han
can be cover ed in a shor t paper , and m ay t her efore r equire addit ional consult at ions.

2 What is High Availability?

A v a i l a b i l i t y , t he m easur e of a syst em 's oper at ional cont inuit y, is expr essed as a per cent age of t im e, inver sely
propor t ional t o downt im e. For exam ple, if a given sy st em is designed t o be available for 99 .9 % of t he t im e
( som et im es called " t hr ee nines" ) ; it s downt im e per year m ust be less t han 0.1 % , or 9 hour s.

D o w n t i m e is t he consequence of out ages, which m ay be int ent ional ( e.g. for syst em upgr ades) or caused by
unplanned fault s. A f a u l t can be due t o equipm ent m alfunct ion, soft war e or net wor k failur es, or due to a
m aj or d i sa st e r such as a fire, a r egional power loss or a const r uct ion accident , which m ay decom m ission t he
ent ire dat a- cent er .

H i g h A v a i l a b i l i t y is a set of t echniques, engineer ing pr act ices and design pr inciples for Business Cont inuit y.
This is achieved by elim inat ing single point s of failur e ( fault t oler ance) , and pr oviding t he abilit y t o rapidly
r esum e oper at ions aft er a syst em out age wit h m inim al business loss ( fault r esilience) .

Fa u l t Re co v e r y is t he pr ocess of r ecover ing and r esum ing oper at ions aft er an out age due t o a fault . Di sa st e r
Re co v e r y is t he process of recover ing oper at ions aft er an out age due t o a prolonged dat acent er or sit e failur e.
Pr epar ing for disast er s m ay requir e backing up dat a acr oss longer dist ances, and m ay t hus be m or e com plex
and cost ly.

Recover y - Key Per formance Indicator s

Custom er s com m only use t wo key m easur es t o specify the recover y par am et er s of a syst em following an
out age: The Recovery Per iod Obj ect ive ( RPO) and t he Recovery Tim e Obj ect ive ( RTO) . The RPO and RTO of
a syst em ar e illust r at ed below:

RPO a n d RTO

The RPO is t he m axim al per m issible period of t im e dur ing which oper at ional dat a m ay be lost wit hout
abilit y t o r ecover ( t im e bet ween t he last backup and t he crash)
The RTO is t he m axim al per m issible t im e it t akes t o r ecover t he syst em , so t hat it s operat ions can
r esum e.

3 Eliminating Single Points of Failure

The key to achieving fault t oler ance is t o elim inat e single point s of failur e by int roducing r edundancy . SAP
HANA Appliance vendor s deliver sever al levels of r edundancy to avoid out age due t o com ponent failur e, which
ar e br iefly discussed her e. Gener ally speaking, t hese t echniques ar e " t r anspar ent " to SAP HANAs oper at ion,
but t hey for m a cr ucial line of defense against avoidable syst em out age, and t herefore gr eatly cont r ibut e t o
Business Cont inuit y 1 .

Hardware Redundancy
SAP HANA appliance har dware vendor s design m ult iple layer s of r edundant har dwar e com ponent s and sub-
system s. These include redundant and hot - swappable power supply unit s ( PSUs) , fans, net work int er face
car ds and ent er pr ise- gr ade err or - cor rect ing pr ot ect ed m em or ies. These subsyst em s ar e designed su ch t hat
t he redundant com ponent can sust ain t he oper at ion of t he syst em if t he ot her com ponent fails 2 .

Par ticular ly crit ical is t he st or age syst em . Ent erpr ise- gr ade st or age syst em s com bine m ult iple physical dr ives
int o logical unit s, wit h built - in st andard ( RAI D) t echniques for r edundancy and err or r ecovery. These include
m ir ror ing, t he wr it ing of t he sam e dat a t o t wo different drives in par allel, and parit y, ext r a bit s wr it t en t o
allow t he det ect ion and aut om at ic cor r ect ion of error s3 .

Networ k Redundancy
Redundant net wor ks, net work equipm ent and net wor k connect ivit y is r equir ed t o avoid net wor k failur es fr om
affect ing syst em availabilit y. This is t ypically accom plished by deploying a com plet ely r edundant swit ch
t opology, using t he Spanning Tr ee Prot ocol t o avoid loops. Rout er s can be configured wit h t he Hot St andby
Router Prot ocol ( HSRP) for aut om at ic failover . BGP is com m only used t o m anage dual WAN connect ions.

Data Center Redundancy

Dat a cent er s t hat host SAP HANA solut ions are equipped wit h Unint er r upt ed Power Supply ( UPS) and back up
power gener at or s, r edundant cooling syst em s and m ult i- sour ced pr ovider s of net work connect ivit y and
elect r icit y, achieving oper at ional availabilit y in t he pr esence of individual failur es, and significant ly reducing
t he probabilit y of a business- im pact ing out age.

Som e ent er pr ises oper at e fully duplicat ed dat a cent er s, providing a high level of disaster t oler ance.

4 SAP HANA High Availability Suppor t

As an in- m em or y dat abase, SAP HANA m ust not only concer n it self wit h m aint aining t he reliabilit y of it s dat a
in t he event of failur es, but also wit h r esum ing oper at ions wit h m ost of t hat dat a loaded back in m em or y as
quickly as possible.

The following figur e shows t he phases of High Availabilit y. The fir st phase is r eadiness, being pr epared for t he
inevit able fault . Dur ing t his t im e, dat a is backed up and standby syst em s ar e r eady to t ake over . A fault m ust
be det ect ed, eit her aut om at ically or adm inist r at ively ( t o avoid false posit ives) , and a r ecover y process is put
in act ion. Finally, t he fault m ust be repair ed, and t he syst em m ay need t o be r ever t ed t o t he or iginal
configur at ion ( failed back) , t o be r eady again for t he next fault .

Different RPO/ RTO values can be associat ed wit h different kinds of fault s. Business crit ical syst em s are
expect ed t o oper at e wit h an RPO of zer o dat a loss in t he case of local fault s, and oft en even in t he case of a
disast er . But t he challenges of disast er r ecovery ar e different from locally r ecover able fault s; t o achieve zero
RPO and low RTO, dat a m ust be r eplicated synchronously over longer dist ances, which im pact s regular syst em
perfor m ance and m ay r equire m or e expensive st andby and failover solut ions.

All of t his leads t o tr adeoff decisions around t he at t r ibut es of fault r ecovery funct ionalit y, cost and com plexit y .
SAP accor dingly offer s com plem ent ar y design opt ions, including t hree levels of Disaster Recovery suppor t and
t wo aut om at ic Fault Recover y suppor t feat ures, sum m ar ized in t he following t able and fur t her discussed in
t he sect ions below.

DI SASTER 1. Ba ck u p s $ >0 high
RECOVERY 2. St o r a g e Re p l i ca t i o n $$ ~0 m ed
SUPPORT 3. Sy s t e m Re p l i ca t i o n $$$ 0 low
4. Sy s t e m Re p l i ca t i o n A ct i v e / A ct i v e ( r e a d e n a b l e d ) $ 0 low
5. Sy s t e m Re p l i ca t i o n w / o d at a p r e l o a d $ 0 m ed
FAULT RECOVERY 1. Se r v i ce A u t o - Re st a r t 0 0 m ed
SUPPORT 2. H o st A u t o - Fa i l o v e r $$ 0 m ed

SAP HANA uses in- m em or y t echnology , but of course, it fully per sist s any t r ansact ion t hat changes t he dat a,
such as row inser t ions, delet ions and updat es, so it can resum e from a pow er - out age wit hout loss of dat a.
SAP HANA per sist s tw o t ypes of dat a t o st or age: t r ansact ion r edo logs, and dat a changes in t he form of
sav epoint s.

A tr ansact ion r edo log is used t o r ecord a change. To m ake a t r ansact ion dur able, it is not requir ed t o per sist
t he com plet e dat a w hen t he t ransact ion is com m it t ed; inst ead it is sufficient t o per sist t he r edo log. Upon an
out age, t he m ost r ecent consist ent st at e of t he dat abase can be r est ored by r eplay ing t he changes r ecorded
in t he log, r edoing com plet ed t r ansact ions and rolling back incom plet e ones.

A savepoint is a per iodic point in t im e, when all t he changed dat a is w r it t en t o st or age, in t he for m of pages.
One goal of perfor m ing sav epoint s is t o speed up r est art : w hen st ar t ing up t he sy st em , logs need not be
processed from t he beginning, but only fr om t he last savepoint posit ion. Sav epoint s ar e coor dinat ed acr oss
all processes ( called SAP HANA ser v ices) and inst ances of t he dat abase t o ensure tr ansact ion consist ency . By
default , sav epoint s are per form ed every fiv e m inut es, but this can be configur ed.

Sav epoint s nor m ally ov er wr it e older sav epoint s, but it is possible t o fr eeze a sav epoint for fut ur e use; t his is
called a snapshot . Snapshot s can be r eplicated in t he for m of f u l l d a t a b a ck u p s , w hich can be used t o r est or e
a dat abase to a specific point in t im e. This can be useful in t he ev ent of dat a corr upt ion, for inst ance. I n
addit ion t o dat a back ups and snapshot s, sm aller per iodic l o g b a ck u p s ensur e t he abilit y t o r ecov er from fat al
st or age fault s w it h m inim al loss of dat a. While full dat a back ups cont ain all cur rent dat a also d e l t a b a ck u p s
can be cr eated ( since HANA 1.0 SPS11 ) cont aining all dat a that was changed since t he last dat a back up. Two
t y pes of delta back ups ar e t o be dist inguished: i n cr e m e n t a l b a ck u p s cont ain all changed dat a since t he last
full or delt a back up, and d i f f e r e n t i a l b a ck u p s cont ain all changed dat a since t he last full back up.
Lo ca l Pe r s i st e n ce an d B a ck u p s

The abov e figur e shows t he savepoint s, sav ed t o local st orage, and t he addit ional back ups, saved t o back up
st or age. Local r ecov ery from t he cr ash uses t he lat est savepoint , and t hen replays t he last logs, t o r ecov er
t he dat abase w it hout any data loss. I f t he local stor age was corr upt ed by t he cr ash, it is st ill possible t o
r ecov er t he dat abase fr om t he dat a back up ( or last snapshot) , and log back ups, possibly wit h som e dat a loss.

Regular ly shipping back ups t o a rem ot e locat ion ov er a net w or k or v ia cour ier s can be a sim ple and relat iv ely
inexpensive w ay to prepar e for a disast er . Depending on the fr equency and shipping m et hod, t his approach
m ay hav e an RPO of hour s to day s.

B a ck u p Sh i p p i n g

Stor age Replication

One dr aw back of back ups is the pot ent ial loss of dat a from t he t im e of t he last back up t o t he t im e of t he
failur e. A prefer r ed solut ion ther efor e, is t o prov ide cont inuous r eplicat ion of all per sist ed dat a. Sev er al SAP
HANA har dw ar e par t ner s offer a st or age- lev el r eplicat ion solut ion, w hich deliv er s a back up of t he v olum es or
file- sy st em t o a rem ot e, net w or k ed st or age syst em . I n som e of t hese vendor - specific solut ions, w hich are
cer t ified by SAP, t he SAP HANA t r ansact ion only com plet es w hen t he locally per sist ed t r ansact ion log has been
r eplicat ed rem ot ely . This is called sy nchr onous st or age replicat ion. Sy nchr onous stor age replicat ion can be
used only w her e t he dist ance bet w een t he pr im ar y and backup sit e is up t o 100 k ilom et er s ( one or few hops,
w it h no m ore t han ~ 5 sec lat ency per k ilom et er ) , allow ing for sub- m illisecond round- t r ip lat encies.

St o r a g e Re p li ca t i o n

Due t o it s cont inuous nat ur e, st or age r eplicat ion ( som et im es also called r em ot e st or age m irr or ing) offer s a
m ore at t r act iv e RPO t han back ups, but t his solut ion of course r equir es a r eliable, high bandw idt h and low
lat ency connect ion bet ween the pr im ary sit e and t he secondary sit e.
I n t he ev ent of a disast rous failur e t hat j ust ifies full sy st em failover , an adm inist rat or at t aches a st andby
sy stem t o t he r eplicat ed st or age, and t hen r est ar t s t he SAP HANA sy st em . The adm inist r at or m ust t ak e car e
t hat t he failed pr im ar y sy st em can no longer wr it e t o t he replicat ed st or age ( an act ion called fencing) , or else
t here is a r isk of dat a cor r upt ion, w it h t wo sy st em s w r it ing to t he sam e st or age.

System Replication
Sy s t e m Re p l i ca t i o n is an alt er nat iv e HA solut ion for SAP HANA, pr ov iding an ex t r em ely shor t RTO, and
com pat ible w it h all SAP HANA har dwar e par t ner solut ions. Sy st em r eplicat ion em ploy s an " N+ N" appr oach,
w it h a secondar y st andby SAP HANA sy st em w it h t he sam e num ber of act ive nodes as t he act iv e, pr im ary
sy stem . Each serv ice and inst ance of t he pr im ar y SAP HANA sy st em com m unicat es pair wise w it h a count er par t
in t he secondary syst em 4 .

Sy s t e m Re p l i ca t i o n

The secondary sy st em can be locat ed near t he pr im ar y sy st em t o ser ve as a r apid failov er solut ion for planned
downt im e, or t o handle st or age cor r upt ion or ot her local fault s. Alt er nat iv ely , or addit ionally ( m ult i- t ier ed or
cascaded) , a secondar y sy st em can be inst alled in a r em ot e sit e for disast er r ecov er y . Lik e St or age Replicat ion,
t his Disast er Recovery opt ion r equires a r eliable link bet ween t he pr im ary and secondary sit es.

The inst ances in t he secondary syst em oper at e in liv e replicat ion m ode. I n t his m ode, all secondary sy st em
services const ant ly com m unicat e wit h t heir pr im ary count er par t s, r eplicat e and per sist dat a and logs, and
t y pically load dat a t o m em ory. The log and dat a can be com pr essed before shipping. The secondar y sy st em
can accept quer ies w hen syst em r eplicat ion w as set up as A ct i v e / A ct i v e ( r e a d e n a b l e d ) configur at ion;
ot her w ise it does not accept request s or quer ies. Wit h t he Act iv e/ Act ive set up t he secondary syst em can be
used to handle repor t ing w or kload w it hout disr upt ing t he pr im ar y sy st em .

I n an alt er nat iv e configur at ion, called sy st em replication w i t h o u t d a t a - p r e l o a d , t he secondar y sy st em does

not pr e- load dat a, and hence consum es very lit t le m em ory. This allows t he host s of t he secondar y sy st em t o
serve dual pur poses, for inst ance for developm ent or t est / QA w it h separ at e stor age. Befor e t ak eov er , t hese
act ivit ies m ust of cour se be t ur ned off. The t r adeoff is a longer RTO in case of failov er .

Her e is how sy st em r eplicat ion work s. When t he secondar y syst em is br ought up t o st ar t r unning in liv e
r eplicat ion m ode, each serv ice com ponent est ablishes a connect ion w it h it s pr im ar y sy st em count er par t , and
r equest s a snapshot of t he dat a. From t hen on, all logged changes in t he pr im ary syst em are r eplicated.
Whenev er logs are per sist ed in t he pr im ary sy st em ( i.e. wr it t en t o t he log volum es of each serv ice) , t hey are
also sent t o t he secondary syst em . A t r ansact ion in t he pr im ar y sy st em is not com m it t ed unt il t he r edo logs
ar e replicat ed, as det er m ined by a log r eplicat ion opt ion:

Sy n ch r o n o u s : The pr im ary syst em wait s w it h com m it t ing t he t r ansact ion unt il it r eceives a reply
t hat t he log is persisted in t he secondar y sy st em . This m ode guar ant ees im m ediat e consist ency
bet w een bot h sy st em s, at a cost of delay ing t he tr ansact ion by t he t im e for dat a tr ansm ission and
persist ing in t he secondar y syst em .

The quest ion of w hat t o do if replicat ion fails ( for inst ance due t o a net work fault ) is governed by t he
full sy nc configur at ion opt ion. I t can be set to com m it t he t r ansact ion, or to fail t he com m it on t he
pr im ary sy st em , unt il r eplicat ion is r est ored.

4 Fr om HANA 1.0 SPS09 SAP HANA suppor t s m u lt i- t enant dat abase cont ainer s. Sy st em r eplicat ion can be on ly set up for

t he syst em as a whole, not per indiv idual t enant .

Sy n ch r o n o u s i n - m e m o r y : The pr im ary sy st em com m it s t he tr ansact ion aft er it r eceiv es a r eply
t hat t he log w as received by t he secondary sy st em , but before it w as persist ed. The t r ansact ion delay
in t he prim ar y sy st em is shor ter , because it only includes t he dat a t r ansm ission t im e.

A s y n ch r o n o u s : t he pr im ary sy st em com m it s t he t r ansact ion aft er sending t he log w it hout w ait ing
for a r esponse. This elim inat es t he sy nchr onizat ion lat ency , at t he r isk of m inor t heor et ical dat a- loss
dur ing failur e. This m ode is m ost useful w hen the secondar y site is hundr eds of k ilom et er s aw ay
from t he pr im ary sit e, or when r educing lat ency is cr it ical.

I f t he connect ion t o t he secondary sy st em is lost , or t he secondary sy st em cr ashes, t he prim ar y sy st em ( aft er

a brief, configur able, t im eout ) w ill resum e oper at ions wit hout t he back up pr ot ect ion 5 .

Handling of t he r eceived logs on t he secondary sit e is done in differ ent w ay s, depending on t he configured
sy stem replication operat ion m ode:

D e l t a d a t a s h i p p i n g : I n t his oper at ion m ode t he secondary sy st em per sists, but does not
im m ediat ely r eplay t he receiv ed logs. To av oid an ev er - grow ing list of logs, incr em ent al dat a
snapshot s ar e t r ansm it t ed asy nchronously from t im e t o t im e from t he pr im ary t o t he secondary
sy st em . I f t he secondary sy stem has t o t ak e ov er, only t hat par t of t he log needs t o be r eplay ed t hat
r epr esent s changes t hat wer e m ade aft er t he m ost recent dat a snapshot . I n addit ion t o snapshot s,
t he pr im ar y sy st em also t r ansfer s st at us infor m at ion regar ding w hich colum n t able colum ns are
cur r ent ly loaded int o m em or y . The secondary sy st em corr espondingly pr eloads t hese colum ns.

Lo g r e p l a y ( as of HANA 1.0 SPS11) : Wit h t his oper at ion m ode configured, t he r eceived log ent r ies
ar e r eplay ed im m ediat ely in t he secondar y sy st em . The t ak eov er t im e is r educed because t he log
does not hav e t o be replayed any m or e. Addit ionally , t her e is m uch less t r affic on t he net w ork bet w een
t he pr im ar y and t he secondary sit e, because no delt a dat a shipping needs t o t ak e place.

Lo g r e p l a y r e a d a cce s s ( as of HANA 2.0 SPS00) : I n t his oper at ion m ode t he r eceived log entr ies
ar e also r eplay ed im m ediat ely in t he secondary sy st em . Addit ionally , t he r eplicat ed data ar e r ead
accessible w it h a sm all delay com par ed t o t he pr im ar y s dat a. Read access is possible v ia direct
connect ion to t he secondar y or by pr ov iding hint ed SQL st atem ent s on t he pr im ar y , w hich ar e r out ed
t o t he secondary for ex ecut ion. The t ak eov er t im e is r educed fur t her not only because t he log does
not hav e t o be replay ed any m or e, but also because t his sy stem is ev en m or e pr epar ed for product iv e
oper at ion.

I n t he event of a failur e t hat j ust ifies full sy st em t akeov er , an adm inist r at or inst r uct s t he secondary sy st em
t o sw it ch from liv e r eplicat ion m ode to full oper at ion. The secondary sy st em , w hich already preloaded t he
sam e colum n dat a as t he pr im ar y sy st em , and possibly is already read enabled, becom es the pr im ar y syst em
by replay ing t he last t r ansact ion logs, and t hen st ar t s t o accept quer ies.

When t he or iginal sy st em can be rest or ed t o ser v ice, it can be configured as t he new secondar y sy st em , or ,
r ever t ed to t he or iginal configur at ion by " falling back" .

HANA 1.0 SPS09 int r oduced a w ay t o hook event s and act ions inside SAP HANA scale- out ( such as Host Aut o-
Failov er ) and sy st em r eplicat ion. An adm inist r at or can add requir ed act ions to a Py t hon scr ipt , t o be ex ecut ed
before or aft er event s ( lik e st ar t up, shut dow n, failov er , t ak eover , ...) .

These so- called " HA/ DR prov ider" hooks can be used t o addr ess issues t hat r equir e int egrat ion at t ent ion such
as how to handle connect ions from dat abase client s t hat w er e configur ed t o r each t he pr im ar y sy st em , and
need t o be " div er t ed" to t he secondary sy st em aft er a takeover . For ex am ple, a hook could be wr it t en t o
r em ap v ir t ual I P addr esses aft er a t ak eov er in SAP HANA syst em r eplication.

I P r e d i r e ct i o n is t he m et hod of choice for end- t o- end client r econnect ion suppor t , as it unifor m ly and sim ply
handles t he end- t o- end r ecover y of bot h SQL and HTTP client s, w it h v ery shor t recov er y t im es, and w it hout
special client - side configur at ion. The pr inciple of I P r edir ect ion ( also k now n as VI P6 ) is to define an addit ional
" logical" host nam e ( hana1, in t he pict ure below ) w it h it s separ at e logical I P address ( for ex am ple, , and t hen m ap t his init ially t o t he MAC addr ess of t he or iginal host in t he pr im ar y sy st em ( by
binding it to one of t he host 's int er faces) . As par t of t he t ak eov er pr ocedure, a scr ipt is ex ecut ed w hich r e-
m aps t he unchanged logical I P addr ess t o t he cor r esponding t akeov er host in t he secondar y sy st em . This
m ust be done pair - w ise, for each host in t he pr im ary sy st em . The r em apping affect s t he L2 sw it ching, as can
be seen in st ep 4 of t he follow ing diagr am :

I P r edir ect ion can be im plem ent ed using a num ber of t echniques, for inst ance w it h t he use of Linux com m ands
w hich affect t he net work ARP t ables, by configur ing L2 net w ork sw it ches dir ect ly , or by using clust er
m anagem ent soft war e. Follow ing t he I P r edir ect ion configur at ion, t he ARP caches should be flushed, t o pr ov ide
an alm ost inst ant aneous r ecov ery exper ience t o client s.

I P redir ect ion r equir es t hat bot h t he pr im ar y and failover host ( s) ar e on t he sam e L2 net wor k . This depends
on the cust om er net w or k design, but net wor k s are incr easingly designed w it h L2 - ov er - L3 ( such as Et her net
ov er MPLS) , m ak ing t his opt ion a v iable solut ion in m any cases. I f t he st andby sy st em is in a com plet ely
separ at e L3 net work , t hen D N S r e d i r e ct i o n is t he preferred alt er nat iv e solut ion.

D N S is a binding fr om a logical dom ain nam e t o an I P addr ess. Client s cont act a DNS ser v er to obt ain t he I P
address of t he HANA host ( st ep 1 below ) t hey w ish t o reach. Many DNS product s suppor t failov er configu r at ion
by using shor t ( few m inut es or less) TTL response fields, and can be set up w it h w at chdog funct ionalit y and
aut om at ically t r iggered sw it chov er .

As par t of t he fail- ov er procedur e, a scr ipt is ex ecut ed t hat changes t he DNS nam e- t o- I P m apping fr om t he
pr im ary host t o t he corr esponding host in t he secondary syst em ( pair - w ise for all host s in t he syst em ) . From
t hat point in t im e, client s ar e r edir ect ed t o t he failover hosts, as in st ep 2 of t he following diagr am :

DNS and I P r edir ect ion share t he advant age t hat t here are no client - specific configur at ions or r equirem ent s.
Fur ther , it suppor t s DR configur at ions w her e t he pr im ar y and st andby sy st em s m ay be in t w o com plet ely
differ ent net work dom ains ( separ at ed by rout er s) . One dr aw back of t his solut ion is that m odify ing DNS
m appings r equir es a vendor - propr iet ar y solut ion. Fur t her , due to DNS caching in nodes ( bot h client s and
int erm ediat e net w or k equipm ent ) , it m ay t ake a w hile ( up t o hour s) unt il t he DNS changes ar e propagat ed,
causing client s t o ex per ience dow nt im e despit e t he recov ery of t he sy st em .
A special handling of v ir t ual I P addr esses is r equir ed in an A ct i v e / A ct i v e ( r e a d e n a b l e d ) sy st em r eplication
configur at ion, w her e a separ at e v ir t ual I P addr ess is needed for t he r ead access connect ions t o t he secondary
sy stem . I n t he t ak eover - case t he v ir t ual I P address for prim ar y access is rebound t o t he secondar y sy st em ,
w hile t he secondar y s v ir t ual I P addr ess st ay s act ive. Tw o v ir t ual I P addr esses are av ailable for syst em access
t o t he t hen act iv e syst em aft er t akeov er .

Service Auto-Restart
I n t he ev ent of a soft w are failur e ( or an int ent ional int er v ent ion by an adm inist r at or ) , t hat disables one of t he
configur ed SAP HANA ser v ices ( I ndex Serv er , Nam e Ser ver , et c.) , t he serv ice w ill be r est ar t ed by t he SAP
HANA Se r v i ce A u t o - Re st a r t w at chdog funct ion, w hich aut om at ically det ect s t he failur e and r est ar t s t he
st opped ser v ice pr ocess. Upon r est ar t , t he ser v ice loads dat a int o m em ory and r esum es it s funct ion. While all
dat a r em ains safe ( RPO= 0) , the ser v ice r ecovery t akes som e t im e.

Host Auto-Failover
H o st A u t o - Fa i l o v e r is a local " N+ m " ( m is oft en 1 ) Fault Recov ery solut ion t hat can be used as a
supplem ent al or alt er nat iv e m easur e t o t he sy st em r eplicat ion solut ion descr ibed ear lier . One ( or m ore)
st andby host s ar e added t o an SAP HANA sy st em , and configur ed t o work in st andby m ode. As long as t hey
ar e in st andby m ode t he dat abases on t hese host s do not cont ain any dat a and do not accept r equest s or

H o s t A u t o - Fa i l o v e r , b e f o r e f a i l u r e

When an act iv e ( work er ) host fails, a st andby host aut om at ically t ak es it s place. Since t he st andby host m ay
t ak e over oper at ion from any of t he pr im ar y host s, it needs access t o all t he dat abase v olum es. This can be
accom plished by a shar ed net w ork ed st or age ser ver , by using a dist r ibut ed file sy st em , or w it h v endor - specific
solut ions t hat use an SAP HANA progr am m at ic int erface ( the so- called St or age Connect or API ) t o dy nam ically
det ach and at t ach ( m ount ) net w ork ed stor age ( e.g. using block st or age v ia Fiber Channel) upon failover .

H o s t A u t o - Fa i l o v e r , a f t e r r e co v e r y

A t opic t hat r equir es som e at t ent ion is how to recov er connect ions from SAP HANA client s t hat w ere configur ed
t o reach t he or iginal host , and need t o be "div er t ed" to t he st andby host aft er host aut o - failov er .

One appr oach is a net work - based ( I P or DNS) approach, exact ly as discussed earlier . Alt er nat iv ely , SQL/ MDX
dat abase client s can be configur ed w it h t he connect ion inform at ion of m ult iple host s, opt ionally including t he
st andby host ( a m ult i- host list is pr ov ided in t he connect ion st r ing) . The client connect ion code ( ODBC/ JDBC)
uses a " round- robin" appr oach t o reconnect and ensur es t hat t hese client s can r each t he SAP HANA dat abase,
ev en aft er failover . To support HTTP ( w eb) client s, w hich use t he SAP HANA XS applicat ion ser v ices 7 , it is

7 I f t he XS ser v ices ar e not used by any applicat ion, it can be disabled, and no HLB is r equ ir ed. Cur r ent ly on ly one XS

ser ver can r un on a dist r ibut ed sy st em . I n t he fut ur e, m u lt iple load - shar in g XS ser ver s w ill be in st allable on a syst em ,
m ak in g t he u se of an HLB even m or e valuable.
r ecom m ended to inst all an ex t er nal, it self fault pr ot ect ed, HTTP load balancer ( HLB) , such as SAP's Web
Dispat cher , or a sim ilar pr oduct fr om anot her vendor . The HLBs ar e configur ed t o m onit or t he w eb- ser v er s on
all t he host s on bot h t he pr im ar y and secondar y sit es.

The HLB ( w hich serv es as a rev er se web- pr oxy ) r edir ect s the HTTP client s t o t he cor r ect ser ver , upon HANA
inst ance failur e. HTTP client s ar e configured t o use t he I P address of t he HLB it self ( obt ained v ia DNS) , and
r em ain unaw ar e of any HANA failov er act iv it y .

One dangerous scenar io t hat m ay occur w it h Host Aut o- Failov er is r efer red t o as split - br ain. A split - br ain could
accident ally happen if, for inst ance, host2 did not really fail, but only lost all it s net work connect ions ( causing
t he st andby host to decide t o t ak e ov er ) . I n t his case bot h non- com m unicat ing sy st em s assum e t he host 2
r ole, and m ay bot h w r it e t o t he sam e stor age, causing dat a corr upt ion. Pr ev ent ing such dat a corr upt ion due
t o split - br ain situat ions ( fencing) m ust be im plem ent ed. The above- m ent ioned st or age API suppor t s fencing.

Once r epair ed, t he failed host can be r ej oined to t he syst em as t he new st andby host , t o reest ablish t he failur e
r ecov ery capabilit y .

5 Design for High Availability

The follow ing t able sum m ar izes t he m ain adv ant ages and lim it at ions of t he SAP HANA High Av ailabilit y suppor t
opt ions.

Ad v an t ages Li m i t a t i o n s
Ba ck u p s Allows Disast er Recover y RPO of m inut es t o hour s, depending on fr equency of
Lowest cost , sim plest backu p and sh ipping m et hod ( syn chr onous sh ipping,
Suppor t s point - in- t im e r ecover y usin g 3 rd par t y t ools, is r ecom m en ded)
Can also be used t o " clone" or copy syst em s I n case of disast er , need t o acqu ir e and configur e
secondar y syst em ( hour s- days)
Cold st ar t longer RTO ( ~ h our )
Ext r a t im e ( u p t o hour s) t o load colu m n dat a an d r et ur n
t o fu ll per for m an ce
St o r a g e Allows Disast er Recover y I n case of disast er , need t o possibly fr ee up, boot up
Re p l i ca t i o n RPO= 0 w it h synchr onous r eplicat ion ; and r e- configur e secondar y syst em ( hour s)
RPO of a few seconds ot her w ise Cold st ar t longer RTO ( ~ h our )
Secondar y syst em can be u sed for ot her Not yet offer ed by all SAP HANA h ar dwar e par t ner s
pur poses, unt il needed Ext r a t im e ( u p t o hour s) t o r et ur n t o fu ll per for m an ce
Requ ir es net wor ked st or age sy st em s and efficient int er -
sit e link
Synchr onou s r eplicat ion on ly suppor t s dist ances of u p t o
100 k m
Doesn't pr ot ect again st st or age cor r upt ion
Mor e bandw idt h wast efu l t han Syst em Replicat ion
Sy st e m Allows Disast er Recover y, an d can be used as Requ ir es dedicat ed live st andby sy st em an d efficient
Re p l i ca t i o n m ain HA failover for near - zer o downt im e int er - sit e link
m aint enance or failur es Requ ir es a solut ion for client conn ect ion r ecover y upon
Act ive/ Act ive ( r ead enabled) con figur at ions t he failover ( e.g. DNS or Virt ual I P ad dr ess based)
secondar y is usable for r epor t ing wor k load
RPO= 0 ( syn chr onou s)
RTO of on ly a m inut e ( cont inu ous log r eplay)
Full per for m ance r ight aft er t akeover
Com pat ible w it h all par t ner solut ions
Suppor t s single- host syst em s w it h local
st or age, no need for ext er nal net w or k st or age
Wit h no dat a- pr eload configur at ion, secondar y
syst em ( s) can be used for n oncr it ical dual
pur poses
H o st Can be used t o com plem ent Syst em Requ ir es access t o dat abase st orage by t he st andby
Au t o- Replicat ion or by it self host ( shar ed net wor k st or age or ot her par t ner - specific
Fa i l o v e r Aut om at ic det ect ion and failover solut ion )

I n addit ion t o t he aforem ent ioned SAP HANA High Av ailabilit y opt ions, one ot her approach deserv es t o be
m ent ioned, for analy t ic " dat a m ar t " applicat ions w her e t he dat a in SAP HANA is t he result of using SAP
Landscape Tr ansform at ion ( SLT) r eplicat ion from anot her dat a source. I n such a situat ion, High Av ailabilit y
t hr ough r edundancy can be achiev ed by set t ing up concurrent SLT r eplication st r eam s from t he com m on dat a-
source t o t w o separ at e SAP HANA sy st em s. Bot h sy st em s can act iv ely operat e independent ly ; in t he case of
a failur e or disast er , t he ot her syst em r em ains av ailable.

Planning for Failure

Failur es ar e inev it able. Planning a com pr ehensive High Av ailabilit y solut ion for SAP HANA requires an
ev aluat ion of t he im pact of pot ent ial failur es, t he com pany's t oler ance and r equir em ent s for differ ent RPO and
RTO v alues in t he presence of com m on tr ansient local failur es v s. ex t rem ely r ar e disast er s, and an
under st anding of t he benefit s and cost of t he different alt ernat iv es offered.

To recap, here is a br ief sum m ar y of m ain fault s and how SAP HANA addr esses t hem :

Fa u l t So l u t i o n
Ser vice down ( soft war e fau lt ) Ser vice Aut o- Rest ar t . Sy st em Replicat ion can also be used t o fail over .
Power out age Per sist ence of sav epoint s and t r an sact ion logs guar ant ees r ecover y w it hout dat a loss.
Host cr ash ( har dwar e fau lt ) Host Aut o- Failover . Alt er nat ively, Syst em Replicat ion can be u sed t o fail over .
St or age or Dat a Cor r upt ion Backups and snapshot s allow poin t - in - t im e r ecover y, applicable t o all solut ions.
Dat a cent er out ( disast er ) Syst em Replicat ion suppor t s r apid r esum pt ion of oper at ion. Alt er nat ively , St or age
Replicat ion or Backups can be u sed t o br ing up t he syst em in an alt er nat e dat acent er

Besides t he high- level consider at ion of RPO/ RTO in t he differ ent scenar ios, ot her aspect s w ill need t o be
ev aluat ed as well: t he size of t he sy st em and dat abase, t he fr equency and size of t he logs and dat a files t hat
need t o be replicat ed, t he bandw idt h av ailabilit y , r eliabilit y and lat ency of t he link s bet ween t he sy st em s, t he
nat ur e of t he landscape m anagem ent and av ailabilit y solut ions used for ot her non- SAP HANA sy st em s, and
ot her consider at ions.

Sm all RTO r equir em ent s lead t o t he pr efer r ed sy st em replicat ion solut ion, w hich can also be used for r apid
failov er in case of planned and unplanned out ages. Tr adeoffs m ay lead to ot her alt er nat ives. The follow ing
decision t r ee sum m ar izes t he m ain design choices:

H i g h A v a i l a b i l i t y De ci s i o n Tr e e
Realist ically , t he abov e decision process w ill be fur t her influenced by consider at ions lik e t im elines, cost s,
budget s and cust om er par adigm - prefer ences, w hich are out side t he scope of t his shor t paper .
6 In Summar y
SAP HANA suppor t s a com prehensiv e r ange of High Av ailabilit y opt ions, designed to sat isfy tr adeoffs bet w een
dem anding High Av ailabilit y and Disast er Recovery requirem ent s, w hile also consider ing cost and com plex it y .

I n par t icular , t he SAP HANA Sy st em Replicat ion solut ion support s an RPO of zero seconds, and an RTO
m easur ed in m inut es, and is SAP's recom m ended configurat ion for addr essing SAP HANA out age r educt ion
due t o planned m aint enance, fault s and disast er s.

SAP HANA High Av ailabilit y docum ent at ion: SAP Not e 2407186

Mor e infor m at ion about SAP HANA can be found on ht t p: / / / .

Glossar y
Industr y Ter ms
Te r m D e scr i p t i o n
Fault A failur e of a syst em or one of it s com ponent s / sub - syst em s ( har dw ar e, net wor k, soft war e)
Disast er Maj or fau lt : t he failur e of an ent ir e dat a cent er / sit e
Out age A sy st em 's inabilit y t o oper at e ( due t o failur e or planned downt im e)
Availabilit y The m easur e of a syst em 's oper at ional cont inuit y, expr essed as a per cent age of t im e
Downt im e I nver se of availabilit y : t he dur at ion of t im e t hat a syst em is not oper at ional
High Availabilit y ( HA) A fr am ewor k of design pr in ciples, t echniques and best pr act ices t o r educe downt im e
Fault Recover y ( FR) Recover y of syst em oper at ions aft er out age due t o a local fau lt
Disast er Recover y ( DR) Recover y of syst em oper at ions aft er out age due t o a disast er
Failover / Takeover Sw it ch ing t o a backu p ( st andby ) syst em / host , u pon failur e of t h e pr im ar y syst em / host
Failback Pr ocess of r est or ing a syst em t o it s or iginal st at e
Recover y Point t he m ax im al per m issible per iod of t im e dur ing wh ich oper at ional dat a m ay be lost w it hout
Obj ect ive ( RPO) abilit y t o r ecover ( t im e bet ween t h e last backup an d t h e cr ash)
Recover y Tim e The m ax im al per m issible t im e it t ak es t o r ecover t he sy st em , so t hat it s oper at ion s can r esu m e
Obj ect ive ( RTO)


Te r m D e scr i p t i o n
SAP HANA Syst em A SAP HANA syst em is ident ified by a syst em id ( SI D) . I t is per ceiv ed as one unit fr om t h e
per spect ive of t he adm in ist r at or , who can in st all, updat e, st ar t u p, shut down, or backup t he
syst em as a whole. A dist r ibut ed SAP HANA sy st em is a syst em wh ich is in st alled on m or e t h an
one host . The collect ion of elem ent s of t he sy st em on each h ost ar e r efer r ed t o as an inst ance.
SAP HANA Ser v ice A SAP HANA ser v ice is an independent funct ional com ponent of a SAP HANA Syst em , such as
t he I ndex Ser ver , t he Nam e Ser ver , et c. They appear as separ at e pr ocesses fr om an Oper at in g
Syst em per spect ive.

D i sa st e r Re co v e r y Backup Per iodic sav in g of dat abase copies in safe place

Su p p o r t St or age Replicat ion Cont inuou s r eplicat ion ( m ir r or ing) bet ween pr im ar y st or age and back up
st or age over a net w or k ( m ay be synchr onou s)
Syst em Replicat ion Cont inuou s ( synchr onou s an d asy nchr onous) updat e of secon dar y syst em
by pr im ar y sy st em , includin g in - m em or y t able loadin g and cont inuou s log
r eplay on t he secondar y sy st em ( if con figur ed)
Fa u l t Re co v e r y Ser vice Aut o- Rest ar t Aut om at ic r est ar t of st opped ser v ices on host ( wat chdog)
Su p p o r t Host Aut o- Failover Aut om at ic failover fr om cr ashed h ost t o st andby h ost in t he sam e sy st em

