Professional Documents
Culture Documents
Business Cont inuit y r equir es t hat t he operat ion of business crit ical syst em s r em ain highly
av ailable at all t im es, ev en in t he presence of failures. This paper discusses t he funct ionalit y of
SAP HANA in support of High Availabilit y and Disast er Recov er y.
Up d a t e d f o r H A N A 2 .0 SP0 0
HANA 2.0 SPS00: new operat ion m ode logreplay_readaccess for syst em replicat ion w it h Act ive/ Act iv e ( read enabled) feat ure
__________________________
chaim .bendelac@sap.com & m echt hild.bore- wuest hof@sap.com
SAP HANA Dev elopm ent Team
Table of Contents
Contents
1 Introduction ............................................................................................................................... 4
Backups .......................................................................................................................................... 6
Service Auto-Restart..................................................................................................................... 11
6 In Summary.............................................................................................................................. 15
Glossary .......................................................................................................................................... 15
This document outlines our general product direction and should not be relied on in making a purchase decision. This document is not subject
to your license agreement or any other agreement with SAP. SAP has no obligation to pursue any course of business outlined in this
presentation or to develop or release any functionality mentioned in this document. This document and SAP's strategy and possible future
developments are subject to change and may be changed by SAP at any time for any reason without notice. This document is provided
without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a
particular purpose, or non-infringement. SAP assumes no responsibility for errors or omissions in this document and shall have no liability for
damages of any kind that may result from the use of these materials, except if such damages were caused by SAP intentionally or grossly
negligent.
Design for scalabilit y is a cor e SAP HANA pr inciple. SAP HANA can be dist r ibut ed acr oss m any m ult iple hosts
t o achieve scalabilit y in t er m s of bot h dat a volum e and user concur r ency. Unlike clust er s, dist r ibut ed HANA
system s also dist r ibut e t he dat a efficient ly, achieving high scaling wit hout I / O locks.
The key per for m ance indicat ors of SAP HANA appeal t o m any of our custom er s, and t housands of deploym ents
ar e in progr ess. SAP HANA has becom e t he fast est growing product in SAPs 40+ year hist or y.
SAP HANA is fully designed for High Availabilit y, suppor t ing a broad r ange of r ecovery scenar ios from var ious
faults, from sim ple soft war e er ror s, t o disast er s t hat decom m ission an ent ire sit e.
This paper descr ibes SAP HANAs High Availabilit y suppor t for Fault and Disast er Recovery . A com pr ehensive
High Availabilit y solut ion offers m or e design choices and evident ly requir es t he discussion of m or e det ails t han
can be cover ed in a shor t paper , and m ay t her efore r equire addit ional consult at ions.
D o w n t i m e is t he consequence of out ages, which m ay be int ent ional ( e.g. for syst em upgr ades) or caused by
unplanned fault s. A f a u l t can be due t o equipm ent m alfunct ion, soft war e or net wor k failur es, or due to a
m aj or d i sa st e r such as a fire, a r egional power loss or a const r uct ion accident , which m ay decom m ission t he
ent ire dat a- cent er .
H i g h A v a i l a b i l i t y is a set of t echniques, engineer ing pr act ices and design pr inciples for Business Cont inuit y.
This is achieved by elim inat ing single point s of failur e ( fault t oler ance) , and pr oviding t he abilit y t o rapidly
r esum e oper at ions aft er a syst em out age wit h m inim al business loss ( fault r esilience) .
Fa u l t Re co v e r y is t he pr ocess of r ecover ing and r esum ing oper at ions aft er an out age due t o a fault . Di sa st e r
Re co v e r y is t he process of recover ing oper at ions aft er an out age due t o a prolonged dat acent er or sit e failur e.
Pr epar ing for disast er s m ay requir e backing up dat a acr oss longer dist ances, and m ay t hus be m or e com plex
and cost ly.
RPO a n d RTO
The RPO is t he m axim al per m issible period of t im e dur ing which oper at ional dat a m ay be lost wit hout
abilit y t o r ecover ( t im e bet ween t he last backup and t he crash)
The RTO is t he m axim al per m issible t im e it t akes t o r ecover t he syst em , so t hat it s operat ions can
r esum e.
Hardware Redundancy
SAP HANA appliance har dware vendor s design m ult iple layer s of r edundant har dwar e com ponent s and sub-
system s. These include redundant and hot - swappable power supply unit s ( PSUs) , fans, net work int er face
car ds and ent er pr ise- gr ade err or - cor rect ing pr ot ect ed m em or ies. These subsyst em s ar e designed su ch t hat
t he redundant com ponent can sust ain t he oper at ion of t he syst em if t he ot her com ponent fails 2 .
Par ticular ly crit ical is t he st or age syst em . Ent erpr ise- gr ade st or age syst em s com bine m ult iple physical dr ives
int o logical unit s, wit h built - in st andard ( RAI D) t echniques for r edundancy and err or r ecovery. These include
m ir ror ing, t he wr it ing of t he sam e dat a t o t wo different drives in par allel, and parit y, ext r a bit s wr it t en t o
allow t he det ect ion and aut om at ic cor r ect ion of error s3 .
Networ k Redundancy
Redundant net wor ks, net work equipm ent and net wor k connect ivit y is r equir ed t o avoid net wor k failur es fr om
affect ing syst em availabilit y. This is t ypically accom plished by deploying a com plet ely r edundant swit ch
t opology, using t he Spanning Tr ee Prot ocol t o avoid loops. Rout er s can be configured wit h t he Hot St andby
Router Prot ocol ( HSRP) for aut om at ic failover . BGP is com m only used t o m anage dual WAN connect ions.
Som e ent er pr ises oper at e fully duplicat ed dat a cent er s, providing a high level of disaster t oler ance.
The following figur e shows t he phases of High Availabilit y. The fir st phase is r eadiness, being pr epared for t he
inevit able fault . Dur ing t his t im e, dat a is backed up and standby syst em s ar e r eady to t ake over . A fault m ust
be det ect ed, eit her aut om at ically or adm inist r at ively ( t o avoid false posit ives) , and a r ecover y process is put
in act ion. Finally, t he fault m ust be repair ed, and t he syst em m ay need t o be r ever t ed t o t he or iginal
configur at ion ( failed back) , t o be r eady again for t he next fault .
1 The SAP HANA soft war e it self is a single point of failur e, as it can cease t o oper at e due t o soft war e er r or s or ext r em e out -
of- m em or y sit uat ions. Fau lt Recov er y suppor t is discussed in t h e nex t sect ion .
2 An exam ple of high availabilit y har dwar e design can be found her e: http://www.redbooks.ibm.com/redpapers/pdfs/redp4864.pdf
All of t his leads t o tr adeoff decisions around t he at t r ibut es of fault r ecovery funct ionalit y, cost and com plexit y .
SAP accor dingly offer s com plem ent ar y design opt ions, including t hree levels of Disaster Recovery suppor t and
t wo aut om at ic Fault Recover y suppor t feat ures, sum m ar ized in t he following t able and fur t her discussed in
t he sect ions below.
Co st RPO RTO
DI SASTER 1. Ba ck u p s $ >0 high
RECOVERY 2. St o r a g e Re p l i ca t i o n $$ ~0 m ed
SUPPORT 3. Sy s t e m Re p l i ca t i o n $$$ 0 low
4. Sy s t e m Re p l i ca t i o n A ct i v e / A ct i v e ( r e a d e n a b l e d ) $ 0 low
5. Sy s t e m Re p l i ca t i o n w / o d at a p r e l o a d $ 0 m ed
FAULT RECOVERY 1. Se r v i ce A u t o - Re st a r t 0 0 m ed
SUPPORT 2. H o st A u t o - Fa i l o v e r $$ 0 m ed
Backups
SAP HANA uses in- m em or y t echnology , but of course, it fully per sist s any t r ansact ion t hat changes t he dat a,
such as row inser t ions, delet ions and updat es, so it can resum e from a pow er - out age wit hout loss of dat a.
SAP HANA per sist s tw o t ypes of dat a t o st or age: t r ansact ion r edo logs, and dat a changes in t he form of
sav epoint s.
A tr ansact ion r edo log is used t o r ecord a change. To m ake a t r ansact ion dur able, it is not requir ed t o per sist
t he com plet e dat a w hen t he t ransact ion is com m it t ed; inst ead it is sufficient t o per sist t he r edo log. Upon an
out age, t he m ost r ecent consist ent st at e of t he dat abase can be r est ored by r eplay ing t he changes r ecorded
in t he log, r edoing com plet ed t r ansact ions and rolling back incom plet e ones.
A savepoint is a per iodic point in t im e, when all t he changed dat a is w r it t en t o st or age, in t he for m of pages.
One goal of perfor m ing sav epoint s is t o speed up r est art : w hen st ar t ing up t he sy st em , logs need not be
processed from t he beginning, but only fr om t he last savepoint posit ion. Sav epoint s ar e coor dinat ed acr oss
all processes ( called SAP HANA ser v ices) and inst ances of t he dat abase t o ensure tr ansact ion consist ency . By
default , sav epoint s are per form ed every fiv e m inut es, but this can be configur ed.
Sav epoint s nor m ally ov er wr it e older sav epoint s, but it is possible t o fr eeze a sav epoint for fut ur e use; t his is
called a snapshot . Snapshot s can be r eplicated in t he for m of f u l l d a t a b a ck u p s , w hich can be used t o r est or e
a dat abase to a specific point in t im e. This can be useful in t he ev ent of dat a corr upt ion, for inst ance. I n
addit ion t o dat a back ups and snapshot s, sm aller per iodic l o g b a ck u p s ensur e t he abilit y t o r ecov er from fat al
st or age fault s w it h m inim al loss of dat a. While full dat a back ups cont ain all cur rent dat a also d e l t a b a ck u p s
can be cr eated ( since HANA 1.0 SPS11 ) cont aining all dat a that was changed since t he last dat a back up. Two
t y pes of delta back ups ar e t o be dist inguished: i n cr e m e n t a l b a ck u p s cont ain all changed dat a since t he last
full or delt a back up, and d i f f e r e n t i a l b a ck u p s cont ain all changed dat a since t he last full back up.
Lo ca l Pe r s i st e n ce an d B a ck u p s
The abov e figur e shows t he savepoint s, sav ed t o local st orage, and t he addit ional back ups, saved t o back up
st or age. Local r ecov ery from t he cr ash uses t he lat est savepoint , and t hen replays t he last logs, t o r ecov er
t he dat abase w it hout any data loss. I f t he local stor age was corr upt ed by t he cr ash, it is st ill possible t o
r ecov er t he dat abase fr om t he dat a back up ( or last snapshot) , and log back ups, possibly wit h som e dat a loss.
Regular ly shipping back ups t o a rem ot e locat ion ov er a net w or k or v ia cour ier s can be a sim ple and relat iv ely
inexpensive w ay to prepar e for a disast er . Depending on the fr equency and shipping m et hod, t his approach
m ay hav e an RPO of hour s to day s.
B a ck u p Sh i p p i n g
St o r a g e Re p li ca t i o n
Due t o it s cont inuous nat ur e, st or age r eplicat ion ( som et im es also called r em ot e st or age m irr or ing) offer s a
m ore at t r act iv e RPO t han back ups, but t his solut ion of course r equir es a r eliable, high bandw idt h and low
lat ency connect ion bet ween the pr im ary sit e and t he secondary sit e.
I n t he ev ent of a disast rous failur e t hat j ust ifies full sy st em failover , an adm inist rat or at t aches a st andby
sy stem t o t he r eplicat ed st or age, and t hen r est ar t s t he SAP HANA sy st em . The adm inist r at or m ust t ak e car e
t hat t he failed pr im ar y sy st em can no longer wr it e t o t he replicat ed st or age ( an act ion called fencing) , or else
t here is a r isk of dat a cor r upt ion, w it h t wo sy st em s w r it ing to t he sam e st or age.
System Replication
Sy s t e m Re p l i ca t i o n is an alt er nat iv e HA solut ion for SAP HANA, pr ov iding an ex t r em ely shor t RTO, and
com pat ible w it h all SAP HANA har dwar e par t ner solut ions. Sy st em r eplicat ion em ploy s an " N+ N" appr oach,
w it h a secondar y st andby SAP HANA sy st em w it h t he sam e num ber of act ive nodes as t he act iv e, pr im ary
sy stem . Each serv ice and inst ance of t he pr im ar y SAP HANA sy st em com m unicat es pair wise w it h a count er par t
in t he secondary syst em 4 .
Sy s t e m Re p l i ca t i o n
The secondary sy st em can be locat ed near t he pr im ar y sy st em t o ser ve as a r apid failov er solut ion for planned
downt im e, or t o handle st or age cor r upt ion or ot her local fault s. Alt er nat iv ely , or addit ionally ( m ult i- t ier ed or
cascaded) , a secondar y sy st em can be inst alled in a r em ot e sit e for disast er r ecov er y . Lik e St or age Replicat ion,
t his Disast er Recovery opt ion r equires a r eliable link bet ween t he pr im ary and secondary sit es.
The inst ances in t he secondary syst em oper at e in liv e replicat ion m ode. I n t his m ode, all secondary sy st em
services const ant ly com m unicat e wit h t heir pr im ary count er par t s, r eplicat e and per sist dat a and logs, and
t y pically load dat a t o m em ory. The log and dat a can be com pr essed before shipping. The secondar y sy st em
can accept quer ies w hen syst em r eplicat ion w as set up as A ct i v e / A ct i v e ( r e a d e n a b l e d ) configur at ion;
ot her w ise it does not accept request s or quer ies. Wit h t he Act iv e/ Act ive set up t he secondary syst em can be
used to handle repor t ing w or kload w it hout disr upt ing t he pr im ar y sy st em .
Her e is how sy st em r eplicat ion work s. When t he secondar y syst em is br ought up t o st ar t r unning in liv e
r eplicat ion m ode, each serv ice com ponent est ablishes a connect ion w it h it s pr im ar y sy st em count er par t , and
r equest s a snapshot of t he dat a. From t hen on, all logged changes in t he pr im ary syst em are r eplicated.
Whenev er logs are per sist ed in t he pr im ary sy st em ( i.e. wr it t en t o t he log volum es of each serv ice) , t hey are
also sent t o t he secondary syst em . A t r ansact ion in t he pr im ar y sy st em is not com m it t ed unt il t he r edo logs
ar e replicat ed, as det er m ined by a log r eplicat ion opt ion:
Sy n ch r o n o u s : The pr im ary syst em wait s w it h com m it t ing t he t r ansact ion unt il it r eceives a reply
t hat t he log is persisted in t he secondar y sy st em . This m ode guar ant ees im m ediat e consist ency
bet w een bot h sy st em s, at a cost of delay ing t he tr ansact ion by t he t im e for dat a tr ansm ission and
persist ing in t he secondar y syst em .
The quest ion of w hat t o do if replicat ion fails ( for inst ance due t o a net work fault ) is governed by t he
full sy nc configur at ion opt ion. I t can be set to com m it t he t r ansact ion, or to fail t he com m it on t he
pr im ary sy st em , unt il r eplicat ion is r est ored.
4 Fr om HANA 1.0 SPS09 SAP HANA suppor t s m u lt i- t enant dat abase cont ainer s. Sy st em r eplicat ion can be on ly set up for
A s y n ch r o n o u s : t he pr im ary sy st em com m it s t he t r ansact ion aft er sending t he log w it hout w ait ing
for a r esponse. This elim inat es t he sy nchr onizat ion lat ency , at t he r isk of m inor t heor et ical dat a- loss
dur ing failur e. This m ode is m ost useful w hen the secondar y site is hundr eds of k ilom et er s aw ay
from t he pr im ary sit e, or when r educing lat ency is cr it ical.
Handling of t he r eceived logs on t he secondary sit e is done in differ ent w ay s, depending on t he configured
sy stem replication operat ion m ode:
D e l t a d a t a s h i p p i n g : I n t his oper at ion m ode t he secondary sy st em per sists, but does not
im m ediat ely r eplay t he receiv ed logs. To av oid an ev er - grow ing list of logs, incr em ent al dat a
snapshot s ar e t r ansm it t ed asy nchronously from t im e t o t im e from t he pr im ary t o t he secondary
sy st em . I f t he secondary sy stem has t o t ak e ov er, only t hat par t of t he log needs t o be r eplay ed t hat
r epr esent s changes t hat wer e m ade aft er t he m ost recent dat a snapshot . I n addit ion t o snapshot s,
t he pr im ar y sy st em also t r ansfer s st at us infor m at ion regar ding w hich colum n t able colum ns are
cur r ent ly loaded int o m em or y . The secondary sy st em corr espondingly pr eloads t hese colum ns.
Lo g r e p l a y ( as of HANA 1.0 SPS11) : Wit h t his oper at ion m ode configured, t he r eceived log ent r ies
ar e r eplay ed im m ediat ely in t he secondar y sy st em . The t ak eov er t im e is r educed because t he log
does not hav e t o be replayed any m or e. Addit ionally , t her e is m uch less t r affic on t he net w ork bet w een
t he pr im ar y and t he secondary sit e, because no delt a dat a shipping needs t o t ak e place.
Lo g r e p l a y r e a d a cce s s ( as of HANA 2.0 SPS00) : I n t his oper at ion m ode t he r eceived log entr ies
ar e also r eplay ed im m ediat ely in t he secondary sy st em . Addit ionally , t he r eplicat ed data ar e r ead
accessible w it h a sm all delay com par ed t o t he pr im ar y s dat a. Read access is possible v ia direct
connect ion to t he secondar y or by pr ov iding hint ed SQL st atem ent s on t he pr im ar y , w hich ar e r out ed
t o t he secondary for ex ecut ion. The t ak eov er t im e is r educed fur t her not only because t he log does
not hav e t o be replay ed any m or e, but also because t his sy stem is ev en m or e pr epar ed for product iv e
oper at ion.
I n t he event of a failur e t hat j ust ifies full sy st em t akeov er , an adm inist r at or inst r uct s t he secondary sy st em
t o sw it ch from liv e r eplicat ion m ode to full oper at ion. The secondary sy st em , w hich already preloaded t he
sam e colum n dat a as t he pr im ar y sy st em , and possibly is already read enabled, becom es the pr im ar y syst em
by replay ing t he last t r ansact ion logs, and t hen st ar t s t o accept quer ies.
When t he or iginal sy st em can be rest or ed t o ser v ice, it can be configured as t he new secondar y sy st em , or ,
r ever t ed to t he or iginal configur at ion by " falling back" .
HANA 1.0 SPS09 int r oduced a w ay t o hook event s and act ions inside SAP HANA scale- out ( such as Host Aut o-
Failov er ) and sy st em r eplicat ion. An adm inist r at or can add requir ed act ions to a Py t hon scr ipt , t o be ex ecut ed
before or aft er event s ( lik e st ar t up, shut dow n, failov er , t ak eover , ...) .
These so- called " HA/ DR prov ider" hooks can be used t o addr ess issues t hat r equir e int egrat ion at t ent ion such
as how to handle connect ions from dat abase client s t hat w er e configur ed t o r each t he pr im ar y sy st em , and
need t o be " div er t ed" to t he secondary sy st em aft er a takeover . For ex am ple, a hook could be wr it t en t o
r em ap v ir t ual I P addr esses aft er a t ak eov er in SAP HANA syst em r eplication.
I P r e d i r e ct i o n is t he m et hod of choice for end- t o- end client r econnect ion suppor t , as it unifor m ly and sim ply
handles t he end- t o- end r ecover y of bot h SQL and HTTP client s, w it h v ery shor t recov er y t im es, and w it hout
special client - side configur at ion. The pr inciple of I P r edir ect ion ( also k now n as VI P6 ) is to define an addit ional
" logical" host nam e ( hana1, in t he pict ure below ) w it h it s separ at e logical I P address ( for ex am ple,
10.68.104.51) , and t hen m ap t his init ially t o t he MAC addr ess of t he or iginal host in t he pr im ar y sy st em ( by
binding it to one of t he host 's int er faces) . As par t of t he t ak eov er pr ocedure, a scr ipt is ex ecut ed w hich r e-
m aps t he unchanged logical I P addr ess t o t he cor r esponding t akeov er host in t he secondar y sy st em . This
m ust be done pair - w ise, for each host in t he pr im ary sy st em . The r em apping affect s t he L2 sw it ching, as can
be seen in st ep 4 of t he follow ing diagr am :
5 As a r esult , t h e pr im ar y sy st em and secondar y syst em m ight get ou t of sync. Such a sit uat ion is det ect ed by t he
secondar y syst em when it r esu m es, r eest ablishes t he connect ion, an d r eceives t he next set of log ent r ies. I n su ch case, t he
secondar y syst em r equest s a dat a backup delt a based on wh ich t he log r eplicat ion can be r est ar t ed.
6 E.g. see her e: http://scale-out-blog.blogspot.com/2011/01/virtual-ip-addresses-and-their.html
I P r edir ect ion can be im plem ent ed using a num ber of t echniques, for inst ance w it h t he use of Linux com m ands
w hich affect t he net work ARP t ables, by configur ing L2 net w ork sw it ches dir ect ly , or by using clust er
m anagem ent soft war e. Follow ing t he I P r edir ect ion configur at ion, t he ARP caches should be flushed, t o pr ov ide
an alm ost inst ant aneous r ecov ery exper ience t o client s.
I P redir ect ion r equir es t hat bot h t he pr im ar y and failover host ( s) ar e on t he sam e L2 net wor k . This depends
on the cust om er net w or k design, but net wor k s are incr easingly designed w it h L2 - ov er - L3 ( such as Et her net
ov er MPLS) , m ak ing t his opt ion a v iable solut ion in m any cases. I f t he st andby sy st em is in a com plet ely
separ at e L3 net work , t hen D N S r e d i r e ct i o n is t he preferred alt er nat iv e solut ion.
D N S is a binding fr om a logical dom ain nam e t o an I P addr ess. Client s cont act a DNS ser v er to obt ain t he I P
address of t he HANA host ( st ep 1 below ) t hey w ish t o reach. Many DNS product s suppor t failov er configu r at ion
by using shor t ( few m inut es or less) TTL response fields, and can be set up w it h w at chdog funct ionalit y and
aut om at ically t r iggered sw it chov er .
As par t of t he fail- ov er procedur e, a scr ipt is ex ecut ed t hat changes t he DNS nam e- t o- I P m apping fr om t he
pr im ary host t o t he corr esponding host in t he secondary syst em ( pair - w ise for all host s in t he syst em ) . From
t hat point in t im e, client s ar e r edir ect ed t o t he failover hosts, as in st ep 2 of t he following diagr am :
DNS and I P r edir ect ion share t he advant age t hat t here are no client - specific configur at ions or r equirem ent s.
Fur ther , it suppor t s DR configur at ions w her e t he pr im ar y and st andby sy st em s m ay be in t w o com plet ely
differ ent net work dom ains ( separ at ed by rout er s) . One dr aw back of t his solut ion is that m odify ing DNS
m appings r equir es a vendor - propr iet ar y solut ion. Fur t her , due to DNS caching in nodes ( bot h client s and
int erm ediat e net w or k equipm ent ) , it m ay t ake a w hile ( up t o hour s) unt il t he DNS changes ar e propagat ed,
causing client s t o ex per ience dow nt im e despit e t he recov ery of t he sy st em .
A special handling of v ir t ual I P addr esses is r equir ed in an A ct i v e / A ct i v e ( r e a d e n a b l e d ) sy st em r eplication
configur at ion, w her e a separ at e v ir t ual I P addr ess is needed for t he r ead access connect ions t o t he secondary
sy stem . I n t he t ak eover - case t he v ir t ual I P address for prim ar y access is rebound t o t he secondar y sy st em ,
w hile t he secondar y s v ir t ual I P addr ess st ay s act ive. Tw o v ir t ual I P addr esses are av ailable for syst em access
t o t he t hen act iv e syst em aft er t akeov er .
Service Auto-Restart
I n t he ev ent of a soft w are failur e ( or an int ent ional int er v ent ion by an adm inist r at or ) , t hat disables one of t he
configur ed SAP HANA ser v ices ( I ndex Serv er , Nam e Ser ver , et c.) , t he serv ice w ill be r est ar t ed by t he SAP
HANA Se r v i ce A u t o - Re st a r t w at chdog funct ion, w hich aut om at ically det ect s t he failur e and r est ar t s t he
st opped ser v ice pr ocess. Upon r est ar t , t he ser v ice loads dat a int o m em ory and r esum es it s funct ion. While all
dat a r em ains safe ( RPO= 0) , the ser v ice r ecovery t akes som e t im e.
Host Auto-Failover
H o st A u t o - Fa i l o v e r is a local " N+ m " ( m is oft en 1 ) Fault Recov ery solut ion t hat can be used as a
supplem ent al or alt er nat iv e m easur e t o t he sy st em r eplicat ion solut ion descr ibed ear lier . One ( or m ore)
st andby host s ar e added t o an SAP HANA sy st em , and configur ed t o work in st andby m ode. As long as t hey
ar e in st andby m ode t he dat abases on t hese host s do not cont ain any dat a and do not accept r equest s or
queries.
H o s t A u t o - Fa i l o v e r , b e f o r e f a i l u r e
When an act iv e ( work er ) host fails, a st andby host aut om at ically t ak es it s place. Since t he st andby host m ay
t ak e over oper at ion from any of t he pr im ar y host s, it needs access t o all t he dat abase v olum es. This can be
accom plished by a shar ed net w ork ed st or age ser ver , by using a dist r ibut ed file sy st em , or w it h v endor - specific
solut ions t hat use an SAP HANA progr am m at ic int erface ( the so- called St or age Connect or API ) t o dy nam ically
det ach and at t ach ( m ount ) net w ork ed stor age ( e.g. using block st or age v ia Fiber Channel) upon failover .
H o s t A u t o - Fa i l o v e r , a f t e r r e co v e r y
A t opic t hat r equir es som e at t ent ion is how to recov er connect ions from SAP HANA client s t hat w ere configur ed
t o reach t he or iginal host , and need t o be "div er t ed" to t he st andby host aft er host aut o - failov er .
One appr oach is a net work - based ( I P or DNS) approach, exact ly as discussed earlier . Alt er nat iv ely , SQL/ MDX
dat abase client s can be configur ed w it h t he connect ion inform at ion of m ult iple host s, opt ionally including t he
st andby host ( a m ult i- host list is pr ov ided in t he connect ion st r ing) . The client connect ion code ( ODBC/ JDBC)
uses a " round- robin" appr oach t o reconnect and ensur es t hat t hese client s can r each t he SAP HANA dat abase,
ev en aft er failover . To support HTTP ( w eb) client s, w hich use t he SAP HANA XS applicat ion ser v ices 7 , it is
7 I f t he XS ser v ices ar e not used by any applicat ion, it can be disabled, and no HLB is r equ ir ed. Cur r ent ly on ly one XS
ser ver can r un on a dist r ibut ed sy st em . I n t he fut ur e, m u lt iple load - shar in g XS ser ver s w ill be in st allable on a syst em ,
m ak in g t he u se of an HLB even m or e valuable.
r ecom m ended to inst all an ex t er nal, it self fault pr ot ect ed, HTTP load balancer ( HLB) , such as SAP's Web
Dispat cher , or a sim ilar pr oduct fr om anot her vendor . The HLBs ar e configur ed t o m onit or t he w eb- ser v er s on
all t he host s on bot h t he pr im ar y and secondar y sit es.
The HLB ( w hich serv es as a rev er se web- pr oxy ) r edir ect s the HTTP client s t o t he cor r ect ser ver , upon HANA
inst ance failur e. HTTP client s ar e configured t o use t he I P address of t he HLB it self ( obt ained v ia DNS) , and
r em ain unaw ar e of any HANA failov er act iv it y .
One dangerous scenar io t hat m ay occur w it h Host Aut o- Failov er is r efer red t o as split - br ain. A split - br ain could
accident ally happen if, for inst ance, host2 did not really fail, but only lost all it s net work connect ions ( causing
t he st andby host to decide t o t ak e ov er ) . I n t his case bot h non- com m unicat ing sy st em s assum e t he host 2
r ole, and m ay bot h w r it e t o t he sam e stor age, causing dat a corr upt ion. Pr ev ent ing such dat a corr upt ion due
t o split - br ain situat ions ( fencing) m ust be im plem ent ed. The above- m ent ioned st or age API suppor t s fencing.
Once r epair ed, t he failed host can be r ej oined to t he syst em as t he new st andby host , t o reest ablish t he failur e
r ecov ery capabilit y .
Ad v an t ages Li m i t a t i o n s
Ba ck u p s Allows Disast er Recover y RPO of m inut es t o hour s, depending on fr equency of
Lowest cost , sim plest backu p and sh ipping m et hod ( syn chr onous sh ipping,
Suppor t s point - in- t im e r ecover y usin g 3 rd par t y t ools, is r ecom m en ded)
Can also be used t o " clone" or copy syst em s I n case of disast er , need t o acqu ir e and configur e
secondar y syst em ( hour s- days)
Cold st ar t longer RTO ( ~ h our )
Ext r a t im e ( u p t o hour s) t o load colu m n dat a an d r et ur n
t o fu ll per for m an ce
St o r a g e Allows Disast er Recover y I n case of disast er , need t o possibly fr ee up, boot up
Re p l i ca t i o n RPO= 0 w it h synchr onous r eplicat ion ; and r e- configur e secondar y syst em ( hour s)
RPO of a few seconds ot her w ise Cold st ar t longer RTO ( ~ h our )
Secondar y syst em can be u sed for ot her Not yet offer ed by all SAP HANA h ar dwar e par t ner s
pur poses, unt il needed Ext r a t im e ( u p t o hour s) t o r et ur n t o fu ll per for m an ce
Requ ir es net wor ked st or age sy st em s and efficient int er -
sit e link
Synchr onou s r eplicat ion on ly suppor t s dist ances of u p t o
100 k m
Doesn't pr ot ect again st st or age cor r upt ion
Mor e bandw idt h wast efu l t han Syst em Replicat ion
Sy st e m Allows Disast er Recover y, an d can be used as Requ ir es dedicat ed live st andby sy st em an d efficient
Re p l i ca t i o n m ain HA failover for near - zer o downt im e int er - sit e link
m aint enance or failur es Requ ir es a solut ion for client conn ect ion r ecover y upon
Act ive/ Act ive ( r ead enabled) con figur at ions t he failover ( e.g. DNS or Virt ual I P ad dr ess based)
secondar y is usable for r epor t ing wor k load
RPO= 0 ( syn chr onou s)
RTO of on ly a m inut e ( cont inu ous log r eplay)
Full per for m ance r ight aft er t akeover
Com pat ible w it h all par t ner solut ions
Suppor t s single- host syst em s w it h local
st or age, no need for ext er nal net w or k st or age
appliances
Wit h no dat a- pr eload configur at ion, secondar y
syst em ( s) can be used for n oncr it ical dual
pur poses
H o st Can be used t o com plem ent Syst em Requ ir es access t o dat abase st orage by t he st andby
Au t o- Replicat ion or by it self host ( shar ed net wor k st or age or ot her par t ner - specific
Fa i l o v e r Aut om at ic det ect ion and failover solut ion )
I n addit ion t o t he aforem ent ioned SAP HANA High Av ailabilit y opt ions, one ot her approach deserv es t o be
m ent ioned, for analy t ic " dat a m ar t " applicat ions w her e t he dat a in SAP HANA is t he result of using SAP
Landscape Tr ansform at ion ( SLT) r eplicat ion from anot her dat a source. I n such a situat ion, High Av ailabilit y
t hr ough r edundancy can be achiev ed by set t ing up concurrent SLT r eplication st r eam s from t he com m on dat a-
source t o t w o separ at e SAP HANA sy st em s. Bot h sy st em s can act iv ely operat e independent ly ; in t he case of
a failur e or disast er , t he ot her syst em r em ains av ailable.
To recap, here is a br ief sum m ar y of m ain fault s and how SAP HANA addr esses t hem :
Fa u l t So l u t i o n
Ser vice down ( soft war e fau lt ) Ser vice Aut o- Rest ar t . Sy st em Replicat ion can also be used t o fail over .
Power out age Per sist ence of sav epoint s and t r an sact ion logs guar ant ees r ecover y w it hout dat a loss.
Host cr ash ( har dwar e fau lt ) Host Aut o- Failover . Alt er nat ively, Syst em Replicat ion can be u sed t o fail over .
St or age or Dat a Cor r upt ion Backups and snapshot s allow poin t - in - t im e r ecover y, applicable t o all solut ions.
Dat a cent er out ( disast er ) Syst em Replicat ion suppor t s r apid r esum pt ion of oper at ion. Alt er nat ively , St or age
Replicat ion or Backups can be u sed t o br ing up t he syst em in an alt er nat e dat acent er
Besides t he high- level consider at ion of RPO/ RTO in t he differ ent scenar ios, ot her aspect s w ill need t o be
ev aluat ed as well: t he size of t he sy st em and dat abase, t he fr equency and size of t he logs and dat a files t hat
need t o be replicat ed, t he bandw idt h av ailabilit y , r eliabilit y and lat ency of t he link s bet ween t he sy st em s, t he
nat ur e of t he landscape m anagem ent and av ailabilit y solut ions used for ot her non- SAP HANA sy st em s, and
ot her consider at ions.
Sm all RTO r equir em ent s lead t o t he pr efer r ed sy st em replicat ion solut ion, w hich can also be used for r apid
failov er in case of planned and unplanned out ages. Tr adeoffs m ay lead to ot her alt er nat ives. The follow ing
decision t r ee sum m ar izes t he m ain design choices:
H i g h A v a i l a b i l i t y De ci s i o n Tr e e
Realist ically , t he abov e decision process w ill be fur t her influenced by consider at ions lik e t im elines, cost s,
budget s and cust om er par adigm - prefer ences, w hich are out side t he scope of t his shor t paper .
6 In Summar y
SAP HANA suppor t s a com prehensiv e r ange of High Av ailabilit y opt ions, designed to sat isfy tr adeoffs bet w een
dem anding High Av ailabilit y and Disast er Recovery requirem ent s, w hile also consider ing cost and com plex it y .
I n par t icular , t he SAP HANA Sy st em Replicat ion solut ion support s an RPO of zero seconds, and an RTO
m easur ed in m inut es, and is SAP's recom m ended configurat ion for addr essing SAP HANA out age r educt ion
due t o planned m aint enance, fault s and disast er s.
SAP HANA High Av ailabilit y docum ent at ion: SAP Not e 2407186
Glossar y
Industr y Ter ms
Te r m D e scr i p t i o n
Fault A failur e of a syst em or one of it s com ponent s / sub - syst em s ( har dw ar e, net wor k, soft war e)
Disast er Maj or fau lt : t he failur e of an ent ir e dat a cent er / sit e
Out age A sy st em 's inabilit y t o oper at e ( due t o failur e or planned downt im e)
Availabilit y The m easur e of a syst em 's oper at ional cont inuit y, expr essed as a per cent age of t im e
Downt im e I nver se of availabilit y : t he dur at ion of t im e t hat a syst em is not oper at ional
High Availabilit y ( HA) A fr am ewor k of design pr in ciples, t echniques and best pr act ices t o r educe downt im e
Fault Recover y ( FR) Recover y of syst em oper at ions aft er out age due t o a local fau lt
Disast er Recover y ( DR) Recover y of syst em oper at ions aft er out age due t o a disast er
Failover / Takeover Sw it ch ing t o a backu p ( st andby ) syst em / host , u pon failur e of t h e pr im ar y syst em / host
Failback Pr ocess of r est or ing a syst em t o it s or iginal st at e
Recover y Point t he m ax im al per m issible per iod of t im e dur ing wh ich oper at ional dat a m ay be lost w it hout
Obj ect ive ( RPO) abilit y t o r ecover ( t im e bet ween t h e last backup an d t h e cr ash)
Recover y Tim e The m ax im al per m issible t im e it t ak es t o r ecover t he sy st em , so t hat it s oper at ion s can r esu m e
Obj ect ive ( RTO)
Te r m D e scr i p t i o n
SAP HANA Syst em A SAP HANA syst em is ident ified by a syst em id ( SI D) . I t is per ceiv ed as one unit fr om t h e
per spect ive of t he adm in ist r at or , who can in st all, updat e, st ar t u p, shut down, or backup t he
syst em as a whole. A dist r ibut ed SAP HANA sy st em is a syst em wh ich is in st alled on m or e t h an
one host . The collect ion of elem ent s of t he sy st em on each h ost ar e r efer r ed t o as an inst ance.
SAP HANA Ser v ice A SAP HANA ser v ice is an independent funct ional com ponent of a SAP HANA Syst em , such as
t he I ndex Ser ver , t he Nam e Ser ver , et c. They appear as separ at e pr ocesses fr om an Oper at in g
Syst em per spect ive.