Professional Documents
Culture Documents
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
UPDATE SERVER
Push
REPLICA
SERVER
UPDATE
Pull
s Pu
VIEW DATA
ll Pu
DATA VIEW
DATABASE
DATABASE
UPDATE
SERVER
CLREPL (replica)
SERVER
UPDATE
Push
sh Pu
VIEW DATA DATA VIEW
DATABASE
DATABASE
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
This Presentation was not researched nor conceived at the British Library
This is bubble-bath-ware!
Disclaimers: NO Proofs...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Ok, just one hack from a red book where I wrote something in...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Download and get this redbook: SG24-7017 Lotus Security Handbook (2004)
Hint: firefox's "modify header" plugin extension (free)
is quite useless !
50% of what you don't know about clusters
About questions...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
IT IS "OK"(not impolite)
To interrupt... to ASK questions... 'ala' easyjet... "within reason" :-)
100% of what you do not understand can, and WILL probably hurt you!
We reserve the right to postpone the answers, but, when in doubt, raise hand!
So a LOT within Notes has a strong LEGACY. So, we're going to provoke your brain to think!
This time the answer is not 42 ;-) but instead: 443! You can specity what you are "listening to" You must understand netstat -an | find "LISTEN" If you bind addresses you will listen just that BUT You CAN specify "0.0.0.0" as a specific address! You can use this to listen to all addresses at a port
Example: You can set a notes server to also listen on NRPC to port 443 on 0.0.0.0 this is a useful hack when you are behind a proxy and want to access your home server and the proxy only allows access to ports 80 and 443 port 443 proxies use transparent "connect method"
When visiting customers Using http proxies and not allowing 1352 direct. If cust agrees to allow me to connect to my own server while at their premises...using their proxy
In my server's Notes.ini
PORTS=TCPIP,TCPIP2 TCPIP=TCP,0,15,0,,45088, TCPIP_TCPIPADDRESS=0,0.0.0.0:1352 TCPIP2=TCP,0,15,0,,45088, TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Definition:
A Notes Client is said to be cluster-aware when it will perform custom logic to transparently and automatically fail-over from one server to another, upon server directive or LACK of reply
QUIZ:
what % of Notes Clients are CLUSTER Aware? hint: what was the first version of Cluster Aware Notes client?
Voila': I can connect using HTTP Proxy "transparent connect method" to 443
Clustering
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Time=22/12/2001 14:26:46 (80256B2A:004F5AD8) Cluster/NotesWeb CN=Notes2/O=Notesweb CN=Notes1/O=Notesweb Time=03/01/2002 16:18:24 (80256B36:0059935B) TheConifers.com CN=dotNSF.TheConifers.com/O=TheConifers CN=Linux.TheConifers.com/O=TheConifers CN=WebSphere.TheConifers.com/O=TheConifers CN=Win2k.TheConifers.com/O=TheConifers CN=www.TheConifers.com/O=TheConifers
The key words of this slide are "PERCEIVED as" NB: We're going to focus on
MultiPlatform SOFTWARE Clustering
Perspective...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Cluster.ncf: (default max 2 mates TIMES 20 clusters, LKB 185700: Cluster_Name_Cache_Size=n (notes.ini)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Cluster Mates:
"Mate" is an industry NON-PC (non politically correct!) std term
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Definition:
A cluster of something is composed of mates logically siblings among them (no master) Domino Wise, a Cluster Mate can be: Available (normal) (SAI>SAT) Busy (Server_Availability_Index <= Server_Availability_Threshold)
Tip: You CAN BUSY a server by setting SAT=100
clrepl pushes changes to other replicas based on information from cluster directory
(D6+ not in servertasks=, launched automatically) logs periodically into replication log (manual: tell clrepl log)
Unavailable (or unreacheable/perceived as such) Restricted (Temp=1 or Perm=2) Invalid (never contacted)
Good News:
NATIVE Event/Queue Driven = CLREPL = (aka Almost Real Time) Most apps will automatically work better
ClDbDir
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
ClDbDir (contents)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Maintained by a server task of the same name It's in the Enterprise Edition of Domino Contains info about databases deployed in a cluster Is used by Notes/Domino Cluster Aware modules
to know where to push what (and what NOT to!!!) and for "failovers": a server finds resource elsewhere!
Like CATALOG, each server updates its OWN dbs BEWARE: 8192 maximun number of useful entries; you do NOT get a warning NOR Error message!
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Bonus Hack: Set Config Cluster_Admin_On=1 It also works IN NON Clustered servers!
Push
REPLICA
SERVER
UPDATE
Pull
sh Pu
VIEW DATA
Pu
ll
DATA VIEW
DATABASE
DATABASE
UPDATE
SERVER
CLREPL (replica)
SERVER
Push
s Pu
VIEW DATA
h
DATA VIEW
Document changes are captured and trigger the cluster Replicator via a message queue Cluster Replicator reads message queue and pushes changes to other all other replicas in the cluster regardless of replication settings (aka almost "real time" replication)
DATABASE
DATABASE
CLREPL
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
ClRepl (cont'd)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
CLREPL is a server task It's an in-Memory QUEUE driven event replicator (REMEMBER BATH TUB !) that SHOULD push content
at most within 15 seconds - in average 7
ClRepl PUSHES content modified locally to all cluster mates containing replicas of the modified database Tips:
It PUSHES ignoring source ACL Check that the queue is not over filled Always schedule CLASS+1 of them
NB: CLREPL does NOT initialize "Replica Stubs" It also knows what YES/NOT to push Out Of Service (for quite obvious reasons) but also Pending Delete (cldbdir does final push, not clrepl !)
thus ClRepl is also sometime called RTR or "ALMOST" REAL TIME REPLICATOR
the KEY here is in "ALMOST"
ClRepl (cont'd)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
ClRepl will keep an IN-memory queue It's a QUEUE, and can be overfilled It's in MEMORY and is NOT disk persistent THUS, also schedule normal replicas: Tips:
within reason, overschedulling pull replicas is not a huge issue, because the deltas are small i.e. Enabled Replica From */Srv/Whatever to <each>/Srv/Whatever, PULL, every 60 Mins Will make servers catch up fast, pulling at restart time.
TIP: SH ST REPLICA.CLUSTER.*Q*
(Daniel to explain detail stats)
Domino server clusters have an optional workload balancing feature that lets you distribute the workload of heavily-used databases across multiple servers in a cluster. To distribute workload, you limit or restrict the work that a server can perform using the following settings in the NOTES.INI: Server_Availability_Threshold
This setting allows you to specify the maximum availability level beyond which the server attempts to redirect user requests to other servers in the cluster. A server's availability index is recalculated each minute and compared against any threshold you set. If the index falls below the server threshold, the server becomes BUSY. The Cluster Manager redirects access requests from a BUSY server to the servers in the cluster. When an attempt to redirect is unsuccessful, the user receives access to the BUSY server. Each time a redirection occurs, Notes generates a workload balancing event in the Notes log (LOG.NSF).
Server_MaxUsers
This setting specifies the maximum number of user sessions allowed on a server. When the server reaches this limit, the server goes into a MAXUSERS state. The Cluster Manager then attempts to redirect new user request to other servers in the cluster. To see how often requests are being redirected, check the LOG.NSF for failover events. If redirection of the user request is unsuccessful, the user receives a message, and is not allowed access to the server.
Server_Restricted
This setting enables a server to deny new open database requests and places the server in a RESTRICTED state. Users who have active connections to databases retain their connections. The Cluster Manager attempts to redirect new requests to other servers in the cluster. When an attempt to redirect is unsuccessful, the user receives a message and is not allowed access to the server. For each redirection attempt, Notes generates a failover event in the LOG.NSF. Note: You can use the Server_Restricted setting for any Domino server. This setting is not restricted to clusters.
Ensure you have full manager access for LocalDomainServers as a Server group or better */Srv/Org as Manager of type Server in all ACLs.. I prefer hardcoding OUs to groups. Works always! Make sure all applications provide roles to give access to documents with reader fields (remember computed auth fields) Give Servers all rights and roles to "see" all documents Don't use replication formulas for clustered databases Have a scheduled replication in case some events in the clrep-queue get lost or the server is down... Add startup replication documents "from *" to ensure databases are up to date after server restart Schedule replication to the Name of the cluster instead of single server names (load balancing & failover)
Bad news:
If you already have this problem you need to delete replication history and CutOff Date to resolve existing replication problems Lotus Script can clear the replication history Set rep = db.ReplicationInfo , Call rep.ClearHistory() , Call rep.Save() But not remove the CutOffDate (in most cases not needed)
Changes/Recommendations
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Customer was using Notes Named Networks (NNN) across WAN connection
Caused unintended traffic
Only local servers in the same NNN Use only local directories in (DA)
Used "*" to specify the local replica only (TN #1087708) Evaluating Extended Directory Catalog to further optimization Directory catalog could simplify working with external addresses and allow more flexibility
Fault-Recovery
Maximize server availability Faster Server Restart after crash! Automatic collect NSDs for faster troubleshooting
LoadMon
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Domino 6+ uses a new algorithm to calculate the workload of a server and the resulting AI
A number of customers reported unpredictable, alternating AI which caused Clustering to fail. Algorithm was enhanced in D6.0.2CF2 and additional notes.ini parameters have been introduced. But there is another bug that is hopefully finally fixed in D6.5.6 and D7.0.2! We traced AI at customer site Live Environment Test Environment with Server.Load
XF is calculated based on the performance values of current transactions in relation to minimum time for a transaction
It's the number of times the current transactions take longer than the minimum transaction time XF values for different transactions build a overall XF This XF is computed and converted into AI based on a Range to scale the XF (TN #1112352) Notes.ini Server_Transinfo_Range n is 6 by default and specifies the maximum Expansion Factor of a Domino Server. The XF is calculated 2 raised to the power n (64 by default)
Debugging LoadMon
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
se co DEBUG_LOADMON=1
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Server.LoadMon.TransInfo.AI.Type = 0 Server.LoadMon.TransInfo.CurrentTransCount.CLOSE_DB = 3 Server.LoadMon.TransInfo.CurrentTransCount.DB_INFO_GET = 2 Server.LoadMon.TransInfo.CurrentTransCount.DB_READ_HIST = 0 Server.LoadMon.TransInfo.CurrentTransCount.DB_REPLINFO_GET = 5 Server.LoadMon.TransInfo.CurrentTransCount.DB_WRITE_HIST = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_NOTE_INFO = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_OBJECT_SIZE = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_SPECIAL_NOTE_ID = 0 Server.LoadMon.TransInfo.CurrentTransCount.NIF_OPEN_NOTE = 0 Server.LoadMon.TransInfo.CurrentTransCount.OPEN_DB = 3 Server.LoadMon.TransInfo.CurrentTransCount.OPEN_NOTE = 7 Server.LoadMon.TransInfo.CurrentTransCount.READ_OBJECT = 0 Server.LoadMon.TransInfo.CurrentTransCount.SERVER_AVAILABLE_LITE = 2 Server.LoadMon.TransInfo.HttpNormalize = 12000 Server.LoadMon.TransInfo.IntervalInSeconds = 15 Server.LoadMon.TransInfo.Max = 5 Server.LoadMon.TransInfo.MinAvgTransTime.CLOSE_DB = 58.1818181818182 46 statistics found Server.LoadMon.TransInfo.MinAvgTransTime.DB_INFO_GET = 119.875 Server.LoadMon.TransInfo.MinAvgTransTime.DB_READ_HIST = 210.666666666667 Server.LoadMon.TransInfo.MinAvgTransTime.DB_REPLINFO_GET = 88.5714285714286 Server.LoadMon.TransInfo.MinAvgTransTime.DB_WRITE_HIST = 240.2 Server.LoadMon.TransInfo.MinAvgTransTime.GET_NOTE_INFO = 110.235087719298 Server.LoadMon.TransInfo.MinAvgTransTime.GET_OBJECT_SIZE = 141.777777777778 Server.LoadMon.TransInfo.MinAvgTransTime.GET_SPECIAL_NOTE_ID = 93.333333333 Server.LoadMon.TransInfo.MinAvgTransTime.NIF_OPEN_NOTE = 1,031.4285714286 Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_DB = 429.166666666667 Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_NOTE = 272.987714987715 Server.LoadMon.TransInfo.MinAvgTransTime.READ_OBJECT = 134.285714285714 Server.LoadMon.TransInfo.MinAvgTransTime.SERVER_AVAILABLE_LITE = 95.3333333 Server.LoadMon.TransInfo.MinTrans = 5 Server.LoadMon.TransInfo.Normalize = 3000 Server.LoadMon.TransInfo.Range = 15 Server.LoadMon.TransInfo.RunningAvgTime.CLOSE_DB = 214.333333333333 Server.LoadMon.TransInfo.RunningAvgTime.DB_INFO_GET = 172 Server.LoadMon.TransInfo.RunningAvgTime.DB_READ_HIST = 0 Server.LoadMon.TransInfo.RunningAvgTime.DB_REPLINFO_GET = 187 Server.LoadMon.TransInfo.RunningAvgTime.DB_WRITE_HIST = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_NOTE_INFO = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_OBJECT_SIZE = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_SPECIAL_NOTE_ID = 0 Server.LoadMon.TransInfo.RunningAvgTime.NIF_OPEN_NOTE = 0 Server.LoadMon.TransInfo.RunningAvgTime.OPEN_DB = 4,143 Server.LoadMon.TransInfo.RunningAvgTime.OPEN_NOTE = 738 Server.LoadMon.TransInfo.RunningAvgTime.READ_OBJECT = 0 Server.LoadMon.TransInfo.RunningAvgTime.SERVER_AVAILABLE_LITE = 104
debug_loadmon=1
Enables LoadMon Debugging, writes additional information to server console
07.10.2003 07:08:09 Loadmon: Domino AI = 100, XF = 1
And adds additional 46 statistics counters (server.loadmon.*) Can be captured locally or remotely via "show server" or statistics collection program. nstats servername or C-API NSFGetServerStats (...)
loadmon.ncf
loadmon.ncf in Domino data directory stores last information from loadmon before server is shutdown loaded on server start to initialize statistics counters
BEWARE LARGE OVERFLOW INTO NEGATIVE VALUES Quit, delete loadmon.ncf, restart server (do after upgrades!)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Listen...(HACK 2)
You need to understand which fields are
Listens (usually in specific tabs) HostNames that are NOT Listens for example:
you can tell domino that it's HTTP hostname is the name of something else even in a different machine urls will be created nicely
AI with default interval 15 sec and 5 sampling values does not always result in steady AI
we needed to find values which provide steady values for cluster-failover not to occur "randomly" or cause Ping-Pong effects reasonable time to reflect current workload in AI Standard interval and sampling 15*5 cover 45 seconds Interval 10 seconds with 20 sampling values cover 200 seconds Standard Server.Load Scripts do not help much because most transactions are not used in standard scripts
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Cluster information: Cluster name: DOMPMAC01, Server name: DOMMYP01/SRV/Customer Server cluster probe timeout: 1 minute(s) Server cluster probe count: 62831 Server cluster default port: * Server availability threshold: 0 Server availability index: 28 (state: AVAILABLE) Server availability default minimum transaction time: 3000 Cluster members (11): Server: DOMPMA02/SRV/Customer, availability index: 79 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPMA01/SRV/Customer, availability index: 78 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPIN01/SRV/Customer, availability index: 64 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPIN02/SRV/Customer, availability index: 39 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0 Server: DOMMYP01/OLDSRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0 server: DOMHEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMVGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMCVP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMAGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMOEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
Failover by replicaID uses the new servers! CLREPL will NOT attempt to keep dead servers updated (EXTREMELY IMPORTANT!!!!!!!!)
In the cluster for reasonable long time BUT you must check the logs and
sh st replica.cluster.*q* You can't have lost transactions.. because CLDBDIR thinks the old servers are EMPTY but alive CL Manager will say once a minute they are unreacheable, which is what you want for AUTOMATIC user failover... over time...
Other Caveats/Tips/Tricks:
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
You must make sure you edit the old servers' records in NAB to remove mail routing You do not want mail to be attempted to be routed via old dead servers You'd better do server decomission report
BEFORE turning them off... a machine turned off produces no reports
DO NOT remove old old server from cluster yet
Cluster Analysis
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Failover
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Cluster Analysis is a great feature to figure out about problems in your cluster
It's part of the Admin Client and (Server / Analysis / Cluster ...) Run it to find problems with ACL, Replication, not existing databases, ...
Definition:
Server Initiated due to reactive Load Balance or failures Client Initiated server is dead or perceived as dead requires client to know how to connect to cluster mates without server assistance! Tips: insert the address in name:
CN=<FullyQualifiedDomainName>/Whatever CN=194.196.39.11/Srv/LotusEmea/Net
Tips
Run it, print it and sign off all warnings you find Use FT Search to remove multiple
DEBUG_NOSTDOUT=1
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
DEBUG_RUN_AS_ROOT=1
it WILL allow you to run as root in UNIX/Linux it will NOT allow you later to run as non root unless you fix all the owners, permissions,etc of everything it created. (just DDT please!) Exception: Some custom restores required root
GET A NEW VERSION OF RESTORE TOOL
for performance reasons and also... for sanity of old 3rd party apps (&BACKUPs)
Replication Debugging
DEBUG_REPL=2 & DEBUG_REPL_ALL=1
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
sh st replica.cluster.*
(if you do not read the stats, why bother clustering?)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Replica.Cluster.Docs.Added = 26790 Replica.Cluster.Docs.Deleted = 16060 Replica.Cluster.Docs.Updated = 378378 Replica.Cluster.Failed = 30 Replica.Cluster.Files.Local = 83 Replica.Cluster.Files.Remote = 83 Replica.Cluster.Retry.Skipped = 222 Replica.Cluster.Retry.Waiting = 0 Replica.Cluster.SecondsOnQueue = 13 Replica.Cluster.SecondsOnQueue.Avg = 2 Replica.Cluster.SecondsOnQueue.Max = 3593 Replica.Cluster.Servers = 1 Replica.Cluster.SessionBytes.In = 160450213 Replica.Cluster.SessionBytes.Out = 824894460 Replica.Cluster.Successful = 13484 Replica.Cluster.WorkQueueDepth = 0 Replica.Cluster.WorkQueueDepth.Avg = 0 Replica.Cluster.WorkQueueDepth.Max = 4
Network_Sprayer_Address=*
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Failover by Path
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Useful to disable name checking after connect I just wished it did work better (not always works) DO_NOT_USE_REMEMBERED_ADDRESSES=1
Normally, you should NOT get it What you should get are mostly by RepId It is a sign that you have multiple instances of the same replica id in one server You should (almost) never have duplicate SH DIR in the server tells you duplicates Requested to be added to ADMIN client next
Server_TransInfo_Normalize
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Server_TransInfo_Range
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
default = 3000 Units is Miliseconds * 100 of std transaction 3000 is a BAAAAAAAAD default Fortunately Loadmon.ncf helps
to save old real times for all transactions
Alledgedly (rumour)
it helps also NON clustered HTTP servers Apparently some code in http checks SAI for self tuning, and a better SAI uses HW better
Useful to be able to read something If you are using a very high debug level
Remember to resume it, else you will get nuts trying to figure out what happened.
nconsole DOMPHU00 "sh ai" 1 2 48406 93 100 2 4 1380 77 93 3 8 1226 64 77 4 16 821 51 64 5 32 106 38 51 6 64 39 26 37 7 128 16 20 25 ...
Current value of SERVER_TRANSINFO_RANGE is 6. <<changes suggested for SERVER_TRANSINFO_RANGE>>
nconsole DOMPHU01 "sh ai " 1 2 48826 93 100 2 4 1052 77 93 3 8 1148 64 77 4 16 711 51 64 5 32 197 38 51 6 64 40 27 38 7 128 0 8 256 4 1 5 9 512 13 0 0 10 1024 11 0 0 11 2048 1 0 0 12 4096 1 0 0
Q&A
Push
SERVER UPDATE
REPLICA
Pull
sh Pu
VIEW DATA
ll Pu
DATA VIEW
DATABASE
DATABASE
UPDATE
SERVER
CLREPL (replica)
SERVER
UPDATE
Push
sh Pu
VIEW DATA DATA VIEW
These are the support pages... Which you can get by asking for them at the back of your business card... We politely request NO REPOSTING...
DATABASE
DATABASE
The "Nines":
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
2 nines (99%) =circa= 88 hours/year 3 nines (99.9%) =circa= 9 hours/year 4 nines (99.99%) =circa= 52 minutes/year 5 nines (99.999%) =circa= 5 minutes/year
Downtime costs per user = [ (Total hours of Unscheduled downtime (25% of user population) X (Hourly user salary) + (Total hours of Scheduled downtime X Hourly Messaging Administrator Salary) ] / Number of messaging users NOTA BENE: R.S.E. and Change Management/Control needs
Business Users do NOT care what you do with your PLANNED down time
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Business users can plan around PLANNED un-availability of mission critical sytems What Business Users can NOT usually accept
is having to have both Planned and UN-Pl'd YOU CAN NOT REDUCE BOTH TO ZERO on an individual component basis
Have the user KEEP updated a contingency "Plan B" for alternative/manual processing, so they realise how much mission critical their system really is... TEST their plan B (fire drill :-) Ask again for the "TC of not Having" Ask again for "Not Having Aversion"
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
High Availability
My petty own TWO definitions
Historical = (ex-post) the FACT that a service has been available
in the past
Strategic Planning:
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
It makes sense to Actively Plan & Design: WHAT CAN I DO TODAY to IMPROVE the probablity or likelihood that a Service will be perceived as available when needed?
Apply standard tuning to OS and TCP DELETE every single other protocol you can PRINT and understand relevant KB notes Examples of TcpIp advised hacks:
EnablePMTUDiscovery=0 TcpTimedWaitDelay=30 etc
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Analyze your network and Investigate and Eliminate ALL non essential traffic
I/O controller
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
bottlenecks
Controller Channels Notes executables Log files Domino data
RAID5 volume
bottlenecks
Notes executables Log files Domino data
I/O controller
I/O controller
RAID(1, 5) volume
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Separate drive
OS kernel Page files Controllers Notes executables Log files
OS
Domino
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
I/O controller
Page \data
RAID(1, 5) volume
I/O controller
bottlenecks
Apps, Domino I/O technology OS technology
RAID(1, 5) volume
\data
BOOT SEQUENCE: C, CD, A DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPS
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
MOST SW vulnerabilities are based on SW Bugs ALL software has (some) known + unknown BUGS If a software is not installed it can not run :-) If a software is not running its Bugs don't matter UNINSTALL everything you do not absolutely need Remove all un-needed online-documentation Win32: SPECIFICALLY
UNINSTALL WORKSTATION LAN SERVICES!!!
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
High Availability
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Something that is "likely" to be available... Must be architected and run as such "Architected" implies with "HEURISTICS",
most of which are "difficult to quantify" It's easier to measure Sq Feet of Grass to Mown than quantifying "Garden Landscaping Work"
WYPIWYG is actually W.Y.P.I.W.Y.G. "What You Print Pay Is What You Get"
I will NOT repeat here the trivial ones Some "hidden SPOFs":
check bill of materials for anything that has1 mouse/keyboard/Switch ==>IMPLY SAME RACK UPS/ISP/Site: you may have to consider multi site/homed
If you measure the wrong things... you WILL get wrong behaviours and outputs
Tips:
do NOT deploy by OS copy nor FTP, use replica Hardcode Cluster OU in ACLs ie. */Srv/<whatever> [Names]: Add to prevent pull replication issues
Credits:
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Our Teachers
Lotus/IBM/Iris: too many links, thanx to all ! Our Partners: Penumbra Partnering Inc. http://www.PENUMBRA.org Our Customers Some names in our site :-)
Average of when you can expect something to fail Assumes eveything will eventually fail - by design!
MTBF implies P(F,eventually)=1.0
Murphy's LAW ...and... Never Let a Machine Know You Need It :-)
Leverage on differences
reduce risk by using stuff that will fail eventually BUT with negative or zero correlation
Win32 code-streams have a huge in-built-correlation, so do UNIX's/Linux's Lower Correlation between Win32,Linux,etc Lower Correlation between AS400/iSeries / rest
you KNOW with a P(X fail,eventually)=1 that individual components = something = will fail (eventually) but you do not know WHEN, WHAT, HOW TRY to make cross-correlations work for you Don't forget Murphy's Law
Embedded Dis-Services
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Anything having EITHER an MTBF, an SLA or windowsupdate.com or liveupdates has "Embedded individual outages" SLA implies Dis-Service agreement trade-offs The Business User does NOT care
for INDIVIDUAL SLAs/MTBFs So you could, can and must Architect and Design a CLUSTERed Solution and offer a CLUSTER SLA
Portfolio Principles
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
"there is nothing wrong with putting all your eggs in one basket, just watch that basket" Henry Ford
don't put all your eggs in one basket cause you can't watch it close enough don't put all your eggs in too many baskets cause you can't watch them all close enough
A Fellow Penumbra told me: You do not need a boat, you need a friend who has one and knows how to use it....
Same for a protocol analyser: you just can NOT guess the client/server dialogue (ex caching)
High Availability
The art of doing something "automagically" to improve the perceived performance of the cluster, usually by making intelligent usage of idle resources. Proactive:
Load Spreading
Reactive
Performning Load "re-"Balancing by trying to fail over to less busy clustermates