Professional Documents
Culture Documents
ISSUE 2
Netbackup Operational
Procedures
Veritas Netbackup Procedures
Contents
Content approval
This is Issue 2 of this document.
The information contained in this document was approved for use.
Filing
The filing reference for this document is ISL/OPS/A869.
History
Issue
Date
Author
Reason
Issue 1
23/02/05
Mick Sweeting
DRAFT 1
Issue 2
26/05/07
John Wesson
Page 2 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
Contents
1
Introduction
Contacts
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
6
6
7
7
7
7
8
8
8
8
10
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
10
10
11
12
15
15
16
17
17
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
17
17
20
20
20
21
22
22
22
22
24
Accessing NetBackup
24
6.1
NBU 4.5
24
25
Page 3 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
7.2
Shutdown Netbackup
26
27
8.1
8.2
8.3
Introduction
File system full problems
Full List of NETBACKUP Processes
27
27
28
8.4
30
8.5
8.6
8.7
30
30
32
32
10
34
10.1
Overview
34
11
35
11.1
11.2
35
36
12
Restore Guidance
37
12.1
12.2
12.3
12.4
12.5
12.6
12.7
Background
Restore information
Media related items
Restore processes
Failover Restores
Reduction of NetBackup drive usage to allow a restore to run
Additional information
37
38
39
40
46
50
51
13
Backups
51
14
51
14.1
52
15
53
15.1
Exit Status 41 Network connection timed out
15.2
Exit Status 1 Backup was partially successful
15.3
Exit Status 52 Timed out waiting for Media Manager to mount volume
15.4
Exit Status 71 None of the files in the file list exist
15.5
Exit Status 219 The required storage unit is not available
15.6
Exit Status 54 Timed out connecting to client
15.7
Exit Status 84 Media write error
15.8
Exit Status 57 Client connection refused
15.9
Exit Status 96 Unable to allocate new media for backup, storage unit has
none available
15.10
Exit Status 131 Client is not validated to use the server
15.11
Application Resource Alert
15.12
Slow throughput of backups
53
55
55
55
56
56
56
57
58
58
60
60
Page 4 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
16
Netbackup Clients
61
16.1
16.2
16.3
16.4
Netbackup Documentation
Troubleshooting Guide
NetBackup reporting
Supportal
61
61
61
61
17
APPENDICES
61
Page 5 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
1 Introduction
This document is intended for use in identifying and resolving Veritas
NetBackup problems. It is not intended to replace any software manuals and
should be used in conjunction with the current Veritas manuals. It also
assumes a certain level of Unix command experience and use of dasadmin
commands.
2 Contacts
2.1
2.2
Contents
Unix:
CWHPOPSUX
Wintel:
CWHPOPSNT
Netbackup: CWHPOPSSTNB
2.3
2.4
Storage management
Contact information for the storage management support group with
responsibility for Netbackup
2.5
CWBACKUP
CWBACKUP Bridge
Callout NETBACKUP
2.6
2.6.1
01344 488786
Maurice Rutherford
Office
Mobile
07710 576425
Mike Halliday
Office
Contents
Mobile
2.6.2
07850 000152
2.6.3
Hours of Cover
Full cover will be provided on a 24 hours per day, 365 days per year basis
with a 2-hour response time.
2.7
Escalation
2.7.1
HP
John Orman
(Tel 01908 656267)
Pete Meade
(Tel 0151 706 8805)
(Mobile: 07802 471232)
2.8
ADIC
Gary Page
(Tel: 0118 922 9100)
(Mobile: 07919 330945)
Maurice Rutherford
(Tel: 0118 922 9100)
(Mobile: 07710 576425)
Hardware Information
Hardware information for each site can be found in Hardware Information
under Documentation at
http://dataintegrity.intra.bt.com/
2.9
Drive Information
This can be found under Tape Library Info under Media & Library under
Drive & Media at
http://byadsm03.nat.bt.com/
2.10
Contents
2.10.1
2.10.2
Hours of Cover
Full cover will be provided on a 24 hours per day, 365 days per year basis
with a 2-hour response time.
Page
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
4.1
Note: Do not leave your robtest session active any longer than you have to.
When you are in the robtest utility, communication between the tlhcd
daemon and the media servers is blocked. If a media server makes a
request to tlhcd (either a mount or a dismount request) while the robtest
utility is active, the request will be blocked. This will result in the tape
drives on that media server being switched into AVR mode until the
robtest session is terminated.
4.2
mtlib Commands
The following commands can be used to interrogate the robot using the mtlib
utility (/usr/bin/mtlib)
Command
Result
mtlib -l 3494c q I
Page 10
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
mtlib -l 3494c q L
mtlib -l 3494c q S
Statistical data
mtlib -l 3494c q K
Count of tapes
mtlib -l 3494c -C -t
<category> -V
<barcode>
mtlib -l 3494c -C -t
FF10 -V <barcode>
mtlib -l 3494c -D
mtlib -l 3494c q M
For further information regarding the robtest utility, refer to ISL/OPS/B274 BRUNIX media management procedures at
_Coll=;
4.3
Page 11
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
they contain any valid data and if so need to be put into a frozen state. If no
valid data exists then they can be deleted from the Media Manager database.
4.4
4.4.1
Caution: Note that while robtest is running no further library action can take place, i.e. mount
and dismounts.
Select the appropriate library, which is normally option 1, and to get a list of
the possible options use the ? command.
dyadsm01 $ robtest
Configured robots with local control supporting test utilities:
TLH(0) LMCP device path = /dev/lmcp0
Robot Selection
--------------1) TLH 0
2) none/quit
Enter choice: 1
Robot selected: TLH(0) LMCP device path = /dev/lmcp0
Invoking robotic test utility:
/usr/openv/volmgr/bin/tlhtest -r /dev/lmcp0 -d /dev/rmt2.1 003590E1A00
-d /dev/rmt1.1 003590E1A01 -d /dev/rmt3.1 003590E1A02
Opening /dev/lmcp0
Enter tlh commands (? returns help information)
?
To exit the utility, type q or Q.
Page 12
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
audit <volser>
- Audit library for volser
catinv <category> [<count>] - Print library inventory by category
dm [<drive>|<IBM Device Name>] - Dismount volser from drive
drmapclear
- Clear drive address mapping
drmapfreeze
- Freeze drive address mapping
drmapshow
- Show drive address mapping
drstat [<drive>|<IBM Device Name>] - Print drive status
eject <vol> [bulk]
- Eject volser to standard (or bulk) output area
inv [<count>]
- Print library inventory
libstat
- Print library status
m <volser> [<drive>|<IBM Device Name>] - Mount volser
setcat <volser> <old> <new> - Set volser category
types
- Print list of media types
verbose
- Toggle verbose mode
view <volser>
- Print volser data
SCSI commands:
unload [<drive>|<IBM Device Name>] - Issue SCSI unload
<drive> = d1 if drive 1, d2 if drive 2, ..., d256 if drive 256
Contents
Page 14
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
Note: To find out the drive details on the server, issue the tpconfig d command on
the master/media server, which uses the library. The output will look similar
to that shown below.
dyadsm01 $ tpconfig -d
Index DriveName
DrivePath
Type Shared Status
***** *********
**********
**** ****** ******
0 Drive1
/dev/rmt2.1
hcart2 No
UP
TLH(0) IBM Device Name=003590E1A00
1 Drive2
/dev/rmt1.1
hcart2 No
UP
TLH(0) IBM Device Name=003590E1A01
2 Drive3
/dev/rmt3.1
hcart2 No
UP
TLH(0) IBM Device Name=003590E1A02
Currently defined robotics are:
TLH(0) LMCP device path = /dev/lmcp0,
volume database host = dyadsm01
From this it can be seen that there are 3 drives configured to be used by
this server dyadsm01.
4.5
4.6
Drive problems
If it is a drive problem then the SCSI commands to manipulate the drive must
be issued from the master/media server. For example to rewind a cartridge
and prepare it to be unloaded from a drive you would have to issue the mt
f /dev/rmt1 rewoff command.
The most common problem is when a cartridge does not eject from a drive.
Page 15
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
4.6.1
4.7
Library problems
If there are problems with the library, i.e. drives in AVR and cartridges not
being mounted, then attempt a manual mount of a cartridge.
For example to mount a cartridge you would run robtest (see above for
details) and then run the appropriate command.
A useful command is the libstat command. This shows the library status on
the first line, which should be:
state:
Automated Operational
State
Further down the output the status of the I/O station/hopper is displayed and
whether it is full and requires emptying. The status should be:
input/output status:
All convenience input
stations empty
All convenience output stations empty
Another useful command is drstat, which lists information for all the drives.
From this it is possible to tell if a cartridge is in the drive and its identity.
Drive 3 information:
drive number:
3
device name:
003590B1A02
device number:
0x203140
device class:
0x11 - 3590 Model B1A/other
device category:
0x0000
mounted volser:
<none>
mounted category:
0x0000
Page 16
Contents
device states:
4.8
5.1
5.2
Page 17
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
E.g.
dybkup01 $ dasadmin ld
listd for client: successful
drive: DRIVE1 amu drive: 01 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 10
drive: DRIVE2 amu drive: 02 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 29
drive: DRIVE3 amu drive: 03 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 28
drive: DRIVE4 amu drive: 04 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 28
drive: DRIVE5 amu drive: 05 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 14
drive: DRIVE6 amu drive: 06 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 1
drive: DRIVE7 amu drive: 07 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 15
drive: DRIVE8 amu drive: 08 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 24
drive: DRIVE9 amu drive: 09 st: UP type: N sysid: client: dyvsisb1
volser: cleaning 0 clean_count: 6
drive: DRIVE10 amu drive: 10 st: UP type: N sysid: client: dyvsisb1
volser: cleaning 0 clean_count: 3
drive: DRIVE11 amu drive: 11 st: UP type: N sysid: client: dybkup01
volser: DEF595 cleaning 0 clean_count: 17
drive: DRIVE12 amu drive: 12 st: UP type: N sysid: client: dybkup01
volser: cleaning 0 clean_count: 7
drive: DRIVE13 amu drive: 13 st: UP type: N sysid: client: dynebk01
volser: cleaning 0 clean_count: 5
drive: DRIVE14 amu drive: 14 st: UP type: N sysid: client: dynebk01
volser: cleaning 0 clean_count: 23
drive: DRIVE15 amu drive: 15 st: UP type: N sysid: client: dynebk01
volser: DEC406 cleaning 0 clean_count: 2
To display the drive list for a specific client
dasadmin ld dybkup01
dybkup01 $ dasadmin ld dybkup01
listd for client: dybkup01 successful
drive: DRIVE11 amu drive: 11 st: UP type: N sysid: client: dybkup01
volser: DEF595 cleaning 0 clean_count: 17
Page 18
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
Description of Parameter
drive
Drive number
st
type
Drive type
sysid
Reserved
client
volser
cleaning
clean count
client
To see the range of tapes assigned use the dasadmin qvolsrange command
Note: This command returns a list of volsers, which are accessible to the specified
client within the requested volser range.
dasadmin qvolsrange beginvolser endvolser count (client name)
e.g. dasadmin qvolsrange DTP216 DTP220 8
Parameter
beginvolser
Description of Parameter
The beginvolser specifies the first volser in the
range
endvolser
count
client name
Contents
5.3
5.4
Description of Parameter
media-type
volser
drive
5.5
Description of Parameter
media-type
volser
Description of Parameter
media-type
area
Contents
5.6
Description of Parameter
Tells DAS to remove the volser from the catalog
media-type
volser-range
area
Page 21
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
5.7
Drive Problems
Robotic problems are more straightforward to identify than drive problems.
Determine what type of robot you are using:Predominantly we use ADIC robots at the moment (AMLs).
5.8
AMLs
The amu is a pc, which sits on the front of the robot and handles all requests
for action from all the backup servers. The tape drives sit in the robot, but
there is no direct electrical connection between the robot and the drives. The
robot arm actions requests given to it by the amu. The racking is used to store
the tapes. The amu is contacted using standard ip addressing. ADIC provide
a binary that can be used to communicate with the library, and that command
is dasadmin. A full list of dasadmin commands and their use can be found by
typing dasadmin -?
5.9
Common Problems
The drives are in AVR mode
Check the messages/syslog for errors. Then either:
Ping amu from the server, the ip address will be found in /etc/hosts
If there is no response from either server callout ADIC. This could also be a
network problem such as the network connectivity to the box has been lost. In
which case there is nothing to be done until the network has been restored.
If the dasadmin ld returns a list of drives, then select a tape from media
manager which does not have a time assigned value
Try and mount it in a drive. Use the command: #dasadmin mount t 3590 volser DRIVEn.
If this returns unexpected response code received from the amu, then you
callout ADIC. If it works then either use reset with drive control, or
#mt -f /drive off ,
and
#dasadmin dismount t 3590 volser
5.10
Points To Note
Some more observations on problems encountered with ADIC robots.
Page 22
Contents
The tlmd daemon netbackup uses to talk with the robot, will periodically test
the state of the tape drives. If drives are in AVR mode (not DOWN-TLM)
then when the daemon gets a positive response the drives come back up
automatically.
On the syslog you may see the message "robot encountered an error handling
a volume", this could cause intermittent problems. You may be able to get
away with freezing the tape or if it is scratch, to move some from another
pool. (see exit status 96 for detailed instructions on moving media.)
If the robot door is opened it will stop the robot, and will need to call ADIC
out to restart the machine.
dasadmin ld2 may have to be used to view tape drive numbers above 15.
Use drive control to reset the drive. This effectively does mt f off and
tells the robot to put the tape away. If this does not work, make sure backups
are not using that tape drive, and issue mt f /device-file off. If this takes you
back to the prompt (ie. It has worked) then instruct the robot to put the tape
away by issuing the dasadmin dismount command, then try the drive out
again. If the drive fails again, then an engineer will have to be called, because
it may be that the drive can read the header label but not position on the tape.
The sequence of events would be:-
Mt -f /devicefile off
Run a backup, and check on job monitor to see if the tape has positioned.
If the mt f /device-file off reports an error then call an engineer. Take the
drive down in NetBackup until the engineer has dealt with it.
Firstly interrogate the robot to determine whether the robot has been putting
tapes in the drive. Use dasadmin ld to list the drives.
If there is a tape on the drive, check that the tape isnt just sitting on the lip
of the drive by instructing the robot to put the tape away. This can be done by
issuing the command
#dasadmin dismount -t 3590 volser
if the command was successful then the tape would be on the lip.
Page 23
Contents
5.11
If the robot reports that the drive cannot be unloaded, then issue
#mt f /device-file off
If this takes you back to the prompt (i.e. It has worked) then instruct the
robot to put the tape away, then try the drive out again. If it doesnt work, try
#mt f /device-file status
then call out an engineer. It is worth checking the syslog again, because the
previous actions may have forced UNIX to report an IO error.
Library Problem
If there are problems with the library, i.e. drives in AVR and cartridges not
being mounted. Check the messages log on the master server for the
following:
Feb 10 09:39:30 dybkup01 tlmd[25753]: [ID 897060 daemon.error] TLM(0)
dismount failure for volser SDY815 on drive DRIVE12, d_errno = 10, The
AMU was unable to communicate with the robot.
Feb 10 09:39:30 dybkup01 tlmd[18661]: [ID 160136 daemon.error] TLM(0)
going to DOWN state, status: Robot hardware or communication error
Feb 10 09:39:55 dybkup01 tlmd[26277]: [ID 969665 daemon.error] TLM(0)
dismount failure for volser SDY470 on drive DRIVE11, d_errno = 10, The
AMU was unable to communicate with the robot.
Feb 10 10:33:13 dybkup01 tlmd[18661]: [ID 861719 daemon.error] TLM(0)
drive DRIVE11 (device 0) is being DOWNED, status: Robotic dismount
failure
On seeing these messages, initiate a call with the vendor ADIC.
6 Accessing NetBackup
6.1
NBU 4.5
When you logon to a server - it is only possible to use this on a Netbackup
Master server, not a client - please access netbackup using one of the
following methods:
from the toolbar:Select the start button programs Veritas NetBackup NetBackup
administration from there access the required panel.
clicking your shortcut icon (if one has been set up)
You may also wish to amend your profile to include the netbackup
directories in your PATH.
Page 24
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
vmd
Ltid
Avrd
Tlmd/tlhd
These daemons run on the server only, and not on any of the clients. On slave
servers only the media manager daemons run.
Of these the most significant are bprd and ltid.
Bprd starts bpdbm, and ltid starts vmd, avrd, and tlmd or tlhd.
Tlmd and tlhd are robotic daemons, which are different depending upon
the type of robot used. Tlmd refers to ADIC robots, whilst tlhd refers to
IBM robots.
7.1.1
7.1.2
Daemon Descriptions
Bprd - On master servers this daemon handles requests for backups and
restores and scheduled backups.
Bpdbm On master servers bpdbm handles all the configuration, error and
file databases
Ltid On master and slaves this daemon controls the reservation and
management of volumes
Starting NetBackup
The script bp.kill_all in /usr/openv/netbackup/bin/goodies will kill off all
netbackup daemons under normal conditions. If after running bp.kill_all
Page 25
Contents
there are daemons hung (usually bpsched processes) these should be killed,
using kill or kill 9 if necessary.
Alternatively, if you experience problems shutting down Netbackup using the
above script use nbu_kill script, which can be found in /usr/openv/btscripts.
7.1.3
If just tlmd has not started, or has crashed, this can be started using ./tlmd
This is the recommend command when starting Netbackup.
If avrd, has crashed, issue /usr/openv/volmgr/bin/stopltid to stop ltid,
and then /usr/openv/volmgr/bin/ltid to restart ltid
7.2
Shutdown Netbackup
The script bp.kill_all in /usr/openv/netbackup/bin/goodies will kill off all
netbackup daemons under normal conditions. If after running bp.kill_all
there are daemons hung (usually bpsched processes) these should be killed,
using kill or kill 9 if necessary.
Alternatively, if you experience problems shutting down Netbackup using the
above script use nbu_kill script, which can be found in /usr/openv/btscripts.
7.2.1
Problem resolution
It is rare that there are problems with the NetBackup daemons.
Common problems are:
The daemons werent started after a reboot, in this case logon to the
server, and su- root, then issue ./netbackup from
/usr/openv/netbackup/bin/goodies.
Page 26
Contents
The hostname on the box has changed. This should be reviewed, with
NetBackup Support and Service Delivery.
Tlmd can crash if there has been a severe robot problem, to start this
daemon logon to root and issue: ./tlmd from /usr/openv/volmgr/bin.
If after trying the above Netbackup does not start, then it will be necessary to
callout NetBackup support.
Introduction
This directory (/usr/openv/netbackup/logs) is where detailed activity logs will
be placed on the NetBackup client box if certain sub-directories exist. These
sub-directories should only be created if unexplained problems are occurring
with the NetBackup product and more information is required to isolate the
problem. For further information on Veritas NetBackup Logging procedures
see ISL/OPS/B194 at
_Coll=;
Warning:
Some of these logs can potentially grow very large, and should only be
enabled if unexplained problems exist.
8.2
Page 27
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
8.3
-restores
-backups
-archives
-manual/immediate backups
bpsched
-backup scheduler
-uses information from the class & storage unit databases to determine what
clients to start, when to start them, and
bpdm
-disk manager
-during backups and restores, one of these is started (on the server with the
storage unit) for each client backup or restore
Contents
-used to display info in the Media Reports screen when you select Media List
bpbrm
-backup/restore manager
-during backups and restores, one of these is started (on the server with the
storage unit) for each client backup or restore
-responsible for managing both the client and the media manager processes.
uses error status from both to determine ultimate
bpdbm
-database manager
bpcd
-"client daemon"
bparchive
bpbackup
bpbkar
bplist
Page 29
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
bprestore
tar
bp
8.4
8.5
8.6
Contents
Are the tape drives down and remain down when you bring them up?
If so this usually means you have a drive related problem in the first instance.
At the end of your initial diagnosis you may not be completely sure whether
the problem is robot or drive related, but if in doubt start looking at the drive
first, as this is much more common a problem.
8.6.1
Points To Note
The following points to bear in mind when dealing with drive problems are
the result of experience, and not from methodical diagnosis, but they have
happened on more than 1 occasion, are not easy to spot.
Ltid, avrd, vmd, tlmd or tldd will not start after a reboot, or after you have
started netbackup. Look in the syslog first, it always tells you why they did
not start.
- If the device file is not in UNIX then ltid will not start, but this is reported
in the syslog.
After a reboot, it has been known (particularly on Compaq) for the device
files to be renamed. This has the same effect as removing the device files. It
is also a nightmare trying to map the new files back on. There is no easy way
to identify whether this has happened, if an ls al command is issued on the
directory it should be possible to identify if there are new device files. SCSI
extenders complicate the situation!
Many of the drive problems are caused by SCSI extenders, either failing or
having a glitch. Resetting the extenders may be enough to fix a problem,
however from our point of view we just call out an engineer; it will be the
engineer who will determine what action is required.
Contents
without requiring a reboot. Compaq also are fairly robust. HP and Sun are
questionable, although Sun should not require a reboot under normal
circumstances. If the SCSI cable is unplugged from the back of the host
machine, then most times you will have to reboot the machine. If you cannot
use a device after it has been fixed, always seek guidance from TSG before
arranging to reboot a box.
8.7
Deconfigure
Prior to deconfiguring,
Check on Device Manager for the device name by which the drive in
question is set or use tpconfig l which displays the same information
In brief the steps are:
1. Check
dumpconf | grep tc
2. Deconfigure
devctl -d tcn
devctl -d scsibusnn
3. Reconfigure
devctl -c qcicn
4. Check
dumpconf | grep tc
9.1.2
Contents
9.1.3
Reconfiguration
Prior to reconfiguring,
If you have misplaced the deconfig details, you can find them in the relevant
ktlog via /usr/adm/ktlog/yyyy/mm/dd - e.g. /usr/adm/ktlog/1999/10/04. You
should find output similar to the following:
Page 33
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
9.1.4
# devctl -c qcic5
devctl: Found scsibus18, tc0
To check what the current settings are and confirm the above commands,
enter "dumpconf | grep tc" which will produce output similar to the
following:
tc0
tc1
tc2
tc3
tcpmux
tc
tc
tc
tc
0
1
2
3
pseudo
0x00000000
0x00000000
0x00000000
0x00000000
-
S
S
S
S
scsi
scsi
scsi
scsi
scsibus18
scsibus22
scsibus26
scsibus30
If the original settings are showing, the work has been completed
successfully.
If, on using the 'devctl -c qcicn' command to reconfigure, only one new
device shows, try a query on the found device e.g. 'devctl -c scsibus18'
which may detect the other device.
Note: If the system responses diverge in any way from the examples given
above, contact Sequent TSS Group.
Overview
There is a requirement to use SCSI extenders to connect some tape drives
because there is a physical limit of 25m on SCSI, which in most computer
halls is not enough to allow for full use to be made of expensive robotic
libraries.
The use of SCSI extenders greatly increases the number of points of failure
that can are present in the configuration. It also complicates problem
resolution.
NOTE: the resetting of SCSI extenders is usually done by the Data Centre
Operations teams. However the procedure is as follows:
1. Remove the covers at the back of the robot. Identify the offending drive.
Page 34
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
2. On top of the drive there is a small LCD panel connected via a cable.
On this panel, using the arrow keys, scroll down to "UNLOAD DRIVE" and
press "RETURN. This will eject any tape which might be mounted.
3. Next, go to the PC at the end of the silo and send a 'KEEP' signal as
follows:
from the open window, click on 'COMMANDS' then from the drop down
menu, click on 'KEEP'.
move the cursor to "SOURCE" and enter drive number e.g. D21, must be
in upper case.
Then click on "EXECUTE".
4. Go back to the LCD display panel connected to the drive and from the
keypad menu, press 'E' then '2'.
You should see 'FIBLEN' on the display. This indicates that you have
carried out a 'fibre length' check. Then press 'C' to clear.
However, if nothing happens after entering '2' then power OFF the drive.
The ON/OFF switch is at the back of the drive on the right hand side.
Then power back ON.
5. Reset both SCSI extenders.
Description of Parameter
-freeze
-unfreeze
-suspend
-unsuspend
-ev
Specify media id
-h
specify host
Page 35
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
11.2
Tape Management
# bpadm
Reports
Media
Media Summary
11.2.1
Killing Processes
# kill 9 {PID Number}
#bpps -a
root 1206 1 0 Mar 22 ?
/usr/openv/netbackup/bin/bprd
root 1218 1 0 Mar 22 ?
/usr/openv/netbackup/bin/bpdbm
11.2.2
0:32
1:35
11.2.3
Logs
These may be in various locations dependant on the platform, however check
in the following first.
/usr/adm/ktlog/ - Sequent
/usr/spool/adm/
/usr/adm/syslog/
/var/adm/syslog/ - HP
/var/adm/messages Sun
errpt a - AIX
/usr/openv/netbackup/logs/bp*
11.2.4
WHO Command
who -b (Shows when system was last re-booted)
Page 36
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
11.2.5
12 Restore Guidance
12.1
Background
This section is intended for use by groups who may be required to perform
restores on Unix Systems using NetBackup. There are a number of different
ways to run a restore. These are:
1. From the Admin GUI.
2. From the command line interface, the bpadm or bp panels.
3. From the command line directly.
12.1.1
Introduction
The restore facility for Netbackup is extremely powerful. Its use must only be
considered with sufficient justification by the requestor. It can be very easy to
destroy a UNIX box with Netbackup, particularly if you are restoring files
in / or /usr. If you are at all unsure about what is being requested by the user,
Page 37
Contents
then refuse to complete the request. In the past many unnecessary restore
requests have been executed because the customer was not completely sure
what was required to fix the problem.
In short do not be afraid to ask why they require restores.
Wherever possible, the file or files should be restored to an alternative path.
If the user wishes to restore to the original directory, see if you need to
overwrite data, doing a restore with overwrite not allowed is much safer and
should be considered as the normal method for performing restores. Always
impress on the requestor the importance of performing restores safely, even if
it means the user has to do some work.
12.1.2
NetBackup
NetBackup backs up data to cartridge. To recover this data is straightforward
as long as the correct information is used to initiate the restore process.
It must be remembered that a file in UNIX does not always exist, as it could
be a hard or soft link to another real file. If NetBackup is asked to backup a
soft link it will not follow it! So to ensure that the data is backed up the target
of the link must be specified. This also holds true when attempting to restore
the file. For a hard links the situation is slightly different. But as hard links
are not commonly used details are not included in this document.
If you attempt to restore a soft link then no data will be restored. Therefore
providing the name of the soft link as the file to be restored is worthless.
12.2
Restore information
There are some pieces of information that must be provided to allow a restore
to be carried out. These are as follows:
1. The host name of the box from which the data was backed up.
2. The operating system of the box. Is it a UNIX or an NT client.
3. The date and time, or dates and times between which, the backup was taken.
The shorter the time between the start and end the quicker the search through
the NetBackup database will be, and hence the restore will take less time.
4. Should the data be placed in an alternate location, or can the data be
overwritten? (Insist the requestor specifies what is required).
5. Should the file, files or directories be renamed when restored?
6. The host name of the target box, if different to the source box.
Cross client restores are only possible if both the target and source box are
connected to the same NetBackup server.
Cross client restores must be enabled within NetBackup.
7. The fully qualified names of the files or directories to be restored.
8. A contact number for when the work is complete.
Page 38
Contents
9. Acquiring a timeline:
In order for Bridge Operations to effectively manage high priority restores
(P1 or P2's), ensure we have a documented time-line provided by the
requester in order to be able to checkpoint our progress. The onus is on
Bridge Operations to obtain from the requester this information before we
take on the responsibility of the restore. Ensure this information is
documented in the Clarify case. This will allow ourselves and the MIT team
to have some visibility of expected progress and will allow us to make a
considered judgement for further escalation." This timeline can only be used
for very rough guidance. This will vary depending on the size of the database
and whether locally attached drives are used. If locally attached drives are
used then the restore should take no longer than the backup. On a shared
server this should be estimated at 20% longer than the backup.
12.3
The class name, client name, start and end dates (and possibly times) are the minimum
requirements to get a comprehensive media list. If you can specify more information it will be
a more accurate list.
bpimagelist media d 07/01/2002 -e 07/05/2002 client tpedm01-fe class ORACLE
This will provide a list of all the cartridges used for ORACLE class backup between the dates
specifie for the TPEDM01 client.
Page 39
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
12.4
Restore processes
If a restore is required to resolve a service affecting problem with a
production system then a problem record should be raised and can be used to
initiate and document the restore.
Note: Please note that all Operations run restores will use the overwrite = no option
as a default. This is to ensure that customer data is not inadvertently over
written. Therefore all restore requests where the data is to be restored to its
original location the files affected will have to be deleted before the restore
can complete.
If a restore is required as part of some testing or development work then a
change record should be raised. This will allow the implementors to plan
their work effectively.
The record will be used to track the progress of the restore, to deal with any
unforeseen problems, i.e. cartridges not available, and request the appropriate
access. Restores must be run from an account that has the correct access level
to the files and directories, usually the root account is used.
If it is for a number of files or directories then a list file should be created.
This list file should contain each fully qualified file name, one on each line.
This file can then be used as an input file for the restore and save a lot of
work for the Operations team.
Warning:
The list file must not contain any blank lines or extra blank spaces at the
end of the file names. If it does the restore process will fail with an invalid line
length message and an Exit Status 144 error code.
If the files need to be renamed individually then a rename list file should be
created. This file must contain the following syntax:
change original_fully_qualified_file_name to new_fully_qualified_file_name
One line for each rename must be specified; no wildcards or substitution can
be done.
Warning:
12.4.1
Am I sure that the host name is correct for the restore (particularly important
for E10k boxes)
Contents
Has the user requested a restore to an alternative path and if so what do you
need to do about links.
Once you are sure you know the answer to each of the above, then proceed.
12.4.2
12.4.3
Logon to Root
NetBackup uses standard UNIX file permissions to control access to files so
unless you have read/write permissions to the file you will need to use root.
Remember that you may not even see the file in NetBackup if you do not
have the correct permissions. Using root will ensure that you can always see
the client.
12.4.4
Invoke bp
Use /usr/openv/netbackup/bin/bp
At this point you select the restore menu, and then you will be presented with
options to restore from different kinds of backups. If you are restoring from
normal filesystem backups, then select restore from backups. If you are
restoring from raw partitions then select restore from raw. If you are restoring
NT then select restore ms-windows.
Source client/Destination Client: both these entries must be the same, if not
you may restore a file from one machine to another. Make sure that the
clientname is as known to NetBackup with all relevant suffixes (i.e. fe)
Date Range: ensure that you have the correct date range and that it is as
narrow as possible. When Netbackup searches its database it could be
uncompressing large amounts of data about backups. Obviously the wider the
search the more time it will take to do this. You also run the risk of filling
the /usr/openv disk partition.
File Path: specify the file path you require to search on.
Page 41
Contents
Set your directory depth for searching, this defaults to 1, if you have this at
too low a level it may seem that the file is not backed up. Because it will only
show files or directories to that level. Setting the level to zero means you will
see everything down from the path, but be careful with its use, as it can give
you masses of data to look at.
When you are selecting files and directories you are searching the NetBackup
database so this may take some time. Also be aware you can drill through
directories using zoom in and out, this can be useful if you are searching for a
file and the user is unsure where it was.
If you need to you can build up a restore job, for instance you may select
some files for restore, change your path and select some more.
When initiating a restore read the prompts carefully. Decide where you wish
to place the progress log. The output from this log can be significant in size if
you are changing the paths for a number of files.
12.4.5
Running a restore
1) Firstly - has the job actually started or is it queued? If the job is queued
Section 11.5 (Reduction of Netbackup drive usage to allow a restore to
run) shows you how to ensure sufficient resources are made available to
allow the restore to commence.
2) Once the restore has been initiated start monitoring the process log. The
process logs are in the format of bplog.rest.xxx and unless you specified
differently whilst setting up the restore are created in the root directory. e.g.
tail -f /bplog.rest.001
Page 42
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
3) Ensure the tapes requested are successfully mounted. If the log states the
tapes are not in the library or the restore job appears to hang then check the
physical location of the tapes using relevant dasadmin or mtlib commands as
documented in the respective Robotic Library sections. For tapes not in the
library inform the relevant Hardware site of this and ask them to locate the
requested tapes and insert them back into the library. Once the tapes have
arrived back onsite, placed in the library, ensure the tapes are made onsite.
Note in some circumstances when vaulting has run, the required tapes will be
physically on drives but marked as not available to the robot which will cause the
restore to hang. In these circumstances the status of the tape will need to be changed
back and any pending requests either resubmitted or denied as appropriate. In
order to check for this, run vmoprcmd on the relevant master/ media server and
check for any Pending Requests from the output. Running vmoprcmd resubmit <
request id > will allow the restore to continue.
4) Under Netbackup 4.5 you can check up on the restore using the activity
monitor. This will show you the throughput in KB/s, Number of files restored
and also the percentage complete. One other way to check up on what is
happening is via the URL <http://byadsm03.nat.bt.com/activity/>.
When you are looking at the restore job it may take a while before a restore
kicks in. This is because the restore is searching a large NB client database
for the files; this may take up to 20 minutes. On heavily used servers, the
restore may time out. If so just reissue the restore request.
If the file is in a large system single stream backup you may notice it takes a
long time to restore relatively small files, after the restore is positioned on the
tape. This is because the restore has to read through the backup to find the
files on the tape.
You may see waiting for mount of tape, this could be because the drives are
in use for backups, or the tape is offsite, you check for an offsite tape by
looking at the requested tape in media manager.
Page 43
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
NetBackup Administration
-----------------------s) Storage Unit Management...
c) Class Management...
g) Global Configuration...
r) Reports...
m) Manual Backups...
x) Special Actions...
u) User Backup/Restore...
e) Media Management...
h) Help
q) Quit
ENTER CHOICE:
Select u
Master Server: tpcds1
Client: tpcds1
Main Menu
---- ---b) Backup...
r) Restore...
h) Help
q) Quit
ENTER CHOICE:
Select r>
Master Server: tpcds1
Client: tpcds1
Restore Menu
-----------b) Restore Files and Directories from Backups...
a) Restore Files and Directories from Archives...
r) Restore From Raw Partition Backups...
f) Restore From Auspex FastBack Backups...
d) Restore From True Image Backups...
o) Restore From Oracle DB Backups...
i) Restore From Informix DB Backups...
s) Restore From Sybase DB Backups...
t) Restore From SQL-BackTrack DB Backups...
p) Restore From SAP DB Backups...
2) Restore From DB2 DB Backups...
m) Change Master Server...
h) Help
q) Quit Menu
Page 44
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
ENTER CHOICE:
Select b>
Path: /usr/users/storage/
Start Date: 11/29/98 22:11:39 Master Server: tpcds1
End Date: 12/01/99 23:59:59 Source Client: tpcds1
Files Selected: 0
Destination Client: tpcds1
Directory Depth: 1 level
Class Type: Standard
Display Mode: Brief
Keyword Phrase:
Restore Backups
--------------s) Select Files and Directories... p) Change Path...
e) Edit/View Selected Files... d) Change Date Range...
i) Initiate Restore
c) Change Directory Depth...
x) Change Display Mode to Verbose m) Change Master Server...
l) List Backup Images...
b) Change Source Client...
a) Specify Alternate Path...
t) Change Destination Client
q) Quit Menu
y) Change Class Type
h) Help
k) Change Keyword Phrase
ENTER CHOICE:
Then p> to change path
d> to change date range
c> directory depth 0 gives the most information
s> to select your files/directories
When you are satisfied with the above
Enter I> to initiate restore
Finishing A Restore
Send the restore log to the requestor, and ask them to verify the restore.
12.4.6
Have the directory level settings been too high or too low.
Do I need to be root
Page 45
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
Finally it may be that the file was not backed up, this should be checked
against the policy to see if it covers the requested files/directories.
12.5
Failover Restores
The reason for failover restores is when a Media Server is unreachable and
your restore job is trying to use its tape drives. The bpmedia movedb
command is then required to move the NetBackup catalog entries from the
files you are restoring to an alternative Media Server.
The restore attempts failed because NetBackup was trying to mount the
relevant tape into the nfmpramb drive which was inoperable at the time. The
restore request itself was resolved by copying an identical file across from the
nfmprama server, but the need to provide a means of managing this problem
in future remained.
Following guidance re. failover restores in the manuals, C/R 5400777 was
raised to update the byadsm01 bp.conf in the hopes that this would enable the
automatic switch to alternative drive(s) specified in the bp.conf file, in the
event of similar restore failures. Unfortunately, the restore tests which
followed failed and case 140017461 has been raised with Veritas to look into
the causes.
In the meantime however, the following command has now been successfully
tested and can be used from root access once the specific tape required is
known:
"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev <media_id>
-newserver <hostname> -oldserver <hostname>"
In this example the command would read:
"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev BBL007
-newserver nfmprama -oldserver nfmpramb"
Details of the tape(s) required will always be are shown in the relevant
bplog.rest.00n log, (usually in the root directory) from the failed restore
attempt.
Once the restore has been completed successfully, it would be adviseable to
run the same command in reverse to ensure that confusion is avoided at a
future date should further restores be required using the same tape. Again in
this context, the command would read:
"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev BBL007
-newserver nfmpramb -oldserver nfmprama"
As this method only provides a very specific solution and needs to be keyed
in, efforts will be continued to progress the case with Veritas, in the hopes
that we will be able to configure automatic restore failover functionality
Overview
Page 46
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
12.5.1
Functional description.
The bplist binary will search the NetBackup master server for backups; by
default the command will search using client name and master name
specified in your bp.conf file. The bp.conf file resides in
/usr/openv/netbackup or you can specify your own bp.conf file in your home
directory that overrides the global bp.conf.
Using various parameters you will be able to get listings similar to the Unix
command ls, and where there are multiple occurrences of a file in a listing,
this will indicate that there are multiple backups.
12.5.2
Using Bplist
The command /usr/openv/netbackup/bin/bplist can be used to find out if a
file or directory was backed up on a specific date, e.g.
bplist s mm/dd/yy e mm/dd/yy fully_qualified_file_name
This command would have to be run on the client box, using the root account
or one that has access to the files or directories being queried.The bplist
command has global execute permissions, but it is important to realise that
Netbackup security is based upon Unix file permissions. If you do not have
permission to restore a file, then you will be unable to perform a bplist for the
file. All that will happen is that you will be returned back to the command
line.
The keyword function allows backups to be associated with a keyword,
which is indexed and should speed up query times. It can be specified when
the bpbackup command is issued, or by the NetBackup administrator when a
scheduled backup is defined.
Before listing backups, confirm whether the file is filesystem, or raw
partition.
Please note that by limiting the date range the database search time will be
reduced.
Page 47
Contents
12.5.3
The s option tells bplist to start searching from the specified date onwards,
there is also a e option for end date, and you can specify the hours and
minutes (see the manual). Using an end-date will reduce the search time. The
-b option displays the backup date and time of each file.
bplist R b l -s mm/dd/yy /usr/home
Will give a listing where the date and time of the backups are listed.
12.5.4
12.5.5
HA Environments
If you are running bplist in a HA set-up be aware how the client was backed
up. Which would be either by the shared IP address or by the physical IP
address. Once you have this ensure that you are using the right client name,
this uses the C client-name option.
Page 48
Contents
Problems
If you do not get a result but you think you should, please do not hesitate to
contact NetBackup support.
12.5.6
Command line
The command line restore command, bprestore, is used when backed up or
archived files or directories are to be restored. If a directory is specified all
files backed up will be restored.
Caution:
Note that by default a restore will probably use the MPRN unless the client
name is the same as that used by NetBackup for the backup.
Warning: You will need to include the t 13 parameter if the client is an NT box.
The default type is standard and suitable for UNIX clients only.
For example the syntax of the bprestore command to restore a list of files
and rename them is as follows:
bprestore -K -L log_file_name R rename_file -C clientname s
mm/dd/yy e mm/dd/yy f listfile
Where
-K means that existing files will not be overwritten.
-L is the location of the log file (these can be large and must be
managed).
-R is the file listing the renames to be done.
-C is the client server name. Please ensure you specify the same
name that NetBackup uses to backup the data, e.g. dyfin04-fe instead of
dyfin04.
-s is the start date.
-e is the end date.
-f is the list of files to be restored.
Another example of how to restore a specific file back to its original location
and over-write any existing file is as follows:
bprestore L log_file_name C clientname s mm/dd/yy e mm/dd/yy
fully_qualified_file_name
Other options and parameters can be specified. The complete syntax is:
/usr/openv/netbackup/bin/bprestore [-A | -B] [-K] [-l | -H | -y][-r] [-T] [-L
progress_log] [-R rename_file] [-C client] [-D client]
[-S master_server] [-t class_type] [-c class] [-s mm/dd/yy
[hh:mm:ss]] [-e mm/dd/yy [hh:mm:ss]] [-w [hh:mm:ss]] [-k
"keyword_phrase"] -f listfile | filenames
Page 49
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
12.6
12.6.1
Introduction
The NetBackup infrastructure is in constant use. This means that the
resources required for a high priority restore might all be allocated and not
immediately available. The following document outlines the procedure to
follow to ensure that sufficient resources are made available in a controlled
manner. This will cause the least impact on other customers as their backups
will not be cancelled but merely queued for the duration of the restore.
12.6.2
Identification
When a restore is submitted it has a higher priority than a backup job. It will
therefore take the first available drive and start running but due to the way
that NetBackup runs it will try to keep a cartridge loaded for as long as there
are backup jobs that can write to it. This is even in preference to other
backups that may have been queued for a long time. So a restore may be
queued because backup jobs are getting preference as it makes more efficient
use of the cartridge drives.
If there are no drives available then it is advisable to try and free one by
reducing the number of jobs running.
12.6.3
Action
Reducing the number of running jobs can be achieved by the following
method.
From the NetBackup administration GUI, open up the POLICIES icon, select
the ORACLE job class, make a note of the Max Jobs/class parameter and
reduce it to 4. This will have the effect of limiting the ORACLE backup to
one drive only. Carry out a similar process for the ORACLE_REDO policy.
Note: It may be necessary to reduce the maximum number of jobs even further if
there are limited cartridge drives available and if the restore is of a high
priority and deemed necessary, kill off any system/non-urgent backups from
the Activity Monitor.
This will gradually reduce the number of active jobs and release some drives.
Once the restore has completed you must re-instate the Max Jobs/class
parameters to their original values.
Page 50
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
12.7
Additional information
12.7.1
Raw volumes
Raw volume restores are a bit more complicated. In general specific files can
not be restored, only the complete raw volume, for this reason restores are
rarely done. If a restore is required then CWBACKUP should be consulted.
13 Backups
It may be necessary on occasions to run test backups or manual backups as
part of problem resolution. Please follow the actions listed below.
Using the Netbackup GUI On the main netbackup screen you will see a list of policies, search for
small_test
Highlight small_test then bring down the Actions toolbar and select manual
backup and schedule and specific client name.
Check on Job Monitor that the job becomes active.
If you receive an error when using the GUI
Use bpadm (VT100)
#bpadm
select m> manual backups
select b> browse classes forward (keep going until small_test comes up)
then enter
i> initiate backup
Page 51
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
14.1
The NetBackup procmon file has the checkstartup command included, which
as it implies, checks whether a process that has died restarts afterwards.
Depending on the result, either of the two following commands will be
issued:
OSMF_procmon: Process bpdbm has not started
SMM_processes: Process back up: bpdbm, new PID 9444
Page 52
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
These are similar faults but resolution normally differs. 41's are often due to
a failure to make a connection to a client. 54's usually occur after a
connection has been made but the client fails to respond. This document will
deal with these failures separately. There are a number of considerations
when dealing with this fault. Is for instance the machine showing as
unreachable on Omnibus and being dealt with via a problem or change
record? If so then the case can be simply transferred to CWBACKUP for
info. If not then a method for dealing follows.
Go to the Job Monitor window and check which Media Server the job was
using. Often a backup may fail when made from one server but not another
(i.e. a client may be successfully backup from tpadsm01 but not from
tpadsm02). Before proceeding it is worth checking that other clients are not
experiencing problems. There have been a number of faults recently where
all backups to clients from the media server have failed with 41's but have
been okay from the master. Resolution has been to reboot the media server.
41's are often client specific problems but can also be due to network
congestion. With the latter a number of backups for a client one might fail
but others are okay. This is often a problem at Bletchley for clients using a
virtual private network (these have names like aspasea1-vpn) where the
backup is actually going over the public network and can fail occasionally
due to congestion.
Step one is to log into the relevant server and attempt to ping the client from
that server. I will use here the example of tpfinder1 and assume that there
have been backup failures from tpadsm01, tpadsm02 and tpadsm03.
Page 53
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
Login to tpadsm01 and ping tpfinder1. Did it respond? Then ping tpfinder1fe. It is important that you ping the hostname as it is known by Netbackup as
this tests the specific network interface. Normally our machines will have a
separate network card which is connected to the dedicated Netbackup private
network. This is ensures that connections can be made by users to a machine
on its public network interface without degradation due to very large amounts
of data which backups often involve. Another issue is that one should always
use the hostname and not the IP address as this tests the name resolution.
When telnetting between the Master/Media Server and the client, run the
who am I command on both to ensure the hostnames reflect the /etc/hosts
and bp.conf entries.
If tpfinder1 pings okay but tpfinder1-fe did not then the problem lies
somewhere on the private network (which includes the network card and
networking software on the client).
Log onto the client and check that the network is up using 'ifconfig' (for
vendor specific instruction, see the OA Network Diagnostic procedures). If
the relevant interface is down then try bouncing the interface. See also
instructions for 'netstat' in the OA procedures.
More often than not caused by the NetBackup server being unable to
communicate with the client.
Check that you can ping / telnet to the backup interface on the client if its
not contactable, forward case to TSS for progression.
If you can telnet to the backup interface, logon and check the entries in
/etc/hosts for the NetBackup server IP addresses (including media servers)
and the clients own addresses (including the name as registered / known
within NetBackup on the server).
Ensure you are connected via the correct network. (i.e. type who am I in
order to identify the connection you have been routed via).
If the config checks out ok, switch the multistreaming option ON within the
NetBackup class in order to see if the errors are occurring under one
particular mount point. If so, then this may be one of two problems. It may be
that the Unix box was connected to a NFS (Network File system) that has
since been removed, therefore when listing the mount points, the session
hangs. Alternatively, there may be a problem with some corrupt data files, or
it may just be the number of files that NetBackup has been asked to back up.
On some problematic clients, we are backing up in excess of 1million files.
(Note: thats files NOT actual bytes of data).
Page 54
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
15.2
15.3
15.4
15.5
Page 55
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
Usually caused when there are no drives available for backups this can
mean one of two things
All drives are up, but in use by other backups for a long period of time.
Check the drive status (if the error occurred overnight, check the drive usage
information on the NetBackup interactive reporting page), following the
normal down drives procedure. If you find that the problem is persistently
being caused by workload, then liaise with who is responsible for the backups
(normally Oracle DBAs) and advise them to reduce the number of concurrent
jobs.
15.6
15.7
15.8
Contents
Caused normally by the Netbackup client code being removed from the client
box this often happens as a result of an upgrade to the Operating System
(Unix / Windows). Unless we are informed of such changes via Change
Control then these errors will persist. Locate the TSS contact for the box and
request that a change record be raised for a reinstall.
To check that the NetBackup client code is resident on a client, log on to it
and see if you can access the /usr/openv/netbackup directory if not then the
code has been removed. This error is normally caused by inactivity of the
bpcd process on the client box. This is required in order to allow access from
the NetBackup server to the client and must comply with the TCP/IP
protocol.
The way to check this is to follow the telnet hostname bpcd instructions
found in the supporting document.
If you find that an error is given when trying this, do the following to ensure
it is working:
On the client box:
type grep bpcd /etc/services.
The following line should appear:
bpcd 13782/tcp
bpcd
This indicates that the port numbered 13782 is to be used only for NetBackup
connections and is generic across all NetBackup configs (server and clients).
Do the same for the request daemon:
grep bprd /etc/services
for the following entry:
bprd 13720/tcp
bprd
15.9
Contents
NetBackup GUI this can be set on tapes that are not time assigned (i.e.
scratch). NetBackup will ignore these until manually un-expired (unless they
have been expired for a particular reason) this method of expiration is used
to stop tapes from being used even if they are scratch. If there are expired
tapes that are available for use, simply use the GUI to un-expire them.
Another reason for these errors can be due to a tape being stuck in a drive. If
a job is allocated a drive in which to mount a tape that already has a tape
stuck in, then NetBackup will appear to attempt to mount all of the available
scratch tapes as it is not intelligent enough to be able to identify whether a
tape is stuck or not. It will send a signal to the robot regardless of what is
already in the drive. As such, you will need to identify the drive with the
stuck tape in and remove it. Check the NetBackup Activity Monitor for
details on whether the job is trying unsuccessfully to mount numerous tapes.
15.10
Contents
15.11
Contents
Application Resource Alert: NetBackup Drive Warning: <servername> Drives are in AVR mode - Investigation required <date and
time>
Application Entity Failed: NetBackup VMD Warning: <servername> Cannot connect to vmd daemon - Investigation required
<date and time>
15.12
16 Netbackup Clients
16.1
Netbackup Documentation
http://dataintegrity.intra.bt.com/
Page 60
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)
Contents
16.2
Troubleshooting Guide
Refer to NetBackup manuals
16.3
NetBackup reporting
16.4
Supportal
_Coll=;
17 APPENDICES
END OF DOCUMENT
Page 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)