You are on page 1of 61

ISL/OPS/A869

ISSUE 2

Netbackup Operational
Procedures
Veritas Netbackup Procedures

Netbackup Operational Procedures

Contents

About this document...


Author
The author of this document may be contacted at:
John Wesson
Eldon House
Sheffield S1 3PL

Content approval
This is Issue 2 of this document.
The information contained in this document was approved for use.

Filing
The filing reference for this document is ISL/OPS/A869.

History
Issue

Date

Author

Reason

Issue 1

23/02/05

Mick Sweeting

DRAFT 1

Issue 2

26/05/07

John Wesson

Incorporate OPS produced docs.

Page 2 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Contents
1

Introduction

Contacts

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10

Customer Experience Management Centre Operations


Customer Experience Management Centre Second Line Team
Customer Experience Management Centre Second Line DBA
Storage management
OPERATIONS / MEDIA CONTACT LIST
ADIC GRAU Contact/Info
Escalation
Hardware Information
Drive Information
IBM Hardware Contact

6
6
7
7
7
7
8
8
8
8

Tape Library Naming Conventions

IBM Library Information

10

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

Using robtest to interrogate the robot


mtlib Commands
Dealing with problems when a robot goes into Pause mode
Media movements - Inserting media
Ejecting media from the IBM library
Drive problems
Library problems
mtlib commands - Description and useful commands

10
10
11
12
15
15
16
17

ADIC Library Information

17

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11

Using robtest to interrogate the robot


dasadmin (only to be used on ADIC libraries)
Mount a Volume in a tape drive
Dismount a Volume from a tape drive
Media Movements - Inserting Media
Ejecting Media from the ADIC Library
Drive Problems
AMLs
Common Problems
Points To Note
Library Problem

17
17
20
20
20
21
22
22
22
22
24

Accessing NetBackup

24

6.1

NBU 4.5

24

NetBackup Daemon Problems

25

Page 3 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

7.2

Shutdown Netbackup

26

Netbackup Activity Logs and process information

27

8.1
8.2
8.3

Introduction
File system full problems
Full List of NETBACKUP Processes

27
27
28

8.4

General Drive Testing and Fault Resolution

30

8.5
8.6
8.7

Dealing With Tape Drive Problems


Determining whether problem is Drive or Robot.
Device Configuration Utility - Tpconfig

30
30
32

Deconfigure/Reconfigure Sequent Drives

32

10

Resetting the SCSI extenders

34

10.1

Overview

34

11

Useful Media Commands

35

11.1
11.2

BPMEDIA: Freeze, unfreeze, suspend or unsuspend media


Tape Management

35
36

12

Restore Guidance

37

12.1
12.2
12.3
12.4
12.5
12.6
12.7

Background
Restore information
Media related items
Restore processes
Failover Restores
Reduction of NetBackup drive usage to allow a restore to run
Additional information

37
38
39
40
46
50
51

13

Backups

51

14

NetBackup Logmon Error Messages

51

14.1

NetBackup Processes and Procmon Error Messages

52

15

NetBackup STATUS Exits the big hitters

53

15.1
Exit Status 41 Network connection timed out
15.2
Exit Status 1 Backup was partially successful
15.3
Exit Status 52 Timed out waiting for Media Manager to mount volume
15.4
Exit Status 71 None of the files in the file list exist
15.5
Exit Status 219 The required storage unit is not available
15.6
Exit Status 54 Timed out connecting to client
15.7
Exit Status 84 Media write error
15.8
Exit Status 57 Client connection refused
15.9
Exit Status 96 Unable to allocate new media for backup, storage unit has
none available
15.10
Exit Status 131 Client is not validated to use the server
15.11
Application Resource Alert
15.12
Slow throughput of backups

53
55
55
55
56
56
56
57
58
58
60
60

Page 4 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

16

Netbackup Clients

61

16.1
16.2
16.3
16.4

Netbackup Documentation
Troubleshooting Guide
NetBackup reporting
Supportal

61
61
61
61

17

APPENDICES

61

Page 5 of 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

1 Introduction
This document is intended for use in identifying and resolving Veritas
NetBackup problems. It is not intended to replace any software manuals and
should be used in conjunction with the current Veritas manuals. It also
assumes a certain level of Unix command experience and use of dasadmin
commands.

2 Contacts
2.1

Customer Experience Management Centre


Operations
The Lines of Business Operations team are based at the Sheffield Customer
Experience Management Centre and provides a first line and monitoring role
for Netbackup alarms
This group is available 24 hours a day, 7 days a week, and can be contacted
as follows:
Tel: 0800 216662 Option 3
Fax: 0114 277 4224
Operations Service Manager: 0800 216662 Option 6,2
Email: cclobops@bt.com
The Clarify class for the Operations Group is: CCLOBOPS

2.2

Customer Experience Management Centre Second


Line Team
The BT/HP Second Line team are based at the Sheffield Customer
Experience Management Centre, and provide second line technical support
for Unix, Wintel and Netbackup.
The team is available 24 hours a day, 7 days a week, and can be contacted as
follows:
Tel: 0800 216662 Option 5,2
Fax: 0114 277 4224
Second Line team leader: 0800 216662 Option 5,4
Email: soc2line@bt.com
The Clarify classes for the Second Line team are:
Page

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Unix:
CWHPOPSUX
Wintel:
CWHPOPSNT
Netbackup: CWHPOPSSTNB

2.3

Customer Experience Management Centre Second


Line DBA
A Second line DBA is available 24 hours a day, 7 days a week, and can be
contacted as follows:
Tel: 0114 277 4032
Fax: 0114 277 4224
Clarify class: DBAORA2L

2.4

Storage management
Contact information for the storage management support group with
responsibility for Netbackup

Bridge Clarify class


Outside office hours

2.5

CWBACKUP
CWBACKUP Bridge
Callout NETBACKUP

OPERATIONS / MEDIA CONTACT LIST


A full list of site contacts can be found in Departmental Contacts under
Contacts at
_Coll=;

2.6

ADIC GRAU Contact/Info

2.6.1

Contact Number / fault logging / hotline


ADIC GRAU

01344 488786

Maurice Rutherford
Office

0118 922 9100

Mobile

07710 576425

Mike Halliday
Office

0118 922 9100


Page

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Mobile

2.6.2

07850 000152

ADIC Callout Agreement


Callout will be instigated by the Second Line team/Netbackup Support out of
hours where we consider there to be a serious degradation of the system, i.e.
over 50 % of tape drives are unavailable. Otherwise a call will be placed
during normal working hours. On site access will be arranged by the Second
Line team via the Change / Problem Management System.

2.6.3

Hours of Cover
Full cover will be provided on a 24 hours per day, 365 days per year basis
with a 2-hour response time.

2.7

Escalation

2.7.1
HP
John Orman
(Tel 01908 656267)
Pete Meade
(Tel 0151 706 8805)
(Mobile: 07802 471232)

2.8

ADIC
Gary Page
(Tel: 0118 922 9100)
(Mobile: 07919 330945)
Maurice Rutherford
(Tel: 0118 922 9100)
(Mobile: 07710 576425)

Hardware Information
Hardware information for each site can be found in Hardware Information
under Documentation at
http://dataintegrity.intra.bt.com/

2.9

Drive Information
This can be found under Tape Library Info under Media & Library under
Drive & Media at
http://byadsm03.nat.bt.com/

2.10

IBM Hardware Contact


In the event of a failure of any IBM kit a call should be made to IBM via
Storagetek (01483 728101). The following information will be required in
order for Storagetek to place a call with IBM.
Page

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

2.10.1

The site id number.

The 4-digit machine type number as listed beside hostname.

The serial number of the machine.

The project name, which in most instances is Brunix

IBM Callout Agreement


Callout will be instigated by the Second Line team/Netbackup Support out of
hours where we consider there to be a serious degradation of the system, i.e.
over 50 % of tape drives are unavailable. Otherwise a call will be placed
during normal working hours. On site access will be arranged by the Second
Line team/Netbackup Support group via the Change / Problem Management
System.

2.10.2

Hours of Cover
Full cover will be provided on a 24 hours per day, 365 days per year basis
with a 2-hour response time.

3 Tape Library Naming Conventions


The NetBackup tape libraries (also known as tape robots) come from two
different manufacturers, IBM and ADIC/GRAU.
You can tell a good deal just from the library name.
- the first two letters give the location, e.g., IP is Ipswich.
- the next three letters are either IBM or GRA, indicating the manufacturer is
either IBM or ADIC/GRAU
- the next two letters are LB, to indicate this is a library rather than a normal
server
- finally there is a robot number. This is needed as one site may have several
IBM or ADIC/GRAU libraries.
Understanding the naming convention means you can tell a good deal just
from a name like IPGRALB1, TPIBMLB2, etc.
An overview of our tape libraries at the major sites can be found under
the URL http://dataintegrity.intra.bt.com/ . Just click the tab in the left-hand
menu for 'Tape Drive Layout Diagrams', then click on the site you are
looking for. Alternatively, issuing tpconfig -d on any server will indicate the
library type it uses (tlh indicates an IBM 3494, and tlm indicates an
ADIC/GRAU AML library) and show the drives that server has.

Page
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

4 IBM Library Information


"Note: Most of our IBM libraries are IBM 3494's, and these have a
library manager which understands the mtlib commands explained
below. Newer IBM 3584's are coming into service now which do not
have a library manager. You cannot use mtlib commands on these, so
will have to rely on robtest and NetBackup for library information."

4.1

Using robtest to interrogate the robot


The robtest utility in /usr/openv/volmgr/bin, can be used to interrogate and
control the IBM silo from the system running the daemon in the same way
that it can be used to control the ADIC silo from machines that drive an
ADIC silo.
Issue the following command
#robtest
Use ? to get help in robtest.
The main commands available are drstat (to display drive status) and view
to examine tapes. Note that unload cannot be used, as the tape drives are
not connected to the SUN systems.
For further information regarding the robtest utility, refer to ISL/OPS/B274 BRUNIX media management procedures at
_Coll=;

Note: Do not leave your robtest session active any longer than you have to.
When you are in the robtest utility, communication between the tlhcd
daemon and the media servers is blocked. If a media server makes a
request to tlhcd (either a mount or a dismount request) while the robtest
utility is active, the request will be blocked. This will result in the tape
drives on that media server being switched into AVR mode until the
robtest session is terminated.

4.2

mtlib Commands
The following commands can be used to interrogate the robot using the mtlib
utility (/usr/bin/mtlib)
Command

Result

mtlib -l 3494c q I

Query Inventory. This command produces an


inventory of the silo and the category of the tape.

Page 10
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

mtlib -l 3494c q L

Query Library. This produces various options of


library data including: State - This should display "Automated Operational
State". If the library has been PAUSED then a pause
state will display.
Available cells - Displays the number of empty cells.

mtlib -l 3494c q S

Statistical data

mtlib -l 3494c q K

Count of tapes

mtlib -l 3494c -C -t
<category> -V
<barcode>

This changes the category of a tape.

mtlib -l 3494c -C -t
FF10 -V <barcode>

It can also be used to eject tapes from the silo.

mtlib -l 3494c -D

This displays the silo device numbers

mtlib -l 3494c q M

Displays which tapes are loaded in above device


drives

For further information regarding the robtest utility, refer to ISL/OPS/B274 BRUNIX media management procedures at
_Coll=;

4.3

Dealing with problems when a robot goes into


Pause mode
Occasionally the IBM library may go into a paused state. This results in no
cartridge movement and probably a number of backup failures. The cause is
the I/O station problem slots, of which there are two, have been filled with
cartridges that the robot has had problems handling.
If the robot goes into a paused state contact the Media Ops and ask
them to empty the I/O station problem slots and note the cartridge numbers.
When this has been done the library should automatically restart and work as
normal. Confirm that backups are running and that cartridges are being
loaded and unloaded. Occasionally the problem may be due to a failure in the
robotic arm and all cartridge manipulation will fail. The cartridges would
then be placed in the I/O station problem slots causing the library to go into a
paused state again. This will require callout to the engineers to resolve the
problem.
If the library does recover then the problem cartridges can be reinserted into the library. If this fails then they need to be checked to see if

Page 11
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

they contain any valid data and if so need to be put into a frozen state. If no
valid data exists then they can be deleted from the Media Manager database.

4.4

Media movements - Inserting media


To insert media to the IBM library is a straightforward process. There is no
insert command to run as the library will automatically take cartridges from
the I/O station, or hopper, and place them in available slots. The only
problems that are likely to occur are that there are no slots available or the
library is not functioning correctly.
Once cartridges have been inserted and accepted by the library they can be
added to NetBackup media management database as normal via the GUI or
command line. This should be done from the NetBackup master server.

4.4.1

Checking available slots


There are two ways to find out how many available slots there are.
The least disruptive way is to use the following webpage _Coll=;, which
includes ADIC and IBM library information.
An alternative is to interrogate the tape library. As the library is being
checked this command must be issued from the mount server. Log onto the
mount server and run the robtest utility.

Caution: Note that while robtest is running no further library action can take place, i.e. mount
and dismounts.
Select the appropriate library, which is normally option 1, and to get a list of
the possible options use the ? command.
dyadsm01 $ robtest
Configured robots with local control supporting test utilities:
TLH(0) LMCP device path = /dev/lmcp0
Robot Selection
--------------1) TLH 0
2) none/quit
Enter choice: 1
Robot selected: TLH(0) LMCP device path = /dev/lmcp0
Invoking robotic test utility:
/usr/openv/volmgr/bin/tlhtest -r /dev/lmcp0 -d /dev/rmt2.1 003590E1A00
-d /dev/rmt1.1 003590E1A01 -d /dev/rmt3.1 003590E1A02
Opening /dev/lmcp0
Enter tlh commands (? returns help information)
?
To exit the utility, type q or Q.

Page 12
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

audit <volser>
- Audit library for volser
catinv <category> [<count>] - Print library inventory by category
dm [<drive>|<IBM Device Name>] - Dismount volser from drive
drmapclear
- Clear drive address mapping
drmapfreeze
- Freeze drive address mapping
drmapshow
- Show drive address mapping
drstat [<drive>|<IBM Device Name>] - Print drive status
eject <vol> [bulk]
- Eject volser to standard (or bulk) output area
inv [<count>]
- Print library inventory
libstat
- Print library status
m <volser> [<drive>|<IBM Device Name>] - Mount volser
setcat <volser> <old> <new> - Set volser category
types
- Print list of media types
verbose
- Toggle verbose mode
view <volser>
- Print volser data

SCSI commands:
unload [<drive>|<IBM Device Name>] - Issue SCSI unload
<drive> = d1 if drive 1, d2 if drive 2, ..., d256 if drive 256

Use the libstat option to display the status of the library.


libstat
Library information:
state:
Automated Operational State
input stations:
1
output stations:
1
input/output status: All convenience input stations empty
All convenience output stations empty
machine type:
3494
sequence number:
0x16552
number of cells:
5431
available cells:
3285
number of subsystems: 17
convenience capacity: 10
accessor config:
01
accessor 0 status: Accessor available
Gripper 1 available
Gripper 2 not installed
Vision system operational
comp avail status: Primary library manager installed.
Primary library manager available.
Primary hard drive installed.
Primary hard drive available.
Secondary hard drive installed.
Secondary hard drive available.
Convenience input station installed.
Convenience input station available.
Page 13
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Convenience output station installed.


Convenience output station available.
avail 3490 cln cycles: 0
avail 3590 cln cycles: 9
QUERY LIBRARY DATA complete

Page 14
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Note: To find out the drive details on the server, issue the tpconfig d command on
the master/media server, which uses the library. The output will look similar
to that shown below.
dyadsm01 $ tpconfig -d
Index DriveName
DrivePath
Type Shared Status
***** *********
**********
**** ****** ******
0 Drive1
/dev/rmt2.1
hcart2 No
UP
TLH(0) IBM Device Name=003590E1A00
1 Drive2
/dev/rmt1.1
hcart2 No
UP
TLH(0) IBM Device Name=003590E1A01
2 Drive3
/dev/rmt3.1
hcart2 No
UP
TLH(0) IBM Device Name=003590E1A02
Currently defined robotics are:
TLH(0) LMCP device path = /dev/lmcp0,
volume database host = dyadsm01
From this it can be seen that there are 3 drives configured to be used by
this server dyadsm01.

4.5

Ejecting media from the IBM library


To eject media from a library, issue the following command for each tape:
mtlib l library device name C s FF00 t FF10 V aaannn
lirary device name can be found by running tpconfig d on the master
server
Where l is the librarys filename, -C indicates that you want to change the
category of a volume, -s is the starting value, -t is the value you want to
change it to and V is the volume serial number to be changed.
This will change the state of cartridge aaannn from being in the library to
ejected and move it accordingly.
For further details of the mtlib command you can issue mtlib -?, which will
produce a list of all the possible parameters and their meaning.

4.6

Drive problems
If it is a drive problem then the SCSI commands to manipulate the drive must
be issued from the master/media server. For example to rewind a cartridge
and prepare it to be unloaded from a drive you would have to issue the mt
f /dev/rmt1 rewoff command.
The most common problem is when a cartridge does not eject from a drive.

Page 15
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

4.6.1

Identifying stuck cartridges


Compare the information from the robtest drstat (see below for details)
command to that from the vmoprcmd command. If drstat lists a cartridge as
being in the drive and vmoprcmd does not, and there are no active jobs using
the cartridge, then it is safe to assume that the cartridge is stuck. To confirm
this it is also worth checking the log files for SCSI errors related to the drive.
These can indicate when the failure occurred and help identify the cause to
the engineers.
If you suspect a cartridge to be stuck, log onto the master/media server which
uses the drive and issue the mt f /dev/xxxxx rewoff command from the root
account, where /dev/xxxxx is the file name for the drive. This should rewind
the cartridge and eject it ready for the robot. If this command fails it could be
because the cartridge has already been rewound or there are problems
communicating with the drive. To determine which you need to log onto the
mount server and use the robtest dismount option. If this fails then the
operators will have to power-cycle the drive to force it to eject the cartridge;
this should be followed by the robtest dismount. If no progress is made after
all this then it is time to get the engineers involved.

4.7

Library problems
If there are problems with the library, i.e. drives in AVR and cartridges not
being mounted, then attempt a manual mount of a cartridge.
For example to mount a cartridge you would run robtest (see above for
details) and then run the appropriate command.
A useful command is the libstat command. This shows the library status on
the first line, which should be:
state:
Automated Operational
State
Further down the output the status of the I/O station/hopper is displayed and
whether it is full and requires emptying. The status should be:
input/output status:
All convenience input
stations empty
All convenience output stations empty
Another useful command is drstat, which lists information for all the drives.
From this it is possible to tell if a cartridge is in the drive and its identity.
Drive 3 information:
drive number:
3
device name:
003590B1A02
device number:
0x203140
device class:
0x11 - 3590 Model B1A/other
device category:
0x0000
mounted volser:
<none>
mounted category:
0x0000
Page 16

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

device states:

Device installed in ATL.


Dev is available to ATL.
ACL is installed.

In this example you can see that there is no cartridge in drive 3.

4.8

mtlib commands - Description and useful


commands
The mtlib command allows the IBM library to be interrogated and
manipulated by a user.
To display all possible options use the mtlib -? command.
To find out what the logical device number is for a library display the
/etc/ibmatl.conf file and the library will be at the end of the file.
mtlib l 3494c qV V xxxnnn
This will display the status of a cartridge; i.e. is in the library.
mtlib l 3494c qM
This will display all mounted cartridges and the library device number it
is in.
mtlib l 3494c D
This will display all the devices and their numbers.
mtlib l 3494c qL
This will display the status of the library.

5 ADIC Library Information


Note: Most of our ADIC/GRAU libraries are AML's, and these have a
library manager which understands the dasadmin commands explained
below. Newer ADIC/GRAU libraries do not have a library manager. You
cannot use dasadmin commands on these, so will have to rely on robtest
and NetBackup for library information."

5.1

Using robtest to interrogate the robot


Please refer to the instructions for the IBM robot above in section 3.1 as the
process is the same.

5.2

dasadmin (only to be used on ADIC libraries)


LISTD: List Drive Status
This command displays the drive status for all clients or a specific client.
dasadmin ld, if the robot has more than 15 drives use dasadmin ld2

Page 17
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

E.g.
dybkup01 $ dasadmin ld
listd for client: successful
drive: DRIVE1 amu drive: 01 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 10
drive: DRIVE2 amu drive: 02 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 29
drive: DRIVE3 amu drive: 03 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 28
drive: DRIVE4 amu drive: 04 st: UP type: N sysid: client: dycase01
volser: cleaning 0 clean_count: 28
drive: DRIVE5 amu drive: 05 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 14
drive: DRIVE6 amu drive: 06 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 1
drive: DRIVE7 amu drive: 07 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 15
drive: DRIVE8 amu drive: 08 st: UP type: N sysid: client: dyespwp1
volser: cleaning 0 clean_count: 24
drive: DRIVE9 amu drive: 09 st: UP type: N sysid: client: dyvsisb1
volser: cleaning 0 clean_count: 6
drive: DRIVE10 amu drive: 10 st: UP type: N sysid: client: dyvsisb1
volser: cleaning 0 clean_count: 3
drive: DRIVE11 amu drive: 11 st: UP type: N sysid: client: dybkup01
volser: DEF595 cleaning 0 clean_count: 17
drive: DRIVE12 amu drive: 12 st: UP type: N sysid: client: dybkup01
volser: cleaning 0 clean_count: 7
drive: DRIVE13 amu drive: 13 st: UP type: N sysid: client: dynebk01
volser: cleaning 0 clean_count: 5
drive: DRIVE14 amu drive: 14 st: UP type: N sysid: client: dynebk01
volser: cleaning 0 clean_count: 23
drive: DRIVE15 amu drive: 15 st: UP type: N sysid: client: dynebk01
volser: DEC406 cleaning 0 clean_count: 2
To display the drive list for a specific client
dasadmin ld dybkup01
dybkup01 $ dasadmin ld dybkup01
listd for client: dybkup01 successful
drive: DRIVE11 amu drive: 11 st: UP type: N sysid: client: dybkup01
volser: DEF595 cleaning 0 clean_count: 17
Page 18
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

drive: DRIVE12 amu drive: 12 st: UP type: N sysid: client: dybkup01


volser: cleaning 0 clean_count: 7
drive: DRIVE23 amu drive: 23 st: UP type: N sysid: client: dybkup01
volser: cleaning 0 clean_count: 13
drive: DRIVE24 amu drive: 24 st: UP type: N sysid: client: dybkup01
volser: DEG393 cleaning 0 clean_count: 16
das options:
Display

Description of Parameter

drive

Drive number

st

Drive status UP or DOWN

type

Drive type

sysid

Reserved

client

Client name allocated to the drive

volser

Mounted volume on the drive

cleaning

Actual cleaning activity


0: no clean activity on the drive
1: cleaning media mounted on the drive

clean count

Number of mounts until the next cleaning interval

ALLOCD: Allocate drive to different client


dasadmin allocd DRIVEx UP
DOWN

client

To see the range of tapes assigned use the dasadmin qvolsrange command
Note: This command returns a list of volsers, which are accessible to the specified
client within the requested volser range.
dasadmin qvolsrange beginvolser endvolser count (client name)
e.g. dasadmin qvolsrange DTP216 DTP220 8
Parameter
beginvolser

Description of Parameter
The beginvolser specifies the first volser in the
range

endvolser

The endvolser specifies the last volser in the


range

count

Specifies the number of volsers to report within


the range. (This number can be larger than the
actual number required)

client name

If the client name is specified the volser range is


checked for that client only, if none is specified
Page 19

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

then all are checked

5.3

Mount a Volume in a tape drive


This command mounts a volume on a drive from the library.
dasadmin mount -t media-type volser drive
e.g. dasadmin mount -t 3590 DTP216 DRIVE4
Parameter

5.4

Description of Parameter

media-type

Specifies the type of media you are using e.g.


3590

volser

Specifies the specific media you wish to be


mounted e.g. DTP216

drive

Specifies the drive number on which the media


is to be mounted e.g. DRIVE4

Dismount a Volume from a tape drive


This command dismounts a volume and replaces it back into the library.
dasadmin dismount -t media-type volser
e.g. dasadmin dismount -t 3590 DTP216
Parameter

5.5

Description of Parameter

media-type

Specifies the type of media you are using e.g.


3590

volser

Specifies the specific media you wish to be


dismounted e.g. DTP216

Media Movements - Inserting Media


Unlike the IBM library, it is necessary to issue a command to insert media
from the input hopper into the library. This command will insert volumes
from a specific insert area into the library area.
dasadmin insert -t media-type area
e.g. dasadmin insert -t 3590 I01
Parameter

Description of Parameter

media-type

Specifies the type of media you are using e.g.


3590

area

Specifies the area where the tape(s) will be


Page 20

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

inserted e.g. I01

5.6

Ejecting Media from the ADIC Library


This command will move volume(s) to the eject area to be removed from the
library.
dasadmin eject (-c) -t media-type volser-range area
e.g. dasadmin eject c -t 3590 DTP216,DTP220 E01
Parameter
-c

Description of Parameter
Tells DAS to remove the volser from the catalog

media-type

Specifies the type of media you are using e.g.


3590

volser-range

Specifies one or more volsers to be ejected

area

Specifies the area where the tape(s) will be


ejected e.g. E01

Page 21
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

5.7

Drive Problems
Robotic problems are more straightforward to identify than drive problems.
Determine what type of robot you are using:Predominantly we use ADIC robots at the moment (AMLs).

5.8

AMLs
The amu is a pc, which sits on the front of the robot and handles all requests
for action from all the backup servers. The tape drives sit in the robot, but
there is no direct electrical connection between the robot and the drives. The
robot arm actions requests given to it by the amu. The racking is used to store
the tapes. The amu is contacted using standard ip addressing. ADIC provide
a binary that can be used to communicate with the library, and that command
is dasadmin. A full list of dasadmin commands and their use can be found by
typing dasadmin -?

5.9

Common Problems
The drives are in AVR mode
Check the messages/syslog for errors. Then either:
Ping amu from the server, the ip address will be found in /etc/hosts
If there is no response from either server callout ADIC. This could also be a
network problem such as the network connectivity to the box has been lost. In
which case there is nothing to be done until the network has been restored.

If you can ping the amu. Then issue the command:


#dasadmin ld
which will talk to the amu and query the status of the drives. If this hangs or
returns unexpected response code received from the amu - callout ADIC.

If the dasadmin ld returns a list of drives, then select a tape from media
manager which does not have a time assigned value
Try and mount it in a drive. Use the command: #dasadmin mount t 3590 volser DRIVEn.
If this returns unexpected response code received from the amu, then you
callout ADIC. If it works then either use reset with drive control, or
#mt -f /drive off ,
and
#dasadmin dismount t 3590 volser

5.10

Points To Note
Some more observations on problems encountered with ADIC robots.
Page 22

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

The tlmd daemon netbackup uses to talk with the robot, will periodically test
the state of the tape drives. If drives are in AVR mode (not DOWN-TLM)
then when the daemon gets a positive response the drives come back up
automatically.

On the syslog you may see the message "robot encountered an error handling
a volume", this could cause intermittent problems. You may be able to get
away with freezing the tape or if it is scratch, to move some from another
pool. (see exit status 96 for detailed instructions on moving media.)

Then call up ADIC. The command to freeze a tape is


#bpmedia -freeze -ev volser

If the robot door is opened it will stop the robot, and will need to call ADIC
out to restart the machine.

dasadmin ld2 may have to be used to view tape drive numbers above 15.

By using vmoprcmd it is possible to determine the UNIX device files


associated with the NetBackup Drive index. When the output appears it
contains the drive index number, and the device file associated with it.
Check whether a tape is stuck on a drive:

Is there an RVSN on the drive index?


If so it means that NetBackup can read the tape label, and the drive is
functioning to some degree.

Use drive control to reset the drive. This effectively does mt f off and
tells the robot to put the tape away. If this does not work, make sure backups
are not using that tape drive, and issue mt f /device-file off. If this takes you
back to the prompt (ie. It has worked) then instruct the robot to put the tape
away by issuing the dasadmin dismount command, then try the drive out
again. If the drive fails again, then an engineer will have to be called, because
it may be that the drive can read the header label but not position on the tape.
The sequence of events would be:-

Mt -f /devicefile off

Dasadmin dismount -t 3590 volser

Run a backup, and check on job monitor to see if the tape has positioned.

If the mt f /device-file off reports an error then call an engineer. Take the
drive down in NetBackup until the engineer has dealt with it.

If there is no RVSN on the drive

Firstly interrogate the robot to determine whether the robot has been putting
tapes in the drive. Use dasadmin ld to list the drives.

If there is a tape on the drive, check that the tape isnt just sitting on the lip
of the drive by instructing the robot to put the tape away. This can be done by
issuing the command
#dasadmin dismount -t 3590 volser
if the command was successful then the tape would be on the lip.
Page 23

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

5.11

If the robot reports that the drive cannot be unloaded, then issue
#mt f /device-file off
If this takes you back to the prompt (i.e. It has worked) then instruct the
robot to put the tape away, then try the drive out again. If it doesnt work, try
#mt f /device-file status
then call out an engineer. It is worth checking the syslog again, because the
previous actions may have forced UNIX to report an IO error.

Library Problem
If there are problems with the library, i.e. drives in AVR and cartridges not
being mounted. Check the messages log on the master server for the
following:
Feb 10 09:39:30 dybkup01 tlmd[25753]: [ID 897060 daemon.error] TLM(0)
dismount failure for volser SDY815 on drive DRIVE12, d_errno = 10, The
AMU was unable to communicate with the robot.
Feb 10 09:39:30 dybkup01 tlmd[18661]: [ID 160136 daemon.error] TLM(0)
going to DOWN state, status: Robot hardware or communication error
Feb 10 09:39:55 dybkup01 tlmd[26277]: [ID 969665 daemon.error] TLM(0)
dismount failure for volser SDY470 on drive DRIVE11, d_errno = 10, The
AMU was unable to communicate with the robot.
Feb 10 10:33:13 dybkup01 tlmd[18661]: [ID 861719 daemon.error] TLM(0)
drive DRIVE11 (device 0) is being DOWNED, status: Robotic dismount
failure
On seeing these messages, initiate a call with the vendor ADIC.

6 Accessing NetBackup
6.1

NBU 4.5
When you logon to a server - it is only possible to use this on a Netbackup
Master server, not a client - please access netbackup using one of the
following methods:

Via the vt100 panels


#bpadm

from the toolbar:Select the start button programs Veritas NetBackup NetBackup
administration from there access the required panel.

clicking your shortcut icon (if one has been set up)

You may also wish to amend your profile to include the netbackup
directories in your PATH.
Page 24
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Another method of accessing the panels without the need to amend


profiles, as root, su storage which will give you the profile necessary to
perform the above.

7 NetBackup Daemon Problems


There are six main NetBackup daemons, which run constantly.
NetBackup :bprd
Bpdbm
Media Manager:-

vmd
Ltid
Avrd
Tlmd/tlhd

These daemons run on the server only, and not on any of the clients. On slave
servers only the media manager daemons run.
Of these the most significant are bprd and ltid.
Bprd starts bpdbm, and ltid starts vmd, avrd, and tlmd or tlhd.
Tlmd and tlhd are robotic daemons, which are different depending upon
the type of robot used. Tlmd refers to ADIC robots, whilst tlhd refers to
IBM robots.

7.1.1

7.1.2

Daemon Descriptions

Bprd - On master servers this daemon handles requests for backups and
restores and scheduled backups.

Bpdbm On master servers bpdbm handles all the configuration, error and
file databases

Ltid On master and slaves this daemon controls the reservation and
management of volumes

Avrd On slaves and masters performs automatic volume recognition, i.e.


being able to recognise a volume that has a label on the tape.

Vmd Volume manager daemon manages the volume database containing


details about tape usage.

Tlmd/tlhd robotic control daemons perform robot handling.

Starting NetBackup
The script bp.kill_all in /usr/openv/netbackup/bin/goodies will kill off all
netbackup daemons under normal conditions. If after running bp.kill_all
Page 25

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

there are daemons hung (usually bpsched processes) these should be killed,
using kill or kill 9 if necessary.
Alternatively, if you experience problems shutting down Netbackup using the
above script use nbu_kill script, which can be found in /usr/openv/btscripts.

7.1.3

Starting up individual NetBAckup daemons


To start up Netbackup daemons you must always be in root.
To check for Netbackup daemons, issue bpps a, which can be found in
directory /usr/openv/netbackup/bin

The Netbackup startup script netbackup is in


/usr/openv/netbackup/bin/goodies.
If bprd has crashed, issuing /usr/openv/netbackup/bin/initbprd will start it
again.
If bpdbm has crashed bprd will periodically start it up again,
If vmd has crashed then it can be started by issuing:#vmadm
s> special actions
i> initiate Media Manager Volume Daemon

If just tlmd has not started, or has crashed, this can be started using ./tlmd
This is the recommend command when starting Netbackup.
If avrd, has crashed, issue /usr/openv/volmgr/bin/stopltid to stop ltid,
and then /usr/openv/volmgr/bin/ltid to restart ltid

7.2

Shutdown Netbackup
The script bp.kill_all in /usr/openv/netbackup/bin/goodies will kill off all
netbackup daemons under normal conditions. If after running bp.kill_all
there are daemons hung (usually bpsched processes) these should be killed,
using kill or kill 9 if necessary.
Alternatively, if you experience problems shutting down Netbackup using the
above script use nbu_kill script, which can be found in /usr/openv/btscripts.

7.2.1

Problem resolution
It is rare that there are problems with the NetBackup daemons.
Common problems are:

The daemons werent started after a reboot, in this case logon to the
server, and su- root, then issue ./netbackup from
/usr/openv/netbackup/bin/goodies.
Page 26

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

The hostname on the box has changed. This should be reviewed, with
NetBackup Support and Service Delivery.

Tlmd can crash if there has been a severe robot problem, to start this
daemon logon to root and issue: ./tlmd from /usr/openv/volmgr/bin.

Tlmd can crash if there is a network problem.

If after trying the above Netbackup does not start, then it will be necessary to
callout NetBackup support.

8 Netbackup Activity Logs and process


information
8.1

Introduction
This directory (/usr/openv/netbackup/logs) is where detailed activity logs will
be placed on the NetBackup client box if certain sub-directories exist. These
sub-directories should only be created if unexplained problems are occurring
with the NetBackup product and more information is required to isolate the
problem. For further information on Veritas NetBackup Logging procedures
see ISL/OPS/B194 at
_Coll=;

Warning:
Some of these logs can potentially grow very large, and should only be
enabled if unexplained problems exist.

8.2

File system full problems


In the event of filesystems listed below becoming full or being reported as
over a specified percentage, this should be addressed with the immediate
removal of log directories under /usr/openv/netbackup/logs, leaving only
admin and user_ops
/usr/openv/ mountpoints > 100%
If the situation arises where the mountpoint for NetBackup reaches or
exceeds 100%, then the following action can be taken to try and reduce the
utilisation.
Logon to the server and su to root and issue the following command:bpimage -cleanup -allclients
An alternative is to bounce the daemons, which will force the cleanup process
to start. This has been added to the crontab on all BRUNIX master servers to
run it regularly.

Page 27
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

8.3

Full List of NETBACKUP Processes


Here are descriptions of NetBackup processes:
bprd
-request daemon

-can be terminated and initiated from the admin interfaces

-responds to client and administrative requests

-restores

-backups

-archives

-"list files backed-up or archived"

-manual/immediate backups

-reread configuration database

bpsched

-backup scheduler

-started by bprd on user directed backups and archives

-started by bprd on immediate/manual backups

-started by bprd every "Wakeup Interval" for regularly scheduled incremental


and full backups

-uses information from the class & storage unit databases to determine what
clients to start, when to start them, and

what storage unit to write backups/archives to

bpdm

-disk manager

-used on storage units of type Disk

-started by bpbrm on backups and restores

-during backups and restores, one of these is started (on the


server with the storage unit) for each client backup or restore bptm

-removable media (tape) manager

-used on storage units of type Logical Tape

-started by bpbrm on backups and restores

-during backups and restores, one of these is started (on the server with the
storage unit) for each client backup or restore

-also responsible for managing the media database


Page 28

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

-used to display info in the Media Reports screen when you select Media List

bpbrm

-backup/restore manager

-started by bpsched on backups/archives

-started by bprd on restores

-during backups and restores, one of these is started (on the server with the
storage unit) for each client backup or restore

-responsible for managing both the client and the media manager processes.
uses error status from both to determine ultimate

status of backup or restore.

bpdbm

-database manager

-manages class, config/behavior, storage unit, and error DB's

-started by the inetd(1M) process

bpcd

-"client daemon"

-used on clients (and remote servers) to initiate other product programs,


without requiring /.rhosts entries for the server on each client

-started by the inetd(1M) process

bparchive

-command-line program on clients to initiate archives

-communicates with bprd on server

bpbackup

-command-line program on clients to initiate backups

-communicates with bprd on server

bpbkar

-program used on standard clients to generate backup images

-not used directly by client users

bplist
Page 29
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

-command-line program on clients to initiate file lists

-communicates with bprd on server

bprestore

-command-line program on clients to initiate restores

-communicates with bprd on server

tar

-program used on standard clients to restore backup images

bp

8.4

-menu user interface for backups, archives, and restores

General Drive Testing and Fault


Resolution
When a media hardware problem is identified by Ops Analysts they will
attempt resolution themselves, without involving NetBackup Support. This
resolution will include callout of onsite support (e.g. NACC Ops, CE), if
required, and subsequent liaison to resolve the fault. However, NetBackup
Support may be contacted for more detailed problem analysis if the problem
is not immediately identified as a tape/drive/library fault.

8.5

Dealing With Tape Drive Problems


The majority of backup service effecting problems that will occur on
NetBackup will be because of Tape Drive failures of one sort or another. It is
important then to understand the components in a drive path, and which tools
we can use to diagnose the problem.
When dealing with drive problems always check first to determine whether
the problem is drive or robot related, often the error codes are the same.

8.6

Determining whether problem is Drive or Robot.


It is not always going to be easy to determine whether the problem is that the
drive is broken, or the robot, but for the most part you can tell very quickly
by looking at the following.

Is the robot a shared library?


If it is, are tapes being mounted on another server? If so it is unlikely that the
problem is robotic.
Page 30

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Can you obtain drive status?


If you cannot you definitely have a robot problem

Are the tape drives in AVR mode?


If so it made be a local problem with either the connection to the robot.

Are the tape drives down and remain down when you bring them up?
If so this usually means you have a drive related problem in the first instance.

What does the messages/syslog say


Netbackup media management logs useful information to the syslog, and this
should always be examined.

At the end of your initial diagnosis you may not be completely sure whether
the problem is robot or drive related, but if in doubt start looking at the drive
first, as this is much more common a problem.

8.6.1

Points To Note
The following points to bear in mind when dealing with drive problems are
the result of experience, and not from methodical diagnosis, but they have
happened on more than 1 occasion, are not easy to spot.

Everything seems ok until you take one drive down


The H/W engineers (usually Sun or Sequent) have swapped the SCSI cables
around. NetBackup will spot the tape it wants is in a drive and use it,
however if the tape drives are in the wrong way around then when you shut
one down the tape will be put in the down drive. This is very confusing. The
resolution is very simple change the robot drive numbers around and run
stopltid and ltid. The NetBackup Support team should do this.

Ltid, avrd, vmd, tlmd or tldd will not start after a reboot, or after you have
started netbackup. Look in the syslog first, it always tells you why they did
not start.
- If the device file is not in UNIX then ltid will not start, but this is reported
in the syslog.

After a reboot, it has been known (particularly on Compaq) for the device
files to be renamed. This has the same effect as removing the device files. It
is also a nightmare trying to map the new files back on. There is no easy way
to identify whether this has happened, if an ls al command is issued on the
directory it should be possible to identify if there are new device files. SCSI
extenders complicate the situation!
Many of the drive problems are caused by SCSI extenders, either failing or
having a glitch. Resetting the extenders may be enough to fix a problem,
however from our point of view we just call out an engineer; it will be the
engineer who will determine what action is required.

Reboot or Not Reboot


It is by no means clear as to when a reboot is required after or during work on
a drive. Some manufacturers support peripherals better than others do. By
deconfiguring the SCSI bus on Sequent boxes we can do most operations
Page 31

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

without requiring a reboot. Compaq also are fairly robust. HP and Sun are
questionable, although Sun should not require a reboot under normal
circumstances. If the SCSI cable is unplugged from the back of the host
machine, then most times you will have to reboot the machine. If you cannot
use a device after it has been fixed, always seek guidance from TSG before
arranging to reboot a box.

8.7

Device Configuration Utility - Tpconfig


Tpconfig is the netbackup utility to configure devices used by NetBackup.
Full details about its use are covered in the Veritas NetBackup manuals.
This utility will mostly be used by Storage Administrators, but there is one
command that the OAs will find useful:
tpconfig -d , -this lists all the devices configured in NetBackup.

9 Deconfigure/Reconfigure Sequent Drives


9.1.1

Deconfigure
Prior to deconfiguring,

Ensure that the drive in question is DOWN in Netbackup, as otherwise


the drive will be actively polled, preventing deconfig.

To perform this task root access is required.

Check on Device Manager for the device name by which the drive in
question is set or use tpconfig l which displays the same information
In brief the steps are:

1. Check
dumpconf | grep tc

2. Deconfigure
devctl -d tcn
devctl -d scsibusnn
3. Reconfigure
devctl -c qcicn
4. Check
dumpconf | grep tc

9.1.2

To perform the DECONFIG


List current config using: dumpconf | grep tc

Enter "devctl -d tcn" where n = drive number e.g. tc2


Page 32
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

(this can be found by using tpconfig, list drive configuration)


If all is well, you will see a brief statement confirming that the drive in
question has been deconfigured
# devctl -d tc0
devctl: deconfiguring tc0 from scsibus18
The output from the above command will give the scsibus number
Enter "devctl -d scsibusnn" where n = scsibus number e.g. scsibus18
If all is well, you will see a brief statement confirming that the scsibus
in question has been deconfigured (see below)
# devctl -d scsibus18
devctl: deconfiguring scsibus18 from qcic5
Keep a note of the drivename, scsibus and qcic (or fcbr) for reconfiguring
The engineer should now be able to carry out any work necessary.

9.1.3

Reconfiguration
Prior to reconfiguring,
If you have misplaced the deconfig details, you can find them in the relevant
ktlog via /usr/adm/ktlog/yyyy/mm/dd - e.g. /usr/adm/ktlog/1999/10/04. You
should find output similar to the following:

#37f8c521 16:17:53 tolog/note p8598 devctl -dD tc1


#37f8c521 16:17:53 tolog/note p8598 NAME
CFGTYPE
DEVNUM UNIT
FLAGS OnBUS OnDEVICE
#37f8c521 16:17:53 tolog/note p8598 deconfig: tc1
tc
1 0x00000000 S
scsi scsibus22
#37f8c522 16:17:54 tolog/note p8598 devctl: deleted tc1: type: tc:
devnum 0x1
#37f8c53b 16:18:19 tolog/note p8635 devctl -dD scsibus22
#37f8c53b 16:18:19 tolog/note p8635 NAME
CFGTYPE
DEVNUM UNIT
FLAGS OnBUS OnDEVICE
#37f8c53b 16:18:19 tolog/note p8635 deconfig: scsibus22
scsibus
22 0x00000060 SM mscsi qcic6
#37f8c53c 16:18:20 tolog/note p8635 devctl: deleted scsibus22:
type: scsibus: devnum 0x16
This information can also be listed using the ktmesg command

Page 33
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

9.1.4

To perform the RECONFIG


Enter "devctl -c qcicn" where n = qcic number e.g. qcic13
If all is well, you will see a brief statement confirming that the devices have
been found as in the example below.

# devctl -c qcic5
devctl: Found scsibus18, tc0
To check what the current settings are and confirm the above commands,
enter "dumpconf | grep tc" which will produce output similar to the
following:

tc0
tc1
tc2
tc3
tcpmux

tc
tc
tc
tc

0
1
2
3
pseudo

0x00000000
0x00000000
0x00000000
0x00000000
-

S
S
S
S

scsi
scsi
scsi
scsi

scsibus18
scsibus22
scsibus26
scsibus30

If the original settings are showing, the work has been completed
successfully.
If, on using the 'devctl -c qcicn' command to reconfigure, only one new
device shows, try a query on the found device e.g. 'devctl -c scsibus18'
which may detect the other device.

Note: If the system responses diverge in any way from the examples given
above, contact Sequent TSS Group.

10 Resetting the SCSI extenders


10.1

Overview
There is a requirement to use SCSI extenders to connect some tape drives
because there is a physical limit of 25m on SCSI, which in most computer
halls is not enough to allow for full use to be made of expensive robotic
libraries.

The use of SCSI extenders greatly increases the number of points of failure
that can are present in the configuration. It also complicates problem
resolution.
NOTE: the resetting of SCSI extenders is usually done by the Data Centre
Operations teams. However the procedure is as follows:

1. Remove the covers at the back of the robot. Identify the offending drive.
Page 34
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

2. On top of the drive there is a small LCD panel connected via a cable.
On this panel, using the arrow keys, scroll down to "UNLOAD DRIVE" and
press "RETURN. This will eject any tape which might be mounted.
3. Next, go to the PC at the end of the silo and send a 'KEEP' signal as
follows:
from the open window, click on 'COMMANDS' then from the drop down
menu, click on 'KEEP'.
move the cursor to "SOURCE" and enter drive number e.g. D21, must be
in upper case.
Then click on "EXECUTE".
4. Go back to the LCD display panel connected to the drive and from the
keypad menu, press 'E' then '2'.
You should see 'FIBLEN' on the display. This indicates that you have
carried out a 'fibre length' check. Then press 'C' to clear.
However, if nothing happens after entering '2' then power OFF the drive.
The ON/OFF switch is at the back of the drive on the right hand side.
Then power back ON.
5. Reset both SCSI extenders.

11 Useful Media Commands


11.1

BPMEDIA: Freeze, unfreeze, suspend or


unsuspend media
bpmedia -[parameter] -ev [media_id]
e.g.
#bpmedia -freeze -ev volser (DTP216)
Parameter

Description of Parameter

-freeze

Freeze specified media id

-unfreeze

Unfreeze specified media id

-suspend

Suspend specified media id

-unsuspend

Unsuspend specified media id

-ev

Specify media id

-h

specify host

Page 35
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

11.2

Tape Management
# bpadm
Reports
Media
Media Summary

11.2.1

Killing Processes
# kill 9 {PID Number}
#bpps -a
root 1206 1 0 Mar 22 ?
/usr/openv/netbackup/bin/bprd
root 1218 1 0 Mar 22 ?
/usr/openv/netbackup/bin/bpdbm

11.2.2

0:32
1:35

VT100 Command Line


For producing readable lists from usually non-readable text files.
# cd /usr/openv/netbackup/bin/admincmd
# ./bpcllist ORACLE -L or -U

11.2.3

Logs
These may be in various locations dependant on the platform, however check
in the following first.
/usr/adm/ktlog/ - Sequent
/usr/spool/adm/
/usr/adm/syslog/
/var/adm/syslog/ - HP
/var/adm/messages Sun
errpt a - AIX
/usr/openv/netbackup/logs/bp*

11.2.4

WHO Command
who -b (Shows when system was last re-booted)

Page 36
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

11.2.5

MT (SCSI Level) Commands


All mt commands involve the use of the full pathname of the drive device
file. They can normally only be issued from root and talk to the AMU on a
SCSI level.
# mt -f /dev/rmt/tc0c status
tc0: Waiting up to 90 seconds for tape ready...
tc0: Device Not Ready (check cartridge).
/dev/rmt/tc0c: I/O error
# mt -f /dev/rmt/tc1c status
/dev/rmt/tc1c: I/O error
# mt -f /dev/rmt/tc2c status
tc2: Waiting up to 90 seconds for tape ready...
tc2: Device Not Ready (check cartridge).
/dev/rmt/tc2c: I/O error
# mt -f /dev/rmt/tc3c status
tc3: Waiting up to 90 seconds for tape ready...
tc3: Device Not Ready (check cartridge).
/dev/rmt/tc3c: I/O error
The above mt commands will work only if a tape is loaded onto drive so
tapes should be loaded using das commands.

12 Restore Guidance
12.1

Background
This section is intended for use by groups who may be required to perform
restores on Unix Systems using NetBackup. There are a number of different
ways to run a restore. These are:
1. From the Admin GUI.
2. From the command line interface, the bpadm or bp panels.
3. From the command line directly.

12.1.1

Introduction
The restore facility for Netbackup is extremely powerful. Its use must only be
considered with sufficient justification by the requestor. It can be very easy to
destroy a UNIX box with Netbackup, particularly if you are restoring files
in / or /usr. If you are at all unsure about what is being requested by the user,
Page 37

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

then refuse to complete the request. In the past many unnecessary restore
requests have been executed because the customer was not completely sure
what was required to fix the problem.
In short do not be afraid to ask why they require restores.
Wherever possible, the file or files should be restored to an alternative path.
If the user wishes to restore to the original directory, see if you need to
overwrite data, doing a restore with overwrite not allowed is much safer and
should be considered as the normal method for performing restores. Always
impress on the requestor the importance of performing restores safely, even if
it means the user has to do some work.

12.1.2

NetBackup
NetBackup backs up data to cartridge. To recover this data is straightforward
as long as the correct information is used to initiate the restore process.
It must be remembered that a file in UNIX does not always exist, as it could
be a hard or soft link to another real file. If NetBackup is asked to backup a
soft link it will not follow it! So to ensure that the data is backed up the target
of the link must be specified. This also holds true when attempting to restore
the file. For a hard links the situation is slightly different. But as hard links
are not commonly used details are not included in this document.
If you attempt to restore a soft link then no data will be restored. Therefore
providing the name of the soft link as the file to be restored is worthless.

12.2

Restore information
There are some pieces of information that must be provided to allow a restore
to be carried out. These are as follows:
1. The host name of the box from which the data was backed up.
2. The operating system of the box. Is it a UNIX or an NT client.
3. The date and time, or dates and times between which, the backup was taken.
The shorter the time between the start and end the quicker the search through
the NetBackup database will be, and hence the restore will take less time.
4. Should the data be placed in an alternate location, or can the data be
overwritten? (Insist the requestor specifies what is required).
5. Should the file, files or directories be renamed when restored?
6. The host name of the target box, if different to the source box.
Cross client restores are only possible if both the target and source box are
connected to the same NetBackup server.
Cross client restores must be enabled within NetBackup.
7. The fully qualified names of the files or directories to be restored.
8. A contact number for when the work is complete.
Page 38

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

9. Acquiring a timeline:
In order for Bridge Operations to effectively manage high priority restores
(P1 or P2's), ensure we have a documented time-line provided by the
requester in order to be able to checkpoint our progress. The onus is on
Bridge Operations to obtain from the requester this information before we
take on the responsibility of the restore. Ensure this information is
documented in the Clarify case. This will allow ourselves and the MIT team
to have some visibility of expected progress and will allow us to make a
considered judgement for further escalation." This timeline can only be used
for very rough guidance. This will vary depending on the size of the database
and whether locally attached drives are used. If locally attached drives are
used then the restore should take no longer than the backup. On a shared
server this should be estimated at 20% longer than the backup.

12.3

Media related items

Identifying the media required and its location


The quickest and easiest way to identify the media required is to run the restore request and
look in the log file.
An alternative is to use the bpimagelist command on the master server.
tpadsm01 $ bpimagelist -?
bpimagelist: unrecognized option -?
USAGE: bpimagelist [-media] [-l|-L|-U|-idonly]
[-d mm/dd/yyyy hh:mm:ss] [-e mm/dd/yyyy hh:mm:ss] [-hoursago hours]
[-keyword keyword phrase]
[-client client_name] [-server server_name]
[-backupid backup_id] [-option option_name]
[-class class_name] [-ct class_type]
[-rl retention_level]
[-sl sched_label] [-st sched_type]
[-M master_server...] [-v]

The class name, client name, start and end dates (and possibly times) are the minimum
requirements to get a comprehensive media list. If you can specify more information it will be
a more accurate list.
bpimagelist media d 07/01/2002 -e 07/05/2002 client tpedm01-fe class ORACLE
This will provide a list of all the cartridges used for ORACLE class backup between the dates
specifie for the TPEDM01 client.

Page 39
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

12.4

Restore processes
If a restore is required to resolve a service affecting problem with a
production system then a problem record should be raised and can be used to
initiate and document the restore.
Note: Please note that all Operations run restores will use the overwrite = no option
as a default. This is to ensure that customer data is not inadvertently over
written. Therefore all restore requests where the data is to be restored to its
original location the files affected will have to be deleted before the restore
can complete.
If a restore is required as part of some testing or development work then a
change record should be raised. This will allow the implementors to plan
their work effectively.
The record will be used to track the progress of the restore, to deal with any
unforeseen problems, i.e. cartridges not available, and request the appropriate
access. Restores must be run from an account that has the correct access level
to the files and directories, usually the root account is used.
If it is for a number of files or directories then a list file should be created.
This list file should contain each fully qualified file name, one on each line.
This file can then be used as an input file for the restore and save a lot of
work for the Operations team.

Warning:
The list file must not contain any blank lines or extra blank spaces at the
end of the file names. If it does the restore process will fail with an invalid line
length message and an Exit Status 144 error code.
If the files need to be renamed individually then a rename list file should be
created. This file must contain the following syntax:
change original_fully_qualified_file_name to new_fully_qualified_file_name
One line for each rename must be specified; no wildcards or substitution can
be done.
Warning:

The same restrictions apply to this file as to the list file.


The dates between which the backups were done must be supplied, if
possible. By specifying a specific date range the search time through the
NetBackup database will be reduced.

12.4.1

Preparation for Restore


Before invoking the bp utility, ask yourself the following questions.

Do I need root to perform the restore

Am I sure that the host name is correct for the restore (particularly important
for E10k boxes)

Do I need to be on the client to do the restore or can I do it from the server


Page 40

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Is the client in a HA configuration.

Is the restore request for a raw partition or a filesystem file

Has the user requested a restore to an alternative path and if so what do you
need to do about links.

Once you are sure you know the answer to each of the above, then proceed.

12.4.2

Performing the Restore


Logon to the correct Master backup server. This assumes you are doing a
Server directed restore.

12.4.3

Logon to Root
NetBackup uses standard UNIX file permissions to control access to files so
unless you have read/write permissions to the file you will need to use root.
Remember that you may not even see the file in NetBackup if you do not
have the correct permissions. Using root will ensure that you can always see
the client.

12.4.4

Invoke bp
Use /usr/openv/netbackup/bin/bp
At this point you select the restore menu, and then you will be presented with
options to restore from different kinds of backups. If you are restoring from
normal filesystem backups, then select restore from backups. If you are
restoring from raw partitions then select restore from raw. If you are restoring
NT then select restore ms-windows.

Primary Options Menu. Initiating a restore


At this screen you will be able to make all the selections needed before
actually initiating the restore.
You can check all the options are correct before you initiate the restore.

Source client/Destination Client: both these entries must be the same, if not
you may restore a file from one machine to another. Make sure that the
clientname is as known to NetBackup with all relevant suffixes (i.e. fe)

Date Range: ensure that you have the correct date range and that it is as
narrow as possible. When Netbackup searches its database it could be
uncompressing large amounts of data about backups. Obviously the wider the
search the more time it will take to do this. You also run the risk of filling
the /usr/openv disk partition.

File Path: specify the file path you require to search on.
Page 41

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Specify Alternative Path: use this if you wish to restore to an alternative


directory, care must be used with this option: - remember that the restore
from prompt requires you to enter the pattern that will be used to match for
restores. It must be the directory you wish to restore or at a lower directory
path level than all the different paths you wish to restore. E.g. If you wished
to restore /usr/openv/patches, and /usr/openv/version. The restore from
prompt may default to /usr/openv/patches/, this would mean that you could
end up restoring patches to a different directory, but version would be
restored to the original location. The correct location for the restore from
would be /usr/openv. Also when you specify an alternative path, the restore
will create a path from that level on, so a restore using restore from as
/usr/openv and restore to as /tmp will create /tmp/patches directory.

Set your directory depth for searching, this defaults to 1, if you have this at
too low a level it may seem that the file is not backed up. Because it will only
show files or directories to that level. Setting the level to zero means you will
see everything down from the path, but be careful with its use, as it can give
you masses of data to look at.

When you are selecting files and directories you are searching the NetBackup
database so this may take some time. Also be aware you can drill through
directories using zoom in and out, this can be useful if you are searching for a
file and the user is unsure where it was.

If you need to you can build up a restore job, for instance you may select
some files for restore, change your path and select some more.

Use edit/view to see you selections, before you initiate a restore.

When initiating a restore read the prompts carefully. Decide where you wish
to place the progress log. The output from this log can be significant in size if
you are changing the paths for a number of files.

12.4.5

Running a restore

The progress of the restore must be tightly monitored in order to enable


successful completion in the minimum time possible. Be aware of the
timelines specified.

1) Firstly - has the job actually started or is it queued? If the job is queued
Section 11.5 (Reduction of Netbackup drive usage to allow a restore to
run) shows you how to ensure sufficient resources are made available to
allow the restore to commence.

2) Once the restore has been initiated start monitoring the process log. The
process logs are in the format of bplog.rest.xxx and unless you specified
differently whilst setting up the restore are created in the root directory. e.g.
tail -f /bplog.rest.001

Page 42
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

3) Ensure the tapes requested are successfully mounted. If the log states the
tapes are not in the library or the restore job appears to hang then check the
physical location of the tapes using relevant dasadmin or mtlib commands as
documented in the respective Robotic Library sections. For tapes not in the
library inform the relevant Hardware site of this and ask them to locate the
requested tapes and insert them back into the library. Once the tapes have
arrived back onsite, placed in the library, ensure the tapes are made onsite.
Note in some circumstances when vaulting has run, the required tapes will be
physically on drives but marked as not available to the robot which will cause the
restore to hang. In these circumstances the status of the tape will need to be changed
back and any pending requests either resubmitted or denied as appropriate. In
order to check for this, run vmoprcmd on the relevant master/ media server and
check for any Pending Requests from the output. Running vmoprcmd resubmit <
request id > will allow the restore to continue.

4) Under Netbackup 4.5 you can check up on the restore using the activity
monitor. This will show you the throughput in KB/s, Number of files restored
and also the percentage complete. One other way to check up on what is
happening is via the URL <http://byadsm03.nat.bt.com/activity/>.

5) The bp process created by the restore on the server can be checked to


ensure that the CPU process is incrementing.

6) It is also worthwhile checking that the files requested to be restored are


actually being restored by cd'ing to the relevant directory on the client and
using the ls command to ensure progress is being made.

When a restore is running there are some points to bear in mind:-

When you are looking at the restore job it may take a while before a restore
kicks in. This is because the restore is searching a large NB client database
for the files; this may take up to 20 minutes. On heavily used servers, the
restore may time out. If so just reissue the restore request.

If the file is in a large system single stream backup you may notice it takes a
long time to restore relatively small files, after the restore is positioned on the
tape. This is because the restore has to read through the backup to find the
files on the tape.

You may see waiting for mount of tape, this could be because the drives are
in use for backups, or the tape is offsite, you check for an offsite tape by
looking at the requested tape in media manager.

Alternatively you may wish to use the panels


Issue bpadm
NetBackup Server: tpcds1

Page 43
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

NetBackup Administration
-----------------------s) Storage Unit Management...
c) Class Management...
g) Global Configuration...
r) Reports...
m) Manual Backups...
x) Special Actions...
u) User Backup/Restore...
e) Media Management...
h) Help
q) Quit

ENTER CHOICE:
Select u
Master Server: tpcds1
Client: tpcds1
Main Menu
---- ---b) Backup...
r) Restore...
h) Help
q) Quit

ENTER CHOICE:
Select r>
Master Server: tpcds1
Client: tpcds1

Restore Menu
-----------b) Restore Files and Directories from Backups...
a) Restore Files and Directories from Archives...
r) Restore From Raw Partition Backups...
f) Restore From Auspex FastBack Backups...
d) Restore From True Image Backups...
o) Restore From Oracle DB Backups...
i) Restore From Informix DB Backups...
s) Restore From Sybase DB Backups...
t) Restore From SQL-BackTrack DB Backups...
p) Restore From SAP DB Backups...
2) Restore From DB2 DB Backups...
m) Change Master Server...
h) Help
q) Quit Menu
Page 44
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

ENTER CHOICE:
Select b>
Path: /usr/users/storage/
Start Date: 11/29/98 22:11:39 Master Server: tpcds1
End Date: 12/01/99 23:59:59 Source Client: tpcds1
Files Selected: 0
Destination Client: tpcds1
Directory Depth: 1 level
Class Type: Standard
Display Mode: Brief
Keyword Phrase:
Restore Backups
--------------s) Select Files and Directories... p) Change Path...
e) Edit/View Selected Files... d) Change Date Range...
i) Initiate Restore
c) Change Directory Depth...
x) Change Display Mode to Verbose m) Change Master Server...
l) List Backup Images...
b) Change Source Client...
a) Specify Alternate Path...
t) Change Destination Client
q) Quit Menu
y) Change Class Type
h) Help
k) Change Keyword Phrase

ENTER CHOICE:
Then p> to change path
d> to change date range
c> directory depth 0 gives the most information
s> to select your files/directories
When you are satisfied with the above
Enter I> to initiate restore
Finishing A Restore
Send the restore log to the requestor, and ask them to verify the restore.

12.4.6

Problems Locating a File.


If you cannot find the file you want to restore consider the following.

Is the source and destination client name correct

Is the file in a directory which is a linked directory name

Are the date ranges too narrow.

Is the path name right

Is the client/server a HA configuration

Is the restore a raw partition

Have the directory level settings been too high or too low.

Do I need to be root

Page 45
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Finally it may be that the file was not backed up, this should be checked
against the policy to see if it covers the requested files/directories.

12.5

Failover Restores
The reason for failover restores is when a Media Server is unreachable and
your restore job is trying to use its tape drives. The bpmedia movedb
command is then required to move the NetBackup catalog entries from the
files you are restoring to an alternative Media Server.
The restore attempts failed because NetBackup was trying to mount the
relevant tape into the nfmpramb drive which was inoperable at the time. The
restore request itself was resolved by copying an identical file across from the
nfmprama server, but the need to provide a means of managing this problem
in future remained.

Following guidance re. failover restores in the manuals, C/R 5400777 was
raised to update the byadsm01 bp.conf in the hopes that this would enable the
automatic switch to alternative drive(s) specified in the bp.conf file, in the
event of similar restore failures. Unfortunately, the restore tests which
followed failed and case 140017461 has been raised with Veritas to look into
the causes.
In the meantime however, the following command has now been successfully
tested and can be used from root access once the specific tape required is
known:
"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev <media_id>
-newserver <hostname> -oldserver <hostname>"
In this example the command would read:
"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev BBL007
-newserver nfmprama -oldserver nfmpramb"
Details of the tape(s) required will always be are shown in the relevant
bplog.rest.00n log, (usually in the root directory) from the failed restore
attempt.
Once the restore has been completed successfully, it would be adviseable to
run the same command in reverse to ensure that confusion is avoided at a
future date should further restores be required using the same tape. Again in
this context, the command would read:
"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev BBL007
-newserver nfmpramb -oldserver nfmprama"
As this method only provides a very specific solution and needs to be keyed
in, efforts will be continued to progress the case with Veritas, in the hopes
that we will be able to configure automatic restore failover functionality

Overview
Page 46
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Bplist is a command supplied by Veritas to interrogate the NetBackup backup


images database. It is an extremely useful NetBackup function as it allows
users the ability to check the existence of backups, without having to use the
restore panels. This command can be easily incorporated into scripts that can
be run on a regular basis. There are a large number of parameters that are
associated with bplist; these are documented in the NetBackup systems
administration guide.

12.5.1

Functional description.
The bplist binary will search the NetBackup master server for backups; by
default the command will search using client name and master name
specified in your bp.conf file. The bp.conf file resides in
/usr/openv/netbackup or you can specify your own bp.conf file in your home
directory that overrides the global bp.conf.

The netbackup database of backups is really a catalogue, which comprised of


a series of directories and flat files, on some of the high volume servers the
image information will be compressed. When a bplist command is executed,
the range of dates specified should be considered carefully, as you could end
up searching the entire set of backups for a server.

Using various parameters you will be able to get listings similar to the Unix
command ls, and where there are multiple occurrences of a file in a listing,
this will indicate that there are multiple backups.

12.5.2

Using Bplist
The command /usr/openv/netbackup/bin/bplist can be used to find out if a
file or directory was backed up on a specific date, e.g.
bplist s mm/dd/yy e mm/dd/yy fully_qualified_file_name
This command would have to be run on the client box, using the root account
or one that has access to the files or directories being queried.The bplist
command has global execute permissions, but it is important to realise that
Netbackup security is based upon Unix file permissions. If you do not have
permission to restore a file, then you will be unable to perform a bplist for the
file. All that will happen is that you will be returned back to the command
line.
The keyword function allows backups to be associated with a keyword,
which is indexed and should speed up query times. It can be specified when
the bpbackup command is issued, or by the NetBackup administrator when a
scheduled backup is defined.
Before listing backups, confirm whether the file is filesystem, or raw
partition.
Please note that by limiting the date range the database search time will be
reduced.
Page 47

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

It is also possible to do recursive searches through a directory tree structure.


Use the bplist -? Command to get a full list of the available keywords
available.

12.5.3

Bplisting Filesystem Backups


The following examples all are examples where the command is issued from
the client, to find files in a directory called /usr/home.
Listing filesystem file backups.
bplist R -l -s mm/dd/yy /usr/home
The example above will search for a filesystem backup recursively listing all
files and directories from /usr/home downwards. The R option tells the
bplist to display all files and directories from /usr/home downwards the R
option will allow you to limit the depth of the search. E.g. R 2 will show 2
directory depths.

The s option tells bplist to start searching from the specified date onwards,
there is also a e option for end date, and you can specify the hours and
minutes (see the manual). Using an end-date will reduce the search time. The
-b option displays the backup date and time of each file.
bplist R b l -s mm/dd/yy /usr/home

Will give a listing where the date and time of the backups are listed.

12.5.4

Bplisting Raw Partition backups.


To list a raw partition backup the -r parameter must be specified, this will
tell NetBackup to look for raw partition backups only.
A typical bplist command for a raw partition might be
bplist r s mm/dd/yy /raw/volume/name
Keyword Function
If backups are performed using the keyword within a backup class, it is even
quicker to identify backups. Backups performed using the standard orahot
and oracold scripts use this and a backup can be identified using the keyword.
E.g.
bplist R keyword dba_* l /

12.5.5

HA Environments
If you are running bplist in a HA set-up be aware how the client was backed
up. Which would be either by the shared IP address or by the physical IP
address. Once you have this ensure that you are using the right client name,
this uses the C client-name option.
Page 48

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Problems

If you do not get a result but you think you should, please do not hesitate to
contact NetBackup support.

12.5.6

Command line
The command line restore command, bprestore, is used when backed up or
archived files or directories are to be restored. If a directory is specified all
files backed up will be restored.

Caution:
Note that by default a restore will probably use the MPRN unless the client
name is the same as that used by NetBackup for the backup.
Warning: You will need to include the t 13 parameter if the client is an NT box.
The default type is standard and suitable for UNIX clients only.
For example the syntax of the bprestore command to restore a list of files
and rename them is as follows:
bprestore -K -L log_file_name R rename_file -C clientname s
mm/dd/yy e mm/dd/yy f listfile
Where
-K means that existing files will not be overwritten.
-L is the location of the log file (these can be large and must be
managed).
-R is the file listing the renames to be done.
-C is the client server name. Please ensure you specify the same
name that NetBackup uses to backup the data, e.g. dyfin04-fe instead of
dyfin04.
-s is the start date.
-e is the end date.
-f is the list of files to be restored.
Another example of how to restore a specific file back to its original location
and over-write any existing file is as follows:
bprestore L log_file_name C clientname s mm/dd/yy e mm/dd/yy
fully_qualified_file_name

Other options and parameters can be specified. The complete syntax is:
/usr/openv/netbackup/bin/bprestore [-A | -B] [-K] [-l | -H | -y][-r] [-T] [-L
progress_log] [-R rename_file] [-C client] [-D client]
[-S master_server] [-t class_type] [-c class] [-s mm/dd/yy
[hh:mm:ss]] [-e mm/dd/yy [hh:mm:ss]] [-w [hh:mm:ss]] [-k
"keyword_phrase"] -f listfile | filenames

Page 49
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

12.6

Reduction of NetBackup drive usage to allow a


restore to run

12.6.1

Introduction
The NetBackup infrastructure is in constant use. This means that the
resources required for a high priority restore might all be allocated and not
immediately available. The following document outlines the procedure to
follow to ensure that sufficient resources are made available in a controlled
manner. This will cause the least impact on other customers as their backups
will not be cancelled but merely queued for the duration of the restore.

12.6.2

Identification
When a restore is submitted it has a higher priority than a backup job. It will
therefore take the first available drive and start running but due to the way
that NetBackup runs it will try to keep a cartridge loaded for as long as there
are backup jobs that can write to it. This is even in preference to other
backups that may have been queued for a long time. So a restore may be
queued because backup jobs are getting preference as it makes more efficient
use of the cartridge drives.
If there are no drives available then it is advisable to try and free one by
reducing the number of jobs running.

12.6.3

Action
Reducing the number of running jobs can be achieved by the following
method.
From the NetBackup administration GUI, open up the POLICIES icon, select
the ORACLE job class, make a note of the Max Jobs/class parameter and
reduce it to 4. This will have the effect of limiting the ORACLE backup to
one drive only. Carry out a similar process for the ORACLE_REDO policy.
Note: It may be necessary to reduce the maximum number of jobs even further if
there are limited cartridge drives available and if the restore is of a high
priority and deemed necessary, kill off any system/non-urgent backups from
the Activity Monitor.
This will gradually reduce the number of active jobs and release some drives.
Once the restore has completed you must re-instate the Max Jobs/class
parameters to their original values.

Page 50
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

12.7

Additional information

12.7.1

Raw volumes
Raw volume restores are a bit more complicated. In general specific files can
not be restored, only the complete raw volume, for this reason restores are
rarely done. If a restore is required then CWBACKUP should be consulted.

13 Backups
It may be necessary on occasions to run test backups or manual backups as
part of problem resolution. Please follow the actions listed below.
Using the Netbackup GUI On the main netbackup screen you will see a list of policies, search for
small_test
Highlight small_test then bring down the Actions toolbar and select manual
backup and schedule and specific client name.
Check on Job Monitor that the job becomes active.
If you receive an error when using the GUI
Use bpadm (VT100)
#bpadm
select m> manual backups
select b> browse classes forward (keep going until small_test comes up)
then enter
i> initiate backup

14 NetBackup Logmon Error Messages


Please note: with status codes 49, 50, 51, 56, 57, 58, 59, 76, 84, 95, 164 and
205, the Omnibus Threshold Rule provides an extra filter. Such traps are
suppressed until over 20 traps have been issued in a twenty-minute period, at
which point a trap is then sent. Please see section 14.1 below for fuller
details.
The error codes are explained in more depth in section 14.

Page 51
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

14.1

NetBackup Processes and Procmon Error


Messages
There are six processes that need to be active on NetBackup servers, and
these consist of two NetBackup processes and four Media Manager processes
as follows:
NetBackup :bpdbm ; bprd
Media Manager :ltid ; avrd ; vmd ; tlmd (can be tlhcd e.g. some ICIP machines)
In the event of any of these processes dying, the following messages will be
issued using the following syntax:
process died : <processname>, formerly PID nnnn..
E.g. process died: bpdbm, formerly PID 19907
Please note that a single trap may list several processes, since the loss of ltid
will kill all other media manager processes, as in the following example:
15/05/00 12:55:04 cyvsis0a System Entity Failed: Processes died:
bpdbm, formerly PID 23097; ltid, formerly PID 23084; avrd, formerly
PID 23099; vmd, formerly PID 23090; tlmd, formerly PID 23098

The NetBackup procmon file has the checkstartup command included, which
as it implies, checks whether a process that has died restarts afterwards.
Depending on the result, either of the two following commands will be
issued:
OSMF_procmon: Process bpdbm has not started
SMM_processes: Process back up: bpdbm, new PID 9444

On receiving a NetBackup process died or 'process not started' trap, Tinsley


Park Operations should wait for five minutes (approx.) to see whether a
'process back up' trap is received. Where no such trap appears, contact should
be made with CWHPOPSUX to advise them of a possible problem.

Page 52
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

15 NetBackup STATUS Exits the big hitters


Here are some of the main status exits that are regularly output.

15.1 Exit Status 41 Network connection timed out


Raise a P3 incident record and progress fault resolution as follows:
This section aims to describe the process of dealing with these traps from a
Bridge perspective.
Some definitions.
Exit Status 41: "network connection timed out"
Exit Status 54: "timed out connecting to client"

These are similar faults but resolution normally differs. 41's are often due to
a failure to make a connection to a client. 54's usually occur after a
connection has been made but the client fails to respond. This document will
deal with these failures separately. There are a number of considerations
when dealing with this fault. Is for instance the machine showing as
unreachable on Omnibus and being dealt with via a problem or change
record? If so then the case can be simply transferred to CWBACKUP for
info. If not then a method for dealing follows.

Go to the Job Monitor window and check which Media Server the job was
using. Often a backup may fail when made from one server but not another
(i.e. a client may be successfully backup from tpadsm01 but not from
tpadsm02). Before proceeding it is worth checking that other clients are not
experiencing problems. There have been a number of faults recently where
all backups to clients from the media server have failed with 41's but have
been okay from the master. Resolution has been to reboot the media server.

41's are often client specific problems but can also be due to network
congestion. With the latter a number of backups for a client one might fail
but others are okay. This is often a problem at Bletchley for clients using a
virtual private network (these have names like aspasea1-vpn) where the
backup is actually going over the public network and can fail occasionally
due to congestion.

Step one is to log into the relevant server and attempt to ping the client from
that server. I will use here the example of tpfinder1 and assume that there
have been backup failures from tpadsm01, tpadsm02 and tpadsm03.

Page 53
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

The error will report as indicate a failure for tpfinder1-fe.

Login to tpadsm01 and ping tpfinder1. Did it respond? Then ping tpfinder1fe. It is important that you ping the hostname as it is known by Netbackup as
this tests the specific network interface. Normally our machines will have a
separate network card which is connected to the dedicated Netbackup private
network. This is ensures that connections can be made by users to a machine
on its public network interface without degradation due to very large amounts
of data which backups often involve. Another issue is that one should always
use the hostname and not the IP address as this tests the name resolution.
When telnetting between the Master/Media Server and the client, run the
who am I command on both to ensure the hostnames reflect the /etc/hosts
and bp.conf entries.

If tpfinder1 pings okay but tpfinder1-fe did not then the problem lies
somewhere on the private network (which includes the network card and
networking software on the client).

Log onto the client and check that the network is up using 'ifconfig' (for
vendor specific instruction, see the OA Network Diagnostic procedures). If
the relevant interface is down then try bouncing the interface. See also
instructions for 'netstat' in the OA procedures.
More often than not caused by the NetBackup server being unable to
communicate with the client.

Check that you can ping / telnet to the backup interface on the client if its
not contactable, forward case to TSS for progression.

If you can telnet to the backup interface, logon and check the entries in
/etc/hosts for the NetBackup server IP addresses (including media servers)
and the clients own addresses (including the name as registered / known
within NetBackup on the server).

Check /usr/open/netbackup/bp.conf to ensure the entries are correct

Ensure you are connected via the correct network. (i.e. type who am I in
order to identify the connection you have been routed via).

If the config checks out ok, switch the multistreaming option ON within the
NetBackup class in order to see if the errors are occurring under one
particular mount point. If so, then this may be one of two problems. It may be
that the Unix box was connected to a NFS (Network File system) that has
since been removed, therefore when listing the mount points, the session
hangs. Alternatively, there may be a problem with some corrupt data files, or
it may just be the number of files that NetBackup has been asked to back up.
On some problematic clients, we are backing up in excess of 1million files.
(Note: thats files NOT actual bytes of data).

Page 54
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

15.2

Exit Status 1 Backup was partially successful


No need to raise an incident record as the backup is deemed successful.
Mainly caused by files being ignored as a result of hot backups, i.e. where
databases are active during the backup and files are removed before
NetBackup backs them up. This occurs mainly on NT clients. Check the
problems report from the bpadm panels on the master server to identify the
offending files and inform TSS.

15.3

Exit Status 52 Timed out waiting for Media


Manager to mount volume
Raise a P3 incident record and progress fault resolution as follows:
Caused by a library failing to satisfy a request for a tape mount before the
timeout expired. If all of the drives are down and NetBackup is still actively
processing backup requests, then eventually they will timeout hence the
status 52. Firstly, check that the library is up. It is also important to check that
all the drives are not in use by other jobs, which will also be a cause of this
error. Resolution can be ensuring that there are not a large number of
concurrent jobs (particularly with differing retention levels) check with
DBAs, ASGs, TSS depending on the jobs affected if persistent.

15.4

Exit Status 71 None of the files in the file list exist


Raise a P3 incident record and progress fault resolution as follows:
A number or all of the mount points as defined within the Netbackup backup
class do not actually exist on the client when the backup runs. A common
cause can be where clients are clustered and therefore share mount points, i.e.
the OAWS (Darwin) clients that have master and slave boxes. They utilise
raid filesystems, which only reside on one of the boxes depending where the
live service is. In order to ensure that both boxes are covered, we would need
to maybe introduce a floating IP address method of clustering boxes. It
would be worth contacting the TSS people in order to discuss this.

Other causes of this error could be incorrect permissions on files / directories


being backed up.

15.5

Exit Status 219 The required storage unit is not


available
Raise a P3 incident record and progress fault resolution as follows. If the
problem is NOT resolved within 1 hour, please upgrade to P2.

Page 55
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Usually caused when there are no drives available for backups this can
mean one of two things

All attached drives are down

All drives are up, but in use by other backups for a long period of time.

Check the drive status (if the error occurred overnight, check the drive usage
information on the NetBackup interactive reporting page), following the
normal down drives procedure. If you find that the problem is persistently
being caused by workload, then liaise with who is responsible for the backups
(normally Oracle DBAs) and advise them to reduce the number of concurrent
jobs.

15.6

Exit Status 54 Timed out connecting to client


Raise a P3 incident record and progress fault resolution as follows:
Problem occurs when a backup is invoked, however, connection from the
master server to the client is broken as a result of it timing out.
This particular error is not normally caused by network problems; its usually
linked with NetBackup config anomalies. More specifically, ensure that the
entries for the client and server in /etc/hosts are correct on both boxes. This
can often occur when the backups are using a media server and the host
entries are incorrect for it. Follow the Network problem resolution document
for more details on checking this.

15.7

Exit Status 84 Media write error


Raise a P3 incident record and progress fault resolution as follows:
The main cause of this error is incompatible media. Currently, we use LTO
and 3590 tapes. Of the 3590 tapes, there are 2 different types of 3590 tapes
there are 3590B and 3590E. 3590B tapes are compatible with the normal
density drives and double density drives, whereas the 3590E tapes are only
compatible in the double density (extended) drives. When 3590E tapes are
mounted in 3590B drives, as soon as the backup starts writing data, it fails
with the media write error.
This can occur where you have different media servers, using different drive /
tape types but the same volume pool, alternatively, it is not uncommon for
someone to unknowingly assign 3590E tapes to a volume pool used by
3590B drives. If this is the case, then the assigned tapes must be re-assigned
elsewhere ASAP.
If a genuinely faulty tape causes the problem, then you must follow the
procedure for dealing with faulty tapes.

15.8

Exit Status 57 Client connection refused


Raise a P3 incident record and progress fault resolution as follows:
Page 56

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Caused normally by the Netbackup client code being removed from the client
box this often happens as a result of an upgrade to the Operating System
(Unix / Windows). Unless we are informed of such changes via Change
Control then these errors will persist. Locate the TSS contact for the box and
request that a change record be raised for a reinstall.
To check that the NetBackup client code is resident on a client, log on to it
and see if you can access the /usr/openv/netbackup directory if not then the
code has been removed. This error is normally caused by inactivity of the
bpcd process on the client box. This is required in order to allow access from
the NetBackup server to the client and must comply with the TCP/IP
protocol.

The way to check this is to follow the telnet hostname bpcd instructions
found in the supporting document.
If you find that an error is given when trying this, do the following to ensure
it is working:
On the client box:
type grep bpcd /etc/services.
The following line should appear:
bpcd 13782/tcp
bpcd
This indicates that the port numbered 13782 is to be used only for NetBackup
connections and is generic across all NetBackup configs (server and clients).
Do the same for the request daemon:
grep bprd /etc/services
for the following entry:
bprd 13720/tcp
bprd

Also, the following entry must be in /etc/inetd.conf:


grep bpcd /etc/inetd.conf which will give you the following:
bpcd stream tcp nowait root /usr/openv/netbackup/bin/bpcd bpcd
On system IPL (boot) the /etc/inetd.conf file is read and will start-up any
required services found within the /etc/services file.
On NT boxes, check for NetBackup services under Control Panel, Services
and ensure it is Started and set to Automatic.

15.9

Exit Status 96 Unable to allocate new media for


backup, storage unit has none available
Raise a P3 incident record and progress fault resolution as follows. If the
problem is NOT resolved within 1 hour, please upgrade to P2.
Error caused by numerous factors. The obvious one is nil scratch in the
specified volume pool in this instance the scratch levels must be topped up
ASAP. One thing to be wary of and that is the expiration date set within the
Page 57

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

NetBackup GUI this can be set on tapes that are not time assigned (i.e.
scratch). NetBackup will ignore these until manually un-expired (unless they
have been expired for a particular reason) this method of expiration is used
to stop tapes from being used even if they are scratch. If there are expired
tapes that are available for use, simply use the GUI to un-expire them.

Another reason for these errors can be due to a tape being stuck in a drive. If
a job is allocated a drive in which to mount a tape that already has a tape
stuck in, then NetBackup will appear to attempt to mount all of the available
scratch tapes as it is not intelligent enough to be able to identify whether a
tape is stuck or not. It will send a signal to the robot regardless of what is
already in the drive. As such, you will need to identify the drive with the
stuck tape in and remove it. Check the NetBackup Activity Monitor for
details on whether the job is trying unsuccessfully to mount numerous tapes.

15.10

Exit Status 131 Client is not validated to use the


server
Raise a P3 incident record and progress fault resolution as follows:
The client name, as determined from the connection to the server, did not
match any client name in the NetBackup configuration and there was no
altnames configuration for this client on the master server. A client and server
that have multiple network connections can encounter this problem if the
name by which the client is configured is not the one by which its routing
tables direct connections to the server.
Try the following.
1. Examine the NetBackup Problems report.
2. Create a debug log directory for bprd and retry the operation. Check the
resulting debug log to determine the connection and client names.
Depending on the request type (restore, backup, and so on.), you may need
or want to:
* Change the client's configured name.
* Modify the routing tables on the client.
* On the master server, set up an altnames directory and file for this client
(see the NetBackup System Administrator's Guide for UNIX).
or
* On a UNIX master server, create a soft link in the NetBackup image
catalog.
3. Review Verifying Host Names and Services Entries in the Troubleshooting
Guide.
Reconfiguration
1. Ascertain an alternate connection (usually over the management LAN).
All Media servers (including the Master) should have connectivity to the
client (if this is not possible then the backups will have to be restricted to
certain NetBackup Media Servers only).
2. Make copies of the client's /usr/openv/netbackup/bp.conf and /etc/hosts
file.
3. Update the client's /usr/openv/netbackup/bp.conf so that CLIENT_NAME
reflects the 'new' name.
Page 58

ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

4. Modify the client's /etc/hosts file to reflect the client name.


5. Modify the the Policy's on NetBackup to include the 'new' client name.
This usually includes ORACLE, ORACLE_REDO and any other
OS/Application backup that is defined.
6. Make copies of /etc/hosts on all the NetBackup Media Servers that the
client can use.
7. Ensure the NetBackup Media Server's /etc/hosts file reflects the 'new'
client name.
Checks
1. Test both user and scheduled backups work as expected.
2. Test that a restore works (this test is only possible if resources on the
NetBackup Master/Media are free).
3. Test to ensure /usr/openv/netbackup/bin/bplist works. This should be
tested for both 'root' user and 'oracle' user. Expected returns:
a. a file list of what has been backed up.
b. 'no entity found' - this will be returned if the user does not have the
correct userid/permissions to interrogate the files e.g. a userid of 'storage'
cannot interrogate files belonging to a userid of 'oracle'.
c. If a NetBackup error code is returned, then that error requires
investigating and resolving.
Errors Returned
Usually, the 'renaming' of a client will not cause any issues to backups,
restores or bplist commands. Errors will only occur in unusual setups/configurations (e.g. if static routing is being used).
The most likely error is Status Code: 131 - client is not validated to use the
server (NetBackup Troubleshooting Guide entry below). There are a few
ways of resolving this issue (the recommend resolution is detailed):
1. run bplist stating the client name in the command i.e. bplist -C
clientname /
2. create an entry in the /usr/openv/netbackup/db/altnames directory (on the
NetBackup Master Server) to allow the 'old' client name to access data
backed up via the 'new' client name.

15.11

Application Resource Alert


Traps will be issued for these alerts. The following alerts are typical, for
example:
Scratch level below minimum:
15% in Netbackup - Raise P3 I/R and pass to CWHWOPSUX
Application Resource Alert: Scratch level below minimum : 5% in
Netbackup - Raise I/R as a priority 2 and pass to the Bridge class as
listed below. If the problem is out of hours you will need to invoke
callout.

Traps will be issued for these alerts, for example:


Page 59
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

Duplicate Tape(s) - Raise P3 I/R and pass to CWBACKUP.


Each tape should be examined on the servers that have it defined. The server
that has most recently written to the tape should be allowed to keep the
definition, the remaining servers should have all backup images that reside on
that tape manually expired, and the tape should be completely deleted from
all Netbackup databases on that server.
Netbackup Warning: Server-name has not processed any backups for 15
minutes - Investigation Required - Raise P3 I/R and pass to
CWHPOPSUX in the first instance.
Check for current activity and identify any possible problems. If further
action is required pass the Bridge record to CWBACKUP for further
investigation.
Application Resource Alert: NetBackup Drive Warning: server-name Drive n Has Remained Down - Investigation Required. Raise P3 I/R and
pass to CWHPOPSUX in the first instance.
Following the actions indicated for error status 219, and section 8 Drive
Testing.

Application Resource Alert: NetBackup Drive Warning: <servername> Drives are in AVR mode - Investigation required <date and
time>
Application Entity Failed: NetBackup VMD Warning: <servername> Cannot connect to vmd daemon - Investigation required
<date and time>

15.12

Slow throughput of backups


Check the backup job throughput via the Activity Monitor on the admin GUI
or by running bpdbjobs | grep Active on the Master server. If all client
backups are slow then this is likely to be either a network problem or a
network interface problem on the Master/Media server running the backups.
If the slow throughput is on one client only then the problem is probably on
the client end with the backup interface card being set to half-duplex. Pass
the problem record to the relevant Service Delivery team for them to
check/amend the card settings.

16 Netbackup Clients
16.1

Netbackup Documentation
http://dataintegrity.intra.bt.com/

Page 60
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

Netbackup Operational Procedures

Contents

16.2

Troubleshooting Guide
Refer to NetBackup manuals

16.3

NetBackup reporting

16.4

Supportal
_Coll=;

17 APPENDICES

END OF DOCUMENT

Page 61
ISL/OPS/A869, Issue 2 (Error! No text of specified style in document.)

You might also like