Professional Documents
Culture Documents
2009
INDUSTRIAL FACILITY SAFETY
Notice
This report was prepared as an account of work sponsored by Risiko Technik Gruppe (RTG). Reference
herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
Risiko Technik Gruppe, any agency thereof, or any of their contractors or subcontractors.
TERMINOLOGY
AIChE
American Institute of Chemical Engineers.
BPCS
Basic Process Control System.
CCPS
Center for Chemical Process Safety.
CDF
cumulative distribution function.
DCS
Distributed Control System.
ESD
Emergency shut-down.
ETA
Event Tree Analysis.
FME(C)A
Failure Mode Effect (and Criticality) Analysis.
FMEDA
Failure Mode Effect and Diagnostics Analysis.
FTA
Fault Tree Analysis.
Hazardous
Event hazardous situation which results in harm.
HAZOP
Hazard and Operability study.
HFT
Hardware failure tolerance.
IEC EN 61508
Functional safety of electrical / electronical / programmable electronical safety-related systems.
IEC EN 61511
Functional safety, safety instrumented systems for the process industry sector.
INDUSTRIAL FACILITY SAFETY
IPL
Independent Protection Layer.
ISA
The Instrumentation, Systems, and Automation Society.
LOPA
Layer of Protection Analysis.
MTBF
Mean time between failures.
PDF
Probability density function.
PFD
Probability of failure on demand.
PFH
Probability of dangerous failure per hour.
PHA
Process Hazard Analysis.
PLC
Programmable Logic Controller.
SFF
Safe failure fraction.
SIF
Safety instrumented function.
SIL
Safety integrity level.
SIS
Safety instrumented system.
SLC
Safety life cycle.
Safety
The freedom from unacceptable risk of physical injury or of damage to the health of persons, either directly
or indirectly, as a result of damage to property or the environment
Safety Function
Function to be implemented by an E/E/PE safety-related system, other technology safety-related system or
external risk reduction facilities, which is intended to achieve or maintain a safe state for the EUC, in respect
of a specific hazardous event.
CONCEPTION, DESIGN, AND IMPLEMENTATION
Tolerable Risk
Risk, which is accepted in a given context based upon the current values of society.
INDUSTRIAL FACILITY SAFETY
CONTENT
Preface 8
Safer Design and Chemical Plant Safety 9
Introduction to Risk Management 9
Risk Management 10
Hazard Mitigation 11
Inherently Safer Design and Chemical Plant Safety 12
Inherently Safer Design and the Chemical Industry 13
Control Systems Engineering Design Criteria 16
Codes and Standards 16
Control Systems Design Criteria Example 16
Risk Acceptance Criteria and Risk Judgment Tools 18
Chronology of Risk Judgment Implementation 19
Conclusions 23
References 24
Safety Lvel Integrity (SIL) 25
Background 25
What are Safety Integrity Levels (SIL) 25
Safey Life Cycle 26
Risks and their reduction 28
Safety Integrity Level Fundamentals 28
Probability of Failure 29
The System Structure 30
How to read a safety integrity level (SIL) product report? 32
Safety Integrity Level Formulae 33
Methods of Determining Safety Integrity Level Requirements 34
Definitions of Safety Integrity Levels 34
Risk Graphic Methods 36
Layer of Protection Analysis (LOPA) 42
After-the-Event Protection 44
Conclusions 45
Safety Integrity Levels Versus Reliability 45
Determining Safety Integrity Level Values 46
Reliability Numbers: What Do They Mean? 46
The Cost of Reliability 47
References 48
Layer of Protection Analysis (LOPA) 49
Introduction 49
Layer Of Protection Analysis (LOPA) Principles 49
Implementing Layer Of Protection Analysis (LOPA) 53
Layer of Protection Analysis (LOPA) Example For Impact Event I 55
Layer of Protection Analysis (LOPA) Example For Impact Event II 57
Integrating Hazard And Operability Analysis (HAZOP), Safety Integrity Level (SIL), and Layer Of Protection
Analysis (LOPA) 58
Methodology 59
Safety Integrity Level (SIL) and Layer of Protection Analysis (LOPA) Assessment 60
The Integrated Hazard and Operability (HAZOP) and Safety Integrity Level (SIL) Process 61
Conclusion 61
Modifying Layer of Protection Analysis (LOPA) for Improved Performance 62
Changes to the Initiating Events 62
Changes to the Independent Protection Layers (IPL) Credits 63
Changes to the Severity 64
Changes to the Risk Tolerance 66
Changes in Instrument Assessment 67
References 68
CONCEPTION, DESIGN, AND IMPLEMENTATION
PREFACE
This document explores some of the issues arising from the recently published international standards for
safety systems, particularly within the process industries, and their impact upon the specifications for signal
interface equipment. When considering safety in the process industries, there are a number of relevant
national, industry and company safety standards – IEC EN 61511, ISA S84.01 (USA), IEC EN 61508 (product
manufacturer) – which need to be implemented by the process owners and operators, alongside all the
relevant health, energy, waste, machinery and other directives that may apply. These standards, which
include terms and concepts that are well known to the specialists in the safety industry, may be unfamiliar to
the general user in the process industries. In order to interact with others involved in safety assessments
and to implement safety systems within the plant it is necessary to grasp the terminology of these
documents and become familiar with the concepts involved. Thus the safety life cycle, risk of accident, safe
failure fraction, probability of failure on demand, safety integrity level and other terms need to be
understood and used in their appropriate context. It is not the intention of this document to explain all the
technicalities or implications of the standards but rather to provide an overview of the issues covered therein
to assist the general understanding of those who may be:
(1) Involved in the definition or design of equipment with safety implications;
(2) Supplying equipment for use in a safety application;
(3) Just wondering what BS IEC EN 61508 is all about.
The concept of the safety life cycle introduces a structured statement for risk analysis, for the
implementation of safety systems and for the operation of a safe process. If safety systems are employed in
order to reduce risks to a tolerable level, then these safety systems must exhibit a specified safety integrity
level. The calculation of the safety integrity level for a safety system embraces the factors “safe failure
fraction” and “failure probability of the safety function”. The total amount of risk reduction can then be
determined and the need for more risk reduction analysed. If additional risk reduction is required and if it is
to be provided in the form of a safety instrumented function (SIF), the layer of protection analysis (LOPA)
methodology allows the determination of the appropriate safety integrity level (SIL) for the safety
instrumented function.
Governing Specifications
There exist several specifications dealing with safety and reliability. Safety integrity level values are specified
in both ISA SP84.01 and IEC 61508. IEC 61511 is the specification that is specific to the process industry.
The IEC 61511 is the process industry specific safety standard based on the IEC 61508 standard and is titled
«Functional Safety of Safety Instrumented Systems for the Process Industry Sector». IEC 61511 Part 3 is
informative and provides guidance for the determination of safety integrity levels. Annex F illustrates the
general principles involved in the layer of protection analysis (LOPA) method and provides a number of
references to more detailed information on the methodology.
CONCEPTION, DESIGN, AND IMPLEMENTATION
CHAPTER 1
(7) Increased team motivation, by encouraging people to think creatively about ways to work better,
simpler, faster, more effectively, etc.
(8) Improved chances of project success, because opportunities are identified and captured, producing
benefits for the project that might otherwise have been overlooked.
Having discussed what a risk is – “any uncertainty that, if it occurs, would have a positive or negative effect
on achievement of one or more objectives” – it is also important to clarify what risk is not. Effective risk
management must focus on risks and not be distracted by other related issues. A number of other elements
are often confused with risks but must be treated separately, such as:
(1) Issues – This term can be used in several different ways. Sometimes it refers to matters of concern that
are insufficiently defined or characterized to be treated as risks. In this case an issue is more vague than
a risk, and may describe an area (such as requirement volatility, or resource availability, or weather
conditions) from which specific risks might arise. The term issue is also used (particularly in the United
Kingdom) as something that has occurred but cannot be addressed by the project manager without
escalation. In this sense an issue may be the result of a risk that has happened, and is usually negative.
(2) Problems – A problem is also a risk whose time has come. Unlike a risk that is a potential future event,
there is no uncertainty about a problem, it exists now and must be addressed immediately. Problems can
be distinguished from issues because issues require escalation, whereas problems can be addressed by
the project manager within the project.
(3) Causes – Many people confuse causes of risk with risks themselves. The cause, however, describes
existing conditions that might give rise to risks. For example, there is no uncertainty about the
statement “We have never done a project like this before”, so it cannot be a risk. But this statement
could result in a number of risks that must be identified and managed.
(4) Effects – Similar confusion exists about effects, which in fact only occur as the result of risks that have
happened. To say, “The project might be late”, does not describe a risk, but what would happen if one
or more risks occurred. The effect might arise in the future, i.e. it is not a current problem, but its
existence depends on whether the related risk occurs.
RISK MANAGEMENT
The widespread occurrence of risk in life and human activities, business, and projects has encouraged
proactive attempts to manage risk and its effects. History as far back as Noah’s Ark, the pyramids of Egypt,
and the Herodian Temple shows evidence of planning techniques that include contingency for unforeseen
events. Modern concepts of probability arose in the 17th century from pioneering work by Pascal and his
contemporaries, leading to an improved understanding of the nature of risk and a more structured approach
to its management. Without covering the historical application of risk management in detail here, clearly
those responsible for major projects have always recognized the potentially disruptive influence of
uncertainty, and they have sought to minimize its effect on achievement of project objectives. Recently, risk
management has become an accepted part of project management, included as one of the key knowledge
areas in the various bodies of project management knowledge and as one of the expected competencies of
project management practitioners. Unfortunately, embedding risk management within project management
leads some to consider it as “just another project management technique”, with the implication that its use
is optional, and appropriate only for large, complex, or innovative projects. Others view risk management as
the latest transient management fad. These attitudes often result in risk management being applied without
full commitment or attention, and are at least partly responsible for the failure of risk management to deliver
the promised benefits. To be fully effective, risk management must be closely integrated into the overall
project management process. It must not be seen as optional, or applied sporadically only on particular
projects. Risk management must be built in not bolted on if it is to assist organizations in achieving their
objectives. Built-in risk management has two key characteristics:
(1) First, project and activities management decisions are made with an understanding of the risks involved.
This understanding includes the full range of management activities, such as scope definition, pricing
and budgeting, value management, scheduling, resourcing, cost estimating, quality management,
change control, post-project review, etc. These must take full account of the risks affecting the different
assets, giving the project a risk-based plan with the best likelihood of being met.
(2) Secondly, the risk management process must be integrated with other management processes. Not only
must these processes use risk data, but there should also be a seamless interface across process
CONCEPTION, DESIGN, AND IMPLEMENTATION
boundaries. This has implications for the project toolset and infrastructure, as well as for project
procedures.
HAZARD MITIGATION
Hazard mitigation is “any action taken to reduce or eliminate the long-term risk to human life and property
and assets from natural or non-natural hazards. In California state (United States of America) this definition
has been expanded to include both natural and man-made hazards. We understand that hazard events will
continue to occur, and at their worst can result in death and destruction of property and infrastructure. The
work done to minimize the impact of hazard events to life and property is called hazard mitigation. Often,
these damaging events occur in the same locations over time (i.e. flooding along rivers), and cause repeated
damage. Because of this, hazard mitigation is often focused on reducing repetitive loss, thereby breaking the
disaster or hazard cycle. The essential steps of hazard mitigation are:
(1) Hazard Identification – First we must discover the location, potential extent, and expected severity of
hazards. Hazard information is often presented in the form of a map or as digital data that can be used
for further analysis. It is important to remember that many hazards are not easily identified, for
example, many earthquake faults lie hidden below the earth’s surface.
(2) Vulnerability Analysis – Once hazards have been identified, the next step is to determine who and what
would be at risk if the hazard event occurs. Natural events such as earthquakes, floods, and fires are
only called disasters when there is loss of life or destruction of property.
(3) Defining a Hazard Mitigation Strategy – Once we know where the hazards are, and who or what could
be affected by a event, we have to strategize about what to do to prevent a disaster from occurring or
to minimize the effects if it does occur. The end result should be a hazard mitigation plan that identifies
long-term strategies that may include planning, policy changes, programs, projects and other activities,
as well as how to implement them. Hazard mitigation plans should be done at every level including
individuals, businesses, state, local, and federal governments.
(4) Implementation of hazard mitigation activities – Once the Hazard Mitigation plans and strategies are
developed, they must be followed for any change in the disaster cycle to occur.
INDUSTRIAL FACILITY SAFETY
The Center for Chemical Process Safety was formed by American Institute of Chemical Engineers (AIChE) in
1985 as the ch emical engineering profession’s response to the Bhopal, India chemical release tragedy. In
the past 21 years, the Center for Chemical Process Safety (CCPS) has defined the basic practices of process
safety and supplemented this with a wide range of technologies, tools, guidelines, and informational texts
and conferences. What is inherently safer design? Inherently safer design is a philosophy for the design and
operation of chemical plants, and the philosophy is actually generally applicable to any technology.
Inherently safer design is not a specific technology or set of tools and activities at this point in its
development. It continues to evolve, and specific tools and techniques for application of inherently safer
design are in early stages of development. Current books and other literature on inherently safer design,
describe a design philosophy and give examples of implementation, but do not describe a methodology.
What do we mean by inherently safer design? One dictionary definition of “inherent” which fits the concept
very well is “existing in something as a permanent and inseparable element”. This means that safety
features are built into the process, not added on. Hazards are eliminated or significantly reduced rather
than controlled and managed. The means by which the hazards are eliminated or reduced are so
fundamental to the design of the process that they cannot be changed or defeated with out changing the
process. In many cases this will result in simpler and cheaper plants, because the extensive safety systems
which may be required to contro all major hazards will introduce cost and complexity to a plant. The cost
includes both the initial investment for safety equipment, and also the ongoing operating cost for
maintenance and operation of safety systems through the life of the plant. Chemical process safety
strategies can be grouped in four categories:
(1) Inherent – As described in the previous paragraphs (for example, replacement of an oil based paint in a
combustible solvent with a latex paint in a water carrier).
(2) Passive – Safety features which do not require action by any device, they perform their intended
function simply because they exist (for example, a blast resistant concrete bunker for an explosives
plant).
(3) Active – Safety shutdown systems to prevent accidents (for example, a high pressure switch which shuts
down a reactor) or to mitigate the effects of accidents (for example, a sprinkler system to extinguish a
fire in a building). Active systems require detection of a hazardous condition and some kind of action to
prevent or mitigate the accident.
(4) Procedural – Operating procedures, ope rator response to alarms, emergency response procedures.
In general, inherent and passive strategies are the most robust and reliable, but elements of all strategies
will be required for a comprehensive process safety management program when all hazards of a process and
plant are considered. Approaches to inherently safer design fall into these categories:
(1) Minimize – Significantly reduce the quantit y of hazardous material or energy in the system, or eliminate
the hazard entirely if possible.
(2) Substitute – Replace a hazardous material with a less hazardous substance, or a hazardous chemistry
with a less hazardous chemistry.
(3) Moderate – Reduce the hazards of a process by handling materials in a less hazardous form, or under
less hazardous conditions, for example at lower temperatures and pressures.
(4) Simplify – Eliminate unnecessary complexity to make plants more “user friendly” and less prone to
human error and incorrect operation.
CONCEPTION, DESIGN, AND IMPLEMENTATION
One important issue in the development of inherently safer chemical technologies is that the property of a
material which makes it hazardous may be the same as the property which makes it useful. For example,
gasoline is flammable, a well known hazard, but that flammability is also why gasoline is useful as a
transportation fuel. Gasoline is a way to store a large amount of energy in a small quantity of material, so it
is an efficient way of storing energy to operate a vehicle. As long as we use large amounts of gasoline for
fuel, there will have to be large inventories of gasoline somewhere.
Inherently safer design of a process addresses the first bullet, but does not have any impact whatsoever on
conventional safety and security needs for the others. A company will still need to protect the site the same
way, whether it uses inherently safer processes or not. Therefore, inherently safer design will not
significantly reduce safety and security requirements for a plant. The objectives of process safety
management and security vulnerability management in a chemical plant are safety and security, not
necessarily inherent safety and inherent security. It is possible to have a safe and secure facility for a facility
with inherent hazards. In fact this is essential for a facility for which there is no technologically feasible
alternative; for example, we cannot envision any way of eliminating large inventories of flammable
transportation fuels in the foreseeable future. An example from another technology – one which much of us
frequently use – may be useful in understanding that the true objective of safety and security management
is safety and security, not inherent safety and security. Airlines are in the business of transporting people
and things from one place to another. They are not really in the business of flying airplanes – that is just the
technology they have selected to accomplish their real business purpose. Airplanes have many major
hazards associated with their operation. One of them tragically demonstrated on September 11, is that they
can crash into buildings or people on the ground, either accidentally or from terrorist activity. In fact,
essentially the entire population of the United States, or even the world, is potentially vulnerable to this
hazard. Inherently safer technologies which completely eliminate this hazard are available – high speed rail
transport is well developed in Europe and Japan. But we do not require airline companies to adopt this
technology, or even to consider it and justify why they do not adopt it. We recognize that the true objective
is “safety” and “security” not “inherent safety” or “inherent security.” The passive, active, and procedural risk
management features of the air transport system have resulted in an enviable, if not perfect, safety record,
and nearly all of us are willing to travel in an airplane or allow them to fly over our houses. Some issues and
challenges in implementation of inherently safer design are:
(1) The chemical industry is a vast interconnected ecology of great complexity. There are dependencies
throughout the system, and any change will have cascading effects throughout the chemical ecosystem.
It is possible that making a change in technology that appears to be inherently safer locally at some
point within this complex enterprise will actually increase hazards elsewhere once the entire system
reaches a new equilibrium state. Such changes need to be carefully and thoughtfully evaluated to fully
understand all of their implications.
(2) In many cases it will not be clear which of several potential technologies is really inherently safer, and
there may be strong disagreements about this. Chemical processes and plants have multiple hazards,
and different technologies will have different inherent safety characteristics with respect to each of those
multiple hazards. Some examples of chemical substitutions which were thought to be safer when initially
made, but were later found to introduce new hazards include: (1) Chlorofluorcarbon (CFC) refrigerants –
Low acute toxicity, non-flammable, but later found to have long-term environmental impacts; (2) PCB
transformer fluids – Non-flammable, but later determine to have serious toxicity and long term
environmental impacts.
(3) Who is to determine which alternative is inherently safer, and how are they make this determination?
This decision requires consideration of the relative importance of different hazards, and there may not
be agreement on this relative importance. This is particularly a problem with requiring the
implementation of inherently safer technology – who determines what that technology is? There are tens
of thousands of chemical products manufactured, most of them by unique and specialized processes.
The real experts on these technologies, and on the hazards associated with the technology, are the
people who invent the processes and run the plants. In many cases they have spent entire careers
understanding the chemistry, hazards, and processes. They are in the best position to understand the
best choices, rather than a regulator or bureaucrat with, at best, a passing knowledge of the
technology. But, these chemists and engineers must understand the concept of inherently safer design,
and its potential benefits – we need to educate those who are in the best position to invent and promote
inherently safer alternatives.
(4) Development of new chemical technology is not easy, particularly if you want to fully understand all of
the potential implications of large scale implementation of that technology. History is full of examples of
changes that were made with good intentions that gave rise to serious issues which were not anticipated
at the time of the change, such as the use of CFCs and PCBs mentioned above. Dennis Hendershot
personally has published brief de scriptions of an inherently safer design for a reactor in which a large
batch reactor was replaced with a much smaller continuous reactor. This is easy to describe in a few
CONCEPTION, DESIGN, AND IMPLEMENTATION
paragraphs, but actually this change represents the results of several years of process research by a
team of several chemists and engineers, followed by another year and millions of dollars to build the
new plant, and get it to operate reliably. And, the design only applies to that particular product. Some of
the knowledge might transfer to similar products, but an extensive research effort would still be
required. Furthermore, Dennis Hendershot has also co-authored a paper which shows that the small
reactor can be considered to be less inherently safe from the viewpoint of process dynamics – how the
plant responds to changes in external conditions – for example, loss of power to a material feed pump.
The point that underlies here is that these are not easy decisions and they require an intimate
knowledge of the process.
(5) Extrapolate the example in the preceding paragraph to thousands of chemical technologies, which can
be operated safely and securely using an appropriate blend of inherent, passive, active, and procedural
strategies, and ask if this is an appropriate use of our national resources. Perhaps money for investment
is a lesser concern: “Do we have enough engineers and chemists to be able to do this in any reasonable
time frame?”, “Do the inherently safer technologies for which they will be searching even exist?”.
(6) The answer to the question “which technology is inherently safer?” may not always the same – there is
most likely not a single “best technology” for all situations. Consider this non-chemical example. Falling
down the steps is a serious hazard in a house and causes many injuries. These injuries could be avoided
by mandating inherently safer houses – we could require that all new houses be built with only one floor,
and we could even mandate replacement of all existing multi-story houses. But would this be the best
thing for everybody, even if we determined that it was worth the cost? Many people in New Orleans
survived the flooding in the wake of Hurricane Katrina by fleeing to the upper floors or attics of their
houses. Some were reportedly trapped there, but many were able to escape the flood waters in this
way. So, single story houses are inherently safer with respect to falling down the steps, but multi-story
houses may be inherently safer for flood prone regions. We need to recognize that decision makers must
be able to account for local conditions and concerns in their decision process.
(7) Some technology choices which are inherently safer locally may actually result in an increased hazard
when considered more globally. A plant can enhance the inherent safety of its operation by replacing a
large storage tank with a smaller one, but the result might be that shipments of the material need to be
received by a large number of truck shipments instead of a smaller number of rail car shipments. Has
safety really been enhanced, or has the risk been transferred from the plant site to the transportation
system, where it might even be larger?
(8) We have a fear that regulations requiring implementation of inherently safer technology will make this a
“one time and done” decision. You get through the technology selection and pick the inherently safer
option, meet the regulation, and then you do not have to think about it any more. We want engineers to
be thinking about opportunities for implementation of inherently safer designs at all times in everything
they do – it should be a way of life for those designing and operating chemical, and other, technologies.
(9) Inherently safer processes require innovation and creativity. How do you legislate a requirement to be
creative? Inherently safer alternatives can not be invented by legislation.
What should we be doing to encourage inherently safer technology? Inherently safer design is primarily an
environmental and process safety measure, and its potential benefits and concerns are better discussed in
context of future environmental legislation, with full consideration of the concerns and issues discussed
above. While consideration of inherently safer processes does have value in some areas of chemical plant
security vulnerability – the concern about off site impact of releases of toxic materials – there are other
approaches which can also effectively address these concerns, and industry needs to be able to utilize all of
the tools in determining the appropriate security vulnerability strategy for a specific plant site. Some of the
current proposals regarding inherently safer design in safety and security regulations seem to drive plants to
create significant paperwork to justify not using inherently safer approaches, and this does not improve
safety and security. We believe that future invention and implementation of inherently safer technologies, to
address both safety and security concerns, is best promoted by enhancing awareness and understanding of
the concepts by everybody associated with the chemical enterprise. They should be applying this design
philosophy in everything they do, from basic research through process development, plant design, and plant
operation. Also, business management and corporate executives need to be aware of the philosophy, and its
potential benefits to their operations, so they will encourage their organization to look for opportunities
where implementing inherently safer technology makes sense.
INDUSTRIAL FACILITY SAFETY
Pressure Instruments
In general, pressure instruments will have linear scales with units of measurement in pounds per square inch
gauge (psig). Pressure gauges will have either a blowout disk or a blowout back and an acrylic or
shatterproof glass face. Pressure gauges on process piping will be resistant to plant atmospheres. Pressure
test points will have isolation valves and caps or plugs. Pressure devices on pulsating services will have
pulsation dampers.
Temperature Instruments
In general, temperature instruments will have scales with temperature units in degrees Celsius (ºC) or
Fahrenheit (ºF). Exceptions to this are electrical machinery resistance temperature detectors (RTD) and
CONCEPTION, DESIGN, AND IMPLEMENTATION
transformer winding temperatures, which are in degrees Celsius (ºC). Dial thermometers will have 4.5 or 5
inch-in-diameter (minimum) dials and white faces with black scale markings and will be every-angle type
and bimetal actuated. Dial thermometers will be resistant to plant atmospheres. Temperature elements and
dial thermometers will be protected by thermowells except when measuring gas or air temperatures at
atmospheric pressure. Temperature test points will have thermowells and caps or plugs. resistance
temperature detectors (RTD) will be 100 ohm platinum or 10 ohm copper, ungrounded, three-wire circuits
(R100/R0-1.385). The element will be spring-loaded, mounted in a thermowell, and connected to a cast iron
head assembly. Thermocouples will be single-element, grounded, spring-loaded, Chromel-Constantan (ANSI
Type E) for general service. Thermocouple heads will be the cast type with an internal grounding screw.
Level Instruments
Reflex-glass or magnetic level gauges will be used. Level gauges for high-pressure service will have suitable
personnel protection. Gauge glasses used in conjunction with level instruments will cover a range that is
covered by the instrument. Level gauges will be selected so that the normal vessel level is approximately at
gauge center.
Flow Instruments
Flow transmitters will be the differential pressure type with the range matching the primary element. In
general, linear scales and charts will be used for flow indication and recording. In general, airflow
measurements will be temperature-compensated.
Control Valves
Control valves in throttling service will generally be the globe-body cage type with body materials, pressure
rating, and valve trims suitable for the service involved. Other style valve bodies (e.g. butterfly, eccentric
disk) may also be used when suitable for the intended service. Valves will be designed to fail in a safe
position. Control valve body size will not be more than two sizes smaller than line size, unless the smaller
size is specifically reviewed for stresses in the piping. Control valves in 600-class service and below will be
flanged where economical. Where flanged valves are used, minimum flange rating will be ANSI 300 Class.
Severe service valves will be defined as valves requiring anti-cavitation trim, low noise trim, or flashing
service, with differential pressures greater than 100 pounds per square inch differential (psid). In general,
control valves will be specified for a noise level no greater than 90 A-weighted decibels (dBA) when
measured 3-feet downstream and 3-feet away from the pipe surface. Valve actuators will use positioners
and the highest pressure, smallest size actuator, and will be the pneumatic-spring diaphragm or piston type.
Actuators will be sized to shut off against at least 110 percent of the maximum shutoff pressure and
designed to function with instrument air pressure ranging from 60 psig to 125 psig. Handwheels will be
furnished only on those valves that can be manually set and controlled during system operation (to maintain
plant operation) and do not have manual bypasses. Control valve accessories (excluding controllers) will be
mounted on the valve actuator unless severe vibration is expected. Solenoid valves supplied with control
valves will have Class H coils. The coil enclosure will normally be a minimum of NEMA 4 but will be suitable
for the area of installation. Terminations will typically be by pigtail wires. Valve position switches (with input
to the distributed control system for display) will be provided for motor operated valves (MOV) and open-
close pneumatic valves. Automatic combined recirculation flow control and check valves (provided by the
pump manufacturer) will be used for pump minimum-flow recirculation control. These valves will be the
modulating type.
Instrument tubing will be supported in both horizontal and vertical runs as necessary. Expansion loops will
be provided in tubing runs subject to high temperatures. The instrument tubing support design will allow for
movement of the main process line.
Field-Mounted Instruments
Field-mounted instruments will be of a design suitable for the area in which they are located. They will be
mounted in areas accessible for maintenance and relatively free of vibration and will not block walkways or
prevent maintenance of other equipment. Freeze protection will be provided. Field-mounted instruments will
be grouped on racks. Supports for individual instruments will be prefabricated, off-the-shelf, 2 inch pipe
stand. Instrument racks and individual supports will be mounted to concrete floors, to platforms, or on
support steel in locations not subject to excessive vibration. Individual field instrument sensing lines will be
sloped or pitched in such a manner and be of such length, routing, and configuration that signal response is
not adversely affected. Local control loops will generally use a locally-mounted indicating controller (flow,
pressure, temperature, etc.). Liquid level controllers will generally be the non-indicating, displacement type
with external cages.
president of operation. In our experience, many companies claim to hold plant managers accountable, but in
the final analysis production goals usually take precedence over safety requirements.
definitions for safety interlock levels and developed standards for the maintenance of interlocks within each
safety interlock level. Then the company developed a guideline that required the implementation of specified
safety interlock levels based solely on safety consequence levels (instead of risk levels). If a process had the
potential for an overpressure event resulting in a catastrophic release of a toxic material or a fire or
explosion (defined as a Category V consequence as listed in Table 1.01) due to a runaway chemical reaction,
then a Class A interlock (triple redundant sensors and double redundant actuator) was required by the
company for preventing the condition that could lead to the runaway. However, basing this decision solely
on the safety consequence levels, did not give any credit for existing safeguards or alternate approaches to
reducing the risk of the overpressure scenario. As a result, this safety interlock level standard skewed
accident prevention toward installing and maintaining complex (albeit highly reliable) interlocks. The
technical personnel in the plants very loudly voiced their concern about this extreme “belts and suspenders”
approach.
Some companies would have called this a semi-quantitative approach, but in this company, the process
hazard analyses (PHA) teams used this matrix to “qualitatively” judge risk. Teams would vote on which
consequence and frequency categories an accident scenario belonged (considering the qualitative merits of
each existing safeguard), and they would generate recommendations for scenarios not in the tolerable risk
area. This approach worked well for most scenarios, but the company soon found considerable
inconsistencies in the application of the risk matrix in qualitative risk judgments. Also, the company observed
that too many accident scenarios were requiring resource-intensive quantitative risk assessments (QRA). It
was clear that an intermediate approach for judging the risk of moderately complex scenarios was needed.
And, the company still needed to eliminate the conflict between the risk matrix and the safety interlock level
standard.
Develop A Semiquantitative Approach (The Beginnings Of A Tiered Approach) For Risk Judgment (Step 6)
This was a very significant step for the company to take; the effort began in early 1995 and was
implemented in early 1996. Along with the inconsistencies in applying risk judgment tools, there was still
confusion among plant personnel about when and how they should use the safety interlock level standard
and the risk matrix. Both were useful tools that the company had spent considerable resources to develop
and implement. The new guidelines would need to somehow integrate the safety interlock levels and the risk
matrix categories to form a single standard for making decisions. And the plants also needed a tool (or
multiple tools), besides the extremes of pure qualitative judgment and a quantitative risk assessments
(QRA), to decide on the best alternative for controlling the risk of an identified scenario. The technical
personnel from the corporate offices and from the plants worked together to develop a semiquantitative tool
and to define the needed guidelines. One effort toward a semiquantitative tool involved defining a new term
called an independent protection layer (IPL), which would represent a single layer of safety for an accident
scenario. Defining this new term required developing examples of independent protection layers (IPL) to
which the plant personnel would be able to relate. For example, a spring-loaded relief valve is independent
from a high-pressure alarm; thus a system protected by both of these devices has two independent
protection layers (IPL). On the other hand, a system protected by a high-pressure alarm and a shutdown
interlock using the same transmitter has only one independent protection layer. Class A, Class B, and Class C
safety interlocks (which were defined previously in the safety interlock level standard) were also included as
INDUSTRIAL FACILITY SAFETY
example independent protection layers (IPL). To ensure consistent application of independent protection
layers, i.e. to account for the relative reliability and availability of various types of independent protection
layers, it was necessary to identify how much “credit” plant personnel could claim for a particular type of
independent protection layer (IPL). For example, a Class A safety interlock would deserve more credit than a
Class B interlock, and a relief valve would be given more credit than a process alarm. This need was
addressed by assigning a “maximum credit number” for each example independent protection layers (see
Table 1.03).
The credit is essentially the order of magnitude of the risk reduction anticipated by claiming the safeguard as
an independent protection layer (IPL) for the accident scenario. The company believed that when process
hazard analysis teams or designers used the independent protection layer definitions and related credit
numbers, the consistency between risk analyses at the numerous plants would improve. Another (parallel)
effort involved assigning frequency categories to typical “initiating events” for accident scenarios (see Table
1.04); these initiating events were intended to represent the types of events that could occur at any of the
various plants. The frequency categories were derived from process hazard analysis (PHA) experience within
the company and provided a consistent starting point for semiquantitative analysis. Finally, a semi-
quantitative approach for estimating risk was developed, incorporating the frequency of initiating events and
the independent protection layer (IPL) credits described previously. Although this approach used standard
equations and calculation sheets not described here, the basic approach required teams to:
(1) Identify the ultimate consequence of the accident scenario and document the scenario as clearly as
possible, stating the initiating event and any assumptions.
(2) Estimate the frequency of the initiating event (using a frequency from Table 1.04, if possible).
(3) Estimate the risk of the unmitigated event and determine from the risk matrix if the risk is tolerable as is
if the risk is not tolerable, take credit for existing independent protection layers (IPL) until the risk
reaches a tolerable level in the risk matrix (use best judgment in defining independent protection layers
and deciding which ones to take credit for first), and if the risk is still not tolerable, develop a
recommendation(s) that will lower the risk to a tolerable level.
(4) Record the specific safety features (independent protection layers) that were used to reach a tolerable
risk level.
CONCEPTION, DESIGN, AND IMPLEMENTATION
The company demanded “zero” tolerance for deviating from inspection, testing, or calibration of the
documented hardware independent protection layers (IPL) and enforcement of administrative independent
protection layers. Any deviation without prior approval was considered a serious deficiency on internal
audits. Other features not credited as independent protection layers could be kept if they served a quality,
productivity, or environmental protection purpose; otherwise, these items could be “run to failure” or
removed because doing so would have no effect on the risk level. This serniquantitative approach explicitly
met a need expressed in Step 3: determining which of the engineered features was critical to managing risk.
Process hazard analysis teams began applying this approach to validate their qualitative risk judgments.
However, the company still needed to (1) formalize guidelines for when to use qualitative, semi-quantitative,
and quantitative risk judgment tools and (2) standardize the use each tool.
CONCLUSIONS
This approach helps the company manage their risk control resources wisely and helps to more defensibly
justify decisions with regulatory and legal implications. The key to the success of this program lies beyond
the mechanics of the risk-judgment approach; it lies with the care company personnel have taken to
understand and manage risk on a day-to-day basis. Company management will develope clear,
comprehensive standards, guidelines, and training to ensure the plants manage risk appropriately. This will
be reinforced by company management taking an aggressive stance on enforcing adherence by the plants to
company standards. The risk judgment standards and guidelines appear to be working to effectively reduce
risk while minimizing the cost of maintaining “critical” safeguards. This success will serve as only one
example that risk management throughout a multinational chemical company is possible, practical, and
necessary.
INDUSTRIAL FACILITY SAFETY
REFERENCES
Advanced Process Hazard Analysis Leader Training, Process Safety Institute, Knoxville, TN, 1993.
D. F. Montague, Process Risk Evaluation-What Method to Use, Reliability Engineering and System Safety,
Vol. 29, Elsevier Science Publishers Ltd., England, 1990.
F. P. Lees, Loss Prevention in the Process Industries, Vols. 1 and 2, Butterworth's, London, 1980.
Guidelines for Chemical Process Quantitative Risk Analysis, Center for Chemical Process Safety, American
Institute of Chemical Engineers, New York, NY, 1989.
Guidelines for Hazard Evaluation Procedures, 2nd Edition with Worked Examples, Center for Chemical
Process Safety, American Institute of Chemical Engineers, New York, NY, 1992.
CONCEPTION, DESIGN, AND IMPLEMENTATION
CHAPTER 2
Implementation measures The second group of implementation comprises the next eight steps:
(1) Operation and maintenance planning.
(2) Validation planning.
(3) Installation and commissioning planning.
(4) Safety-related systems and E/E/PES implementation.
(5) Safety-related systems: other technology implementation.
(6) External risk reduction facilities implementation.
(7) Overall installation and commissioning.
(8) Overall safety validation and would be conducted by the end user together with chosen contractors and
suppliers of equipment. It may be readily appreciated, that whilst each of these steps has a simple title,
the work involved in carrying out the tasks can be complex and time-consuming!
The third group is essentially one of operating the process with its effective safeguards and involves the final
three steps:
(1) Overall operation and maintenance
(2) Overall modification and retrofit
(3) Decommissioning, these normally being carried out by the plant end-user and his contractors.
CONCEPTION, DESIGN, AND IMPLEMENTATION
Following the directives given in IEC EN 61511 and implementing the steps in the safety life cycle, when the
safety assessments are carried out and E/E/PES are used to carry out safety functions, IEC EN 61508 then
identifies the aspects which need to be addressed. There are essentially two groups, or types, of subsystems
that are considered within the standard:
(1) The equipment under control (EUC) carries out the required manufacturing or process activity.
(2) The control and protection systems implement the safety functions necessary to ensure that the
equipment under control is suitably safe.
Concept
Overall Scope
Definition
Overall Safety
Requirements
Safety Requirements
Allocation
Decommissioning or
Disposal
Fundamentally, the goal here is the achievement or maintenance of a safe state for the equipment under
control. You can think of the “control system” causing a desired equipment under control operation and the
INDUSTRIAL FACILITY SAFETY
“protection system” responding to undesired equipment under control operation. Note that, dependent upon
the risk-reduction strategies implemented, it may be that some control functions are designated as safety
functions. In other words, do not assume that all safety functions are to be performed by a separate
protection system. If you find it difficult to conceive exactly what is meant by the IEC EN 61508 reference to
equipment under control, it may be helpful to think in terms of “process”, which is the term used in IEC EN
61511. When any possible hazards are analysed and the risks arising from the equipment under control and
its control system cannot be tolerated, then a way of reducing the risks to tolerable levels must be found.
Perhaps in some cases the equipment under control or control system can be modified to achieve the
requisite risk-reduction, but in other cases protection systems will be needed. These protection systems are
designated safety-related systems, whose specific purpose is to mitigate the effects of a hazardous event or
to prevent that event from occurring.
When there is a history of plant operating data or industry-specific methods or guidelines, then the analysis
may be readily structured, but is still complex. This step of clearly identifying hazards and analysing risk is
one of the most difficult to carry out, particularly if the process being studied is new or innovative. The
standards embody the principle of balancing the risks associated with the equipment under control (i.e. the
consequences and probability of hazardous events) by relevant dependable safety functions. This balance
includes the aspect of tolerability of the risk. For example, the probable occurrence of a hazard whose
consequence is negligible could be considered tolerable, whereas even the occasional occurrence of a
catastrophe would be an intolerable risk. If, in order to achieve the required level of safety, the risks of the
equipment under control cannot be tolerated according to the criteria established, then safety functions
must be implemented to reduce the risk. The goal is to ensure that the residual risk – the probability of a
hazardous event occurring even with the safety functions in place – is less than or equal to the tolerable
risk. The diagram shows this effectively, where the risk posed by the equipment under control is reduced to
a tolerable level by a “necessary risk reduction” strategy. The reduction of risk can be achieved by a
combination of items rather than depending upon only one safety system and can comprise organisational
measures as well. The effect of these risk reduction measures and systems must be to achieve an “actual
risk reduction” that is greater than or equal to the necessary risk reduction.
under all the stated conditions within a stated period of time”. Thus the specification of the safety function
includes both the actions to be taken in response to the existence of particular conditions and also the time
for that response to take place. The safety integrity level is a measure of the reliability of the safety function
performing to specification.
PROBABILITY OF FAILURE
To categorise the safety integrity of a safety function the probability of failure is considered, in effect the
inverse of the safety integrity level definition, looking at failure to perform rather than success. It is easier to
identify and quantify possible conditions and causes leading to failure of a safety function than it is to
guarantee the desired action of a safety function when called upon. Two classes of safety integrity level are
identified, depending on the service provided by the safety function. For safety functions that are activated
when required (on demand mode) the probability of failure to perform correctly is given, whilst for safety
functions that are in place continuously the probability of a dangerous failure is expressed in terms of a
given period of time (per hour or in continous mode). In summary, IEC EN 61508 requires that when safety
functions are to be performed by E/E/PES the safety integrity is specified in terms of a safety integrity level.
The probabilities of failure are related to one of four safety integrity levels, as shown in Table 2.01.
An important consideration for any safety related system or equipment is the level of certainty that the
required safe response or action will take place when it is needed. This is normally determined as the
likelihood that the safety loop will fail to act as and when it is required to and is expressed as a probability.
The standards apply both to safety systems operating on demand, such as an emergency shut-down (ESD)
system, and to systems operating “continuously” or in high demand, such as the process control system. For
a safety loop operating in the demand mode of operation the relevant factor is the probability the function
fails on demand average (PFDavg), which is the average probability of failure on demand. For a continuous or
high demand mode of operation the probability of a dangerous failure per hour (PFH) is considered rather
than probability the function fails on demand average (PFDavg). Obviously the aspect of risk that was
discussed earlier and the probability of failure on demand of a safety function are closely related. Using the
definitions, frequency of accident or event in the absence of protection functions (Fnp) and tolerable
frequency of accident or event (Ft), then the risk reduction factor (R) is defined as,
Fnp
R [2.01]
Ft
1 F
PFD avg t [2.02]
R Fnp
Since the concepts are closely linked, similar methods and tools are used to evaluate risk and to assess the
probability the function fails on demand average (PFDavg). Failure modes and effects analysis (FMEA) is a
way to document the system being considered using a systematic approach to identify and evaluate the
effects of component failures and to determine what could reduce or eliminate the chance of failure. Once
the possible failures and their consequence have been evaluated, the various operational states of the
subsystem can be associated using the Markov models, for example. One other factor that needs to be
applied to the calculation is that of the interval between tests, which is known as the “proof time” or the
INDUSTRIAL FACILITY SAFETY
“proof test interval”. This is a variable that may depend not only upon the practical implementation of
testing and maintenance within the system, subsystem or component concerned, but also upon the desired
end result. By varying the proof time within the model it can result that the subsystem or safety loop may be
suitable for use with a different safety integrity level (SIL). Practical and operational considerations are often
the guide. In the related area of application that most readers may be familiar with one can consider the fire
alarm system in a commercial premises. Here, the legal or insurance driven need to frequently test the
system must be balanced with the practicality and cost to organise the tests. Maybe the insurance premiums
would be lower if the system were to be tested more frequently but the cost and disruption to organise and
implement them may not be worth it. Note also that “low demand mode” is defined as one where the
frequency of demands for operation made on a safety related system is no greater than one per year and no
greater than twice the proof test frequency. Failure rate d is the dangerous (detected and undetected)
failure rate of a channel in a subsystem. For the probability the function fails on demand (PFD) calculation
(low demand mode) it is stated as failures per year. Target failure measure probability the function fails on
demand average (PFDavg) is the average probability of failure on demand of a safety function or subsystem,
also called average probability of failure on demand. The probability of a failure is time dependant,
It is a function of the failure rate () and the time (t) between proof tests. The maximum safety integrity
level (SIL) according to the failure probability requirements is then read out from Table 2.05. That means
that you cannot find out the maximum safety integrity level of your system, or subsystem, if you do not
know if a test procedure is implemented by the user and what the test intervals are! These values are
required for the whole safety function, usually including different systems or subsystems. The average
probability of failure on demand of a safety function is determined by calculating and combining the average
probability of failure on demand for all the subsystems, which together provide the safety function. If the
probabilities are small, this can be expressed by the following,
where PFDsys is the average probability of failure on demand of a safety function safety-related system; PFDs
is the average probability of failure on demand for the sensor subsystem; PFDl is the average probability of
failure on demand for the logic subsystem; and, PFDfe is the average probability of failure on demand for the
final element subsystem.
These terms are further categorised into “detected” or “undetected” to reflect the level of diagnostic ability
within the device. For example:
(1) dd is dangerous detected failure rate
(2) du is dangerous undetected failure rate.
SFF 1 du [2.06]
total
CONCEPTION, DESIGN, AND IMPLEMENTATION
Table 2.02 – Hardware safety integrity. Architectural constraints on type A safety-related subsystems (IEC
EN 61508-2, Part 2).
Safety Failure Fraction Hardware Fault Tolerance (HFT)
(SFF) 0 1 2
< 60% SIL 1 SIL 2 SIL 3
60% 90% SIL 2 SIL 3 SIL 4
90% 99% SIL 3 SIL 4 SIL 4
> 99% SIL 3 SIL 4 SIL 4
Subsystem type B have by definition the characteristics: the failure mode of at least one component is not
well defined, or behaviour of the subsystem under fault conditions cannot be completely determined, or
insufficient dependable failure data from field experience show that the claimed rates of failure for detected
and undetected dangerous failures are met.
Table 2.03 – Hardware safety integrity. Architectural constraints on type B safety-related subsystems (IEC
EN 61508-2, part 3).
Safety Failure Fraction Hardware Fault Tolerance (HFT)
(SFF) 0 1 2
< 60% Not allowed SIL 1 SIL 2
60% 90% SIL 1 SIL 2 SIL 3
90% 99% SIL 2 SIL 3 SIL 4
> 99% SIL 3 SIL 4 SIL 4
These definitions, in combination with the fault tolerance of the hardware, are part of the “architectural
constraints” for the hardware safety integrity as shown in Table 2.02 and Table 2.03. In the tables above, a
hardware fault tolerance of N means that N+1 faults could cause a loss of the safety function. For example,
if a subsystem has a hardware fault tolerance of 1 then 2 faults need to occur before the safety function is
lost. We have seen that protection functions, whether performed within the control system or a separate
protection system, are referred to as safetyrelated systems. If, after analysis of possible hazards arising from
the equipment under control (EUC) and its control system, it is decided that there is no need to designate
any safety functions, then one of the requirements of IEC EN 61508 is that the dangerous failure rate of the
equipment under control system shall be below the levels given as SIL 1 rating. So, even when a process
may be considered as benign, with no intolerable risks, the control system must be shown to have a rate not
lower than 105 dangerous failures per hour.
safety functions and safety systems that need to be explained before considering an example. These are the
safe failure fraction and the probability of failure.
SFF
s [2.07]
s d
The probability the function fails on demand (PFD) and safe failure fraction (SFF) of this device depend of
the overall safety function and its fault reaction function. If, for example, a “fail low” failure will bring the
system into a safe state and the “fail high” failure will be detected by the logic solver input circuitry, then
these component faults are considered as safe. If, on the other hand, a “fail low” failure will bring the
system into a safe state and the “fail high” failure will not be detected and could lead to a dangerous state
of the system, then this fault is a dangerous fault.
CONCEPTION, DESIGN, AND IMPLEMENTATION
FIT
[2.08]
N comp
Usually, the failure rate of components and systems is high at the beginning of their life and falls rapidly
(“infant mortality”, defective components fail normally within 72 hours). Then, for a long time period the
failure rate is constant. At the end of their life, the failure rate of components and systems starts to
increase, due to wear effects. This failure distribution is also referred to as a “bathtub” curve. In the area of
electrical and electronic devices the failure rate is considered to be constant ( = k). Since we have
considered the failure rate as being constant, in this case the failure distribution will be exponential. This
kind of probability density function (PDF) is very common in the technical field.
f t e t [2.09]
where is the constant failure rate (failures per unit of time) and t is the time. The cumulative distribution
function (CDF, also referred to as the cumulative density function) represents the cumulated probability of a
random component failure, F(t). F(t) is also referred to as the unavailability and includes all the failure
modes. The probability of failure on demand (PFD) is given by,
where PFS is the probability of safe failures, PFD is the probability of dangerous failures ( = du), and F(t) is
the probability of failure on demand (PFD), when = du. For continuous random variable,
t
F t f t dt [2.11]
where f(t) is the probability density function (PDF). In the case of an exponential distribution,
F t t [2.13]
R t e t [2.14]
The reliability represents the probability that a component will operate successfully. The only parameter of
interest in industrial control systems, in this context, is the average probability of failure on demand
(PFDavg). In the case of an exponential distribution,
1 T1
PFD avg F t dt [2.15]
T1 0
1 T1
PFD avg d t dt [2.16]
T1 0
where d is the rate of dangerous failures per unit of time and T1 is the time to the next test.
1
PFD avg d T1 [2.17]
2
If the relationship between du and dd is unknown, one usually sets the following assumption,
1
PFD d T1 [2.18]
2
and
1
PFD avg d T1 [2.19]
4
where du are the dangerous undetected failures and dd are the dangerous detected failures. The mean
time between failures (MTBF) is the “expected” time to a failure and not the “guaranteed minimum life
time”! For constant failure rates,
T
MTBF R t dt [2.20]
0
or
1
MTBF [2.21]
On the other hand there are functions which are in frequent or continuous use; examples of such functions
are:
(1) Normal braking.
(2) Steering.
CONCEPTION, DESIGN, AND IMPLEMENTATION
The fundamental question is how frequently will failures of either type of function lead to accidents. The
answer is different for the two types:
(1) For functions with a low demand rate, the accident rate is a combination of two parameters. The first
parameter is the frequency of demands, and the second parameter is the probability the function fails on
demand (PFD). In this case, therefore, the appropriate measure of performance of the function is
function fails on demand, or its reciprocal, risk reduction factor (RRF).
(2) For functions which have a high demand rate or operate continuously, the accident rate is the failure
rate () which is the appropriate measure of performance. An alternative measure is mean time to
failure (MTTF) of the function. Provided failures are exponentially distributed, mean time to failure is the
reciprocal of failure rate ().
These performance measures are, of course, related. At its simplest, provided the function can be proof-
tested at a frequency which is greater than the demand rate, the relationship can be expressed as,
t t
PFD [2.22]
2 2 MTTF
or
2 2 MTTF
RRF [2.23]
t t
where t is the proof-test interval. Note that to significantly reduce the accident rate below the failure rate
1
of the function, the test frequency , should be at least two and preferably equal to five times the
t
demand frequency. They are, however, different quantities. Probability the function fails on demand (PFD) is
a probability (dimensionless); is a failure rate with dimension units time1. The standards, however, use
the same term safety integrity level (SIL) for both these measures, with the following definitions shown in
the Table 2.05.
Table 2.05 – Definitions of safety integrity level (SIL) for low demand mode and high demand mode (BS EN
61508).
Low Demand Mode
SIL PFD RRF
4 10-5 PFD < 10-4 100,000 RRF > 10,000
3 10-4 PFD < 10-3 10,000 RRF > 1,000
2 10-3 PFD < 10-2 1,000 RRF > 100
1 10-2 PFD < 10-1 100 RRF > 10
High Demand Mode / Continuous Mode
SIL (hr1) MTTF (years)
4 10-9 < 10-8 100,000 MTTF > 10,000
3 10-8 < 10-7 10,000 MTTF > 1,000
2 10-7 < 10-6 1,000 MTTF > 100
1 10-6 < 10-5 100 MTTF > 10
In low demand mode, safety integrity level (SIL) is a proxy for probability the function fails on demand
(PFD); in high demand and continuous mode, safety integrity level is a proxy for failure rate. The boundary
between low demand mode and high demand mode is in essence set in the standards at one demand per
year. This is consistent with proof-test intervals of 3 to 6 months, which in many cases will be the shortest
feasible interval. Now consider a function which protects against two different hazards, one of which occurs
at a rate of 1 every 2 weeks, or 25 times per year, i.e. a high demand rate, and the other at a rate of 1 in 10
years, i.e. a low demand rate. If the mean time to failure (MTTF) of the function is 50 years, it would qualify
INDUSTRIAL FACILITY SAFETY
as achieving SIL 1 rating for the high demand rate hazard. The high demands effectively proof-test the
function against the low demand rate hazard. All else being equal, the effective safety integrity level for the
second hazard is given by,
0.04
PFD 4 10 4 SIL 3 rating [2.24]
2 50
So what is the safety integrity level achieved by the function? Clearly it is not unique, but depends on the
hazard and in particular whether the demand rate for the hazard implies low or high demand mode. In the
first case, the achievable safety integrity level is intrinsic to the equipment; in the second case, although the
intrinsic quality of the equipment is important, the achievable safety integrity level is also affected by the
testing regime. This is important in the process industry sector, where achievable safety integrity levels are
liable to be dominated by the reliability of field equipment – process measurement instruments and,
particularly, final elements such as shutdown valves – which need to be regularly tested to achieve required
safety integrity levels. The differences between these definitions may be well understood by those who are
dealing with the standards day-by-day, but are potentially confusing to those who only use them
intermittently. The standard BS EN 61508 offers three methods of determining safety integrity level
requirements:
(1) Quantitative method.
(2) Risk graph, described in the standard as a qualitative method.
(3) Hazardous event severity matrix, also described as a qualitative method.
Risk graphs and layer of protection analysis are popular methods for determining safety integrity level
requirements, particularly in the process industry sector. Their advantages and disadvantages and range of
applicability are the main topic of this chapter.
They do not require a detailed study of relatively minor hazards. They can be used to assess many hazards
relatively quickly. They are useful as screening tools to identify hazards which need more detailed
assessment, and minor hazards which do not need additional protection, so that capital and maintenance
expenditures can be targeted where they are most effective, and lifecycle costs can be optimised.
W3 W2 W1
CA
a - -
PA
1 a -
FA
PB
CB
2 1 a
PA Legend of typical risk graphic:
FB
"-" No safety requirements
Starting point of risk PB
reduction estimation CC FA "a" No special requirements
3 2 1
PA "b" A single E\E\EPS is not sufficient
FB
"1, 2, 3, ..." Safety integrity level
PB
FA
4 3 2
CD PA
FB
PB
b 4 3
Note that geometric means are used because the scales of the risk graph parameters are essentially
logarithmic. For the unprotected hazard:
(1) Worst case risk = (1 100% 100%) per 3 fatalities per year = 1 fatality in ~ 3 years;
(2) Geometric mean risk = (0.32 32% 32%) per 10 fatalities per year = 1 fatality in ~ 300 years;
(3) Best case risk = (0.1 10% 10%) per 30 fatalities per year = 1 fatality in ~ 30,000 years.
INDUSTRIAL FACILITY SAFETY
Conclusion, the unprotected risk has a range of 4 orders of magnitude. With SIL 3 rating protection:
(1) Worst case residual risk = 1 fatality in (~ 3 1,000) years = 1 fatality in ~ 3,000 years;
(2) Geometric mean residual risk = 1 fatality in (~ 300 3,200) years = 1 fatality in ~ 1 million years;
(3) Best case residual risk = 1 fatality in (~ 30,000 10,000) years = 1 fatality in ~ 300 million years.
With SIL 3 rating the residual risk with protection has a range of 5 orders of magnitude. Figure 2.03 shows
the principle, based on the mean case.
Increasing Risk
A reasonable target for this single hazard might be 1 fatality in 100,000 years. In the worst case we achieve
less risk reduction than required by a factor of 30; in the mean case we achieve more risk reduction than
required by a factor of 10; and in the best case we achieve more risk reduction than required by a factor of
3,000. In practice, of course, it is most unlikely that all the parameters will be at their extreme values, but
on average the method must yield conservative results to avoid any significant probability that the required
CONCEPTION, DESIGN, AND IMPLEMENTATION
risk reduction is under-estimated. Ways of managing the inherent uncertainty in the range of residual risk, to
produce a conservative outcome, include:
(1) Calibrating the graph so that the mean residual risk is significantly below the target, as above.
(2) Selecting the parameter values cautiously, i.e. by tending to select the more onerous range whenever
there is any uncertainty about which value is appropriate. Restricting the use of the method to situations
where the mean residual risk from any single hazard is only a very small proportion of the overall total
target risk. If there are a number of hazards protected by different systems or functions, the total mean
residual risk from these hazards should only be a small proportion of the overall total target risk. It is
then very likely that an under-estimate of the residual risk from one hazard will still be a small fraction of
the overall target risk, and will be compensated by an over-estimate for another hazard when the risks
are aggregated.
This conservatism may incur a substantial financial penalty, particularly if higher safety integrity level
requirements are assessed.
The first option was recommended in the UKOOA Guidelines (UKOOA, 1999), but cannot be justified from
failure rate data. The second option is liable to lead to an over-estimate of the required SIL, and to incur a
cost penalty, so cannot be recommended. An approach which has been found to work, and which accords
with the standards is:
(1) Derive an overall risk reduction requirement (SIL) on the basis that there is no protection, i.e. before
applying the safety instrumented function (SIF) or any mechanical protection.
(2) Take credit for the mechanical device, usually as equivalent to SIL 2 rating for a relief valve (this is
justified by available failure rate data, and is also supported by BS IEC 61511, Part 3, Annex F).
(3) The required safety integrity level (SIL) for the safety instrumented function is the safety integrity level
determined in the first step minus 2 (or the equivalent safety integrity level of the mechanical
protection).
Our overall target for individual risk might therefore be “less than 1 in 50,000 (2105) per year” for all
hazards, so that the total risk from hazards protected by safety instrumented functions again represents 2%
of the target, so probably allows more than adequately for other hazards, and we might conclude that the
graph is also over-calibrated for average individual risk to the workers. The consequence (C) and demand
rate (W) parameter ranges are available to adjust the calibration. The Exposure (F) and Avoidability (P)
parameters have only two ranges each, and FA and PA indices both imply reduction of risk by at least a factor
of 10. Typically, the ranges might be adjusted up or down by half an order of magnitude. The plant
operating organisation may, of course, have its own risk criteria, which may be onerous than these criteria
derived from R2P2 and the major hazards of transport study.
Calibration for Process Plants Based on Individual Risk to Most Exposed Person
To calibrate a risk graph for individual risk of the most exposed person it is necessary to identify who that
person is, at least in terms of his job and role on the plant. The values of the consequence (C) parameter
must be defined in terms of consequence to the individual,
CA Minor injury
CB ~ 0.01 probability of death per event
CC ~ 0.1 probability of death per event
CD Death almost certain
The values of the exposure parameter (F) must be defined in terms of the time he spends at work,
Recognising that this person only spends ~ 20% of his life at work, he is potentially at risk from only ~ 20%
of the demands on the safety instrumented function (SIF). Thus, again using consequence index (CC),
exposure index (FB), avoidability index (PB), and demand rate index (W2):
(1) Consequence index (CC) , ~ 0.1 probability of death per event;
(2) Exposure index (FB), exposed for 10% of working week or year;
(3) Avoidability index (PB), > 10% to 100% probability that the hazard cannot be avoided;
(4) Demand rate index (W2), 1 demand in > 3 to 30 years;
(5) SIL 3 rating range is 1,000 RRF > 10,000.
(3) Best case residual risk is equal to 1 in ~ 1.5 billion probability of death per year.
If we estimate that this person is exposed to 10 hazards protected by safety instrumented functions (SIF)
(i.e. to half of the total of 20 assumed above), then, based on the geometric mean residual risk, his total risk
of death from all of them is 1 in 1.5 million per year. This is 3.3% of our target of 1 in 50,000 per year
individual risk for all hazards, which probably leaves more than adequate allowance for other hazards for
which safety instrumented functions are not relevant. We might therefore conclude that this risk graph also
is overcalibrated for the risks to our hypothetical most exposed individual, but we can choose to accept this
additional element of conservatism. Note that this is not the same risk graph as the one considered above
for group risk, because, although we have retained the form, we have used a different set of definitions for
the parameters. The above definitions of the consequence (C) parameter values do not lend themselves to
adjustment, so in this case only the demand rate (W) parameter ranges can be adjusted to recalibrate the
graph. We might for example change the demand rate ranges to:
(1) W1 denotes < 1 demand in 10 years.
(2) W2 denotes 1 demand in > 1 to 10 years.
(3) W3 denotes 1 demand in 1 year.
Typical Results
As one would expect, there is wide variation from installation to installation in the numbers of functions
which are assessed as requiring safety integrity level ratings, but Table 2.07 shows figures which were
assessed for a reasonably typical offshore gas platform.
Typically, there might be a single SIL 3 rating requirement, while identification of SIL 4 rating requirements
is very rare. These figures suggest that the assumptions made above to evaluate the calibration of the risk
graphs are reasonable. The implications of the issues identified above are:
(1) Risk graphs are very useful but coarse tools for assessing safety integrity level requirements. It is
inevitable that a method with five parameters – consequence (C), exposure (F), avoidability (P), demand
rate (W) and safety integrity level (SIL) – each with a range of an order of magnitude, will produce a
result with a range of five orders of magnitude.
(2) They must be calibrated on a conservative basis to avoid the danger that they underestimate the
unprotected risk and the amount of risk reduction and protection required. Their use is most appropriate
when a number of functions protect against different hazards, which are themselves only a small
proportion of the overall total hazards, so that it is very likely that under-estimates and over-estimates of
residual risk will average out when they are aggregated. Only in these circumstances can the method be
realistically described as providing a “suitable” and “sufficient”, and therefore legal, risk assessment.
(3) Higher safety integrity level requirements (rating SIL 2+) incur significant capital costs (for redundancy
and rigorous engineering requirements) and operating costs (for applying rigorous maintenance
procedures to more equipment, and for proof-testing more equipment). They should therefore be re-
assessed using a more refined method.
The severity level may be expressed in semi-quantitative terms, with target frequency ranges (see Table
2.08), or it may be expressed as a specific quantitative estimate of harm, which can be referenced to F-N
curves.
Table 2.08 – Example definitions of severity levels and mitigated event target frequencies.
Target Mitigated
Severity Consequence
Event Likelihood
Minor Serious injury at worst No specific requirement
< 3106 per year
Serious Serious permanent injury or up to 3 fatalities
1 in > 330,000 years
< 2106 per year
Extensive 4 or 5 fatalities
1 in > 500,000 years
Catastrophic > 5 fatalities use F-N curve
Similarly, the initiation likelihood may be expressed semi-quantitatively (see Table 2.09), or it may be
expressed as a specific quantitative estimate.
The strength of the method is that it recognises that in the process industries there are usually several
layers of protection against an initiating cause leading to an impact event. Specifically, it identifies the
following:
(1) General Process Design – There may, for example, be aspects of the design which reduce the probability
of loss of containment, or of ignition if containment is lost, so reducing the probability of a fire or
explosion event.
(2) Basic Process Control System (BPCS) – Failure of a process control loop is likely to be one of the main
Initiating Causes. However, there may be another independent control loop which could prevent the
Impact Event, and so reduce the frequency of that event.
(3) Alarms – Provided there is an alarm which is independent of the basic process control system, sufficient
time for an operator to respond, and an effective action he can take (a “handle” he can “pull”), credit
can be taken for alarms to reduce the probability of the impact event.
(4) Additional Mitigation, Restricted Access – Even if the impact event occurs, there may be limits on the
occupation of the hazardous area (equivalent to the F parameter in the risk graph method), or effective
means of escape from the hazardous area (equivalent to the P parameter in the risk graph method),
which reduce the severity level of the event.
(5) Independent Protection Layers (IPL) – A number of criteria must be satisfied by an independent
protection layer, including risk reduction factor (RRF) equal to 100. Relief valves and bursting disks
usually qualify.
Based on the initiating likelihood (frequency) and the probability the function fails on demand (PFD) of all
the protection layers listed above, an intermediate event likelihood (frequency) for the impact event and the
initiating event can be calculated. The process must be completed for all initiating events, to determine a
total intermediate event likelihood for all initiating events. This can then be compared with the target
mitigated event likelihood (frequency). So far no credit has been taken for any safety instrumented function
(SIF). The ratio, between intermediate event likelihood (IEL) and mitigated event likelihood (MEL), gives the
INDUSTRIAL FACILITY SAFETY
required risk reduction factor (RRF) of the safety instrumented function, and can be converted to a safety
integrity level.
IEL 1
RRF [2.25]
MEL PFD
AFTER-THE-EVENT PROTECTION
Some functions on process plants are invoked “after-the-event”, i.e. after a loss of containment, even after a
fire has started or an explosion has occurred. Fire and gas detection and emergency shutdown are the
principal examples of such functions. Assessment of the required safety integrity levels of such functions
presents specific problems:
(1) Because they operate after the event, there may already have been consequences which they can do
nothing to prevent or mitigate. The initial consequences must be separated from the later consequences.
(2) The event may develop and escalate to a number of different eventual outcomes with a range of
consequence severity, depending on a number of intermediate events.
(3) Analysis of the likelihood of each outcome is a specialist task, often based on event trees (Figure 2.05).
Loss of
Ignition Gas Detection Fire Detection Consequences
Containment
Fails
Possible escalation
Immediate
Outcome
Jet fire, immediate fatalities and injuries
Significant
gas release
Fails Fails
Explosion Possible escalation
Jet fire Outcome
No ignition
Outcome
The risk graph method does not lend itself at all well to this type of assessment:
(1) Demand rates would be expected to be very low, e.g. 1 in 1,000 to 10,000 years. This is off the scale of
the risk graphs presented here, i.e. it implies a range 1 to 2 orders of magnitude lower than demand
rate class W1.
(2) The range of outcomes from function to function may be very large, from a single injured person to
major loss of life. Where large scale consequences are possible, use of such a coarse tool as the risk
graph method can hardly be considered “suitable” and “sufficient”.
The layer of protection analysis method does not have these limitations, particularly if applied quantitatively.
CONCLUSIONS
To summarise, the relative advantages and disadvantages of these two methods are listed as follows.
Advantages of risk graph methods:
(1) Can be applied relatively rapidly to a large number of functions to eliminate those with little or no safety
role, and highlight those with larger safety roles.
(2) Can be performed as a team exercise involving a range of disciplines and expertise.
Both methods are useful, but care should be taken to select a method which is appropriate to the
circumstances.
integrity level ratings can be equated to the probability to fail on demand (PFD) of the process in question.
The following tables gives relationships based on whether the process is required “Continuously” or “On
Demand”.
where is the failure rate and t is the test interval. Note that,
1
[2.27]
MTBF
In the case of the simplified calculations method, the next step would be to sum the probability to fail on
demand (PFD) values for every component in the process. This summed probability to fail on demand can
then be compared for the safety integrity level rating for the process. In the case of the fault tree analysis
method, the next step would be to produce a fault tree diagram. This diagram is a listing of the various
process components involved in a hazardous event. The components are linked within the tree via Boolean
logic (logical OR gate and AND gate relationships). Once this is done, the probability to fail on demand for
each path is determined based upon the logical relationships. Finally, the probability to fail on demands are
summed to produce the average probability to fail on demand (PFDavg) for the process. Once again, the
average probability to fail on demand can be referenced for the proper safety integrity level. The Markov
analysis is a method where a state diagram is produced for the process. This state diagram will include all
possible states, including all “Off Line” states resulting from every failure mode of all process components.
With the defined state diagram, the probability of being in any given state, as a function of time, is
determined. This determination includes not only mean time between failure (MTBF) numbers and
probability to fail on demand (PFD) calculations, but it also includes the mean time to repair (MTTR)
numbers. This allows the Markov analysis to better predict the availability of a process. With the state
probability (PFDavg) determined, they can once again be summed and compared to table 1.03 to determine
the process safety integrity level (SIL). As the brief descriptions above point out, the simplified calculations
method will be the easiest to perform. It will provide the most conservative result, and thus should be used
as a first approximation of safety integrity level values. If having used the simplified calculations method,
and find that a less conservative result is desired, then employ the fault tree analysis (FTA) method. This
method is considered by many to be the proper mix of simplicity and completeness when performing safety
integrity level calculations. For the subject expert, the Markov analysis will provide the most precise result. It
can be very tedious and complicated to perform. A simple application can encompass upwards of 50
separate equations needing to be solved. It is suggested, that relying upon a Markov analysis to provide that
last little bit of precision necessary to improve a given safety integrity level, is a misguided use of resource.
A process that is teetering between two safety integrity level ratings would be better served being
redesigned to comfortably achieve the desired safety integrity level rating.
(1) Mean Time Between Failure (MTBF) – This is usually a statistical representation of the likelihood of a
component, device, or system to fail. The value is expressed as a period of time (i.e. 14.7 years). This
value is almost always calculated from theoretical information (laboratory value). Unfortunately, this
often leads to some very unrealistic values. Occasionally, mean time between failure values will have
observed data as their basis (demonstrated value). For example, mean time between failure can be
based upon failures rates determined as a result of accelerated lifetime testing. Lastly, mean time
between failure can be based upon reported failures (reported value). Because of the difficulty in
determining demonstrated values, and the likelihood that the true operating conditions within any given
plant are truly replicated in this determination, as well as the uncertainty associated with reported values
it is recommended that laboratory values be the basis of comparison for mean time between failure.
However, mean time between failure alone is a poor statement of a device’s reliability. It should be used
primarily as a component of the probability to fail on demand calculation.
(2) Mean Time To Repair (MTTR) – Mean time to repair is the average time to repair a system, or
component, that has failed. This value is highly dependent upon the circumstances of operation for the
system. A monitoring system operating in a remote location without any spare components may have a
tremendously larger mean time to repair than the same system being operated next door to the system’s
manufacturer. So the ready availability of easily installed spares can significantly improve mean time to
repair.
(3) Probability to Fail on Demand (PFD) – The probability to fail on demand is a statistical measurement of
how likely it is that a process, system, or device will be operating and ready to serve the function for
which it is intended. Among other things, it is influenced by the reliability of the process, system, or
device, the interval at which it is tested, as well as how often it is required to function. Below are some
representative sample probability to fail on demand values. They are order of magnitude values relative
to one another.
Many end users have developed calculations to determine the economic benefit to inspections and testing
based upon some of the reliability numbers used to determine safety integrity level values. These
calculations report the return on investment for common maintenance expenditures such as visual
equipment inspections. The premise of these calculations is to reduce the number of maintenance activities
performed on systems that:
(1) Have a high degree of reliability;
(2) Those that protect processes where monetary loss from failure would not outweigh the cost of
maintenance.
LP MTTR Pf
RIL MCS [2.28]
CMA
where RIL is the reliability integrity level, MCS is maintenance cost savings as a percentage of total
maintenance cost, LP is dollar loss of process per unit of time, Pf is probability of failure per unit of time,
CMA is current cost of maintenance activity per unit of time. If reliability integrity level (RIL) is greater than
one would indicate that a given process is reliable enough to discontinue the maintenance activity. Of
course, many times a process offers benefits that go beyond simple monetary considerations.
INDUSTRIAL FACILITY SAFETY
REFERENCES
AIChemE, 1993. Guidelines for Safe Automation of Chemical Processes, ISBN 0-8169-0554-1.
BSI, 2002. BS EN 61508 – Functional Safety of Electrical, Electronic, Programmable Electronic Safety-Related
Systems.
BSI, 2003. BS IEC 61511 – Functional Safety: Safety Instrumented Systems for the Process Industry Sector.
HMSO, 1991. Major Hazards Aspects of the Transport of Dangerous Substances, ISBN 0-11-885699-5.
HSE Books, 2001. Reducing Risks, Protecting People, Clause 136, ISBN 0-7176-2151-0.
UKOOA, 1999. Guidelines for Instrument-Based Protective Systems, Issue No. 2, Clause 4.4.3.
CONCEPTION, DESIGN, AND IMPLEMENTATION
CHAPTER 3
Table 3.01 – General format of layer of protection analysis (LOPA) table headline.
The severity of the consequence is estimated using appropriate techniques, which may range from simple
“look up” tables to sophisticated consequence modeling software tools. One or more initiating events
(causes) may lead to the conse quence; each cause-consequence pair is called a scenario. Layer of
protection analysis (LOPA) focuses on one scenario at time. The frequency of the initiating event is
estimated (usually from look-up tables or historical data). Each identified safeguard is evaluated for two key
characteristics:
(1) Is the safeguard effective in preventing the scenario from reaching the consequence?
(2) And, is the safeguard independent of the initiating event and the other independent protection layers
(IPL)?
If the safeguard meets both of these tests, it is an independent protection layers (IPL). Layer of protection
analysis estimates the likelihood of the undesired consequence by multiplying the frequency of the initiating
event by the product of the probability of failure on demands for the applicable independent protection
layers using Equation [3.01].
j
fi,C fi,0 PFD ij fi,0 PFD i1 PFD i2 ... PFD ij [3.02]
j1
Where fi,C is frequency for consequence (C) for initiating event i, fi,0 is initiating event frequency for initiating
event i, PFDij is probability of failure on demand of the jth independent protection layer (IPL) that protects
against consequence C for initiating event i. Typical initiating event frequencies, and independent protection
layers (IPL) probability of failure on demands (PFD) are given by Dowell and CCPS literature. Figure 3.01
illustrates the concept of layer of protection analysis (LOPA) – that each independent protection layers (IPL)
acts as a barrier to reduce the frequency of the consequence. Figure 3.01 also shows how layer of
protection analysis compares to event tree analysis. A layer of protection analysis describes a single path
through an event tree, as shown by the heavy line in Figure 3.01. The result of the layer of protection
analysis is a risk measure for the scenario – an estimate of the likelihood and consequence. This estimate
can be considered a “mitigated consequence frequency”, the frequency is mitigated by the independent
layers of protection. The risk estimate can be compared to company criteria for tolerable risk for that
particular consequence severity. If additional risk reduction is needed, more independent protection layers
must be added to the design. Another option might be to redesign the process; perhaps considering
inherently safer design alternatives. Frequently, the independent protection layers include safety
instrumented functions (SIF). One product of the layer of protection analysis is the required probability of
failure on demands (PFD) of the safety instrumented function, thus defining the required safety integrity
level (SIL) for that safety instrumented function. With the safety integrity level defined, ANSI/ISA 84.01-
1996, IEC 61508, and when finalized, draft IEC 61511 should be used to design, build, commission, operate,
test, maintain, and decommission the safety instrumented function (SIF).
CONCEPTION, DESIGN, AND IMPLEMENTATION
I I I
P P P
L L L Cosequence occurs
1 2 3
Success
Safe outcome
Failure Success
Undesired but tolerable outcome
Failure
Failure
Consequences exceeding criteria
Figure 3.01 – Comparison between layer of protection analysis (LOPA) and event tree analysis.
The safety lifecycle defined in IEC 61511-1 requires the determination of a safety integrity level for the
design of a safety-instrumented function. The layer of protection analysis (LOPA) described here is a method
that can be applied to an existing plant by a multi-disciplined team to determine the required safety
instrumented functions and the safety integrity level for each. The team should consist of:
(1) Operator with experience operating the process under consideration.
(2) Engineer with expertise in the process.
(3) Manufacturing management.
(4) Process control engineer.
(5) Instrument and electrical maintenance person with experience in the process under consideration.
(6) Risk analysis specialist.
At least one person on the team should be trained in the layer of protection analysis (LOPA) methodology.
The information required for the layer of protection analysis is contained in the data collected and developed
in the hazard and operability analysis (HAZOP). Table 3.01 shows a typical spreadsheet that can be used for
the layer of protection analysis.
Impact Event
Each impact event (consequence) determined from the hazard and operability analysis is entered in the
spreadsheet.
Severity Level
Severity levels of Minor (M), Serious (S), or Extensive (E) are next selected for the impact event. Likelihood
values are events per year, other numerical values are average probabilities of failure on demand (PFDavg).
Initiation Likelihood
Likelihood values of the initiating causes occurring, in events per year, are entered. The experience of the
team is very important in determining the initiating cause likelihood.
INDUSTRIAL FACILITY SAFETY
Protection Layers
Each protection layer consists of a grouping of equipment and administrative controls that function in
concert with the other layers. Protection layers that perform their function with a high degree of reliability
may qualify as independent protection layer (IPL). The criteria to qualify a protection layer (PL) as an
independent protection layers are:
(1) The protection provided reduces the identified risk by a large amount, that is, a minimum of a ten-fold
reduction.
(2) The protective function is provided with a high degree of availability (90% or greater).
Only those protection layers that meet the tests of availability, specificity, independence, dependability, and
auditability are classified as independent protection layers. If a control loop in the basic process control
system (BPCS) prevents the impacted event from occurring when the initiating cause occurs, credit based on
its average probabilities of failure on demand (PFDavg) is claimed.
Additional Mitigation
Mitigation layers are normally mechanical, structural, or procedural. Examples would be:
(1) Pressure relief devices;
(2) Dikes;
(3) Restricted access.
Mitigation layers may reduce the severity of the impact event but not prevent it from occurring. Examples
would be:
(1) Deluge systems for fire or fume release;
(2) Fume alarms;
(3) Evacuation procedures.
probabilities of failure on demand (PFDavg) for the safety instrumented function below this number is
selected as a maximum for the safety instrumented systems (SIS).
Total Risk
The last step is to add up all the mitigated event likelihood for serious and extensive impact events that
present the same hazard. For example, the mitigated event likelihood for all serious and extensive events
that cause fire would be added and used in formulas like the following,
Risk of Fatality due to Fire = [Mitigated Event Likelihood of all flammable material release][Probability of
Ignition][Probability of a person in the area][Probability of Fatal Injury in the Fire]
Serious and extensive impact events that would cause a toxic release could use the following formula,
Risk of Fatality due to Toxic Release = [Mitigated Event Likelihood of all Toxic Releases][Probability of a
person in the area][Probability of Fatal Injury in the Release]
The expertise of the risk analyst specialist and the knowledge of the team are important in adjusting the
factors in the formulas to conditions and work practices of the plant and affected community. The total risk
to the corporation from this process can now be determined by totalling the results obtained from applying
the formulas. If this meets or is less than the corporate criteria for the population affected, the layer of
protection analysis (LOPA) is complete. However, since the affected population may be subject to risks from
other existing units or new projects, it is wise to provide additional mitigation if it can be accomplished
economically.
(3) Understanding what constitutes an independent protection layer (IPL). An independent protection layers
is a device, system, or action that is capable of preventing a scenario from proceeding to its undesired
consequence independent of the initiating event or the action of any other layer of protection associated
with the scenario. The effectiveness and independence of an independent protection layer must be
auditable. All independent protection layers are safeguards, but not all safeguards are independent
protection layers. Each safeguard identified for a scenario must be tested for conformance with this
definition. The following keywords may be helpful in evaluating an independent protection layer (IPL).
The “three Ds” help determine if a candidate is an independent protection layer (IPL): Detect – Most
independent protection layers detect or sense a condition in the scenario; Decide – Many independent
protection layers make a decision to take action or not; Deflect – All independent protection layers
deflect the undesired consequence by preventing it. The “four Enoughs” help evaluate the effectiveness
of a candidate independent protection layer (IPL): “Big Enough?”, “Fast Enough?”, “Strong Enough?”,
“Smart Enough?”. The “Big I” – Remember that the independent protection layer (IPL) must be
independent of the initiating event and all other independent protection layers.
(4) Understanding Independence. A critical issue for layer of protection analysis (LOPA) is determining
whether independent protection layers (IPL) are independent from the initiating event and from each
other. The layer of protection analysis (LOPA) methodology is based on the assumption of
independence. If there are common mode failures among the initiating event and independent
protection layers, the layer of protection analysis will underestimate the risk for the scenario. Dowell and
CCPS discuss how to ensure independence, and provide several useful examples.
(5) Procedures and Inspections. Procedures and inspections cannot be counted as independent protection
layers (IPL). They do not have the ability to detect the initiating event, cannot make a decision to take
action, and cannot take action to preven t the consequence. Inspections and tests of the independent
protection layer do not count as another independent protection layer. They do affect the probability of
failure on demands (PFD) of the independent protection layer (IPL).
(6) Mitigating independent protection layers (IPL). An independent protection layer may prevent the
consequence identified in the scenario, but, through its proper functioning, it may generate another less
severe, but still undesirable, consequence. A rupture disk on a vessel is an example. It prevents
overpressurization of the vessel (although not 100% of the time, the rupture disk does have a probability
of failure on demands). However, the proper operation of the rupture disk results in a loss of
containment from the vessel to the environment or a containment or treatment system. This best way
do deal with this situation is to create another layer of protection analysis (LOPA) scenario to estimate
the frequency of the release through the rupture disk, its consequence, and then determine if it meets
the risk tolerance criteria.
(7) Beyond layer of protection analysis (LOPA). Some scenarios or groups of scenarios are too complex for
layer of protection analysis. A more detailed risk assessment tool such as event tree analysis, fault tree
analysis, or quantitative risk analysis is needed. Some examples where this might be true include: A
system that has shared components be tween the initiating event and candidate independent protection
layers (IPL), and no cost effective way of providing independence. This system violates the layer of
protection analysis requirement for independence between initiating event and independent protection
layers (IPL). A large complex system with many layer of protection analysis scenarios, or a variety of
different consequences impacting different populations. This system may be more effectively analyzed
and understood using quantitative risk analysis.
(8) Risk Criteria. Implementation of layer of protection analysis (LOPA) is easier if an organization has
defined risk tolerance criteria. It is ve ry difficult to make risk ba sed decisions without these criteria,
which are used to decide if the frequency of the mitigated consequence (with the independent protection
layers in place) is low enough. CCPS provides guidance and references on how to develop and use risk
criteria.
(9) Consistency. When an organization implements layer of protection analysis (LOPA), it is important to
establish tools, including aids like look-up tables for consequence severity, initiating event frequency,
and probability of failure on demands (PFD) for standard independent protection layers (IPL). The
calculation tools must be documented, and users trained. All layer of protection analysis (LOPA)
practitioners in an organization must use the same rules in the same way to ensure consistent results.
CONCEPTION, DESIGN, AND IMPLEMENTATION
Process safety engineers and safety integrity level (SIL) assignment teams from many companies have
concluded that layer of protection analysis (LOPA) is an effective tool for safety integrity level assignment.
Layer of protection analysis requires fewer resources and is faster than fault tree analysis or quantitative risk
assessment. If more detailed analysis is needed, the layer of protection analysis scenarios and candidate
IPLs provide an excellent starting point. Layer of protection analysis (LOPA) has the following advantages:
(1) Focuses on severe consequences;
(2) Considers all the identified initiating causes;
(3) Encourages system perspective;
(4) Confirms which IPLs are effective for which initiating causes;
(5) Allocates risk reduction resources efficiently;
(6) Provides clarity in the reasoning process;
(7) Documents everything that was considered;
(8) Improves consistency of SIL assignment;
(9) Offers a rational basis for managing IPLs in an operating plant.
Initiating Causes
The hazard and operability analysis (HAZOP) listed two initiating causes for high pressure: Loss of cooling
water to the condenser and failure of the reactor steam control loop.
Initiating Likelihood
Plant operations have experienced loss in cooling water once in 15 years in this area. The team selects once
every 10 years as a conservative estimate of cooling water loss. It is wise to carry this initiating cause all the
way through to conclusion before addressing the other initiating cause (failure of the reactor steam control
loop in this case).
reactor. The basic process control system would shut off steam to the reactor jacket if the reactor
temperature is above setpoint. Since shutting off steam is sufficient to prevent high pressure, the basic
process control system is a protection layer. The basic process control system (BPCS) is a very reliable
distributed control system (DCS) and the production personnel have never experienced a failure that would
disable the temperature control loop. The layer of protection analysis (LOPA) team decides that a average
probability of failure on demands (PFDavg) of 0.1 is appropriate and enters 0.1 under basic process control
system (0.1 is the minimum allowable for the basic process control system).
Alarms
There is a transmitter on cooling water flow to the condenser, and it is wired to a different basic process
control system (BPCS) controller than the temperature control loop. Low cooling water flow to the condenser
is alarmed and utilizes operator intervention to shut off the steam. The alarm can be counted as a protection
layer since it is located in a different basic process control system controller than the temperature control
loop. The layer of protection analysis (LOPA) team agrees that a 0.1 average probability of failure on
demands (PFDavg) is appropriate since an operator is always present in the control room and enters 0.1
under alarms column.
Additional Mitigation
Access to the operating area is restricted during process operation. Maintenance is only performed during
periods of equipment shut down and lock out. The process safety management plan requires all non-
operating personnel to sign into the area and notify the process operator. Because of the enforced restricted
access procedures, the layer of protection analysis (LOPA) teams estimate that the risk of personnel in the
area is reduced by a factor of 10. Therefore 0.1 is entered under additional mitigation column.
Next Event
The layer of protection analysis (LOPA) team now considers the second initiation event (failure of reactor
steam control loop). Table 3.03 is used to determine the likelihood of control valve failure and 0.1 is entered
under initiation likelihood column. The protection layers obtained from process design, alarms, additional
mitigation and the safety instrumented systems (SIS) still exist if a failure of the steam control loop occurs.
CONCEPTION, DESIGN, AND IMPLEMENTATION
The only protection layer lost is the basic process control system (BPCS). The layer of protection analysis
team calculates the intermediate likelihood (1105) and the mitigated event likelihood (1.1108).
Table 3.03 – Typical protection layer (prevention & mitigation) probability of failure on demands (PFD).
Independent Protection Layer (IPL) Probability of Failure on Demand (PFD)
Control loop 1.0101
Relief valve 1.0102
Human performance (trained, no stress) 1.0102
Human performance (under stress) 0.5 to 1.0
Operator Response to Alarms 1.0101
Vessel pressure rating above maximum challenge
1.0104 or better
from internal and external pressure sources
The layer of protection analysis team would continue this analysis until all the deviations identified in the
hazard and operability analysis (HAZOP) have been addressed. The last step would be to add the mitigated
event likelihood for the serious and extensive events that present the same hazard. In this example, if only
the one impact event was identified for the total process, the number would be 1108. Since the probability
of ignition was accounted for under process design (0.1) and the probability of a person in the area was
accounted for under additional mitigation (0.1), the equation for risk of fatality due to fire reduces to,
Risk of Fatality Due to Fire = [Mitigated Event Likelihood of all flammable material releases][Probability of
Fatal Injury in the fire]
or
This number is below the corporate criteria for this hazard so the work of the layer of protection analysis
(LOPA) team is complete.
(LOPA) team identified one process design independent protection layer (IPL) for this impact event and this
cause. The maximum allowable working pressure of the distillation column and connected equipment is
greater than the maximum pressure that can be generated by the steam reboiler during a cooling tower
water failure. Its probability of failure on demand (PFD) is 1.0102. The basic process control system (BPCS)
for this plant is a distributed control system (DCS). The distributed control system contains logic that trips
the steam flow valve and a steam RCV on high pressure or high temperature of the distillation column. This
logic's primary purpose is to place the control system in the shut-down condition after a trip so that the
system can be restarted in a controlled manner; it can prevent the impact event. However, no probability of
failure on demand (PFD) credit is given for this logic since the valves it uses are the same valves used by the
safety instrumented system (SIS) – the distributed control system (DCS) logic does not meet the test of
independence for an independent protection layer. High pressure and temperature alarms displayed on the
distributed control system can alert the operator to shut off the steam to the distillation column, using a
manual valve if necessary. This protection layer meets the criteria for an independent protection layer, the
sensors for these alarms are separate from the sensors used by the safety instrumented systems. The
operators should be trained and drilled in the response to these alarms. Safety instrumented systems logic
implemented in a PLC will trip the steam flow valve and a steam RCV on high distillation column pressure or
high temperature using dual sensors separate from the distributed control system. The PLC has sufficient
redundancy and diagnostics such that the safety instrumented systems has a probability of failure on
demands of 1.0103 or SIL 3 rating. The distillation column has additional mitigation of a pressure relief
valve designed to maintain the distillation column pressure below the maximum allowable working pressure
when cooling tower water is lost to the condenser. Its probability of failure on demand is 1.0102. The
number of independent protection layers is three. The mitigated event likelihood for this cause-consequence
pair is calculated by multiplying the challenge likelihood by the independent protection layer probability of
failure on demands,
The value of 1.0109 is less than the maximum target likelihood of 1.0108 for extensive impact events.
Note that the relief valve protects against catastrophic rupture of the distillation column, but it introduces
another impact event, a toxic release.
Initiating Cause
Risk Matrix
Cause Likelihood
Safeguard
Process Design
(IPL & PFD)
Recommendation
BPCS
(IPL & PFD)
Alarms, Procedures
(IPL & PFD)
SIS
(IPL & PFD)
Additional Mitigation
(IPL & PFD)
YES
Totalize Mitigated
Continue with next
Event Likelihoods for
Consequence-Cause pair
whole process
Figure 3.02 – Relationship between hazard and operability (HAZOP) and layer of protection analysis (LOPA).
METHODOLOGY
The integrated hazard and operability (HAZOP) and safety integrity level (SIL) study is initiated by calling a
meeting (or session) usually comprising of the operating company, the engineering consultancy company (if
this is a new project) and the hazard and operability and safety integrity level facilitator with his scribe (who
is usually an independent third party). The team of engineers should definitely consist of chemical (or
process engineers), instrumentation and safety engineers. Other engineers are optional depending on their
need during the course of the session. The session has the following steps in the order as listed below.
INDUSTRIAL FACILITY SAFETY
Table 3.05 – A typial risk matrix used in hazard and operability (HAZOP) study.
Frequent (more Probable (once Occasional (once Remote (not in the
than once per year) every four years) every 25 years) life of the facility)
Severity Level 1 Priority 1 Priority 1 Priority 1 Priority 2
(Critical) (Unacceptable) (Unacceptable) (Unacceptable) (High)
Severity Level 2 Priority 1 Priority 2 Priority 2 Priority 3
(High) (Unacceptable) (High) (High) (Medium)
Severity Level 3 Priority 2 Priority 3 Priority 4 Priority 4
(Moderate) (High) (Medium) (Low) (Low)
Severity Level 4 Priority 3 Priority 4 Priority 4 Priority 4
(Minor) (Medium) (Low) (Low) (Low)
The outputs from the hazard and operability (HAZOP) are the risk ranking of each identified cause of process
deviation and recommendations to lower the risk involved. These recommendations are given in the form of
safeguards.
SAFETY INTEGRITY LEVEL (SIL) AND LAYER OF PROTECTION ANALYSIS (LOPA) ASSESSMENT
Safety integrity level (SIL) and layer of protection analysis (LOPA) assessment study is to assess the
adequacy of the safety protection layers (SPL) or safeguards that are in place to mitigate against hazardous
events relating to major process hazards, identify those safety protection layers or safeguards that do not
meet the required risk reduction for a particular hazard, and make reasonable recommendations where a
hazard generates a residual risk that needs further risk reduction. This is done by defining the tolerable
frequency (TF). The tolerable frequency of the process deviation is a number which is derived from the level
of the risk identified from the hazard and operability (HAZOP) risk matrix. It indicates the period of
occurrence, in terms of years, of the process deviation which the operating company can tolerate. For
example a tolerable frequency of 104 indicates that the company can tolerate the occurrence of the process
deviation once in 10,000 years. The mitigation frequency (MF) is derived as a calculation from the likelihood
of each cause and the probability of failure on demand (PFD) of the safety protection layers (SPL). The
inputs to the safety integrity level (SIL) and layer of protection analysis (LOPA) assessment are the process
deviations, causes, risk levels and safeguards identified during the hazard and operability (HAZOP). The
safety integrity level (SIL) and layer of protection analysis (LOPA) assessment recommend the safety
protection layers (SPL) to be designed to meet the process hazard.
Recommendations
In the event that the mitigation frequency (MF) is not less than the tolerable frequency (TF), more safety
protection layers (SPL) are recommended, their probability of failure on demand (PFD) values are assumed
and it is included in the equation of the mitigation frequency to get it less than the tolerable frequency.
These safety protection layers are recommended as safeguards to decrease the risk of the consequences
because of the deviation (or cause) being analyzed. The session ends with the mitigation frequency values
of all the layer of protection analysis scenarios derived lees than the tolerable frequency.
Safety Integrity Level (SIL) and Layer of Protection Analysis (LOPA) Assessment Validation
This is done after the session by the reliability or safety engineer. The methodology is to calculate the
probability of failure on demand (PFD) values of the identified safety protection layers (SPL), then derive the
mitigation frequency (MF) as a calculation from the likelihood of each cause and the probability of failure on
demand of the safety protection layers. If the total mitigation frequency (MF) of all the causes is less than
CONCEPTION, DESIGN, AND IMPLEMENTATION
the tolerable frequency (TF), which is defined as a numerical value from the hazard and operability (HAZOP)
risk matrix, the integrated study is complete. This validates the assumed probability of failure on demand
values of the safety protection layers during the session.
THE INTEGRATED HAZARD AND OPERABILITY (HAZOP) AND SAFETY INTEGRITY LEVEL (SIL)
PROCESS
The following process is used in a session for each of the identified nodes during an hazard and operability
(HAZOP) study:
(1) The process engineer describes the intention of the node.
(2) Concerns and hazards within the node are recorded under the discussed node notes.
(3) The team applies process parameter deviations to each node and identifies the associated hazards.
(4) Causes and initiating events to those hazards are identified, and recorded.
(5) The resulting consequences are identified, categorized, and recorded based on the consequence
grading in the operating company’s risk matrix.
(6) The likelihood of the initiating event is then assigned by the group and recorded based on the risk
matrix.
(7) The resulting risk score based on the consequence and likelihood scores are recorded not taking credit
for any of the safeguards in place, as per the risk matrix
(8) An identification of the safeguards and an evaluation as safety protection layers (SPL) is then carried
out.
(9) The risk is re-scored taking into account the identified safeguards which are independent safety
protection layers (SPL). Usually a standard safety integrity level (SIL) value is assigned to the safety
protection layers (SPL) which are validated outside the session for accuracy.
(10) If sufficient independent layers of protection are identified to reduce the risk to the tolerable level (TF),
then no further safeguards are identified and no recommendations are required.
(11) If the risk with safeguards are high and not meeting the tolerable frequency, then recommendations
and actions are developed in the aim of reducing the risk below the tolerable frequency (TF).
(12) The implementation of those actions and recommendations are assigned to the responsible party and
individual. The recommended safety protection layers are validated and their probability of failure on
demand (PFD) numbers are used to calculate if the mitigation frequency (MF) is less than the tolerable
frequency (TF).
(13) The process is repeated covering the applicable parameters, deviations, and nodes.
In the following example, a hazard and operability (HAZOP) related with “High Level” in a storage tank is
considered. As per the hazard and operability (HAZOP) process, all the causes have been identified,
consequences listed and risk ranking done without and with the existing safeguards (SPLs). From the hazard
and operability (HAZOP), the causes of deviation are listed as layer of protection analysis (LOPA) causes,
their likelihoods identified and the safeguards are listed as protection layers (SPL). The probability of failure
on demand (PFD) value of each safety protection layer (SPL) is either manually entered or linked to a
calculated value. If the mitigation frequency (MF) is less than tolerable frequency (as in the case of this
example), it implies that some additional safety protection layers (SPL) are required to meet the tolerable
frequency (TF).
CONCLUSION
By integrating hazard and operability (HAZOP) and safety integrity level (SIL) process into one session, the
time and cost to conduct these sessions are reduced, there is more data integrity as the same team
conducts both the studies and it removes the subjectivity which comes out of a pure hazard and operability
session. An integrated study is a semi-quantitative technique and applies much more rigor than a hazard and
operability study alone. It determines if the existing safeguards are enough and if proposed safeguards are
warranted. It tightly couples the risk tools (matrices, risk graphs) of a corporation.
INDUSTRIAL FACILITY SAFETY
Note that the human response credits are generous in Table 3.07. Many companies reduce these numbers
by one credit each. Again, companies that choose to modify the independent protection layer (IPL) credits
table usually have a formal procedure for comment, review and acceptance. A formal, periodic review should
be made to verify that the independent protection layer (IPL) credits table is consistent with not only local
field experience but also with wider industry practice. Such verification can be done through internal incident
reviews, through industry associations, through employment of outside consultants with experience specific
to the layer of protection analysis (LOPA) procedures of your industry, or through commercial and insurance
databases that are typically available for a fee.
the various categories of severity. Each consequence of interest is then rated for severity within each
category (see Figure 3.03).
Potential
environmental impact Potential for a major
Potential for a minor Potential for an
Environmenta
Potential for multiple Potential for multiple Potential for multiple Potential for multiple
On-site
Potential for a single minor injuries or a moderate injuries or moderate injuries or moderate injuries or
Injury
minor injury requiring single moderate injury illnesses or a single illnesses or a single illnesses or a single
first-aid treatment. or illness requiring major injury requiring life-threatening injury life-threatening injury
medical treatment. physician's care. or irreversible illness. or irreversible illness.
SEVERITY LEVELS
1S 2S 3S 4S 5S
(Negligible) (Low) (Medium) (Major) (Ctastrophic)
A typical consequence of an on-site chemical release might receive a severity ranking of 2-1-3-2-2 with the
five numbers corresponding to the categories of on-site injury, off-site injury, environmental consequence,
cost, and publicity, respectively. The highest of these numbers (in this case, the “3” for environmental
impact) would be the overall severity number used in the risk tolerance calculation. Using a multi-factor
severity table of this type allows insight into the process hazards analysis team’s concerns even years after
the study. By looking at the severity category rankings done by the process hazards analysis team, the
team’s actual concerns and thinking can be reconstructed by reviewing the study report documents.
Without such categorization of severity, no such reconstruction is possible. The team (even if team members
are available for interview) will have forgotten the exact scenario discussed and will be unable to reconstruct
INDUSTRIAL FACILITY SAFETY
the “worst case scenario” from memory. Severity categories of this kind are now considered standard
industry practice in the chemical manufacturing, refinery, and pipeline industries.
The layer of protection analysis (LOPA) table, as shown in Table 3.09, allows for slightly higher tolerance of
moderate risk events. This is done to compensate for layer of protection analysis’ “round-up” requirement
for likelihoods. These changes in risk tolerance were artifacts of the layer of protection analysis process and
should not provide significantly different risk to the company. The company used for these examples makes
another modification to their layer of protection analysis table. That modification is shown below.
Table 3.09 – A typical layer of protection analysis (LOPA) risk-tolerance table with independent protection
layer (IPL) credit numbers.
Layer of rotection Analysis (LOPA) Risk Matrix
Severity
Probability Level 1S 2S 3S 4S 5S
Category Range (Negligible) (Low) (Medium) (Major) (Catastrophic)
Probable < 1100 5 D C2 B3 A4 A5
0
High 110 < P < 1101 4 D C1 C2 B3 A4
Likelihood
Note the numbers after the A, B, and C risk letters. These numbers represent the number of credits required
from the independent protection layer (IPL) credits table to reach what this company considers a minimally
acceptable risk (“D”). By placement on the layer of protection analysis (LOPA) risk matrix, it will be evident
that a specific number of credits will be required to reduce risk to an acceptable level. Since each credit in
the independent protection layer (IPL) credits table represents an order of magnitude reduction in likelihood
of the undesired event, this practice is consistent with the layer of protection analysis procedure, as defined
by the AIChE guidelines. The use of numbers in the layer of protection analysis (LOPA) risk matrix makes it
less likely that an inexperienced layer of protection analysis practitioner will err in assessing the risk
reduction required.
While traditional instrument concerns have been over architecture and manufacturer’s recommendations, the
safety instrumented systems standards base instrument requirements on hazard analysis. LOPA is the most
commonly used tool for assessing instrument reliability requirements. The regulatory implementation of the
safety instrumented systems standards became active by way of an industrial explosion in 2004 where five
workers were killed. OSHA cited the employer for not documenting that the plant’s programmable logic
controllers and distributed control systems installed prior to 1997 (emphasis mine) complied with recognized
generally accepted engineering practices such as ANSI/ISA 84.01. Since this citation was paid without
contest, a precedent has been set that these safety instrumented systems consensus standards are now
“generally accepted engineering practice” in the chemical manufacturing, refining, and pipeline industries.
The safety instrumented systems standards (to simplify significantly) require the company to ask the
question “If this safeguard fails to operate on demand, what will the consequences be?”. After the worst-
case severity of consequence is determined, then the likelihood of the existing control system to fail is
calculated. In calculating the likelihood of failure of an existing control system, all elements of the control
system must be assessed, including the sensor(s), the logic element(s), and the actuated element(s) or
valves. Because a failure of any of these elements will disable the entire control or trip system, the
probabilities of failure are additive. Probability of failure on demand (PFOD) of the sensor(s) PLUS the .
Probability of failure on demand of the logic element(s) PLUS the probability of failure on demand of the
actuated element(s) equals the total probability of failure on demand (PFOD). Once the system total
probability of failure on demand is determined, the severity of the consequences can be included to
determine overall risk. Most companies use a chart to equate the expected risk to a desired reliability level
for the instrumented system. If the existing system is not sufficiently reliable to provide a desired risk level,
then the reliability of the instrumented system can be improved by any combination of the following:
(1) Substituting more reliable components for the existing ones.
(2) Adding redundancy to reduce the total PFOD for the system.
(3) Increasing testing and calibration frequency to ensure desired function.
The goal of safety instrumented systems (SIS) is to reduce the hazard assessment errors, design errors,
installation errors, operations errors, maintenance errors, and change-management errors that might cause
the instrument system to fail. Layers of protection analysis (LOPA) is now a firmly established, industry-wide
“generally accepted engineering practice”. Businesses affected by OSHA’s 1910.119 (Process Safety
Management of Highly Hazardous Chemicals) should already be using layers of protection analysis to verify
risk assessments. The practices illustrated in this paper are typical of current industry layers of protection
analysis practice. All industries that should be using layers of protection analysis (LOPA) should also be
starting implementation of safety instrumented systems (SIS).
INDUSTRIAL FACILITY SAFETY
REFERENCES
Bollinger et al., Inherently Safer Chemical Processes, A Life Cycle Approach, CCPS, New York, 1996.
Center for Chemical Process Safety (CCPS), Inherently Safer Chemical Pro cesses: A Life Cycle Approach,
American Institute of Chemical Engineers, New York, NY, 1996.
Center for Chemical Process Safety (CCPS), Layer of Protection Analysis, Simplified Process Risk
Assessment, American Institute of Chemical Engineers, New York, NY, 2001.
Center for Chemical Process Safety (CCPS), Guidelines for Safe Automation of Chemical Processes, American
Institute of Chemical Engineers, New York, NY, 1993.
Center for Chemical Process Safety. Layers of Protection Analysis: Simplified Process Risk Assessment. New
York: John Wiley, 2001.
Dowell, A. M., III, Layer of Protection Analys is: A New PHA Tool, After HAZOP, Before Fault Tree Analysis,
Presented at Center for Chemical Process Safety International Conference and Workshop on Risk Analysis
in Process Safety , Atlanta, GA, October 21, 1997, American Institute of Chemical Engineers, New York, NY,
1997.
Dowell, A. M., III, Layer of Protection Analysis – A Worked Distillation Example, ISA Tech 1999, Philadelphia
PA, The Instrumentation, Systems, and Automation Society, Research Triangle Park, NC, 1999.
Dowell, A. M., III, Layer of Protection Analysis and Inherently Safer Processes, Process Safety Progress 18,
4, 214-220, 1999.
Dowell, A. M., III, Layer of Protection Analysis for Determining Safety Integrity Level, ISA Transactions 37,
155-165, 1998.
Dowell, A. M., III, Layer of Protection Analysis: Lessons Learned, ISA Technical Conference Series: Safety
Instrumented Systems for the Process Industry, May 14-16, 2002, Baltimore, MD.
Ewbank, R, M., and York, G. S., 1997. Rhone-Poulenc Inc., Process Hazard Analysis and Risk Assessment
Methodology”, International Conference and Workshop on Risk Analysis in Process Safety, CCPS, pp 61-74.
Huff, A. M., and Montgomery, R. L., 1997. A Risk Assessment Methodology for Evaluating the Effectiveness
of Safeguards and Determining Safety Instrumented System Requirements, International Conference and
Workshop on Risk Analysis in Process Safety, CCPS, pp 111-126.
International Electrotechnical Commission, IEC 61508. Functional Safety of Electrical, Electronic,
Programmable Electronic Safety-related Systems, Parts 1-7, Geneva, International Electrotechnical
Commission, 1998.
International Electrotechnical Commission, IEC 61511, Functional Safety Instrumented Systems for the
Process Industry Sector, Parts 1-3, International Electrotechnical Commission, Geneva, Draft in Progress.
The Instrumentation, Systems, and Automation Society (ISA), ANSI/ISA 84.01-1996. Application of Safety
Instrumented Systems to the Process Industries, The Instrumentation, Systems, and Automation Society,
Research Triangle Park, NC, 1996.
CONCEPTION, DESIGN, AND IMPLEMENTATION
CHAPTER 4
INTRODUCTION TO RELIABILITY
Reliability is an area in which there are many misconceptions due to a misunderstanding or misuse of the
basic language. It is therefore important to get an understanding of the basic concepts and terminology.
Some of these basic concepts are described in chapter. What is failure rate ()? Every product has a failure
rate () which is the number of units failing per unit time. This failure rate changes throughout the life of the
product that gives us the familiar bathtub curve, that shows the failure rate per operating time for a
population of any product. It is the manufacturer’s aim to ensure that product in the “infant mortality period”
does not get to the customer. This leaves a product with a useful life period during which failures occur
randomly, i.e. the failure rate () is constant, and finally a wear out period, usually beyond the products
useful life, where is increasing.
What is reliability? A practical definition of reliability is “the probability that a piece of equipment operating
under specified conditions shall perform satisfactorily for a given period of time”. The reliability is a number
between 0 and 1.
What is mean time between failures (MTBF), and mean time to failure (MTTF)? Strictly speaking, mean time
between failures (MTBF) applies to equipment that is going to be repaired and returned to service, and
mean time to failure (MTTF) applies to parts that will be thrown away on failing. During the useful life period
assuming a constant failure rate, mean time between failures (MTBF) is the inverse of the failure rate and
we can use the terms interchangeably,
1
MTBF [4.01]
Many people misunderstand mean time between failures (MTBF) and wrongly assume that the mean time
between failures (MTBF) figure indicates a minimum, guaranteed, time between failures. If failures occur
randomly then they can be described by an exponential distribution,
INDUSTRIAL FACILITY SAFETY
1
t
R t e t
e MTBF [4.02]
After a certain time (t) which is equal to the mean time between failures (MTBF), the reliability (Equation
[3.02]) becomes,
R t e 1 0.37 [4.03]
Now let us consider a customer who has 700 such units. Since we can expect, on average, 0.2% of units to
fail per 1,000 hours, the number of failures per year is,
0. 2 1
700 24 365 12.26 [4.04]
100 1,000
What is service life, mission life, useful life? Note that there is no direct connection or correlation between
service life and failure rate. It is possible to design a very reliable product with a short life. A typical example
is a missile for example: it has to be very, very reliable (with a mean time between failures of several million
hours), but its service life is only 0.06 hours (4 minutes)! Twenty-five year old humans have an mean time
between failures (MTBF) of about 800 years (a failure rate about 0.1% per year), but not many have a
comparable “service life”. Just because something has a good mean time between failures (MTBF), it does
not necessarily have a long service life as well.
What is reliability prediction? Reliability prediction describes the process used to estimate the constant failure
rate during the useful life of a product. This however is not possible because predictions assume that:
(1) The design is perfect, the stresses known, every thing is within ratings at all times, so that only random
failures occur.
(2) Every failure of every part will cause the equipment to fail.
(3) The database is valid.
These assumptions are sometimes wrong. The design can be less than perfect, not every failure of every
part will cause the equipment to fail, and the database is likely to be at least 15 years out-of-date.
However, none of this matters much, if the predictions are used to compare different topologies or
approaches rather than to establish an absolute figure for reliability. This is what predictions were originally
designed for. Some prediction manuals allow the substitution of use of vendor reliability data where such
data is known instead of the recommended database data. Such data is very dependant on the environment
under which it was measured and so, predictions based on such data could no longer be depended on for
comparison purposes. These and other issues will be covered in more detail in the following chapters.
CONCEPTION, DESIGN, AND IMPLEMENTATION
Failure rate predictions are useful for several important activities in the design phase of electronic equipment
in addition to many other important procedures to ensure reliability. Examples of these activities are:
(1) To assess whether reliability goals can be reached;
(2) To identify potential design weaknesses;
(3) To compare alternative designs;
(4) To evaluate designs and to analyse life-cycle costs;
(5) To provide data for system reliability and availability analysis;
(6) To plan logistic support strategies;
(7) To establish objectives for reliability tests.
PREDICTION MODELS
The failure rate of the system is calculated by summing up the failure rates of each component in each
category (based on probability theory). This applies under the assumption that a failure of any component is
assumed to lead to a system failure. The following models assume that the component failure rate under
reference or operating conditions is constant. Justification for use of a constant failure rate assumption
should be given. This may take the form of analyses of likely failure mechanisms, related failure
distributions, etc.
n
s ,i ref ,i [4.05]
i1
where ref is the failure rate under reference conditions; n is the number of components. The reference
conditions adopted are typical for the majority of applications of components in equipment. Reference
conditions include statements about:
(1) Operating phase;
(2) Failure criterion;
(3) Operation mode (e.g. continuous, intermittent);
(4) Climatic and mechanical stresses;
(5) Electrical stresses.
It is assumed that the failure rate used under reference conditions is specific to the component, i.e. it
includes the effects of complexity, technology of the casing, different manufacturers and the manufacturing
process etc. Data sources used should be the latest available that are applicable to the product and its
specific use conditions. Ideally, as said before, failure rate data should be obtained from the field. Under
these circumstances failure rate predictions at reference conditions used at an early stage of design of
equipment should result in realistic predictions.
n
ref U T I i [4.06]
i 1
CONCEPTION, DESIGN, AND IMPLEMENTATION
where ref is the failure rate under reference conditions; U is the voltage dependence factor; I is the
current dependence factor; T is the temperature dependence factor; and n is the number of components.
In the standard IEC 61709, clause 7 specific stress models and values for component categories are given
for the -factors and should be used for converting reference failure rates to field operational failure rates.
The stress models are empirical and allow fitting of observed data. However, if more specific models are
applicable for particular component types then these models should be used and their usage noted.
Conversion of failure rates is only possible within the specified functional limits of the components.
(1) Gives guidance on obtaining accurate failure rate data for components used on electronic equipment, so
that we can precisely predict reliability of systems.
(2) Specifies reference conditions for obtaining failure rate data, so that data from different sources can be
compared on a consistent basis.
(3) Describes stress models as a basis for conversion of the failure rate data from reference conditions to
the actual operating conditions.
The stated stress models contain constants that were defined according to the state of the art. These are
averages of typical component values taken from tests or specified by various manufacturers. A factor for
the effect of environmental application conditions is basically not used in IEC 61709 because the influence of
the environmental application conditions on the component depends essentially on the design of equipment.
Thus, such an effect may be considered within the reliability prediction of equipment using an overall
environmental application factor.
E 1 1
t f ,1
k T1 T2
AF e [4.07]
t f ,2
CONCEPTION, DESIGN, AND IMPLEMENTATION
where tf,1 is time to failure at temperature T1, tf,2 is time to failure at temperature T2, T1 and T2 are
temperature in degrees Kelvin (K), E is activation energy per molecule (eV), k is Boltzmann’s constant (8.617
x 10-5 eVK1).
What are the conditions that have a significant effect on the reliability? Important factors affecting reliability
include:
(1) Temperature stress;
(2) Electrical and mechanical stress;
(3) End use environment;
(4) Duty cycle;
(5) Quality of components.
What is the mean time between failures (MTBF) of items? In the case of exponential distributed lifetimes the
mean time between failures (MTBF) is the time that approximately 37% of items will run without random
failure. Statements about mean time between failures (MTBF) prediction should at least include the
definition of:
(1) Evaluation method (prediction and life testing);
(2) Operational and environmental conditions (e.g. temperature, current, voltage);
(3) Failure criteria;
(4) Period of validity.
What is the difference between observed, predicted and demonstrated mean time between failures (MTBF)?
Observed mean time between failures is field failure experienced; Predicted mean time between failures is
the estimated reliability based on reliability models and predefined conditions; Demonstrated mean time
between failures is statistical estimation based on life tests or accelerated reliability testing.
How many field failures can be expected during the warranty period if mean time between failures (MTBF) is
known? If lifetimes are exponential distributed and all devices are exposed to the same stress and
environmental conditions used in predicting the mean time between failures (MTBF) the mean number of
field failures excluding other than random failures can be estimated by,
t nt
w
w
n 1 e T
n tw [4.08]
T
where n is quantity of devices under operation, tw is warranty period (in years, hours etc.), T is mean time
between failures (MTBF) or mean time to failure (MTTF) in years, hours etc.
INDUSTRIAL FACILITY SAFETY
REFERENCES
MIL-HDBK-217F. Military Handbook, Reliability prediction of electronic equipment (1991).
MIL-HDBK-781. A Handbook for reliability test methods, plans, and environments for engineering,
development qualification, and production. Department of Defence (1996).