Professional Documents
Culture Documents
As our high-tech society becomes increasingly dependent on computers, the demand for more
dependable software will increase and likely become the norm. In the past, fault-tolerant
computing was the exclusive domain of very specialized organizations such as telecom
companies and financial institutions. With business-to-business transactions taking place over the
Internet, however, we are interested not only in making sure that things work as intended, but
also, when the inevitable failures do occur, that the damage is minimal. (None of us would be
happy to lose money because a fault occurred during the transfer of funds from one account to
another, for instance.)
Unfortunately, fault-tolerant computing is extremely hard, involving intricate algorithms for
coping with the inherent complexity of the physical world. As it turns out, that world conspires
against us and is constructed in such a way that, generally, it is simply not possible to devise
absolutely foolproof, 100% reliable software 1 . No matter how hard we try, there is always a
possibility that something can go wrong. The best we can do is to reduce the probability of failure
to an "acceptable" level. Unfortunately, the more we strive to reduce this probability, the higher
the cost.
The Concepts Behind Fault-Tolerant Computing
There is much confusion around the terminology used with fault tolerance. For example, the
terms "reliability" and "availability" are often used interchangeably, but do they always mean the
same thing? What about "faults" and "errors"? In this section, we introduce the basic concepts
behind fault tolerance 2 .
Fault tolerance is the ability of a system to perform its function correctly even in the presence of
internal faults. The purpose of fault tolerance is to increase the dependability of a system. A
complementary but separate approach to increasing dependability is fault prevention. This
consists of techniques, such as inspection, whose intent is to eliminate the circumstances by
which faults arise.
Failures, Errors, and Faults
Implicit in the definition of fault tolerance is the assumption that there is a specification of what
constitutes correct behavior. Afailure occurs when an actual running system deviates from this
specified behavior. The cause of a failure is called an error. An error represents an invalid system
state, one that is not allowed by the system behavior specification. The error itself is the result of
a defect in the system or fault. In other words, a fault is the root cause of a failure. That means
that an error is merely the symptom of a fault. A fault may not necessarily result in an error, but
the same fault may result in multiple errors. Similarly, a single error may lead to multiple failures.
These basic concepts are illustrated using the Unified Modeling Language (UML) class diagram
in Figure 1.
Figure 1: Failures, Errors, and Faults
For example, in a software system, an incorrectly written instruction in a program may decrement
an internal variable instead of incrementing it. Clearly, if this statement is executed, it will result
in the incorrect value being written. If other program statements then use this value, the whole
system will deviate from its desired behavior. In this case, the erroneous statement is the fault, the
invalid value is the error, and the failure is the behavior that results from the error. Note that if the
variable is never read after being written, no failure will occur. Or, if the invalid statement is
never executed, the fault will not lead to an error. Thus, the mere presence of errors or faults does
not necessarily imply system failure.
As this example illustrates, the designation of what constitutes a fault -- the underlying cause of a
failure -- is relative in the sense that it is simply a point beyond which we do not choose to delve
further. After all, the incorrect statement itself is really an error that arose in the process of
writing the software, and so on.
At the heart of all fault tolerance techniques is some form of masking redundancy. This means
that components that are prone to defects are replicated in such a way that if a component fails,
one or more of the non-failed replicas will continue to provide service with no appreciable
disruption. There are many variations on this basic theme.
Fault Classifications
It is helpful to classify faults in a number of different ways, as shown by the UML class diagram
in Figure 2.
Figure 2: Different Classifications of Faults
Based on duration, faults can be classified as transient or permanent. A transient fault will
eventually disappear without any apparent intervention, whereas a permanent one will remain
unless it is removed by some external agency. While it may seem that permanent faults are more
severe, from an engineering perspective, they are much easier to diagnose and handle. A
particularly problematic type of transient fault is the intermittent fault that recurs, often
unpredictably.
A different way to classify faults is by their underlying cause. Design faults are the result of
design failures, like our coding example above. While it may appear that in a carefully designed
system all such faults should be eliminated through fault prevention, this is usually not realistic in
practice. For this reason, many fault-tolerant systems are built with the assumption that design
faults are inevitable, and theta mechanisms need to be put in place to protect the system against
them. Operational faults, on the other hand, are faults that occur during the lifetime of the system
and are invariably due to physical causes, such as processor failures or disk crashes.
Finally, based on how a failed component behaves once it has failed, faults can be classified into
the following categories:
• Crash faults -- the component either completely stops operating or never returns to a
valid state;
• Omission faults -- the component completely fails to perform its service;
• Timing faults -- the component does not complete its service on time;
• Byzantine faults -- these are faults of an arbitrary nature. 3
Dependability
Dependability means that our system can be trusted to perform the service for which it has been
designed. Dependability can be decomposed into specific aspects. Reliability characterizes the
ability of a system to perform its service correctly when asked to do so. Availability means that
the system is available to perform this service when it is asked to do so. Safety is a characteristic
that quantifies the ability to avoid catastrophic failures that might involve risk to human life or
excessive costs. Finally, security is the ability of a system to prevent unauthorized access.
Technically, reliability is defined as the probability that a system will perform correctly up to a
given point in time. A common measure of reliability, therefore, is the mean time between
failures (MTBF).
Availability is defined as the probability that a system is operational at a given point in time. For
a given system, this characteristic is strongly dependent on the time it takes to restore it to service
once a failure occurs. A common way of characterizing this ismean time to repair (MTTR).
The two measures for reliability (MTBF) and availability (MTBR) can be used to show the
relationship between these two important measures. It is important to distinguish these two
technical terms, since they are often used interchangeably in everyday communications. This can
lead to confusion. The availability of a system can be calculated from these two measures
according to the formula:
Availability = (MTBF) / (MTTR + MTBF)
Note that for systems that never fail, availability is equal to reliability.
Distributed Systems
We define a distributed software system (Figure 4) as: a system with two or more independent
processing sites that communicate with each other over a medium whose transmission delays
may exceed the time between successive state changes.
Figure 4: A Distributed System
From a fault-tolerance perspective, distributed systems have a major advantage: They can easily
be made redundant, which, as we have seen, is at the core of all fault-tolerance techniques.
Unfortunately, distribution also means that the imperfect and fault-prone physical world cannot
be ignored, so that as much as they help in supporting fault-tolerance, distributed systems may
also be the source of many failures. In this section we briefly review these problems.
Processing Site Failures
The fact that the processing sites of a distributed system are independent of each other means that
they are independent points of failure. While this is an advantage from the viewpoint of the user
of the system, it presents a complex problem for developers. In a centralized system, the failure of
a processing site implies the failure of all the software as well. In contrast, in a fault-tolerant
distributed system, a processing site failure means that the software on the remaining sites needs
to detect and handle that failure in some way. This may involve redistributing the functionality
from the failed site to other, operational, sites, or it may mean switching to some emergency
mode of operation.
Communication Media Failures
Another kind of failure that is inherent in most distributed systems comes from the
communication medium. The most obvious, of course, is a permanent hard failure of the entire
medium, which makes communication between processing sites impossible. In the most severe
cases, this type of failure can lead to partitioning of the system into multiple parts that are
completely isolated from each other. The danger here is that the different parts will undertake
activities that conflict with each other.
A different type of media failure is an intermittent failure. These are failures whereby messages
travelling through a communication medium are lost, reordered, or duplicated. Note that these are
not always due to hardware failures. For example, a message may be lost because the system may
have temporarily run out of memory for buffering it. Message reordering may occur due to
successive messages taking different paths through the communication medium. If the delays
incurred on these paths are different, they may overtake each other. Duplication can occur in a
number of ways. For instance, it may result from a retransmission due to an erroneous conclusion
that the original message was lost in transit.
One of the central problems with unreliable communication media is that it is not always possible
to positively ascertain that a message that was sent has actually been received by the intended
remote destination. A common technique for dealing with this is to use some type of positive
acknowledgement protocol. In such protocols, the receiver notifies the sender when it receives a
message. Of course, there is the possibility that the acknowledgement message itself will be lost,
so that such protocols are merely an optimization and not a solution.
The most common technique for detecting lost messages is based on time-outs. If we do not get a
positive acknowledgement within some reasonable time interval that our message was received,
we conclude that it was dropped somewhere along the way. The difficulty of this approach is to
distinguish situations in which a message (or its acknowledgement) is simply slow from those in
which a message has actually been lost. If we make the time-out interval too short, then we risk
duplicating messages and also reordering in some cases. If we make the interval too long, then
the system becomes unresponsive.
Transmission Delays
While transmission delays are not necessarily failures, they can certainly lead to failures. We've
already noted that a delay can be misconstrued as a message loss.
There are two different types of problems caused by message delays. One type results
from variable delays (jitter). That is, the time it takes for a message to reach its destination may
vary significantly. The delays depend on a number of factors, such as the route taken through the
communication medium, congestion in the medium, congestion at the processing sites (e.g., a
busy receiver), intermittent hardware failures, etc. If the transmission delay is constant, then we
can much more easily assess when a message has been lost. For this reason, some communication
networks are designed as synchronous networks, so that delay values are fixed and known in
advance.
However, even if the transmission delay is constant, there is still the problem of out-of-date
information. Since messages are used to convey information about state changes between
components of the distributed system, if the delays experienced are greater than the time required
to change from one state to the next, the information in these messages will be out of date. This
can have major repercussions that can lead to unstable systems. Just imagine trying to drive a car
if visual input to the driver were delayed by several seconds.
Transmission delays also lead to a complex situation that we will refer to as the relativistic effect.
This is a consequence of the fact that transmission delays between different processing sites in a
distributed system may be different. As a result, different sites may see the same set of messages
but in a different order. This is illustrated in Figure 5 below:
Figure 5: The Relativistic Effect
In this case, distributed sites NotifierP and NotifierQ each send out a notification about an event
to the two clients (ClientAand ClientB). Due to the different routes taken by the individual
messages and the different delays along those routes, we see that ClientB sees one sequence
(event1 followed by event2), whereas ClientA sees a different one (event2-event1). As a
consequence, the two clients may reach different conclusions about the state of the system.
Note that the mismatch here is not the result of message overtaking (although this effect is
compounded if overtaking occurs); it is merely a consequence of the different locations of the
distributed agents relative to each other.
Distributed Agreement Problems
The various failure scenarios in distributed systems and transmission delays in particular have
instigated important work on the foundations of distributed software. 5 Much of this work has
focused on the central issue of distributed agreement. There are many variations of this problem,
including time synchronization, consistent distributed state, distributed mutual exclusion,
distributed transaction commit, distributed termination, distributed election, etc. However, all of
these reduce to the common problem of reaching agreement in a distributed environment in the
presence of failures.
• Error detection is the detection of errors caused by noise or other impairments during
transmission from the transmitter to the receiver.[1]
• Error correction is the detection of errors and reconstruction of the original, error-free
data.
There are two basic ways to design the channel code and protocol for an error-correcting system:
• Automatic Repeat reQuest (ARQ): The transmitter sends the data and also an error-
detection code, which the receiver uses to check for errors, and requests retransmission of
the data that was deemed erroneous. In many cases the request is implicit: The receiver
sends an acknowledgement (ACK) of correctly received data, and the transmitter re-sends
anything not acknowledged within a reasonable period of time.
• Forward error correction (FEC): The transmitter encodes the data with an error-
correcting code (ECC) and sends the coded message. The receiver never sends any
messages back to the transmitter. The receiver decodes what it receives into the "most
likely" data. Forward error-correction codes are designed so that it would take an
"unreasonable" amount of noise to trick the receiver into misinterpreting the data.
It is possible to combine the ARQ and FEC so that minor errors are corrected without
retransmission, and major errors are corrected via a request for retransmission. The combination
is called a hybrid automatic repeat-request.
Several schemes exist to achieve error detection. The general idea is to add some redundancy
(i.e., some extra data) to a message, which enables detection of any errors in the delivered
message. Most such error-detection schemes are systematic: The transmitter sends the original
data bits, and attaches a fixed number of check bits, which are derived from the data bits by some
deterministic algorithm. The receiver applies the same algorithm to the received data bits and
compares its output to the received check bits; if the values do not match, an error has occurred at
some point during the transmission. In a system that uses a "non-systematic" code, such as some
raptor codes, the original message is transformed into an encoded message that has at least as
many bits as the original message.
In general, any hash function may be used to compute the redundancy. However, some functions
are of particularly widespread use because of either their simplicity or their suitability for
detecting certain kinds of errors (e.g., the cyclic redundancy check's performance in detecting
burst errors).
Other mechanisms of adding redundancy are repetition schemes and error-correcting codes.
Repetition schemes are rather inefficient but very simple to implement. Error-correcting codes
can provide strict guarantees on the number of errors that can be detected.
Repetition codes
A repetition code is an coding scheme that repeats the bits across a channel to achieve error-free
communication. Given a stream of data to be transmitted, the data is divided into blocks of bits.
Each block is transmitted some predetermined number of times. For example, to send the bit
pattern "1011", the four-bit block can be repeated three times, thus producing "1011 1011 1011".
However, if this twelve-bit pattern was received as "1010 1011 1011" – where the first block is
unlike the other two – it can be determined that an error has occurred.Repetition codes are not
very efficient, and can be susceptible to problems if the error occurs in exactly the same place for
each group (e.g., "1010 1010 1010" in the previous example would be detected as correct). The
advantage of repetition codes is that it they are extremely simple, and are in fact used in some
transmissions of numbers stations.[citation needed]
Parity bits
A parity bit is a bit that is added to ensure that the number of set bits (i.e., bits with the value 1) in
a group of bits is even or odd. A parity bit can only detect an odd number of errors (i.e., one,
three, five, etc. bits that are incorrect).
There are two variants of parity bits: even parity bit and odd parity bit. When using even parity,
the parity bit is set to 1 if the number of ones in a given set of bits (not including the parity bit) is
odd, making the entire set of bits (including the parity bit) even. When using odd parity, the
parity bit is set to 1 if the number of ones in a given set of bits (not including the parity bit) is
even, making the entire set of bits (including the parity bit) odd. In other words, an even parity bit
will be set if the number of set bits plus one is even, and an odd parity bit will be set if the
number of set bits plus one is odd.
There is a limitation to parity schemes. A parity bit is only guaranteed to detect an odd number of
bit errors. If an even number of bits (i.e., two, four, six, etc.) are flipped, the parity bit will appear
to be correct even though the data is erroneous. Extensions and variations on the parity bit
mechanism are horizontal redundancy checks, vertical redundancy checks, and "double," "dual,"
or "diagonal" parity (used in RAID-DP).
Checksums
A checksum of a message is a modular arithmetic sum of message code words of a fixed word
length (e.g., byte values). The sum is often negated by means of a one's-complement prior to
transmission as the redundancy information to detect errors resulting in all-zero messages.
Checksum schemes include parity bits, check digits, and longitudinal redundancy checks. Some
checksum schemes, such as the Luhn algorithm and the Verhoeff algorithm, are specifically
designed to detect errors commonly introduced by humans in writing down or remembering
identification numbers.
Cyclic redundancy checks have favorable properties in that they are specifically suited for
detecting burst errors. CRCs are easily implemented in hardware, and are commonly used in
digital networks and storage devices such as hard disk drives.
Even parity is a special case of a cyclic redundancy check, where the single-bit CRC is generated
by the polynomial x+1.
A cryptographic hash function can provide strong assurances about data integrity, provided that
changes of the data are only accidental (i.e., due to transmission errors). Any modification to the
data will likely be detected through a mismatching hash value. Furthermore, given some hash
value, it is infeasible to find some input data (other than the one given) that will yield the same
hash value. Message authentication codes, also called keyed cryptographic hash functions,
provide additional protection against intentional modification by an attacker.
Error-correcting codes
Any error-correcting code can be used for error detection. A code with minimum Hamming
distance, d, can detect up to d-1 errors in a code word. Using error-correcting codes for error
detection can be favorable if strict integrity guarantees are desired, and the capacity of the
transmission channel can be modeled.
Codes with minimum Hamming distance d=2 are degenerate cases of error-correcting codes, and
can be used to detect single errors. The parity bit is an example of a single-error-detecting code.
The Berger code is an early example of a unidirectional error(-correcting) code that can detect
any number of errors on an asymmetric channel, provided that only transitions of cleared bits to
set bits or set bits to cleared bits can occur.
Error correction
Automatic Repeat reQuest (ARQ) is an error control method for data transmission that makes use
of error-detection codes, acknowledgment and/or negative acknowledgment messages, and
timeouts to achieve reliable data transmission. An acknowledgment is a message sent by the
receiver to indicate that it has correctly received a data frame.
Usually, when the transmitter does not receive the acknowledgment before the timeout occurs
(i.e., within a reasonable amount of time after sending the data frame), it retransmits the frame
until it is either correctly received or the error persists beyond a predetermined number of
retransmissions.
Three types of ARQ protocols are Stop-and-wait ARQ, Go-Back-N ARQ, and Selective Repeat
ARQ.
ARQ is appropriate if the communication channel has varying or unknown capacity, such as is
the case on the Internet. However, ARQ requires the availability of a back channel, results in
possibly increased latency due to retransmissions, and requires the maintenance of buffers and
timers for retransmissions, which in the case of network congestion can put a strain on the server
and overall network capacity.[2]
Error-correcting code
An error-correcting code (ECC) or forward error correction (FEC) code is a system of adding
redundant data, or parity data, to a message, such that it can be recovered by a receiver even
when a number of errors (up to the capability of the code being used) were introduced, either
during the process of transmission, or on storage. Since the receiver does not have to ask the
sender for retransmission of the data, a back-channel is not required in forward error correction,
and it is therefore suitable for simplex communication such as broadcasting. Error-correcting
codes are frequently used in lower-layer communication, as well as for reliable storage in media
such as CDs, DVDs, and dynamic RAM.
Error-correcting codes are usually distinguished between convolutional codes and block codes:
• Convolutional codes are processed on a bit-by-bit basis. They are particularly suitable for
implementation in hardware, and the Viterbi decoder allows optimal decoding.
• Block codes are processed on a block-by-block basis. Early examples of block codes are
repetition codes, Hamming codes and multidimensional parity-check codes. They were
followed by a number of efficient codes, of which Reed-Solomon codes are the most
notable ones due to their widespread use these days. Turbo codes and low-density parity-
check codes (LDPC) are relatively new constructions that can provide almost optimal
efficiency.
Shannon's theorem is an important theorem in forward error correction, and describes the
maximum information rate at which reliable communication is possible over a channel that has a
certain error probability or signal-to-noise ratio (SNR). This strict upper limit is expressed in
terms of the channel capacity. More specifically, the theorem says that there exist codes such that
with increasing encoding length the probability of error on a discrete memoryless channel can be
made arbitrarily small, provided that the code rate is smaller than the channel capacity. The code
rate is defined as the fraction k/n of k source symbols and n encoded symbols.
The actual maximum code rate allowed depends on the error-correcting code used, and may be
lower. This is because Shannon's proof was only of existential nature, and did not show how to
construct codes which are both optimal and have efficient encoding and decoding algorithms.
Hybrid schemes
Hybrid ARQ is a combination of ARQ and forward error correction. There are two basic
approaches[2]:
• Messages are always transmitted with FEC parity data (and error-detection redundancy).
A receiver decodes a message using the parity information, and requests retransmission
using ARQ only if the parity data was not sufficient for successful decoding (identified
through a failed integrity check).
• Messages are transmitted without parity data (only with error-detection information). If a
receiver detects an error, it requests FEC information from the transmitter using ARQ,
and uses it to reconstruct the original message.
The latter approach is particularly attractive on the binary erasure channel when using a rateless
erasure code.
Applications
Applications that require low latency (such as telephone conversations) cannot use Automatic
Repeat reQuest (ARQ); they must use Forward Error Correction (FEC). By the time an ARQ
system discovers an error and re-transmits it, the re-sent data will arrive too late to be any good.
Applications where the transmitter immediately forgets the information as soon as it is sent (such
as most television cameras) cannot use ARQ; they must use FEC because when an error occurs,
the original data is no longer available. (This is also why FEC is used in data storage systems
such as RAID and distributed data store).
Applications that use ARQ must have a return channel. Applications that have no return channel
cannot use ARQ.
Applications that require extremely low error rates (such as digital money transfers) must use
ARQ.
The Internet
• Each Ethernet frame carries a CRC-32 checksum. The receiver discards frames if their
checksums do not match.
• The IPv4 header contains a header checksum of the contents of the header (excluding the
checksum field). Packets with checksums that don't match may be discarded or
processed, depending on application.
• The checksum was omitted from the IPv6 header, because most current link layer
protocols have error detection.
• UDP has an optional checksum. Packets with wrong checksums may be discarded or
retained depending on application.
• TCP has a checksum of the payload, TCP header (excluding the checksum field) and
source- and destination addresses of the IP header. Packets found to have incorrect
checksums are discarded and eventually get retransmitted when the sender receives a
triple-ack or a timeout occurs.
Deep-space telecommunications
Development of error-correction codes was tightly coupled with the history of deep-space
missions due to the extreme dilution of signal power over interplanetary distances, and the limited
power availability aboard space probes. Whereas early missions sent their data uncoded, starting
from 1968 digital error correction was implemented in the form of (sub-optimally decoded)
convolutional codes or Reed-Muller codes. The Reed-Muller code was well suited to the noise
the spacecraft was subject to (approximately matching a Bell curve), and was implemented at the
Mariner spacecraft for missions between 1969 and 1977.
The Voyager 1 and Voyager 2 missions, which started in 1977, were designed to deliver color
imaging amongst scientific information of Jupiter and Saturn. This resulted in increased coding
requirements, and thus the spacecrafts were supported by (optimally Viterbi-decoded)
convolutional codes that could be concatenated with an outer Golay (24,12,8) code. The Voyager
2 probe additionally supported an implementation of a Reed-Solomon code: the concatenated
Reed-Solomon-Viterbi (RSV) code allowed for very powerful error correction, and enabled the
spacecraft's extended journey to Uranus and Neptune.
The CCSDS currently recommends usage of error correction codes with performance similar to
the Voyager 2 RSV code as a minimum. Concatenated codes are increasingly falling out of favor
with space missions due to their relatively high hardware costs, and are replaced by more
powerful codes such as Turbo codes or LDPC codes.
The different kinds of deep space and orbital missions that are conducted suggest that trying to
find a "one size fits all" error correction system will be an ongoing problem for some time to
come. For missions close to earth the nature of the channel noise is different from that a
spacecraft on an interplanetary mission experiences. Additionally, as a spacecraft increases its
distance from earth, the problem of correcting for noise gets larger.
The demand for satellite transponder bandwidth continues to grow, fueled by the desire to deliver
television (including new channels and High Definition TV) and IP data. Transponder availability
and bandwidth constraints have limited this growth, because transponder capacity is determined
by the selected modulation scheme and Forward error correction (FEC) rate.
Overview
• QPSK coupled with traditional Reed Solomon and Viterbi codes have been used for
nearly 20 years for the delivery of digital satellite TV.
• Higher order modulation schemes such as 8PSK, 16QAM and 32QAM have enabled the
satellite industry to increase transponder efficiency by several orders of magnitude.
• This increase in the information rate in a transponder comes at the expense of an increase
in the carrier power to meet the threshold requirement for existing antennas.
• Tests conducted using the latest chipsets demonstrate that the performance achieved by
using Turbo Codes may be even lower than the 0.8 dB figure assumed in early designs.
Data storage
Error detection and correction codes are often used to improve the reliability of data storage
media.
A "parity track" was present on the first magnetic tape data storage in 1951. The "Optimal
Rectangular Code" used in group code recording tapes not only detects but also corrects single-bit
errors.
Some file formats, particularly archive formats, include a checksum (most often CRC32) to detect
corruption and truncation and can employ redundancy and/or parity files to recover portions of
corrupted data.
Reed Solomon codes are used in compact discs to correct errors caused by scratches.
Modern hard drives use CRC codes to detect and Reed-Solomon codes to correct minor errors in
sector reads, and to recover data from sectors that have "gone bad" and store that data in the spare
sectors.[3]
RAID systems use a variety of error correction techniques, to correct errors when a hard drive
completely fails.
Error-correcting memory
DRAM memory may provide increased protection against soft errors by relying on error
correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is
particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space
applications due to increased radiation.
Error-correcting memory controllers traditionally use Hamming codes, although some use triple
modular redundancy.
Interleaving allows distributing the effect of a single cosmic ray potentially upsetting multiple
physically neighboring bits across multiple words by associating neighboring bits to different
words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single
error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error
correcting code), and the illusion of an error-free memory system may be maintained.[4]
• BCH code
• Constant-weight code
• Convolutional code
• Group codes
• Golay codes, of which the Binary Golay code is of practical interest
• Goppa code, used in the McEliece cryptosystem
• Hadamard code
• Hagelbarger code
• Hamming code
• Latin square based code for non-white noise (prevalent for example in broadband over
powerlines)
• Lexicographic code
• Low-density parity-check code, also known as Gallager code, as the archetype for sparse
graph codes
• LT code, which is a near-optimal rateless erasure correcting code (Fountain code)
• m of n codes
• Online code, a near-optimal rateless erasure correcting code
• Raptor code, a near-optimal rateless erasure correcting code
• Reed-Solomon code
• Reed-Muller code
• Repeat-accumulate code
• Repetition codes, such as Triple modular redundancy
• Tornado code, a near-optimal erasure correcting code, and the precursor to Fountain
codes
• Turbo code
• BCH Codes
o Berlekamp–Massey algorithm
o Peterson-Gorenstein-Zierler algorithm
o Reed Solomon error correction
• BCJR algorithm: decoding of error correcting codes defined on trellises (principally
convolutional codes)
• Hamming codes
o Hamming(7,4): a Hamming code that encodes 4 bits of data into 7 bits by adding
3 parity bits
o Hamming distance: sum number of positions which are different
o Hamming weight (population count): find the number of 1 bits in a binary word
• Redundancy checks
o Adler-32
o Cyclic redundancy check
o Fletcher's checksum
o Luhn algorithm: a method of validating identification numbers
o Luhn mod N algorithm: extension of Luhn to non-numeric characters
o Parity: simple/fast error detection technique
o Verhoeff algorithm
o Longitudinal redundancy check (LRC)
Functional Microcontroller Design and Implementation
Abstract
In many situations, hardware description languages (HDL) such as VHDL, Verilog or SystemC is
used to develop the functionality of the digital system, while the timing and control signal
generation is either neglected or ignored. The authors have used a methodology wherein a
hardware structure was conceptually laid out of the digital system under consideration. The
system development started with topdown planning approach and the blocks were designed using
bottom-up implementation. The programs were written, simulated and synthesized using
Electronic Data Automation (EDA) tools such as ModelSim and Leonardo Spectrum. Instruction
set such as transfer, arithmetic, logic, input, output and control instructions were implemented.
This approach guaranteed the integrity of the system realization with proper timing and data flow,
without the invisible ghost states. In this article, the authors have presented the design
methodology of such a multi purpose microcontroller, and provided the functional
HDL code, simulation and synthesis results. Also, the authors have presented the sequence in
which this microcontroller can be made a general purpose controller unit which can be used by
other system designs.
Keywords
VLSI, VHDL, Microcontroller, Simulation, Synthesis
1. Introduction
In any control and controller system applications, microcontroller is an important module, which
provides the control, timing and status signals, in any complex digital system realizations.
(Bartbel, 1997) A microprocessor is usually defined as “a single chip that contains control logic
and data processing logic, so that it can execute instructions listed in a program to operate on
some data”. Microcontrollers are nothing but microprocessors with on-chip memory. Whether
using ASIC, FPGA or CPLD based realizations, it is essential to incorporate the microcontroller
module, as an integral part of the system. Functional microcontroller has been developed using
VHDL coding using structural design of logic blocks which generates control and timing signals
used for the data processing operation.
2. Methodology
While designing a microcontroller module, it is important to determine the number data bits,
which can be processed by the microcontroller in one cycle. In the process of designing, the
instruction set, instruction width, internal register set, type of control unit, flags, amount of
memory to be used has to be laid out. (Gloria, 1999) VLSI implementation point of view, the
most relevant factor that intervenes in the decision of implementing a certain instruction lies on
the number of operands needed to specify the instruction. This is not an absolute factor, it
depends on the number of registers, and therefore on the number of bits required for coding a
register reference. Present day VLSI technology has lead to the design and development of
millions of gates on a chip. Hardware designers create several VLSI modules for their
research and development purposes. It is often important to re-use these modules to reduce
product development time, thereby minimizing the time to market. Therefore, it is important to
design hardware in a modular fashion, so that these modules can be included in the development
of a complex system. The design and development of such a modular design microcontroller
helps other designers to incorporate
this module with minimal or no modifications to the hardware module. The block diagram of
such a 4-bit microcontroller is as shown in fig.1.
Figure 1: 4-bit Slice Microcontroller Block Diagram
3. Module Development
In this study, 4-bit microcontroller was designed using a modular approach. Using top-down
approach,
the elements of the microcontroller were identified as basic registers, instruction decoders, ALU,
RAM,
Control and timing Unit. (Zabawa and Wunnava, 2004) propose the design of a microcontroller
unit with
the following building blocks:
A 16-word by four bit port RAM.
Four registers (RegA, RegB, RegC, RegD)
An ALU selector which select two inputs. The D (Destination Input) will always be rega_data.
The S (Source Input) will be either regb_data, regc_data, regd_data based on the selected output
from the MUX.
The MUX module used instruction [11:8] as the select input. If select is ‘0010’ then regb_data
is
set to signal S (Source Input). If select is ‘0100’ then regc_data is set to signal S. If Select is
‘1000’ then regd_data is set to signal S.
A 4-bit ALU capable of doing arithmetic, logical, and bitwise functions on the selected source
and destination words.
An instruction decoder is used to decide whether to load the ALU output into RegA or whether
to
load the read data from the RAM. It also load RegB, RegC, and RegD from memory if the
MOVB, MOVC, or MOVD operations occurs.
The implementation was done using a bottom-up approach. The basic hardware blocks like
adders, flip flops, shift registers, counters, and comparators were designed. Later, these blocks
were used to form ALU, RAM, and instruction decoder. Finally, in the top level module, these
blocks were connected to form a functional microcontroller. The top level module has the I/O
pins as shown in Table 1. The instruction set has a 12-bit instruction width, which has three 3-bit
fields whose functions are as shown in
Table 2. The ALU performs arithmetic, logic and shift operations as defined in Table 2. In all, 14
instructions can be performed by this ALU.
5. Results
The VHDL program was simulated in ModelSim using TEXTIO test benches with extensive test
vectors. The simulation of instruction phase OR and NOT are shown in the fig 5. The device
utilization of the target FPGA device A1240XLPG132, from Actel family is as shown in Table 3.
The synthesized microcontroller module has 29 pins for address, data, I/O and for other signals.
The Leonardo spectrum synthesis report shows 443 accumulated instances with 24 sequential and
280 combinational modules with the speed of operation of 14.47 MHz.
Clock Domains
The 8051 IP core is a fully synchronous design. There is a single clock signal that controls the
clock input of every storage element. Clock gating is not used. The clock signal is not fed into
any combinatorial element. The interrupt input lines are
synchronized to the global clock signal using a standard two-level synchronization stage because
they may be driven by external circuitry that operates with another clock. The parallel port input
signals are not synchronized that way. If the user decides that there is also the need for
synchronizing these signals it may be added easily.
Memory Interfaces
Due to the optimized architecture the signals coming from and going to the memory blocks have
not been registered. So during synthesis input and output timing constraints should be placed on
the corresponding ports and synchronous memory blocks should be used for the mc8051 IP-core.
Configuring the 8051 IP Core
In the following the parameterizability of the 8051 microcontroller IP-core design will be
discussed and information for embedding the IP-core in larger designs will be given.
Timer/Counter, Serial Interface, and Interrupts
The original microcontroller design offered only 2 timer/counter units, one serial
interface, and two external interrupt sources. 8051 derivates later offered more of
these resources on chip. Since this is sometimes a limiting factor we decided to
implement some sort of parameterization in the 8051 IP core. This 8051
microcontroller IP-core offers the capability to generate up to 256 of these units by simply
changing a VHDL constant’s value. In the VHDL source file mc8051_p.vhd the constant
C_IMPL_N_TMR can take
values from 1 to 256 to control this feature. Values out of this interval result in a non functioning
configuration of the core. Figure 3 shows the corresponding lines of
VHDL code.
page 7 of 11
-----------------------------------------------------------------------------
-- Select how many timer/counter units should be implemented
-- Default: 1
constant C_IMPL_N_TMR : integer := 1;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select how many serial interface units should be implemented
-- Default: C_IMPL_N_TMR ---(DO NOT CHANGE!)---
constant C_IMPL_N_SIU : integer := C_IMPL_N_TMR;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select how many external interrupt-inputs should be implemented
-- Default: C_IMPL_N_TMR ---(DO NOT CHANGE!)---
constant C_IMPL_N_EXT : integer := C_IMPL_N_TMR;
-----------------------------------------------------------------------------
figure 3: VHDL source code for configuring the number of timer/counter units, serial
interfaces,
and external interrupts.
Optional Instructions
In some cases it makes sense to not implement instructions which are not needed
and consume furthermore much chip area. Such instructions are 8bit multiplication,
8bit division, and 8bit decimal correction. Therefore the MUL instruction for 8bit
multiplication can be skipped when the VHDL constant C_IMPL_MUL in the
mc8051_p.vhd source file is set to 0. Equally the 8bit division DIV can be skipped
through setting the VHDL constant C_IMPL_DIV to 0 and the decimal correction
instruction can be skipped by setting the constant C_IMPL_DA to 0. The
corresponding lines of VHDL source code can be seen in figure 5.
-----------------------------------------------------------------------------
-- Select whether to implement (1) or skip (0) the multiplier
-- Default: 1
constant C_IMPL_MUL : integer := 1;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select whether to implement (1) or skip (0) the divider
-- Default: 1
constant C_IMPL_DIV : integer := 1;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select whether to implement (1) or skip (0) the decimal adjustment command
-- Default: 1
constant C_IMPL_DA : integer := 1;
-----------------------------------------------------------------------------
figure 5: Code fragment showing how instructions can be skipped.
The gain in terms of chip area when not implementing all three optional instructions is
approximately 10 %.
Parallel I/O Ports
The mc8051 IP-core offers just as the original 8051 microcontroller 4 bidirectional
8bit I/O ports to conveniently exchange data with the microcontroller’s environment.
To ease integration of our core for IC design the original’s multi-function ports have
not been rebuilt and all signals (e.g. serial interface, interrupts, counter inputs, and
interface to external memory) have been fed separately out of the core (see figure 1).
The basic structure of the parallel I/O ports is shown in figure 6.