You are on page 1of 5

Information Theory

April 9, 2007

Data Compression
A common characteristic of signals generated by physical sources is that, in their natural form, they contain a signicant amount of information that is redundant, the transmission of which is therefore wasteful of primary communication resources. For ecient signal transmission, the redundant information should be removed from the signal prior to transmission. This operation, with no loss of information, is ordinally performed on a signal in digital form, in which case we refer to it as data compaction or lossless data compression.

Prex coding
Consider a discrete memoryless source of alphabet s0 , s1 , s2 , , sk1 and statistics p0 , p1 , p2 , , pk1 . For a source code representing the output of this source to be of practical use, the code has to be uniquely decodable. This restriction ensures that for each nite sequence of symbols emitted by the source, the corresponding sequence of code words is dierent from the sequence of code words corresponding to any other source sequence. We are specically interested in a special class of codes satisfying a restriction known as the prex condition. The example of prex coding is shown in Figure 1. A prex code has the important property that it is always uniquely decodable. But the converse is not necessarily true. For example, code III in above Figure does not satisfy the prex condition, yet it is uniquely decodable since the bit 0 indicates the beginning of each code word in the code.//[0.2cm] Moreover, if a prex code has been constructed for a discrete memoryless source with source alphabet s0 , s1 , s2 , , sk1 and source statistics p0 , p1 , p2 , , pk1 and the code word for symbol sk has length lk , k = 0, 1, , K 1, then the code-word lengths of the code always satisfy a certain inequality known as the kraft-Millman inequality, as shown as
K1

2lk 1
k=0

Where the factor 2 refers to the radix in the binary alphabet. It is important to note, however, that the kraft-McMillan inequality does not tell us that a source 1

source symbol s 0 s 1 s2 s 3

probability 0.5 0.25 0.125 0.125

code I 0 1 00 11

code II 0 10

code III 0 01 011 0111

110 111

Figure 1: Illustrating the denition of a prex code

code is a prex code. Rather, it is merely a condition on the code-word lengths of the code and not on the code words themselves. For example, referring to Figure 1, we note the following 1. Code 1. violates the Kraft-McMillan inequality; it cannot therefore be a prex code. 2. The Kraft-McMillan inequality is satised by both codes II and III; but only code II is a prex code. Given a discrete memoryless source of entropy H(S), a prex code can be con structed with an average code-word length L, which is bounded as follows: H(S) L H(S) + 1

Human Coding
The basic idea behind Human coding is to assign to each symbol of an alphabet a sequence of bits roughly equal in length to the amount of information conveyed by the symbol in question. Specically, the hufmann encoding algorithm proceeds as follows: 1. The source symbols are listed in order of decreasing probability. The two source symbols of lowest probability are assigned a 0 and 1. This part of the step is referred to as a splitting stage. 2. These two source symbols are regarded as being combined into a new source symbol with probability equal to the sum of the two original probabilities. The probability of the new symbol is placed in the list in accordance with its value. 3. The procedure is repeated until we are left with a nal list of source statistics of only two for which a 0 and a 1 are assigned. 2

The code for each source symbol is found by working backward and tracing the sequence of 0s and 1s assigned to that symbol as well as its successors. A Human code is shown in Figure 2.
symbol s 0 s 1 s 2 s 3 s 4 stage 1 stage 2 stage 3 stage 4 0.6 0 0.2 0.2 0 0.1 0.1 1 0.2 0.2 0 0.2 1 0.2 0.4 1 0.4 0 1

0.4

0.4

0.4

symbol s 0

probability 0.4

code word 00 10 11 010 011

s1 s2 s3 s4

0.2 0.2 0.1 0.1

Figure 2: Human coding

Lempel-Ziv coding
The drawback of Human coding is that it requires knowledge of a probabilistic model of the source, but in practice source statistics are not always known apriori. To overcome these disadvantages of Human coding, we use an adaptive coding algorithm which is also simpler than Human coding. Basically, encoding in the Lempel-Ziv algorithm is accomplished by parsing the source data stream into segments that are the shortest subsequences not encountered previously. consider the following example in Figure 3. 3

Let the input sequence be 000101110010100101......... We assume that 0 and 1 are known and stored in codebook subsequences stored : 0, 1 Data to be parsed: 000101110010100101.........

The shortest subsequence of the data stream encountered for the first time and not seen befo is 00 subsequences stored: 0, 1, 00 0101110010100101......... Data to be parsed: The second shortest subsequence not seen before is 01; accordingly, we go on to write Subsequences stored: 0, 1, 00, 01 Data to be parsed: 01110010100101......... We continue in the manner described here until the given data stream has been completely parsed. The code book is shown below: Numerical positions: 1 subsequences: Numerical Repre sentations: Binary encoded blocks: 0010 0 2 1 3 00 11 4 01 12 5 011 42 6 10 21 7 010 41 8 100 61 9 101 62

0011

1001

0100

1000

1100

1101

Figure 3: Lempel Ziv coding

The last symbol of each subsequence in the code book is an innovation symbol. The last bit of each uniform block of bits in the encoded representation of the data stream represents the innovation symbol for the particular subsequence under consideration. The remaining bits provide the equivalent binary representation of the pointer to the root subsequence that matches the one in question except for the innovation symbol. In contrast to Human coding, the lempel - ziv algorithm uses xed length codes to represent a variable number of source symbols; this feature makes the Lempel-Ziv code suitable form synchronous transmission.

You might also like