You are on page 1of 23

Lecture 4

Introduction to Digital Signal


Processors (DSPs)

Dr. Konstantinos Tatas
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
2
Outline/objectives
Identify the most important DSP processor
architecture features and how they relate
to DSP applications
Understand the types of code appropriate
for DSP implementation
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
3
What is a DSP?
A specialized microprocessor for real-
time DSP applications
Digital filtering (FIR and IIR)
FFT
Convolution, Matrix Multiplication etc
ADC DAC DSP
ANALOG
INPUT
ANALOG
OUTPUT
DIGITAL
INPUT
DIGITAL
OUTPUT
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
4
Hardware used in DSP
ASIC FPGA GPP DSP
Performance Very High High Medium Medium High
Flexibility Very low High High High
Power
consumption
Very low low Medium Low Medium
Development
Time
Long Medium Short Short
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
5
Common DSP features
Harvard architecture
Dedicated single-cycle Multiply-Accumulate
(MAC) instruction (hardware MAC units)
Single-Instruction Multiple Data (SIMD) Very
Large Instruction Word (VLIW) architecture
Pipelining
Saturation arithmetic
Zero overhead looping
Hardware circular addressing
Cache
DMA

ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
6
Harvard Architecture
Physically separate
memories and paths
for instruction and
data

DATA
MEMORY
PROGRAM
MEMORY
CPU
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
7
Single-Cycle MAC unit
Multiplier
Adder
Register
a x
i i
a x
i i
a x
i-1 i-1
a x
i i
a x
i-1 i-1 +

(a x )
i i
i=0
n
Can compute a sum of n-
products in n cycles
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
8
Single Instruction - Multiple Data
(SIMD)
A technique for data-level parallelism by
employing a number of processing
elements working in parallel


ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
9
Very Long Instruction Word (VLIW)
A technique for
instruction-level
parallelism by executing
instructions without
dependencies (known at
compile-time) in parallel
Example of a single
VLIW instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction
F=a+b c=e/g d=x&y w=z*h
PU
PU
PU
PU
a
b
F
c
d
w
e
g
x
y
z
h
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
10
CISC vs. RISC vs. VLIW

ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
11
Pipelining
DSPs commonly feature deep pipelines
TMS320C6x processors have 3 pipeline stages
with a number of phases (cycles):
Fetch
Program Address Generate (PG)
Program Address Send (PS)
Program ready wait (PW)
Program receive (PR)
Decode
Dispatch (DP)
Decode (DC)
Execute
6 to 10 phases
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
12
Saturation Arithmetic
fixed range for operations like addition and
multiplication
normal overflow and underflow produce the
maximum and minimum allowed value,
respectively
Associativity and distributivity no longer apply
1 signed byte saturation arithmetic examples:
64 + 69 = 127
-127 5 = -128
(64 + 70) 25 = 122 64 + (70 -25) = 109
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
13
Examples
Perform the following operations using
one-byte saturation arithmetic
0x77 + 0x99 =
0x4*0x42=
0x3*0x51=
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
14
Zero Overhead Looping
Hardware support for loops with a
constant number of iterations using
hardware loop counters and loop buffers
No branching
No loop overhead
No pipeline stalls or branch prediction
No need for loop unrolling
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
15
Hardware Circular Addressing
A data structure
implementing a fixed
length queue of fixed size
objects where objects are
added to the head of the
queue while items are
removed from the tail of
the queue.
Requires at least 2
pointers (head and tail)
Extensively used in digital
filtering
y[n] = a0x[n]+a1x[n-1]++akx[n-k]
X[n]
X[n-1]
X[n-2]
X[n-3]
X[n]
X[n-1]
X[n-2]
X[n-3]
Head
Tail
Cycle1
Cycle2
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
16
Direct Memory Access (DMA)
The feature that allows peripherals to access
main memory without the intervention of the
CPU
Typically, the CPU initiates DMA transfer, does
other operations while the transfer is in
progress, and receives an interrupt from the
DMA controller once the operation is complete.
Can create cache coherency problems (the data
in the cache may be different from the data in
the external memory after DMA)
Requires a DMA controller

ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
17
Cache memory
Separate instruction and data L1 caches
(Harvard architecture)
Cache coherence protocols required,
since most systems use DMA

ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
18
DSP vs. Microcontroller
DSP
Harvard Architecture
VLIW/SIMD (parallel
execution units)
No bit level operations
Hardware MACs
DSP applications


Microcontroller
Mostly von Neumann
Architecture
Single execution unit
Flexible bit-level
operations
No hardware MACs
Control applications
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
19
Examples
Estimate how long will the following code
fragment take to execute on
A general purpose processor with 1 GHz operating
frequency, five-stage pipelining and 5 cycles required
for multiplication, 1 cycle for addition
A DSP running at 500 MHz, zero overhead looping
and 6 independent ALUs and 2 independent single-
cycle MAC units?

for (i=0; i<8; i++)
{
a[i] = 2*i + 3;
b[i] = 3*i + 5;
}
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
20
Review Questions
Which of the following code fragments is
appropriate for SIMD implementation?
a[0]=b[0]+c[0]; a[0]=b[0]&c[0];
a[2]=b[2]+c[2]; a[0]=b[0]%c[0];
a[4]=b[4]+c[4]; a[0]=b[0]+c[0];
a[6]=b[6]+c[6]; a[0]=b[0]/c[0];
Can the following instructions be merged into
one VLIW instruction? If not in how many?
a=b+c;
d=c/e;
f=d&a;
g=b%c;
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
21
Review Questions
Which of the following is not a typical DSP
feature?
Dedicated multiplier/MAC
Von Neumann memory architecture
Pipelining
Saturation arithmetic
Which implementation would you choose for
lowest power consumption?
ASIC
FPGA
General-Purpose Processor
DSP
ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
22
Examples
How many VLIW instructions does the following program
fragment require if there two independent data paths
(a,b), with 3 ALUs and 1 MAC available in each and 8
instructions/word? How many cycles will it take to
execute if they are the first instructions in the program
and all instructions require 1 cycle, assuming the
pipelining architecture of slide 10 with 6 phases of
execution?
ADD a1,a2,a3 ;a3 = a1+a2
SUB b1,b3,b4 ;b4 = b1-b3
MUL a2,a3,a5 ;a5 = a2-a3
MUL b3,b4,b2 ;b2 = b3*b4
AND a7,a0,a1 ;a1 = a7 AND a0
MUL a3,a4,a5 ;a5 = a3*a4
OR a6,a3,a2 ;a2 = a6 OR a3


ACOE343 - Embedded Real-Time Processor Systems -
Frederick University
23
References
DR. Chassaing, DSP Applications using
C and the TMS320C6x DSK, Wiley, 2002
Texas Instruments, TMS320C64x
datasheets
Analog Devices, ADSP-21xx Processors

You might also like