Professional Documents
Culture Documents
Lecture # 04
<rehan.hafiz@seecs.edu.pk>
Course Information
Couse Website http://lms.nust.edu.pk/ Slides from Advanced Digital System Design (FALL 2011) Course
http://www.scribd.com/collections/3409162/Digital-System-Design-Lectures
Acknowledgement: Material from the following sources has been consulted/used in these slides: 1. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan 2. [SAM] Samir Palnitkar, Verilog HDL, Prentice Hall, ISBN: 0130449113. , Latest Edition 3. [STV] Advanced FPGA Design, Steve Kilts 4. [PAR] VLSI Signal Processing Systems, Parhi
Material/Slides from these slides CAN be used with following citing reference: Dr. Rehan Hafiz: Advanced Digital System Design 2012 Creative Commons Attribution--ShareAlike 3.0 Unported License.
Tuesday (1730-1920), Thursday (1830-1920) By appointment/Email VISpro Lab above SEECS Library
1 2 3 4 5 7 8 9 10 11 13 14 15 16 17
Introduction: Course Overview, Design Space Exploration, Digital design methodology Understanding FPGAs, (Xilinx FPGA Architecture) Verilog Introduction : Combinational Building Blocks in Verilog Sequential Common Structure in Verilog (LFSR /CRC+ Counters + RAMS) Synthesis of Blocking/Non-Blocking Statements Design Partitioning & Micro Architectures Controllers, Micro-Coded Controllers Understanding Throughput, Latency &Timing & Architecting Speed/Area in Digital System Design. Representation of Non Recursive DFGs & Optimizations for Non Recursive DFGs FIR Implementations + Pipelining & Parallelism in Non Recursive DFGs Cross-Clock Domain Issues & RESET circuits Arithmetic Operations: Review Fixed Point Representation Adders & Fast Adders, Multi-Operand Addition Multiplication , Multiplication by Constants + BOOTH Multipliers CORDIC (sine, cosine, magnitude, division, etc) CORDIC implementation in HW DFG representation of Recursive DSP Algorithms Iteration Bound Retiming , Unfolding, Look ahead transformations Hybrid Architectures / Kahn Process Networks
This Lecture .
4
Micro
ASM/FSM - Review
Condition expression
ASM - Elements
State name
State box represents a state. Equivalent to a node in a state diagram or a row in a state table. Contains register transfer actions or output signals Moore-type outputs are listed inside of the box. Decision box Indicates that a given condition is to be tested and the exit path is to be chosen accordingly. Conditional output box Denotes output signals that are of the Mealy type. The condition that determines whether such outputs are generated is specified in the decision box.
0 (False)
Condition expression
1 (True)
Up/Down Counter
[CIL]
Implicit Coding
Up/Down Counter
9
Moore Machine
Mealy Machine
Output function only of present state May have more states Synchronous outputs No glitching One cycle delay Full cycle of stable output
Output function of both present states & input May have fewer states Asynchronous outputs If input glitches, so does output Output immediately available Output may not be stable long enough to be useful
The choice between Mealy and Moore machine implementations is usually the designers will.
When some of the inputs are expected to glitch and outputs are required to be stable for one complete cycle MOORE is the best choice [SHO]
[SHO]
MOORE Equivalent
// This module implements FSM for the detection of four ones in a serial input stream of data
15
module fsm mealy( input input input clk, //system clock reset, //system reset data in, //1-bit input stream
if(reset)
else
output reg four_ones det //1-bit output to indicate 4 ones are detected or not
end endmodule
);
// Internal Variables reg [1:0] current _state, //Current state register next _state; //Next state // State tags assigned using binary encoding parameter STATE _0 = 2'b00,
STATE _1 = 2'b01,
STATE _2 = 2'b10, STATE _3 = 2'b11;
STATE _3 :
begin
// This block implements if(data _in) the combination cloud of four _ones _det = 1'b1; 16 next state assignment else logic always @(*) begin case(current _state) STATE _0 : four _ones _det = 1'b0; STATE _1: four _ones _det = 1'b0; STATE _2 : four _ones _det = 1'b0;
end endcase
end
STATE 1:
STATE _3 :
begin
begin
// This block implements if(data_ in) if(data _in) the combination cloud of next state = STATE _0; next _state = STATE 2; 17 next state assignment else else logic next state = STATE _3; next state = STATE 1; always @(*) end end begin endcase STATE _2 : case(current _state) end begin STATE _0 : if(data _in) begin next _state = STATE if(data _in) 3; next _state = STATE 1; else else next _state = STATE next _state = STATE 0; 2; end end
Binary-coded counter sequences often change multiple bits on one count transition. One Hot: A sequence can be defined using a simple shift register
Although a one hot state machine results in simple logic for state transitions, it requires N ip ops as compared to log2N in a binary coded design. The latter requires fewer ip ops to encode states, but the logic that decodes the states and generates the next states is more complex
Gray Coding: Each state in state machine is assigned using gray coding, so that only one bit changes at a time.
20
Steps:
GCD -Algorithm
22
Steps:
A = 100, B= 60 B !=0 A = 40, B= 60 A = 60, B= 40 B !=0 A = 20, B= 40 A = 40, B= 20 B !=0 A = 20, B= 20 A = 20, B= 20 B !=0 A = 0, B= 20 A = 20, B= 0 B !=0
(s) (c) (p) (s) (c) (p) (s) (c) (p) (s) (c) (p) (s) (c)
while ( !done ) begin if ( A < B ) begin swap = A; A = B; B = swap; end else if ( B != 0 ) A = A - B; else done = 1; end Y = A; end endmodule
S/W is different than hardware. A function is EXECUTED when it is called. A H/W is always there How do we tell it if the inputs are ready.. & how do we tell if a particular function is done. In S/W the PC simply manages the control of the Programme. In H/W control has to be enforced through control signals
Reference Slides
Slides from MIT Course 6.375 Complex Digital Systems http://csg.csail.mit.edu/6.375/
Slides 11-28
Summary
Define higher level block diagram Define its interface Decompose into smaller blocks if required Decompose into Datapath & Controller
Use
different modules to implement Data path & Controller Define their interface
Design Partitioning
26
Data path:
The pipe that carries the data from the input of the design to the output and performs the necessary operations on the data. ALUs, Storage Registers & logic for moving data Determines the sequence Congure the data path for various operations
Controller
Data path and control blocks should be partitioned into different modules.
Allows module re-use Controller updates without requiring to update the Datapath Datapath Critical Timing
Logic systems consist of two basic elements: Control logic consists of state machines (FSM) Datapath logic consists of functions like counters, arithmetic, multiplexers, decoders and memory (Wired Connected Datapaths)
27
28
Guidelines - Summary
29
Datapath and control parts have different design objects so keep in different blocks ! Datapath usually synthesized for better timing; controller synthesized to take minimum area.
one implements the sequential part that assigns the next state to the state register, the second block implements the combinational logic that computes the next state The output computing always block
State Encoding
Use meaningful tags using dene or parameter statements for all possible states. Select the best encoding scheme
Roadmap
For each module extract out Datapath Draw Micro-architecture of your Datapath. Identify the control signals
Dont worry about Controller Identify States Identify what you need to do in each state Identify when the states transition
Make different modules for Datapath & Contoller and instantiate them in a single module
31
Lecture Outline
Micro Programmed Mealy Machine Micro Programmed Moore Machine Generic Micro Coded Architecture
In hardwired state machine based designs, the controller is implemented as a Mealy or Moore nite state machine (FSM) Makes the design rigid
You can never update without re-programming the FPGA What else can you do ????? ASIPs
How
Idea
35
We DO NOT implement the logic for next state --- WE Simply store the outputs & next state for the current state in a memory --- Just like a lookup table
The combinational logic is replaced by a sequence of control signals that are stored in program memory (PM)
The PM may be a read only (ROM) or random access (RAM). The address of the contents in the memory is determined by the current state and input to the FSM.
General Architecture
36
The designer evaluates all possible state transitions based on inputs and the current state and tabulates the outputs and next states as micro coding for PM.
These values are placed in the PM such that the inputs and the current state provide the index or address to the PM.
Example (MEALY)
37
Verilog Code
38
Verilog Code
39
__
Verilog Code
40
Example
42
Remember the difference b/w micro-processor and these microprogrammed state machines for upcoming slides
Now that we have counter based micro-code how can incorporate JUMPS ?
For
State machines also have jumps & may also have explicit jumps decided at runtime !!!
Controller should be capable of jumping to start generating control signals from a new address in the PM.
Variations Loadable Counter based State Machines with Conditional Branch Support
46
Algorithms may require conditional Jump support as a result of for example some ALU operation Some sort of Status and Control register (SCR) may be sued
We increase the load bits to have a programmable way to test various options from the available status bits
Parity bits are some time added to check false conditions . Again this helps in keeping the datapath as much independent as possible Allows us to branch on both true and false states & its programmable
PC ADDr
On CALL Write is enabled to save the RET address. The correct LIFO address is generated based upon the MUX value(write_lifo_ addr) (simple increment is fine for STACK ADDRESSING)
Read_lifo_addr points to top of stack Write_lifo_addr points to top+1 of stack Assumed no error handling
Complete System !
53
State 1
State 2
State 3
State 4
Start Processing : Repeat State 5 6, 256 times Convolve filter with data at location x x++ End
State 5
State 6
State 7
State 1
Reset
Wait for Data
State 2
State 3
State 3.5
State 4
State 5
State 6
Need Nested LOOP Support ! Imagine doing this for a Hard Wired State Machine !
State 6.5
State 7
Consider a LOOP instruction Need a counter now Loop counter loads the value on loop command
PC == Loop End Address & Loop Counter == 0 End address in a loop instance reached Why need this ?
To get the start of For Loop Address --- Get Address from LOOP ADDRESS STACK To proceed further after LOOP_end address for example : after 100 in this case ! --- Get Address from PC+1
Loop Ends
(its the only time where you need to update loop register)
stacks need the same global address logic controller !!! Why ?
Complete System !
LOOP & Subroutine Address Stack
Block based exhaustive Motion Estimation searches a block in the whole image & computes some similarity measure, e.g. Sum of Absolute Difference
Raster Scanning
[SHO]Fig 10.22
Algo:
64
i = tx j = ty
j = 0: (255- (N-1))
End
If (SAD(i,j) End
End
1.
2.
From where shall I start Follow a Top Down Hierarchical Model with iterative refinement Define the interface with the external world {other components and memory e.t.c. !}
1.
The way of your memory arrangement can be tricky but again we identify incrementally
3.
Define major functional blocks & reiterate Step 1-3 for each of them until you constitute your complete data path
Block based Raster Machine that is able to perform SAD on a block od Data Specification
67
We wish to have flexibility to change the raster scan direction !!! The FUN Part : Lets start the design right now Divide & Conquer
[SHO]Fig 10.22
Block based Raster Machine that is able to perform SAD on a block od Data Functional Description Specification Describe what your block shall do
68
Shall read an image and a reference block, both from memory; & shall raster scan the target image completely and report the x,y for lowest computed SAD. Any particular Specification Customer want it programmable (micro-coded) so that they may change rater style/direction & starting position in future Study your algorithm & dig deeper the kind of microcoded controller required for the design Requires Four nested Loops so you need a nice looking controller with nested loop support ! Shall require some ALU to perform real data crunching ! Requires Register file (Local Storage) to store data read from the memory RASTER MACHINE Need a lot of Address Logic to generate the right logic depending upon the current state ! Need to store tx,ty
a e i m
b f j n
c g k o
d h l p
0 0 0 0 1 How can you implement for a square image 1 - A Row Major to Linear Address Mapper 1 - A Linear to Row-Major Mapper 1 2 2 2 2 Solution : 3 3 Concatenation & De-Concatenation ! 3 3
C = Number of Columns Suppose your loop is over i,j ; where i is the loop index for current row and j is the loop index for current column
Row Major
Linear Address
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(Row * C)+Col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Add [0000] [0001] [0010] [0011] [0100] [0101] [0110] [0111] [1000] [1001] [1010] [1011] [1100] [1101] [1110] [1111]
Data a b c d e f g h i j k l m n o p
a e i m
b f j n
c g k o
d h l p
Row Major
Linear Address
0 0 1 1
0 1 0 1
(Row * C)+Col 0 1 4 5
Data a b e f
To Generate this address we can use to cascaded counters. Count_R, Count_C Assume we want to read a block of size NXN When Count_C equals a predefined number (N) Count_R is incremented When Count_R equals a predefined number (N) all the required data has been accessed
a e i m
b f j n
c g k o
d h l p
Row Major
Linear Address
0 0 1 1
0 1 0 1
(Row * C)+Col 0 1 4 5
Data a b e f
The data width for R & C is dependent upon the complete image size. Thus for a 256x256 image the linear address for this subset shall be: {00000000,00000000} {00000000,00000001} {00000001,00000000} {00000001,00000001}
Algo:
72
i = tx j = ty
For k = 0: (N-1) For l = 0 : (N-1) Temp = S(k+i,l+j)-R(k,l) SAD(i,j) = SAD(i,j) + Temp End
End
Once the blocks are loaded it requires a simple one-to-one mapping (address generation) for ALU (SAD Block) !
Raster Scanning
73
1 2 3 4 5
5 3 7 3 1
3 7 4 5 6
7 5 3 2 1
1 2 3 4 5
Raster Scanning
74
1 2 3 4 5
5 3 7 3 1
3 7 4 5 6
7 5 3 2 1
1 2 3 4 5
Rastering efficiently
We may avoid reloading the already loaded values from memory
75
1 2 3 4 5
5 3 7 3 1
3 7 4 5 6
7 5 3 2 1
1 2 3 4 5
Controller
Controller
Micro-Coded + Nested Loop (4)
(ALU) Gets two operands on every cycle when enabled. Keeps on comparing and recording the SAD, TX & TY value with the minimum SAD. Requires : SAD Unit, Accumulator Unit, Comparator & Registers, Understands a DNE instruction to compare SAD
(RRF) Stores an NxN Block. During Loading & Processing the address is a simple counter increment. Clocked
(TRF) Stores an NxN Target Block. Once each Block is processed a single row/col is loaded based upon the RASTER Direction. Has the functionality to SHIFT the N x N block of data in any direction. Clocked
Ref RAM
(AG)
(MC) (Simplification) Gets Address from AG & supplies the corresponding value to RRF & TRF Generates the addresses based upon a particular state Maintain Tx & Ty Load REF Block (Starting Address 0000h assumed). Needs a cascaded counters for Row Major to Linear Conversion. Counter bits to be used for addressing depend upon the image size. Load TARGET Block (Starting Address 0000h assumed). Depends upon the value of Tx & Ty. A counter similar to the above can be used. Generate Address for Extra Single Row Column. Depends on RASTER Direction. Needs a counter that counts from 0 till (N-1)
Target RAM
ALU-In Depth
78
Controller
Controller
Controller
Target RAM
Controller
Address Generator (AG) Memory Controller (MC) TxTy Module OF_Counter Block_Address_Generator
Target RAM
Controller
Target RAM
Controller
Target RAM
Controller
Reference Register File (RRF) Target Register File REG FILE (TRF)
Ref RAM
Target RAM
Instruction/State Reset Set tx Set ty RASTER Lp InitBlk Lp R 94 Lp C Lc+Pr Pr Pr_dne Update_ty SHIFT LpC_Dne RASTER Update_tx Load R/C SHIFT RASTER Lp C Lc+Pr Pr Pr_dne Update_ty SHIFT LpC_Dne RASTER Update_tx Load R/C SHIFT RASTER Lp R_Dne
State Value
Loop
Start
End
0 0 RIGHT B (256)-N 256-N N B N Lp InitBlk Lp C Lc+Pr Lc+Pr Pr Lp InitBlk LpR_Dne LpC_Dne Lc+Pr Pr
Initialize tx (Starting x co-ordinate / ROW Coordinate) Initialize ty (Starting y co-ordinate / COLUMN Coordinate) Tell processor you are traversing right initially Load initial Blocks for REF & TARGET. Will take clks equal to the number of elements in b X 2
LEFT
Shift Left Done with one row --- (over all the coulmns)
DOWN
UP LEFT
RIGHT
DOWN
N UP RIGHT
Load R/C
Load R/C
Every Block knows the current State -- & is enables/disabled as decided by the controller
Questions.
Notes