MA C6000 2DAY Student Guide Rev2.3

in association with
C6000 Embedded Design Workshop

Student Guide
C6000 Embedded Design Workshop

Student Guide (includes slides & lab procedures), Rev 2.3 Dec 2016
C6000 Embedded Design Workshop - Cover 0-1

Notice
Notice
These materials, slides, labs, solutions are essentially creative commons license
because they are stored on a public website. However, the most current author,
Mindshare Advantage LLC, must be contacted before these materials are used in any
other form for presentations, college course material or for any other purpose. These
materials are being updated and kept current by Mindshare Advantage LLC but are
used in association with Texas Instruments with their permission to update and
maintain as current.
Mindshare Advantage reserves the right to update this Student (and Lab) Guide to
reflect the most current product information for the spectrum of users. If there are
any differences between this Guide and a technical reference manual, references
should always be made to the most current reference manual and/or datasheet.
Information contained in this publication is believed to be accurate and reliable.
However, responsibility is assumed neither for its use nor any infringement of patents
or rights of others that may result from its use. No license is granted by implication or
otherwise under any patent or patent right of Texas Instruments or Mindshare
Advantage.
If you have any questions pertaining to this material, please contact Mindshare
Advantage at:
www.MindshareAdvantage.com
Revision History
2.00 March 2016 entire workshop updated to the latest tools (slides, code, labs, etc.)
2.1 July 2016 updated labs/solutions files, minor errata
2.3 Dec 2016 updated labs/solutions files, minor errata
0-2 C6000 Embedded Design Workshop - Cover

C6000 Introduction
Introduction
This is the first chapter that specifically addresses ONLY the C6000 architecture. All chapters
from here on assume the student has already taken the 2-day TI-RTOS Kernel workshop.
During those past two days, some specific C6000 architecture items were skipped in favor of
covering all TI EP processors with the same focus. Now, it is time to dive deeper into the C6000
specifics.
The first part of this chapter focuses on the C6000 family of devices. The 2 nd part dives deeper
into topics already discussed in the previous two days of the TI-RTOS Kernel workshop. In a way,
this chapter is catching up all the C6000 users to understand this target environment
specifically.
After this chapter, we plan to dive even deeper into specific parts of the architecture like
optimizations, cache and EDMA.
Objectives
Objectives
Introduce the C6000 Core and the

C6748 target device
Highlight a few uncommon pieces of the
architecture e.g. the SCR and PRU
Catch up from the TI-RTOS Kernel
discussions are C6000-specific topics
such as Interrupts, Platforms and Target
Config Files
Lab 11 Create a custom platform and
create an Hwi to respond to the audio
interrupts
C6000 Embedded Design Workshop - C6000 Introduction 11 - 1

Module Topics
Module Topics
C6000 Introduction .................................................................................................................... 11-1
Module Topics ......................................................................................................................... 11-2
TI EP Product Portfolio............................................................................................................ 11-3
DSP Core ................................................................................................................................ 11-4
Devices & Documentation ....................................................................................................... 11-6
Peripherals .............................................................................................................................. 11-7
PRU ..................................................................................................................................... 11-8
SCR / EDMA3 .................................................................................................................... 11-9
Pin Muxing......................................................................................................................... 11-10
Example Device: C6748 DSP ............................................................................................... 11-11
Choosing a Device ................................................................................................................ 11-12
C6000 Arch Catchup .......................................................................................................... 11-13
C64x+ Interrupts................................................................................................................ 11-13
Event Combiner ................................................................................................................ 11-14
Target Config Files ............................................................................................................ 11-15
Creating Custom Platforms ............................................................................................... 11-16
Quiz ....................................................................................................................................... 11-19
Quiz - Answers .................................................................................................................. 11-20
Using Double Buffers ............................................................................................................ 11-21
Lab 11: An Hwi-Based Audio System ................................................................................... 11-23
Lab 11 Procedure ............................................................................................................... 11-24
Import Existing Project ...................................................................................................... 11-24
Application (FIR Audio) Overview ..................................................................................... 11-25
Source Code Overview ..................................................................................................... 11-26
Add Hwi to the Project ....................................................................................................... 11-27
Optional OMAP-L138 LCDK Users ONLY ..................................................................... 11-28
Build, Load, Run. ............................................................................................................... 11-29
Debug Interrupt Problem ................................................................................................... 11-29
Using the Profiler Clock ..................................................................................................... 11-31
11 - 2 C6000 Embedded Design Workshop - C6000 Introduction

TI EP Product Portfolio
TI EP Product Portfolio
TIs Embedded
Microcontrollers (MCU)Processor Portfolio
Application (MPU)
MSP430 C2000 Tiva-C Hercules Sitara DSP Multicore
16-bit 32-bit 32-bit 32-bit 32-bit 16/32-bit 32-bit
Ultra Low Real-time All-around Safety Linux All-around Massive
Power & Cost MCU Android DSP Performance
Real-time ARM ARM ARM C66 + C66

MSP430 DSP
C28x MCU Cortex-M3 A15 + C66
ULP RISC Cortex-M3 Cortex-A8 C5000
A8 + C64
MCU ARM M3+C28 Cortex-M4F Cortex-R4 Cortex-A9 C6000
ARM9 + C674
Low Pwr Mode Motor Control 32-bit Float Lock step $5 Linux CPU C5000 Low Fix or Float
0.1 A
0.5 A (RTC)
Digital Power Nested Vector Dual-core R4
IntCtrl (NVIC) ECC Memory 3D Graphics
Power DSP Up to 12 cores
4 A15 + 8 C66x
Analog I/F Precision PRU-ICSS 32-bit fix/float
RF430 Timers/PWM Ethernet
(MAC+PHY) SIL3 Certified industrial subsys C6000 DSP DSP MMACs:
352,000
TI RTOS TI RTOS TI RTOS Linux, Android, C5x: DSP/BIOS Linux
N/A
(SYS/BIOS) (SYS/BIOS) (SYS/BIOS) SYS/BIOS C6x: SYS/BIOS SYS/BIOS
Flash: 512K 512K 512K 256K to 3M L1: 32K x 2 L1: 32K x 2 L1: 32K x 2
FRAM: 64K Flash Flash Flash L2: 256K L2: 256K L2: 1M + 4M
25 MHz 300 MHz 80 MHz 220 MHz 1.35 GHz 800 MHz 1.4 GHz
$0.25 to $1.85 to $1.00 to $5.00 to $5.00 to $2.00 to $30.00 to
$9.00 $20.00 $8.00 $30.00 $25.00 $25.00 $225.00

DSP Core
DSP Core
What Problem Are We Trying To Solve?
x Y
ADC DSP DAC
Digital sampling of Most DSP algorithms can be

an analog signal: expressed with MAC:
count
A
Y = coeffi * xi
i = 1
for (i = 0; i < count; i++){

t Y += coeff[i] * x[i]; }
How is the architecture designed to maximize computations like this?

8
'C6x CPU Architecture

C6x Compiler excels at Natural C
Memory
Multiplier (.M) and ALU (.L) provide up
to 8 MACs/cycle (8x8 or 16x16)
A0 B0
.D1 .D2 Specialized instructions accelerate
intensive, non-MAC oriented
calculations. Examples include:
Video compression, Machine
.S1 .S2 Vision, Reed Solomon,
While MMACs speed math intensive
MACs algorithms, flexibility of 8 independent
functional units allows the compiler to
.M1 .M2 quickly perform other types of
processing
.. .. C6x CPU can dispatch up to eight
.L1 .L2 parallel instructions each cycle
A31 B31
All C6x instructions are conditional
allowing efficient hardware pipelining
Controller/Decoder
Note: More details later 9

DSP Core
C6000 DSP Family CPU Roadmap

C66x
C674
C64x+
Fixed and Floating
L1 RAM/Cache Point
Fixed Point C64x
Compact Instrs Lower power
EDMA3 EDMA3
Video/Imaging PRU
Enhanced
EDMA2
Available on the most
recent releases
C621x
C67x+
C62x C671x
Floating Point
C67x
10
C6000 DSP Family CPU Roadmap

C66x
1.2 GHz
EDMA3
SPLOOP
32x32 Int Multiply
1GHz Enhanced Instr for

EDMA (v2)
2x Register Set
FIR/FFT/Complex
C674
SIMD Instrs
(Packed Data Proc)
C64x+
Combined Instr Sets from
L1 RAM and/or Cache C64x+/C67x+
C64x

Timestamp Counter Incr Floating-pt MHz
Compact Instrs Lower power
Exceptions EDMA3
Supervisor/User modes PRU
C621x DMAX (PRU)

2x Register Set
EDMA FFT enhancements
L1 Cache
L2 Cache/RAM
Lower Cost
C67x+
C62x C671x
C67x

Devices & Documentation
Devices & Documentation

DSP Generations : DSP and ARM+DSP
Fixed-Point Float-Point DSP+DSP
DSP ARM+DSP
Cores Cores (Multi-core)
C62x C67x C620x, C670x
C621x C67x C6211, C671x
C641x
C64x DM642
C67x+ C672x
DM643x DM64xx,
C64x+ C645x C647x OMAP35x, DM37x
C6748 OMAP-L138*
C674x C6A8168
C667x
C66x Future
C665x
13
Key C6000 Manuals

C64x/C64x+ C674 C66x
CPU Instruction Set Ref Guide SPRU732 SPRUFE8 SPRUGH7
Megamodule/Corepac Ref Guide SPRU871 SPRUFK5 SPRUGW0
Peripherals Overview Ref Guide SPRUE52 SPRUFK9 N/A
Cache Users Guide SPRU862 SPRUG82 SPRUGY8

SPRA198
Programmers Guide SPRU198
SPRAB27
To find a manual, at www.ti.com

DSP/BIOS Real-Time Operating System and enter the document number
SPRU423 - DSP/BIOS (v5) Users Guide in the Keyword field:
SPRU403 - DSP/BIOS (v5) C6000 API Guide
SPRUEX3 - SYS/BIOS (v6) Users Guide
Code Generation Tools or
SPRU186 - Assembly Language Tools Users Guide
SPRU187 - Optimizing C Compiler Users Guide
www.ti.com/lit/<litnum>
14

Peripherals
Peripherals
Graphics
C6x DSP
ARM
Accelerator
Video Accelerator(s)
Peripherals PRU Video/Display

(Soft Peripheral) Subsytem
Serial Storage Master Timing
McBSP DDR2 PCIe Timers
CAN Capture
McASP DDR3 USB 2.0 Watch
ASP SDRAM EMAC PWM Analog
UART Display
UART Async uPP eCAP
Whats Digital
SPI SD/MMC HPI RTC Next? Display
I2C ATA/CF EDMA3
LCD
DIY
CAN SATA SCR GPIO Controller
Well just look at three of these: PRU and SCR/EDMA3 16

Peripherals
PRU
Programmable Realtime Unit (PRU)
PRU consists of: Use as a soft peripheral to imple-
2 Independent, Realtime RISC Cores ment addl on-chip peripherals
Access to pins (GPIO)
Examples implementations
Its own interrupt controller
include:
Access to memory (master via SCR)
Device power mgmt control Soft UART
(ARM/DSP clock gating) Soft CAN
Create custom peripherals or
setup non-linear DMA moves.
No C compiler (ASM only)
Implement smart power
controller:
Allows switching off both ARM and
DSP clocks
Maximize power down time by
evaluating system events before
waking up DSP and/or ARM
18
PRU SubSystem : IS / IS-NOT

Is IsNot
Dual 32bit RISC processor specifically Is not a H/W accelerator used to speed up
designed for manipulation of packed memory algorithm computations.
mapped data structures and implementing
system features that have tight real time
constraints.
Simple RISC ISA: Is not a general purpose RISC processor:
Approximately 40 instructions No multiply hardware/instructions
Logical, arithmetic, and flow control ops all No cache or pipeline
complete in a single cycle No C programming
Simple tooling: Is not integrated with CCS. Doesnt include
Basic commandline assembler/linker advanced debug options
Includes example code to demonstrate No Operating System or high-level
various features. Examples can be used as application software stack
building blocks.

Peripherals
SCR / EDMA3
System Architecture SCR/EDMA
Switched
Masters Central Slaves
SCR Switched Central Resource Resource
ARM C64 Mem
Masters initiate accesses to/from
slaves via the SCR DDR2
DSP
Most Masters (requestors) and Slaves EMIF64
(resources) have their own port
to the SCR EDMA3 TCP
TC0
Lower bandwidth masters (HPI, VCP
PCI66, etc) share a port CC TC1
There is a default priority (0 to 7) to TC2

PCI
SCR resources that can be modified.
McBSP
Note: this picture is the general idea.
Every device has a different scheme PCI
for SCRs and peripheral muxing. In Utopia
other words check your data sheet.
HPI
EMAC SCR
21
TMS320C6748 Interconnect Matrix
Note: not ALL connections are valid

Peripherals
Pin Muxing
What is Pin Multiplexing?
Pin Mux Example
HPI
uPP
How many pins are on your device?

How many pins would all your peripheral require?
Pin Multiplexing is the answer only so many peripherals can be used at
the same time in other words, to reduce costs, peripherals must share
available pins
Which ones can you use simultaneously?
Designers examine app use cases when deciding best muxing layout
Read datasheet for final authority on how pins are muxed
Graphical utility can assist with figuring out pin-muxing
Pin mux utility... 24
Pin Muxing Tools
Graphical Utilities For Determining which Peripherals can be Used Simultaneously

Provides Pin Mux Register Configurations. Warns user about conflicts.
ARM-based devices: www.ti.com/tool/pinmuxtool others: see product page
25

Example Device: C6748 DSP
Example Device: C6748 DSP

TMS320C674x Architecture - Overview
TMS320C6748 Performance & Memory
128K L3 EDMA3 Up to 456MHz
4-32x
PLL 256K L2 (cache/SRAM)
16-bit EMIF 128
32K L1P/D Cache/SRAM
32KB L1P Cache/SRAM 16-bit DDR2-266
Switched Central Resource (SCR)

DDR2 16-bit EMIF (NAND Flash)
mDDR 256
C674x+ DSP Core Communications

McASP 256K 64-Channel EDMA 3.0
L2 10/100 EMAC
MMC/SD USB 1.1 & 2.0
Fixed & Floating-Pt
EMAC SATA
CPU
HPI
Power/Packaging
USB 128
SATA 128
13x13mm nPBGA & 16x16mm
PBGA
Timers 128
128
Pin-to-pin compatible w/OMAP
I2C, SPI, UART L138 (+ARM9), 361-pin pkg
LCD, PWM, eCAP 32KB L1D Cache/SRAM Dynamic voltage/freq scaling
uPP Total Power < 420mW
27

Choosing a Device
Choosing a Device
DSP & ARM MPU Selection Tool
http://focus.ti.com/en/multimedia/flash/selection_tools/dsp/dsp.html 29

C6000 Arch Catchup
C6000 Arch Catchup

C64x+ Interrupts
How do Interrupts Work?
1. An interrupt occurs 2. Interrupt Selector 3. Sets flag in Interrupt
EDMA Flag Register
McASP 12
Timer 124+4 (IFR)
Extl pins

4. Is this specific interrupt 5. Are interrupts globally 6. CPU Acknowledge
enabled? (IER) enabled? (GIE/NMIE) Auto hardware sequence
HWI Dispatcher (vector)
Branch to ISR
7. Interrupt Service Routine (ISR)
Context Save, ISR, Context Restore
User is responsible for setting up the following:

#2 Interrupt Selector (choose which 12 of 128 interrupt sources to use)
#4 Interrupt Enable Register (IER) individually enable the proper interrupt sources
#5 Global Interrupt Enable (GIE/NMIE) globally enable all interrupts
32
C64x+ Hardware Interrupts

Interrupt
Selector
IFR IER GIE
0 .
. HWI4
MCASP0_INT 0
HWI5 Vector
.. 1
..
. Table
HWI15
127 0
1 2 3 4
C6748 has 128 possible interrupt sources (but only 12 CPU interrupts)
4-Step Programming:
1. Interrupt Selector choose which of the 128 sources are tied to the 12 CPU ints
2. IER enable the individual interrupts that you want to listen to (in BIOS .cfg)
3. GIE enable global interrupts (turned on automatically if BIOS is used)
4. Note: HWI Dispatcher performs smart context save/restore (automatic for BIOS Hwi)
Note: NMIE must also be enabled. BIOS automatically sets NMIE=1. If
BIOS is NOT used, the user must turn on both GIE and NMIE manually.
33

C6000 Arch Catchup
Event Combiner
Event Combiner (ECM)
Use only if you need more than 12 interrupt events
ECM combines multiple events (e.g. 4-31) into one event (e.g. EVT0)
EVTx ISR must parse MEVTFLAG to determine which event occurred
Occur? Care? Both Yes? Interrupt

EVTFLAG[0] EVTMASK[0] MEVTFLAG[0] Selector
EVT 4-31 EVT0
EVTFLAG[1] EVTMASK[1] MEVTFLAG[1]

EVT 32-63 EVT1

C
EVT 64-95 EVT2 128:12 P
U
EVT 96-127 EVT3
EVT 4-127
35

C6000 Arch Catchup
Target Config Files

Creating a New Target Config File (.ccxml)
Target Configuration defines your target i.e. emulator/device used, GEL
scripts (replaces the old CCS Setup)
Create user-defined configurations (select based on chosen board)
Advanced Tab
click
Specify GEL script here
More on GEL files... 37
What is a GEL File ?

GEL General Extension Language (not much help, but there you go)
A GEL file is basically a batch file that sets up the CCS debug
environment including:
Memory Map
Watchdog
UART
Other periphs
The board manufacturer (e.g. SD or LogicPD) supplies GEL files

with each board.
To create a stand-alone or bootable system, the user must
write code to perform these actions (optional chapter covers these details)
38

C6000 Arch Catchup
Creating Custom Platforms

Creating Custom Platforms - Procedure
Most C6000 users will want to create their own
custom platform package
Here is the process:
1. Create a new platform package

2. Select repository, add to project path, select device
3. Import the existing seed platform
4. Modify settings
5. [Save] creates a custom platform pkg
6. Build Options select new custom platform
40

1 Create New Platform (via DEBUG perspective)
2 Configure New Platform

Platform Package Name
Custom Repository vs. XDC default location
Add Repository to Path adds platform path to project path
41

C6000 Arch Catchup

3 New Device Page Click Import (copy seed platform)
4 Customize Settings
42

5 [SAVE] New Platform (creates custom platform package)
6 Select New Platform in Build Options (RTSC tab)
Custom Repository vs. XDC default location
With path added, the tools find new platform
43

C6000 Arch Catchup
*** this page is blank for absolutely no reason ***

Quiz
Quiz
Chapter Quiz
1. How many functional units does the C6000 CPU have?
2. What is the size of a C6000 instruction word?
3. What is the name of the main bus arbiter in the architecture?
4. What is the main difference between a bus master and slave?
5. Fill in the names of the following blocks of memory and bus:
256
CPU
128

Quiz
Quiz - Answers
Chapter Quiz
1. How many functional units does the C6000 CPU have?
8 functional units or execution units
2. What is the size of a C6000 instruction word?
256 bits (8 units x 32-bit instructions per unit)
3. What is the name of the main bus arbiter in the architecture?
Switched Central Resource (SCR)
4. What is the main difference between a bus master and slave?
Masters can initiate a memory transfer (e.g. EDMA, CPU)
5. Fill in the names of the following blocks of memory and bus:
L1P
256
S
C L2 CPU
R 128
L1D
46

Using Double Buffers

Single vs Double Buffer Systems
Single buffer system: collect data or process data not both!
Hwi Swi/Task Hwi Swi/Task

BUF BUF
Nowhere to store new data when prior data is being processed
Double buffer system: process and collect data real-time compliant!

Hwi Swi/Task Hwi Swi/Task
BUF BUF BUF
y x x BUF
y
One buffer can be processed while another is being collected

When Swi/Task finishes buffer, it is returned to Hwi
Task is now caught up and meeting real-time expectations
Hwi must have priority over Swi/Task to get new data while prior
data is being processed standard in SYS/BIOS
48

*** HTTP ERROR 404 PAGE NOT FOUND ***

Lab 11: An Hwi-Based Audio System
Lab 11: An Hwi-Based Audio System

In this lab, we will use an Hwi to respond to McASP interrupts. The McASP/AIC3106 init code has
already been written for you. The McASP causes an EDMA interrupt which has already been
enabled. However, it is your challenge to create an Hwi and ensure all the necessary conditions
to respond to the interrupt are set up properly.
This lab also employs triple buffers (another version of ping/pong, with an extra pang). Both the
RCV and XMT sides have triple buffers. The concept here is that when you are processing one,
the NEXT buffer (in line) is being filled. This lab is based on the C6748 StarterWare audio
application from TI that was converted to the latest TI-RTOS and FIR filter.
Application: Audio pass-thru using Hwi, McASP/AIC3106 and EDMA3

Key Ideas: Hwi creation, Hwi conditions to trigger an interrupt, Buffer
management
Pseudo Code:
hardwareInitTaskFxn() init codec, I2C, McASP, EDMA3, LED
EDMA3CCComplIsr() responds to EDMA3 interrupt, determines Rx or Tx, calls

appropriate handler. Rx handler unblocks the Task to filter the incoming audio data.
CopyBufRxToTxTaskFxn()- unblocked when Rx handler fires, modifies the PSETs

for EDMA transfers, filters the data (FIR), performs interleave/de-interleave to
construct/deconstruct 32-bit data with zero pad into 16-bit only data (L and R channels).
This could have been done by the EDMA, but the StarterWare app didnt work that way.
Then this function starts the next EDMA transfer (get new Rx data and send filter Tx data
from buffer to McASP serializer). Repeat.
Lab 11 Hwi Audio

aic31_MA_TIRTOS.c
mcaspPlayBk_MA_TIRTOS.c
codecif_MA_TIRTOS.c EDMA3 Triple Buffers
CopyBufRxToTxTaskFxn()
Audio McASP rxBuf0 // Rx is full,Tx is empty
ADC
RxCh rxBuf0
rxBuf0
Input AIC3106 XBUF14 SEM_pend(Rx);
(48 KHz) Mod PSETs
Copy HIST
Audio McASP De-interleave 3216
DAC txBuf0
txBuf0 FIR Filter RxTx
Output AIC3106 XBUF13 TxCh txBuf0
(48 KHz) Interleave 1632
Start EDMA3 XFRs
Based on C6748 StarterWare Audio App
EDMA3CCComplIsr()
Procedure {
1. Import existing project (Lab11) // Rx & Tx Handlers
// post Rx SEM
2. Code Review }
3. Create your own CUSTOM PLATFORM
4. Config Hwi to respond to EDMA3 interrupt
Clk1
5. Debug Interrupt Problems
500ms
Tick
Time = 45min 49

Lab 11 Procedure
Lab 11 Procedure
If you cant remember how to perform some of these steps, please refer back to the previous labs
for help. Or, if you really get stuck, ask your neighbor. If you AND your neighbor get stuck, then
ask the instructor (who is probably doing absolutely NOTHING important) for help.
Import Existing Project

1. Close ALL open projects and files and then open CCS.
2. Import Lab11 project.

As before, import the archived starter project from:
C:\TI-RTOS\C6000\Labs\Lab_11\
This starter file contains all the starting source files for the audio project including the setup
code for the A/D and D/A on the C6748 LCDK (or OMAP-L138 LCDK). It also has UIA
activated but this wont be used until the next lab.
3. Check the Properties to ensure you are using the latest XDC, BIOS and UIA.
For every imported project in this workshop, ALWAYS check to make sure the latest tools
(XDC, BIOS and UIA) are being used. The author created these projects at time x and you
may have updated the tools on your student PC at x+1 some time later. The author used
the tools available at time x to create the starter projects and solutions which may or may
not match YOUR current set of tools.
Therefore, you may be importing a project that is NOT using the latest versions of the tools
(XDC, BIOS, UIA) or the compiler.
Check ALL settings for the Properties of the project (XDC, BIOS, UIA) and the compiler
and update the imported project to the latest tools before moving on and save all settings.

Lab 11 Procedure
Application (FIR Audio) Overview

4. Lets review what this audio pass-thru code is doing.
As discussed in the lab description, this application is based on the C6748 StarterWare audio
app which can be downloaded from:
http://www.ti.com/tool/starterware-c6dsp
Once downloaded, the original application can be obtained by importing the project located
at:
[STARTERWARE_INSTALL_PATH]\build\c674x\cgt_ccs\c6748\lcdkC6748\mcasp
The author then modified the code to make it TI-RTOS compliant and added a FIR filter of the
audio data that will be optimized in a future lab.
The best way to understand the process is via I-P-O:
Input (RCV) each analog audio sample from the audio INPUT port of the LCDK (top
stereo audio jack) is converted by the A/D and sent to the McASP port on the C6748. For
each sample, the McASP generates an EDMA3 event which fills up the rxBuf[0-2] buffer.
When the rxBuf is full, the EDMA generates an interrupt to the CPU. In the ISR, Rx and
Tx are handled separately. Assuming the Rx and Tx are running at the same frequency,
only Rx unblocks the processing Task to filter the audio data
Process Assuming at this point that the next Rx buffer is full and the next Tx buffer is
empty, it is time to process the data. A simple pass-thru would just copy Rx to Tx which is
what the original StarterWare application did. However, the author modified that code to
perform an FIR filter. The original code peformed 32-bit reads and writes with the upper
16 bits of each sample padded with zeroes. An off the shelf FIR filter cant process this
type of data. It needs to be channelized (L and R), converted to 16 bits and remove the
zeroes. So, a de-interleave routine is done first to create a new local Rx buffer. That
buffer is then filtered to create a new local Tx buffer. When that process is complete, an
interleave routine is done to put the data back into a zero-padded 32-bit format which the
EDMA copies to the McASP (Tx) serializer
Output (XMT) When the EDMA Tx transfer starts, it copies one sample at a time to the
McASP Tx serializer. When a buffer full of samples have been transferred, the Tx EDMA
channel interrupts the CPU to say done. Again, assuming that the Rx and Tx transfers
are synced, the Rx interrupt would then unblock the processing Task above.
Several source files are needed to create this application. Lets explore those briefly

Lab 11 Procedure
Source Code Overview

5. Inspect the source code.
Following is a brief description of the source files:
Feel free to open any of these files and inspect them as you readeach has been
commented heavily to explain each piece thoroughly.
aic31_MA_TIRTOS.c AIC3106 setup code. The actual sampling frequency is set in
the file system_MA_TIRTOS.h.
codecif_MA_TIRTOS.c I2C setup code.
coeffs_MA_TIRTOS.c contains the FIR filter coefficients low pass, high pass and
all pass. To change the values, simply comment out one set and uncomment another set.
ALL_PASS is set by default.
Led_MA_TIRTOS.c contains the Task that toggles the LED on the LCDK. The code
uses the StarterWare library calls.
mcaspPlayBk_MA_TIRTOS.c this is the MAIN code for this application. It contains

all of the init routines and the key Task CopyBufRxToTxTaskFxn() which is where all of
the magic happens. This file also contains main() which is simply BIOS_start().
RxTxBuf_MA_TIRTOS.cmd The buffers are all allocated in

mcaspPlayBk_MA_TIRTOS.c and use user-defined sections so that they can easily be
relocated in L2 vs DDR2 memory areas. The default location is L2. However, for the later
cache lab, in order to test cache performance, those buffers will be allocated in DDR2
with the cache off and then on.
system_MA_TIRTOS.h this is the main header file used by all other files. It contains
the #define statements that control almost everything in the code along with the function
prototypes.
audio_app.cfg.c The TI-RTOS configuration file. All TI-RTOS modules are

configured and set via this file.

Lab 11 Procedure
Add Hwi to the Project

6. Use Hwi module and configure the hardware interrupt for the EDMA3.
Ok, FINALLY, we get to do some real work to get our code running. For most targets, an
interrupt source (e.g. EDMA3) will have an interrupt EVENT ID (specified in the datasheet).
This event id needs to be tied to a specific CPU interrupt. The details change based on the
target device.For the C6748, the EVENT ID is #8 and the CPU interrupt were using is INT5
(there are 16 interrupts on the C6748 again, target specific).
So, we need to do two things: (1) tell the tools we want to USE the Hwi module; (2) configure
a specific interrupt to point to our ISR routine (EDMA3CCComplIsr).
During the 2-day TI-RTOS Kernel Workshop, you performed these actions so this should
be review but thats ok. Review is good.
First, make sure you are viewing the audio_app.cfg file.
Right-click on Hwi in the Outline View and select New Hwi.
Click on the new Hwi (e.g. hwi2):
Then fill in the following dialogue boxes to match what is shown below:
Make sure Enable at startup is NOT checked (this sets the IER bit
on the C6748). This will provide us with something to debug later.

Lab 11 Procedure
Optional OMAP-L138 LCDK Users ONLY

C6748 LCDK users know how to build, load and run these labs they did this multiple times
during the TI-RTOS workshop. However, because this workshop supports either board, there are
a few steps that an OMAP-L138 user needs to perform that are DIFFERENT than a C6748 user.
The following information will help users of the OMAP-L138 LCDK get these labs to work
properly.
First, the devices are very similar. For build, you can target either the C6748 LCDK or the OMAP-
L138 LCDK in (right-click on the project) PropertiesGeneral. The author chose to simplify things
and just keep the C6748 LCDK as the target. That works fine.
Second, the OMAP device contains an ARM9 CPU that must be powered up FIRST before the
DSP (C674x). So, OMAP-L138 users must do two things in addition to the C6748 LCDK users:
Use a different target configuration file

Connect to the ARM9 first, followed by the DSP and then load the .out file to the DSP.
Using a different target configuration file

The name of the target config file to be used is:
Which is located in C:\TI_RTOS\Workshop_Admin\Target_Config_Files
Refer back to the early steps of Lab 1 (TI-RTOS workshop) if you dont remember how to import
target config files. Also, if you use a different emulator than the Spectrum Digital XDS510, you will
need to update your .ccxml file to reflect that. The board should be set to LCDKOMAPL138.
When you launch a debug session, use this file instead of the C6748 version. It uses a different
GEL file that configures both the ARM9 and the DSP which is also located in the TI_RTOS.zip file
as long as you placed that folder at the root: C:\TI_RTOS. If not, you will have to change the
location of the GEL file used by the target config file.
Powering up each CPU

After you launch the above target config file, you will see multiple CPUs in the Debug window:
First, click on the ARM9_0 CPU and select RunConnect Target or just click the Connect button:
Then connect to the C674x_0 CPU the same way. Then load the .out file to the C674x CPU. You
should see the GEL file output to the Console window as it runs.

Lab 11 Procedure
Build, Load, Run.

7. Build, load and run the audio FIR filter application.
Before you Run, make sure audio is playing into the board and your headphones are set
up so you can hear the audio. The TOP jack is INPUT (from your PC) and the BOTTOM jack
is OUTPUT (to your headphone).
Also, make sure that Windows Media Player is set to REPEAT forever. If the music stops
(the input is air), and you click Run, you might think there is a problem with your code. Nope,
there is no music playing.
Build and fix any errors. After a successful build, load the .out file to your target board.
OMAP-L138 LCDK USERS ONLY: Remember, OMAP-L138 LCDK users need to connect to
the ARM9 first, then connect to the C674x CPU second. Then, make sure the C674x CPU is
highlighted and LOAD the .out file to that CPU.
Once the program is loaded, click Run.
Do you hear audio? If not, its debug time it SHOULD NOT be working (by design). One
quick tip for debug is to place a breakpoint in the EDMA3CCComplIsr() routine (located in
mcaspPlayBk_MA_TIRTOS.c) and see if the program stops there. If not, no interrupt is
being generated. Move on to the next steps to debug the problem
Hint: The StarterWare application has a unique send zeroes if McASP Xmt underruns
feature. Normally, the McASP on the C6748 cannot be restarted after a halt i.e. you
cant just hit halt, then Run. However, in this application, if a halt occurs and underruns
the XMT side of the McASP, the application continues to send ZEROES to the output to
keep it alive vs simply dying. This is a nice feature. You may hear static when you halt,
but you can simply click Play again to keep running.
Debug Interrupt Problem

As we already know, we decided early on to NOT enable the IER bit in the static configuration of
the Hwi. Ok. But debugging interrupt problems is a crucial skill. The next few steps walk you
through HOW to do this. You may not know WHERE your interrupt problem occurred, so using
these brief debug skills may help in the future.
8. Pause for a moment to reflect on the dominos in the interrupt game:

An interrupt must occur (McASP event to EDMA3 must be set along with the EDMA3
options register for each channel must be configured to interrupt the CPU). Our init code
does this already.
The individual interrupt must be enabled (IER, BITx) which is NOT enabled right now.
Global Interrupts must be turned on (GIE = 1, handled by TI-RTOS)
HWI Dispatcher must be used to provide proper context save/restore (automatic when
using TI-RTOS to manage interrupts)
Keep this all in mind as you do the following steps

Lab 11 Procedure
9. EDMA3CC interrupt firing IFR bit set?

If your application is still running, halt it.
The EDMA3 Channel Controller (CC) interrupt is set to fire properly, but is it setting the IFR
bit? You configured HWI_INT5, so that would be a 1 in bit 5 (IF5) of the IFR.
Go there now (View Registers Core Registers).
Look down the list to find the IFR and IER the two of most interest at the moment.
(author note: could it have been set, then auto-cleared already?). You can also DISABLE
IERbit (as it is already in the CFG file), build/run, and THEN look at IFR (this is a nice trick).
So, if you have the IER set and the interrupt fires (IFR is set), it gets cleared automatically by
hardware. So turning off IER allows us to see the IFR bit being set.
Expand IFR and look for IF5 (because the EDMA3 interrupt is tied to CPU INT #5)
Write your debug checkmarks here:
IFR bit set? Yes No

10. Is the IER bit set?
Interrupts must be individually enabled. When you look at IER bit 5, is it set to 1? Probably
NOT because we didnt check that Enable at Start checkbox.
Open up the config for HWI_INT5 and check the proper checkbox. Then, hit build and your
code will build and load automatically regardless of which perspective you are in. Run. Do
you hear audio playing now? The all pass filter coefficients are set by default so your music
should sound normal.
Halt the CPU. Is the IER bit (IE05) bit set?
IER bit set? Yes No

So lets check one more thing
11. Is GIE set?

The Global Interrupt Enable (GIE) Bit is located in the CPUs CSR register. TI-RTOS turns this
on automatically and then manages it as part of the O/S. So, no need to check on this but do
FIND this bit in the CSR register so you know where it is
GIE bit set? Yes No

Hint: If you create a project that does NOT use SYS/BIOS (or TI-RTOS), it is the responsibility
of the user to not only turn on GIE, but also NMIE in the CSR register. Otherwise, NO
interrupts will be recognized. Ever. Did I say ever?

Lab 11 Procedure
Using the Profiler Clock

12. Turn on the Profiler Clock and perform a benchmark.
Set two breakpoints anywhere you like (double click in left pane of code) one at the
start point and another at the end point that you want to benchmark.
Turn on the Profiler clock by selecting: Run Clock Enable
In the bottom right-hand part of the screen, you should see a little CLK symbol that looks like
this:
Run to the first breakpoint, then double-click on the clock symbol to zero it. Run again and
the number of CPU cycles will display.
One place to set breakpoints is just before the FIR filter starts and ends basically
benchmarking how long the FIR filter takes to run. Like this:
FYI - the author registered 642K cycles:

Lab 11 Procedure
Thats It. Youre Done!!

13. Note about benchmarks, UIA and Logs in this lab.
There is really no extra work we can do in terms of UIA and Logs. These services will be
used in all future labs. If you have time and want to add a Log or benchmark using
Timestamp to the code, go ahead.
In reality, the number of LED blinks have been logged in the system log. If you remember
how to find those based on your previous lab experience, go look for that. If not, we will do
more UIA in future labs.
You spent the past two days in the Kernel workshop playing with these tools. The point of this
lab was to get you up to speed on Platforms and focusing more on C6000 as the specific
target. In the future labs, though, youll have more chances soon to use UIA and Logs to test
the compiler and optimizer and cache settings.
14. Close the project and delete it from the workspace.

Terminate the debug session and close CCS. Power cycle the board.
RAISE YOUR HAND and get the instructors attention when you
have completed this lab.

C6000 CPU Architecture
Introduction
In this chapter, we will take a deeper look at the C64x+ architecture and assembly code. The
point here is not to cover HOW to write assembly it is just a convenient way to understand the
architecture better.
Objectives
Objectives
Provide a detailed overview of the

C64x+/C674x CPU architecture
Describe the basic ASM language and
h/w needed to solve a SOP
Analyze how the hardware pipeline
works
Learn basics of software pipelining
Note: This chapter and the next
chapter shape our knowledge of how
the compiler/optimizer work
C6000 Embedded Design Workshop - C6000 CPU Architecture 12 - 1

Module Topics
Module Topics
C6000 CPU Architecture ........................................................................................................... 12-1
Module Topics ......................................................................................................................... 12-2
What Does A DSP Do? ........................................................................................................... 12-3
CPU From the Inside Out .............................................................................................. 12-4
Instruction Sets ..................................................................................................................... 12-10
MAC Instructions ................................................................................................................ 12-12
C66x MAC Instructions .................................................................................................... 12-14
Hardware Pipeline ................................................................................................................. 12-15
Software Pipelining ............................................................................................................... 12-16
Chapter Quiz ......................................................................................................................... 12-19
Quiz - Answers .................................................................................................................. 12-20
12 - 2 C6000 Embedded Design Workshop - C6000 CPU Architecture

What Does A DSP Do?
What Does A DSP Do?
What Problem Are We Trying To Solve?
x Y
ADC DSP DAC
Digital sampling of Most DSP algorithms can be

an analog signal:
expressed with MAC:
count
Y = coeffi * xi
i = 1
for (i = 0; i < count; i++){

t Y += coeff[i] * x[i]; }
How is the architecture designed to maximize computations like this?

5
'C6x CPU Architecture

Memory C6x Compiler excels at Natural C
Multiplier (.M) and ALU (.L) provide up
A0 B0 to 8 MACs/cycle (8x8 or 16x16)
.D1 .D2 Specialized instructions accelerate
intensive, non-MAC oriented
calculations. Examples include:
Video compression, Machine
.S1 .S2 Vision, Reed Solomon,
MACs
While MMACs speed math intensive
algorithms, flexibility of 8 independent
.M1 .M2 functional units allows the compiler to
quickly perform other types of
processing
.. .. C6x CPU can dispatch up to eight
.L1 .L2 parallel instructions each cycle
A31 B31 All C6x instructions are conditional
allowing efficient hardware pipelining
Controller/Decoder
Note: More details later

CPU From the Inside Out

The Core of DSP : Sum of Products
40
y = cn * xn
Mult
.M n = 1
The C6000
MPY .M c, x, prod
Designed to ALU
.L ADD .L y, prod, y
handle DSPs
math-intensive
calculations
Note:
You dont have to
specify functional
units (.M or .L)
Where are the variables stored?

8
Working Variables : The Register File
Register File A 40
c y = cn * xn
x .M n = 1
16 or 32 registers
MPY .M c, x, prod
prod .L ADD .L y, prod, y
y
..
.
32-bits
How can we loop our MAC?

9

Making Loops
1. Program flow: the branch instruction
B loop
2. Initialization: setting the loop count

MVK 40, cnt
3. Decrement: subtract 1 from the loop counter

SUB cnt, 1, cnt
10
.S Unit: Branch and Shift Instructions
Register File A 40
c y = cn * xn
x .S n = 1
MVK .S 40, cnt

16 or 32 registers
cnt
loop:
prod .M MPY .M c, x, prod
y ADD .L y, prod, y
SUB .L cnt, 1, cnt
.. .L B .S loop
.
32-bits
How is the loop terminated?

11

Conditional Instruction Execution
To minimize branching, all instructions are conditional
[condition] B loop
Execution based on [zero/non-zero] value of specified variable
Code Syntax Execute if:

[ cnt ] cnt 0
[ !cnt ] cnt = 0
Note: If condition is false, execution is essentially replaced with NOP

12
Loop Control via Conditional Branch
Register File A 40
c y = cn * xn
x .S n = 1
MVK .S 40, cnt

16 or 32 registers
cnt
loop:
prod .M MPY .M c, x, prod
y ADD .L y, prod, y
SUB .L cnt, 1, cnt
.. .L [cnt] B .S loop
.
32-bits
How are the c and x array values brought in from memory?

13

Memory Access via .D Unit
Register File A 40
c y = cn * xn
x .S n = 1
MVK .S 40, cnt

16 or 32 registers
cnt
loop:
prod .M LDH .D *cp , c
y LDH .D *xp , x
*cp MPY .M c, x, prod
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
.D
Data Memory: Note: No restrictions on which regs can be

x(40), a(40), y used for address or data!
What does the H in LDH signify? 14
Instr. MemoryDescription AccessC viaType .DSize Unit

LDB load byte char 8-bits
LDH Register load
Filehalf-word
A short 16-bits
40
LDW
c
load word int y = cn * xn
32-bits
LDDW* load double-word .S double 64-bits
n = 1
x
MVK .S 40, cnt
16 or 32 registers
cnt& C67x generations

* Except C62x
loop:
prod .M LDH .D *cp , c
y LDH .D *xp , x
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
.D
Data Memory:
x(40), a(40), y
How do we increment through the arrays?
15

Auto-Increment of Pointers
Register File A 40
c y = cn * xn
x .S n = 1
MVK .S 40, cnt

16 or 32 registers
cnt
loop:
prod .M LDH .D *cp++, c
y LDH .D *xp++, x
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
.D
Data Memory:
x(40), a(40), y
How do we store results back to memory?
16
Storing Results Back to Memory
Register File A 40
c y = cn * xn
x .S n = 1
MVK .S 40, cnt

16 or 32 registers
cnt
loop:
prod .M LDH .D *cp++, c
y LDH .D *xp++, x
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
STW .D y, *yp
.D
Data Memory:
x(40), a(40), y
But wait - thats only half the story...
17

Dual Resources : Twice as Nice
Register File A Register File B

A0 cn B0
xn .S1 .S2
A1 B1
A2 cnt B2
A3 prd .M1 .M2 B3
A4 sum B4
A5 *c B5
A6 *x .L1 .L2
B6
A7 *y B7
.. .. .. ..
.D1 .D2
A15 B15
or or
32-bits 32-bits
A31 B31
Our final view of the sum of products example...
18
Optional - Resource Specific Coding

40
Register File A y = cn * xn
n = 1
A0 cn
.S1 MVK .S1 40, A2
A1 xn
loop: LDH .D1 *A5++, A0
A2 cnt
LDH .D1 *A6++, A1
A3 prd .M1
MPY .M1 A0, A1, A3
A4 sum
*c ADD .L1 A4, A3, A4
A5
A6 *x .L1 SUB .S1 A2, 1, A2
A7 *y [A2] B .S1 loop
.. .. STW .D1 A4, *A7
.D1
A15
or
32-bits Its easier to use symbols rather than
A31 register names, but you can use
either method.
19

Instruction Sets
Instruction Sets
C62x RISC-like instruction set
.S Unit .L Unit
ADD NEG ABS NOT
ADDK NOT ADD OR
.S ADD2 OR AND SADD
AND SET CMPEQ SAT
B SHL CMPGT SSUB
CLR SHR CMPLT SUB
EXT SSHL LMBD SUBC
.L MV SUB MV XOR
MVC SUB2 NEG ZERO
MVK XOR NORM
MVKH ZERO
.D
.M Unit
MPY SMPY
.D Unit MPYH SMPYH
.M ADD NEG MPYLH
ADDAB (B/H/W) STB (B/H/W) MPYHL
LDB (B/H/W) SUB
SUBAB (B/H/W) No Unit Used
MV ZERO NOP IDLE
21
C67x: Superset of Fixed-Point

.S Unit .L Unit
ADD NEG ABSSP ABS NOT ADDSP
ADDK NOT ABSDP ADD OR ADDDP
.S ADD2 OR CMPGTSP AND SADD SUBSP
AND SET CMPEQSP CMPEQ SAT SUBDP
B SHL CMPLTSP CMPGT SSUB INTSP
CLR SHR CMPGTDP CMPLT SUB INTDP
EXT SSHL CMPEQDP LMBD SUBC SPINT
.L MV SUB CMPLTDP MV XOR DPINT
MVC SUB2 RCPSP NEG ZERO SPRTUNC
MVK XOR RCPDP NORM DPTRUNC
MVKH ZERO RSQRSP DPSP
.D RSQRDP
SPDP .M Unit
MPY SMPY MPYSP
.D Unit MPYH SMPYH MPYDP
.M ADD NEG MPYLH MPYI
ADDAB (B/H/W) STB (B/H/W) MPYHL MPYID
LDB (B/H/W) SUB
LDDW SUBAB (B/H/W) No Unit Required
MV ZERO NOP IDLE
22

Instruction Sets
'C64x: Superset of C62x Instruction Set

Dual/Quad Arith Data Pack/Un Compares Dual/Quad Arith Data Pack/Un
.S SADD2 PACK2 CMPEQ2 .L ABS2 PACK2
SADDUS2 PACKH2 CMPEQ4 ADD2 PACKH2
SADD4 PACKLH2 CMPGT2 ADD4 PACKLH2
Bitwise Logical PACKHL2 CMPGT4 MAX PACKHL2
ANDN UNPKHU4 Branches/PC MIN PACKH4
Shifts & Merge UNPKLU4 BDEC SUB2 PACKL4
SHR2 SWAP2 BPOS SUB4 UNPKHU4
SHRU2 SPACK2 BNOP SUBABS4 UNPKLU4
SHLMB SPACKU4 ADDKPC Bitwise Logical SWAP2/4
SHRMB ANDN
Shift & Merge Multiplies
SHLMB MPYHI
SHRMB MPYLI
Dual Arithmetic Mem Access
Load Constant MPYHIR
.D ADD2 LDDW
MVK (5-bit) MPYLIR
SUB2 LDNW .M
Bitwise Logical LDNDW MPY2
AND STDW SMPY2
ANDN STNW Average Bit Operations DOTP2
OR STNDW AVG2 BITC4 DOTPN2
XOR Load Constant AVG4 BITR DOTPRSU2
Address Calc. MVK (5-bit) Shifts DEAL DOTPNRSU2
ADDAD ROTL SHFL DOTPU4
SSHVL Move DOTPSU4
SSHVR MVD GMPY4
XPND2/4 23
C64x+ Additions
CALLP DINT ADDSUB
.S DMV None RINT .L ADDSUB2
RPACK2 SPKERNEL DPACK2
SPKERNELR DPACKX2
SPLOOP SADDSUB
SPLOOPD SADDSUB2
SPLOOPW SHFL3
SPMASK SSUB2
SPMASKR
SWE CMPY
SWENR CMPYR
CMPYR1
DDOTP4
None DDOTPH2
.D DDOTPH2R
.M DDOTPL2
DDOTPL2R
GMPY
MPY2IR
MPY32 (32-bit result)
MPY32 (64-bit result)
MPY32SU
MPY32U
MPY32US
SMPY32
XORMPY
24

MAC Instructions
MAC Instructions
DOTP2 with LDDW
a3 a2 : a1 a0 A1:A0 LDDW .D1 *A4++,A1:A0
x3 x2 : x1 x0 B1:B0 || LDDW .D2 *B4++,B1:B0
B2 = A2
DOTP2 A0,B0,A2
a3*x3 + a2*x2 a1*x1 + a0*x0
|| DOTP2 A1,B1,B2
+ +
B3 A3
intermediate sum A5
intermediate sum ADD A2,A3,A3
|| ADD B2,B3,B3
+ final sum A4 ADD A3,B3,A4
26
Block Real FIR Example (DDOTPL2 )

for (i = 0; I < ndata; i++) {
DDOTPL2 d3d2:d1d0, c1c0, sum1:sum0
sum = 0;
for (j = 0; j < ncoef; j++) {
sum = sum + (d[i+j] * c[j]);
}
y[i] = sum;
}
loop Iteration
[0,0] [0,1]
[i,j]
d0c0
+
d1c0
d1c1
+ Four 16x16 multiplies
d2c2 d2c1 In each .M unit every cycle
--------------------------------------
d3c3 d3c2 adds up to 8 MACs/cycle, or
. 8000 MMACS
. Bottom Line: Two loop
. iterations for the price of one
27

MAC Instructions
Complex Multiply (CMPY)

A0 r1 i1
x x
A1 r2 i2
= =
CMPY A0, A1, A3:A2 r1*r2 - i1*i2 : i1*r2 + r1*i2
32-bits 32-bits
single .M unit
Four 16x16 multiplies per .M unit

Using two CMPYs, a total of eight 16x16 multiplies per cycle
Floating-point version (CMPYSP) uses:
64-bit inputs (register pair)
128-bit packed products (register quad)
You then need to add/subtract the products to get the final result
28

C66x MAC Instructions
C66x MAC Instructions

C66x: QMPY32 (fixed), QMPYSP (float)
A3:A2:A1:A0 c3 : c2 : c1 : c0
x x x x
A7:A6:A5:A4 x3 : x2 : x1 : x0
= = = =
A11:A10:A9:A8 c3*x3 : c2*x3 : c1*x1 : c0*x0 QMPY32
or QMPYSP
32-bits 32-bits 32-bits 32-bits
single .M unit
Four 32x32 multiplies per .M unit

Total of eight 32x32 multiplies per cycle
Fixed or floating-point versions
Output is 128-bit packed result (register quad)
30
C66x: Complex Matrix Multiply (CMAXMULT)

M3 M2
[ M9 M8 ] = [ M7 M6 ] *
M1 M0
M9 = M7*M3 + M6*M1 Where Mx represents a packed
M8 = M7*M2 + M6*M0 16-bit complex number
Single .M unit implements complex matrix multiply using 16 MACs (all in 1 cycle)
Achieve 32 16x16 multiplies per cycle using both .M units
src1 r1 i1 : r2 i2
src2_3 src2_2 src2_1 src2_0

src2 ra ia : rb ib : rc ic : rd id
r1*ra - i1*ia r1*ia + i1*ra r1*rb - i1*ib r1*ib + i1*rb

dest + : + : + : +
r2*rc - i2*ic r2*ic + i2*rc r2*rd - i2*id r2*id + i2*rd
32-bits 32-bits 32-bits 32-bits
single .M unit
31

Hardware Pipeline
Hardware Pipeline
Pipeline Phases
Program Decode
Fetch Execute
PG PS PW PR DP DC E1
Pipeline Full
Pipeline Phases
Full Pipe
34

Software Pipelining
Software Pipelining
Instruction Delays
All 'C64x instructions require only one cycle to
execute, but some results are delayed ...
Description # Instr. Delay

All, instrs
Single Cycle 0
except ...
MPY,
Multiply 1
SMPY
LDB, LDH,
Load 4
LDW
Branch B 5
36
Would This Code Work As Is ??

40
Register File A y = cn * xn
n = 1
A0 cn
xn .S1 MVK .S1 40, A2
A1
cnt loop: LDH .D1 *A5++, A0
A2
prd LDH .D1 *A6++, A1
A3 .M1
sum MPY .M1 A0, A1, A3
A4
*c ADD .L1 A4, A3, A4
A5
A6 *x .L1 SUB .S1 A2, 1, A2
A7 *y [A2] B .S1 loop
.. .. STW .D1 A4, *A7
.D1
A15 Need to add NOPs to get this
or
32-bits code to work properly
A31 NOP = Not Optimized Properly
How many instructions can this CPU
execute every cycle?
37

Software Pipelining
Software Pipelined Algorithm

PROLOG LOOP
0 1 2 3 4 5 6 7
.L1 3 add
.L2 6 add
.S1 8 B B2 B3 B4 B5 B6
.S2 7 sub sub2 sub3 sub4 sub5 sub6 sub7
.M1 2 mpy mpy2 mpy3
.M2 5 mpyh mpyh2 mpyh3
.D1 1 ldw m ldw2 ldw3 ldw4 ldw5 ldw6 ldw7 ldw8
.D2 ldw n ldw2 ldw3 ldw4 ldw5 ldw6 ldw7 ldw8
4
38
Software Pipelined C6x Code

c0: ldw .D1 *A4++,A5 c5_6: ldw .D1 *A4++,A5
|| ldw .D2 *B4++,B5 || ldw .D2 *B4++,B5
|| [B0] sub .S2 B0,1,B0
c1: ldw .D1 *A4++,A5 || [B0] B .S1 loop
|| ldw .D2 *B4++,B5 || mpy .M1x A5,B5,A6
|| [B0] sub .S2 B0,1,B0 || mpyh .M2x A5,B5,B6
.
*** Single-Cycle Loop
c2_3_4: ldw .D1 *A4++,A5 loop: ldw ..D1 *A4++,A5
|| ldw .D2 *B4++,B5 || ldw .D2 *B4++,B5
|| [B0] sub .S2 B0,1,B0 || [B0] sub .S2 B0,1,B0
|| [B0] B .S1 loop
|| [B0] B .S1 loop
|| mpy .M1x A5,B5,A6
. || mpyh.M2x A5,B5,B6
. || add .L1 A7,A6,A7
. || add .L2 B7,B6,B7
39

Software Pipelining
*** this page contains no useful information ***

Chapter Quiz
Chapter Quiz
Chapter Quiz
1. Name the four functional units and types of instructions they execute:
2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ?
3. Where are CPU operands stored and how do they get there?
4. What is the purpose of a hardware pipeline?
5. What is the purpose of s/w pipelining, which tool does this for you?

Chapter Quiz
Quiz - Answers
Chapter Quiz
1. Name the four functional units and types of instructions they execute:
M unit Multiplies (fixed, float)
L unit ALU arithmetic and logical operations
S unit Branches and shifts
D unit Data loads and stores
2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ?
C674x 8 MACs/cycle, C66x 32 MACs/cycle
3. Where are CPU operands stored and how do they get there?
Register Files (A and B), Load (LDx) data from memory
4. What is the purpose of a hardware pipeline?

To break up instruction execution enough to reach min cycle count
thereby allowing single cycle execution when pipeline is FULL
5. What is the purpose of s/w pipelining, which tool does this for you?
Maximize performance use as many functional units as possible in
every cycle, the COMPILER/OPTIMIZER performs SW pipelining
42

C and System Optimizations
Introduction
In this chapter, we will cover the basics of optimizing C code and some useful tips on system
optimization. Also included here are some other system-wide optimizations you can take
advantage of in your own application.
Outline
Objectives
Describe how to configure and use the

various compiler/optimizer options
Discuss the key techniques to increase
performance or reduce code size
Demonstrate how to use optimized
libraries
Overview key system optimizations
Lab 13 Use FIR algo on audio data
and optimize using the compiler,
benchmark
C6000 Embedded Design Workshop - C and System Optimizations 13 - 1

Module Topics
Module Topics
C and System Optimizations ................................................................................................... 13-1
Module Topics ......................................................................................................................... 13-2
Introduction Optimal and Optimization ............................................................................ 13-3
C Compiler and Optimizer ....................................................................................................... 13-5
Debug vs. Optimized ...................................................................................................... 13-5
Levels of Optimization ......................................................................................................... 13-6
Build Configurations ............................................................................................................ 13-7
Code Space Optimization (ms) ......................................................................................... 13-8
File and Function Specific Options ..................................................................................... 13-9
Coding Guidelines ............................................................................................................. 13-10
Data Types and Alignment .................................................................................................... 13-11
Data Types ........................................................................................................................ 13-11
Data Alignment .................................................................................................................. 13-12
Forcing Data Alignment..................................................................................................... 13-13
Restricting Memory Dependencies (Aliasing) ....................................................................... 13-14
Access Hardware Features Using Intrinsics ...................................................................... 13-16
Give Compiler MORE Information ........................................................................................ 13-17
Pragma Unroll() .............................................................................................................. 13-17
Pragma MUST_ITERATE() ............................................................................................ 13-18
Keyword - Volatile ............................................................................................................. 13-18
Setting MAX interrupt Latency (-mi option) ....................................................................... 13-19
Compiler Directive - _nassert() ......................................................................................... 13-20
Using Optimized Libraries ..................................................................................................... 13-21
Libraries Download and Support .................................................................................... 13-23
System Optimizations ........................................................................................................... 13-24
Custom Sections ............................................................................................................... 13-24
Use EDMA......................................................................................................................... 13-25
Use Cache......................................................................................................................... 13-26
System Architecture SCR .............................................................................................. 13-26
Chapter Quiz ......................................................................................................................... 13-27
Quiz - Answers .................................................................................................................. 13-28
Lab 13 C Optimizations ...................................................................................................... 13-29
Lab 13 C Optimizations Procedure ................................................................................. 13-30
PART A Goals and Using Compiler Options.................................................................. 13-30
Determine Goals and CPU Min ..................................................................................... 13-30
Using Release Configuration (o2, g) ......................................................................... 13-33
Using Opt Configuration ............................................................................................. 13-36
Part B Code Tuning ........................................................................................................ 13-39
Part C Minimizing Code Size (ms) ............................................................................... 13-40
Part D Using DSPLib ...................................................................................................... 13-41
Conclusion......................................................................................................................... 13-42
13 - 2 C6000 Embedded Design Workshop - C and System Optimizations

Introduction Optimal and Optimization

What Does Optimal Mean ?
Every user will have a different definition of optimal:
When my processing keeps up with my I/O (real-time)
When my algo achieves theoretical minimum
When Ive worked on it for 2 weeks straight, it is FAST ENOUGH
When my boss says GOOD ENOUGH
After I have applied all known (by me) optimization

techniques, I guess this is as good as it gets
What is implied by that last statement?

5
Know Your Goal and Your Limits

count
for (i = 0; i < count; i++){
Y =
i = 1
coeffi * xi Y += coeff[i] * x[i]; }
Goals:
A typical goal of any systems algo is to meet real-time
You might also want to approach or achieve CPU Min in
order to maximize #channels processed
CPU Min (the limit):

The minimum # cycles the algo takes based on architectural
limits (e.g. data size, #loads, math operations required)
Real-time vs. CPU Min
Often, meeting real-time only requires setting a few compiler options (easy)
However, achieving CPU Min often requires extensive knowledge
of the architecture (harder, requires more time)
6

Optimization Intro
Optimization is:
Continuous process of refinement in which code being optimized executes faster
and takes fewer cycles, until a specific objective is achieved (real-time execution).
When is it fast enough? Depends on users definition.
Compilers personality? Paranoid. Will ALWAYS make decisions

to give you the RIGHT answer vs. the best optimization (unless told otherwise)
Bottom Line:
Learn as many optimization techniques as possible try them all (if necessary)
This is the GOAL of this chapter
Keep in mind: mileage may vary (highly system/arch dependent)
So, lets jump right in

7

C Compiler and Optimizer

Debug vs. Optimized
Debug vs. Optimized Benchmarks
FIR
for (j = 0; j < nr; j++) { Dot Product
sum = 0;
for (i = 0; i < nh; i++) for (i = 0; i < count; i++){
sum += x[i + j] * h[i]; Y += coeff[i] * x[i]; }
r[j] = sum >> 15;
}
Benchmarks:
Algo FIR (256, 64) DOTP (256-term)
Debug (no opt, g) 817K 4109
Opt (-o3, no g) 18K 42
Addl pragmas 7K 42
(DSPLib) 7K 42
CPU Min 4096 42
Debug get your code LOGICALLY correct first (no optimization)
Opt increase performance using compiler options (easier)
CPU Min it depends. Could require extensive time
10
Debug vs. Optimized Environments

Debug (g, NO opt): Get Code Logically Correct
Provides the best debug environment with full symbolic
support, no code motion, easy to single step
Code is NOT optimized i.e. very poor performance
Create test vectors on FUNCTION boundaries (use same
vectors as Opt Env)
Opt (o3, g ): Increase Performance

Higher levels of opt results in code motion functions
become black boxes (hence the use of FXN vectors)
Optimizer can find errors in your code (use volatile)
Highly optimized code (can reach CPU Min w/some algos)
Each level of optimization increases optimizers scope
11

Levels of Optimization
Levels of Optimization
FILE1.C
-o0, -o1 -o2 -o3 -pm -o3
{
{
} LOCAL
{ ... single block
}
} FUNCTION
across
blocks
{ . . .
} FILE
across
functions
PROGRAM
across files
Increasing levels of opt:

scope, code motion
build times
FILE2.C visibility
{ . . .
}
12
Program Level Optimization (-pm)

Using pm
Right-click on your Project
and select:
Build Options
Throttling pm with opn
-pm is critical in compiling for maximum performance (requires use of o3)

-pm creates a temp.c file which includes all C source files, thus giving the
optimizer a program-level optimization context
-opn describes a program's external references (-op2 means NO extl refs)
(-op is what throttles pm )
Be careful with op2 (no extl refs). BIOS scheduler calls are external to C
13

Build Configurations
Two Default Configurations
For new projects, CCS always creates two default build configurations:
Debug Options (OK for Debug Environment)
Release Options (Ok for first step optimization)
Note: these are simply sets or containers for build options. If you set a path in one,
it does NOT copy itself to the other (e.g. includes). Also, you can make your own! 15

Code Space Optimization (ms)

Minimizing Space Option (-ms)
The table shows the basic strategy employed by
compiler and Asm-Opt when using the ms options.
% denotes how much you care about each:
-ms level Performance Code Size

none 100% 0
-ms0 90 10
-ms1 60 40
-ms2 20 80
-ms3 0 100%
Any ms will invoke compressed opcodes (16 bit)

User must use the optimizer (-o) with ms for the
greatest effect. Suggestion: use on init code.
17
Additional Code Space Options

Use program level optimization (-pm)
Try -mh to reduce prolog/epilog code
Use oi0 to disable auto-inlining
Inlining inserts a copy of a function into a C file
rather than calling (i.e. branching) to it
Auto-inlining is a compiler feature whereas small
functions are automatically inlined
Auto-inlining is enabled for small functions by o3
The oisize sets the size of functions to be
automatcially inlined
size = function size * # of times inlined
Use on1 or on2 to report size
Force function inlining with inline keyword
inline void func(void);

File and Function Specific Options

File Specific Options
Right-click on file and

select Build Options
Apply settings and
click OK.
Can also use FUNCTION-specific
Little triangle on file options via a pragma:
denotes file-specific
options applied #pragma FUNCTION_OPTIONS();
Note: most used are -o, -ms
20

Coding Guidelines
Programming the C6000
Source Efficiency* Effort
C Compiler 80 - 100% Low
C ++ Optimizer
Linear Assembly 95 - 100% Med

ASM Optimizer
ASM Hand 100% High

Optimize
22
Basic C Coding Guidelines

In order for the compiler to create the most efficient code, it is
best to follow these guidelines:
1. Use Minimum Complexity Code
If a human cant understand and read it easily, neither can the compiler
Break up larger logic into smaller loops/pieces
2. No function calls in tight loops

The compiler cannot create a pipelined loop with fxn calls present
3. Keep loops relatively small

Helps compiler generate tighter, more efficient pipelined loops
4. Create test vectors at FUNCTION boundaries

When optimization is turned on, it is nearly impossible to single-step inside fxns
5. Look at the assembly file SPLOOP ?

If curious, look at the disassembly. Was SPLOOP/LDDW used or not? Why?
Assembly optimizer generates comments as to what happened in the loop and why
Use mw (verbose pipeline info), -os (interlist), -k (keep .asm file) to see all info
23

Data Types and Alignment

Data Types
C6000 C Data Types
Type Bits Representation
char 8 ASCII
short 16 Binary, 2's complement
int 32 Binary, 2's complement
long 40* Binary, 2's complement
long long 64 Binary, 2's complement
float 32 IEEE 32-bit
double 64 IEEE 64-bit
long double 64 IEEE 64-bit
pointers 32 Binary
* long type is 32-bit for EABI (ELF)
Device ALWAYS accesses data on aligned boundaries

26

Data Alignment
Data Alignment in Memory
DataType.C Byte (LDB) Boundaries
0 z
char z = 1;
short x = 7; 1
int y; 2
double w; 3
4
void main (void)
{ 5
y = child(x, 5); 6
} 7
8
Hint: all single data items are
aligned on type boundaries
9
Alignment of Structures
Structures are aligned to the largest

type they contain
For data space efficiency, start with
larger types first to minimize holes
Arrays within structures are only
aligned to their typesize Aligning arrays within structs...
33

Forcing Data Alignment

Forcing Alignment within Structures
While arrays are aligned to 32 or 64-bit boundaries, arrays within
structures are not, which might affect optimization.
Here are a couple ideas to force arrays to 8-byte alignment:
1. Use dummy variable to force alignment
typedef struct ex1_t{
short b;
long long dummy1;
short a[40];
} ex1;
2. Use unions
typedef union algn_t{ typedef struct ex2_t{
short a2[80]; short b;
long long a8[10]; algn_t a3;
}; } ex2;
How can we force alignments of scalars or structs?

34
Forcing Alignment
#pragma DATA_ALIGN(x, 4)
short z;
short x;
Data Align pragma can

0 z align to any 2n boundary
z
x They would have been
placed here ...
x
4 x but pragma forces
them to next 4 byte
x (int) boundary
35

Restricting Memory Dependencies (Aliasing)

What is Aliasing?
int x;
int *p;
One memory location,

main() two ways to access it:
x and *p
{
p = &x;
x = 5; Note: This is a very simple alias

example. The compiler doesn't
*p = 8; have any problem
disambiguating an alias
} condition like this.
43
Aliasing?
void fcn(*in, *out)
in {
LDW *in++, A0
a ADD A0, 4, A1
in + 4
STW A1, *out++
b }
out
c out0 Intent: no aliasing (ASM code?)
*in and *out point to different
d out1 memory locations
Reads are not the problem,
e out2 WRITES are. *out COULD
point anywhere
... ... Compiler is paranoid it assumes
aliasing unless told otherwise.
ASM code is the key (pipelining)
Use restrict keyword (more soon)
44

Aliasing?
What happens if the function is void fcn(*in, *out)
called like this? {
fcn(*myVector, *myVector+1) LDW *in++, A0
ADD A0, 4, A1
STW A1, *out++
in }
a Definitely Aliased pointers

in + 4
*in and *out could point to
b the same address
But how does the compiler know?
c If you tell the compiler there is no
aliasing, this code will break (LDs
d in software pipelined loop)
One solution is to restrict the
e writes - *out (see next slide)
...
45
Alias Solutions
1. Compiler solves most aliasing on its own.
If in doubt, the result will be correct
even if the most optimal method wont be used
2. Program Level Optimization (pm o3)

Provide compiler visibility to entire program
3. No Bad Aliasing Option (mt)

Tell the compiler that no bad aliases exist in entire project
See Compiler User's Guide for definition of bad
Previous weighted vector summation example
performance was increased by 5x (by using mt)
4. Restrict Keyword (ANSI C)

Similar to mt, but on a array-level basis
void fcn(short * in, short * restrict out)
Along with these suggestions, we highly recommend you check out:

TMS320C6000 Programmers Guide
TMS320C6000 Optimizing C Compiler Users Guide
46

Access Hardware Features Using Intrinsics
Access Hardware Features Using Intrinsics

Comparing the Coding Methods
C Code
y = a * b;
C Code Using Intrinsics

y = _mpyh (a, b);
Intrinsics... In-Line Assembly

Can asm
use(C variable
MPYHnamesA0, A1, A2);of
instead
register names
Assembly
Are compatible Code
with the C environment
Adhere to Cs function
MPYH callA2
A0, A1, syntax
Do NOT use in-line assembly !
48
Intrinsics - Examples
Think of intrinsic functions
Intrinsics as a specialized function
library written by TI
#include <c6x.h>
_add2( ) _sadd ( ) has prototypes for all the
_clr ( ) _set ( ) intrinsic functions
Intrinsics are great for
_ext/u ( ) _smpy ( ) accessing the hardware
_lmbd ( ) _smpyh ( ) functionality which is
unsupported by the C
_mpy ( ) _sshl ( ) language
To run your C code on
_mpyh ( ) _ssub ( ) another compiler,
_mpylh ( ) _subc ( ) download intrinsic C-
source:
_mpyhl ( ) _sub2 ( ) spra616.zip
_nassert ( ) _sat ( ) Example:
_norm ( ) int x, y, z;
Refer to C Compiler Users Guide for more information
z = _lmbd(x, y);
49

Give Compiler MORE Information

Provide Compiler with More Insight
1. Program Level Optimization: -pm op2 -o3
2. #pragma DATA_ALIGN (var, byte align)

3. #pragma UNROLL(# of times to unroll);
4. #pragma MUST_ITERATE(min, max, %factor);
5. Use volatile keyword

6. Set MAX interrupt threshold
7. Use _nassert() to tell optimizer about pointer alignment
Like pm, #pragmas are an easy way to pass more

information to the compiler
The compiler uses this information to create better code
#pragmas are ignored by other C compilers if they are
not supported
51
Pragma Unroll()
3. UNROLL(# of times to unroll)
#pragma UNROLL(2);
for(i = 0; i < count ; i++) {
sum += a[i] * x[i];
}
Tells the compiler to unroll the for() loop twice

The compiler will generate extra code to handle
the case that count is odd
The #pragma must come right before the for() loop
UNROLL(1) tells the compiler not to unroll a loop
52

Pragma MUST_ITERATE()
4. MUST_ITERATE(min, max, %factor)
#pragma UNROLL(2);
#pragma MUST_ITERATE(10, 100, 2);
for(i = 0; i < count ; i++) {
sum += a[i] * x[i];
}
Gives the compiler information about the trip (loop) count

In the code above, we are promising that:
count >= 10, count <= 100, and count % 2 == 0
If you break your promise, you might break your code
MIN helps with code size and software pipelining
MULT allows for efficient loop unrolling (and odd cases)
The #pragma must come right before the for() loop
53
Keyword - Volatile
5. Use Volatile Keyword
Ifa variable changes OUTSIDE the optimizers scope, it will
remove/delete the variable and any associated code.
For example, lets say *ctrl points to an EMIF address:
int *ctrl;
while (*ctrl == 0);
Use volatile keyword to tell compiler to leave it alone:

volatile int *ctrl;
while (*ctrl == 0);

54

Setting MAX interrupt Latency (-mi option)

6. Set MAX Interrupt Threshold
Loops using SPLOOP buffer are interruptible. However, loops that do

not meet the criteria for SPLOOP are NOT generally interruptible
Use the mi option to set the MAX #cycles that interrupts are disabled
(n = 1000 is a good starting number)
This option does NOT comprehend slow memory cycles or stalls
#pragma FUNC_INTERRUPT_THRESHOLD(func, threshold);
55
-mi Details
-mi0
Compilers code is not interruptible
User must guarantee no interrupts will occur
-mi1
Compiler uses single assignment and never produces a loop less
than 6 cycles
-mi1000 (or any number > 1)
Tells the compiler your system must be able to see interrupts every
1000 cycles
When not using mi (compilers default)
Compiler will software pipeline (when using o2 or o3)
Interrupts are disabled for s/w pipelined loops
Notes:
Be aware that the compiler is unaware of issues such as memory
wait-states, etc.
Using mi, the compiler only counts instruction cycles
56

Compiler Directive - _nassert()

7. _nassert()
_nassert((ptr & 0x7) == 0 );
Generates no code, evaluated at compile time

Tells the optimizer that the expression declared with
the assert function is true
Above example declares that ptr is aligned on an 8-byte
boundary (i.e. the lowest 3-bits of the address in ptr are
000b)
In the next lab, _nassert() is used to tell the compiler
that history pointer is aligned on an 8-byte boundary
58

Using Optimized Libraries

DSPLIB Adaptive filtering Math
DSP_firlms2 DSP_dotp_sqr
Optimized DSP Function Library for
Correlation DSP_dotprod
C programmers using C62x/C67x
DSP_autocor DSP_maxval
and C64x devices
FFT DSP_maxidx
These routines are typically used in DSP_bitrev_cplx DSP_minval
computationally intensive real-time DSP_radix 2 DSP_mul32
applications where optimal DSP_r4fft DSP_neg32
execution speed is critical. DSP_fft DSP_recip16
By using these routines, you can DSP_fft16x16r DSP_vecsumsq
achieve execution speeds DSP_fft16x16t DSP_w_vec
considerably faster than equivalent DSP_fft16x32 Matrix
code written in standard ANSI C DSP_fft32x32 DSP_mat_mul
language. And these ready-to-use DSP_fft32x32s DSP_mat_trans
functions can significantly shorten DSP_ifft16x32 Miscellaneous
your development time. DSP_ifft32x32 DSP_bexp
Filters & convolution DSP_blk_eswap16
The DSP library features:
DSP_fir_cplx DSP_blk_eswap32
C-callable
DSP_fir_gen DSP_blk_eswap64
Hand-coded assembly-optimized DSP_fir_r4 DSP_blk_move
Tested against C model and DSP_fir_r8 DSP_fltoq15
existing run-time-support functions DSP_fir_sym DSP_minerror
DSP_iir DSP_q15tofl
60
IMGLIB Compression / Picture Filtering /

Optimized Image Function Library Decompression Format Conversions
for C programmers using C62x/C67x IMG_fdct_8x8 IMG_conv_3x3
and C64x devices IMG_idct_8x8 IMG_corr_3x3
IMG_idct_8x8_12q4 IMG_corr_gen
The Image library features: IMG_mad_8x8 IMG_errdif_bin
C-callable IMG_mad_16x16 IMG_median_3x3
C and linear assembly src code IMG_mpeg2_vld_intra IMG_pix_expand
IMG_mpeg2_vld_inter IMG_pix_sat
Tested against C model
IMG_quantize IMG_yc_demux_be16
IMG_sad_8x8 IMG_yc_demux_le16
IMG_sad_16x16 IMG_ycbcr422_rgb565
IMG_wave_horz Image Analysis
IMG_wave_vert IMG_boundary
IMG_dilate_bin
IMG_erode_bin
IMG_histogram
IMG_perimeter
IMG_sobel
IMG_thr_gt2max
IMG_thr_gt2thr
IMG_thr_le2min
IMG_thr_le2thr
61

FastRTS (C67x)
Optimized floating-point math function library for C programmers using
TMS320C67x devices
Includes all floating-point math routines currently in existing C6000 run-
time-support libraries
Single Precision Double Precision
The FastRTS library features: atanf atan
C-callable atan2f atan2
Hand-coded assembly-optimized cosf cos
Tested against C model and
expf exp
exp2f exp2
existing run-time-support functions exp10f exp10
logf log
FastRTS must be installed per log2f log2
log10f log10
directions in its Users Guide powf pow
(SPRU100a.PDF) recipf recip
rsqrtf rsqrt
sinf sin
62
FastRTS (C62x/C64x)
Optimized floating-point math function library for C programmers
enhances floating-point performance on C62x and C64x fixed-point devices
The FastRTS library features: Single Double

Others
C-callable Precision Precision
Hand-coded assembly-optimized _addf _addd _cvtdf
Tested against C model and _divf _divd _cvtfd
existing run-time-support functions _fixfi _fixdi
_fixfli _fixdli
_fixfu _fixdu
FastRTS must be installed per _fixful _fixdul
directions in its Users Guide _fltif _fltid
_fltlif _fltlid
(SPRU653.PDF)
_fltuf _fltud
_fltulf _fltuld
_mpyf _mpyd
recipf recip
_subf _subd
63

Libraries Download and Support

Download and Support
Download via TI Wiki
Source code available
Includes doc folders which
contain useful API guides
Other docs:
SPRU565 DSP API User Guide
SPRU023 Imaging API UG
SPRU100 FastRTS Math API UG
SPRA885 DSPLIB app note
SPRA886 IMGLIB app note
64

System Optimizations
Custom Sections
Custom Placement of Data and Code
Problem #1: have three arrays, two have to be linked into L1D and
one can be linked to DDR2. How do you split the .far section??
.far
rcvPing
L1D
rcvPing DDR2
rcvPong SlowBuf
rcvPong
SlowBuf
Problem #2: have two fxns, one has to be linked into L1P and the
other can be linked to DDR2. How do you split the .text section??
.text
filter L1P DDR2
filter SlowCode
SlowCode
72
Making Custom Sections

Create custom data section using:
#pragma DATA_SECTION (rcvPing, .far:rcvBuff);
int rcvPing[32];
#pragma DATA_SECTION (rcvPong, .far:rcvBuff);
int rcvPong[32];
rcvPing is the name of the buffer

.far: rcvBuff is the name of the custom section
Create custom code section using:
#pragma CODE_SECTION(filter, .text:_filter);

void filter(*rcvPing, *coeffs, ){
How do we link these custom sections?

73

Linking Custom Sections

audio_appcfg.cmd Build
app.cfg MEMORY { }
SECTIONS { } .map
Linker
SECTIONS userlinker.cmd
{ .far:rcvBuff: > FAST_RAM
.text:_filter: > FAST_RAM app.out
}
Create your own linker.cmd file for custom sections

CCS projects can have multiple linker CMD files
Results of the linker are written to the .map file
.far: used in case linker.cmd forgets to link custom section
mo creates subsection for every fxn (great for libraries)
w warns if unexpected section encountered
74
Use EDMA
Using EDMA
External
Memory
Internal 0x8000 func1
RAM Program
func2
func3
CPU EDMA EMIF
Program the EDMA to automatically transfer

data/code from one location to another.
Operation is performed WITHOUT CPU intervention
All details covered in a later chapter
76

Use Cache
Using Cache Memory
DDR2
Cache 0x8000 func1
Program
func2
func3
Cache
CPU mDDR
H/W
Cache hardware automatically transfers

code/data to internal memory, as needed
Addresses in the Memory Map are associated
with locations in cache
Cache locations do not have their own addresses
Note: we have an entire chapter dedicated to cache later on 80
System Architecture SCR

System Architecture SCR
Switched
SCR Switched Central Resource Masters Central Slaves
Masters initiate accesses to/from Resource
slaves via the SCR
Most Masters (requestors) and Slaves SRIO C64 Mem
(resources) have their own port to SCR
DDR2
Lower bandwidth masters (HPI, CPU
PCIe, etc) share a port EMIF64
There is a default priority (0 to 7) to
SCR resources that can be modified: TC0 TCP
TC1
CC SCR VCP
TC2
SRIO, HOST (PCI/HPI), EMAC
TC3
TC0, TC1, TC2, TC3 PCI66
CPU accesses (cache misses)
PCIe McBSP
Priority Register: MSTPRI
HPI Utopia
EMAC
Note: refer to your specific datasheet for register names 82

Chapter Quiz
Chapter Quiz
Chapter Quiz
1. How do you turn ON the optimizer ?
2. Why is there such a performance delta between Debug and Opt ?
3. Name 4 compiler techniques to increase performance besides -o?
4. Why is data alignment important?
5. What is the purpose of the mi option?
6. What is the BEST feedback mechanism to test compilers efficiency?

Chapter Quiz
Quiz - Answers
Chapter Quiz
1. How do you turn ON the optimizer ?
Project -> Properties, use o2 or o3 for best performance
2. Why is there such a performance delta between Debug and Opt ?

Debug allows for single-step (NOPs), Opt fills delay slots optimally
3. Name 4 compiler techniques to increase performance besides -o?

Data alignment, MUST_ITERATE, restrict, mi, intrinsics, _nassert()
4. Why is data alignment important?

Performance. The CPU can only perform 1 non-aligned LD per cycle
5. What is the purpose of the mi option?

To specify the max # cycles a loop will go dark responding to INTs
6. What is the BEST feedback mechanism to test compilers efficiency?

Benchmarks, then LOOK AT THE ASSEMBLY FILE. Look for LDDW & SPLOOP
85

Lab 13 C Optimizations
Lab 13 C Optimizations
In the following lab, you will gain some experience benchmarking the use of optimizations using
the C optimizer switches. While your own mileage may vary greatly, you will gain an
understanding of how the optimizer works and where the switches are located and their possible
affects on speed and size.
Lab 13 Optimizations Galore

aic31_MA_TIRTOS.c
ADC
RxCh rxBuf0
rxBuf0
(48 KHz) Mod PSETs
Copy HIST
DAC txBuf0
Start EDMA3 XFRs
Procedure EDMA3CCComplIsr()
{
// post Rx SEM
2. Part A Determine goal & CPU Min }
Apply compiler options
3. Part B Code tuning (using pragmas)
4. Part C Optimize for Space (-ms) Clk1
5. Part D Use DSPLIB FIR filter Tick 500ms
Time = 75min 49

Lab 13 C Optimizations Procedure

PART A Goals and Using Compiler Options
Determine Goals and CPU Min
1. Determine Real-Time Goal
Because we are running audio, our real-time goal is for the processing (using low-pass FIR
filter) to keep up with the I/O which is sampling at 48KHz. So, if we were doing a single
sample FIR, our processing time would have to be less than 1/48K = 20.8uS. However, we
are using double buffers, so our time requirement is relaxed to 20.8uS * BUFFSIZE = 20.8 *
256 = 5.33ms. Alright, any DSP worth its salt should be able to do this work inside 5ms.
Right? Hmmm
Real-time goal: music sounds fine.
2. Determine CPU Min.
What is the theoretical minimum based on the C674x architecture? This is based on several
factors data type (16-bit), #loads required and the type mathematical operations involved.
What kind of algorithm are we using? FIR. So, lets figure this out:
256 data samples * 64 coeffs = 16384 cycles. This assumes 1 MAC/cycle
Data type = 16-bit data
# loads possible = 8 16-bit values (aligned). Two LDDW (load double words).
Mathematical operation DDOTP (cross multiply/accumulate) = 8 per cycle
So, the CPU Min = 16384/8 = ~2048 cycles + overhead.
If you look at the inner loop (which is a simple dot product, it will take 64/8 cycles = 8 cycles
per inner loop. Add 8 cycles overhead for prologue and epilogue (pre-loop and post-loop
code), so the inner loop is 16 cycles. Multiply that by the buffer size = 256, so the
approximate CPU min = 16*256 = 4096.
CPU Min = 4096 cycles.
3. Import Lab 13 Project.
Import Lab 13 Project from \Labs\Lab13 folder. The file name is:
LAB_13_C6000_STARTER_OPT.zip
Change the build properties to use the latest TI-RTOS tools.
4. Make a few changes.
Open ceoffs_MA_TIRTOS.c. Comment out the ALL PASS coefficient table and
uncomment the LOW PASS filter coefficients. We are going low in this lab. ;-) When done,
save the file. This will help you know a new set of coeffs are being used and it adds a bit of
spice to what we are doing.
Open mcaspPlayBk_MA_TIRTOS.c. Scroll down to the function
CopyBufRxToTxTaskFxn() around line 634. This is where the cfir() function is called for both
the left and right samples. We will be benchmarking this set of calls throughout the lab.
Notice the built-in logging info in the code for cfir(). As you add each optimization, you will
be able to look at the logging data in UIA to see the differences as each option (or piece of
code) is added.

5. Using the Debug Build Configuration, build and play.

Make sure your project is using the build configuration named Debug and build your code.
After a successful build, load it and then run it for about 5-10 seconds and then HALT. The
audio should sound fine. What is your cfir() benchmark?
Select the RTOS Analyzer, enable CPU Load and click Start.
Filter the Live Session system log to look at only those logs that contain FIR.
Click on the Live Session tab. And then click on the filter button:
When the dialogue comes up, fill in the following:
Then click the Filter button above and view the Live Session tab. It should look something
like this:
Write down your actual benchmark below. The author, at the time of this writing, calculated
about 639K cycles as you can see.
Debug (-g, no opt) benchmark for cfir()? _________________ cycles

Did we meet our real-time goal (music sounding fine?): ____________
6. Compare to Lab 11 profiler clock results.

In Lab 11, we used the profiler clock to benchmark the cfir() calls. Copy your benchmark to
here and compare:
Profiler clock (Lab 11) ___________ Lab 13 benchmark ___________
The author compared these two numbers: Lab 11 642K vs Lab 13 639K cycles. Close.

7. Check CPU load.

Click on the CPU Load tab. What do you see?
The author saw about 83%. Goodness a high powered DSP performing a simple FIR filter
and it is almost out of steam. Whoa. Maybe we need to OPTIMIZE this thing.
What were your results? Write the down below:
Debug (-g, no opt) CPU Load for the application? ________ %

Did we meet our real-time goal (music sounding fine?): ____________
Yes, we met the real-time goal because the audio sounds fine.
But hey, its using the Debug Configuration. And if we wanted to single step our code, we
can. It is a very nice debug-friendly environment although the performance is abysmal. This
is to be expected.
8. View Debug Build Configuration compiler options.

FYI if you looked at the options for the Debug configuration, youd see the following:
Full symbolic debug is turned on.

What about the optimizer? Is it turned on? Go look:
Nope. It is off. So this is that standard Debug configuration. Ok, nice fluffy debug environment
to make sure were getting the right answers, but not very high performance. Lets kick it up
a notch
Using Release Configuration (o2, g)

9. Change the build configuration from Debug to Release.
Next, well use the Release build configuration.
In the project view, right-click on the project and choose Build Configurations and Set
Active the Release configuration as shown:
Normally, the author NEVER uses the Release build configuration at all. Why? Because it
doesnt contain all of the build paths that now work perfect in the Debug configuration. Yes,
we could simply copy each one over manually but that is a pain. The author uses Debug first,
gets the code logically correct and then creates a new configuration (OPT), copy over the
Debug settings (paths) and then begin adding optimizations one by one.
However, in this lab, we just want to test what Release does and the author already copied
over the settings for you to the Release configuration to make it easy on you.

10. Rebuild and Play.

Build, Load and Run.
By the way, if you are NOT in this habit already, you need to be now. When you click the
Load button (down arrow):
And then select Load Program, the following dialogue pops up:
The .out file shown above was the LAST file that was loaded. Now that we have switched
configurations (or maybe even switched projects), if we just select OK, we will get the
WRONG file loaded. Always, always, ALWAYS click the Browse project button and
specifically choose the file you want:
Once built and loaded, your audio should sound fine now that is, if you like to hear music
with no treble. Remember, just run it for 5-10 seconds.
11. Benchmark cfir() release mode.

Using the same method as before, observe the benchmark for cfir().
Release (-o2, -g) benchmark for cfir()? __________ cycles
Heres our picture:
Ok, now were talkin it went from 639K to 27K just by switching to the release
configuration. So, the bottom line is TURN ON THE OPTIMIZER !!

12. Did the CPU Load change?

Check out your CPU Load. Here is ours:
Wow from 83% down to about 5%. What a difference and we arent even CLOSE to the
best benchmark yet. This begins to show you the real difference between NO optimization
and simply using -o2.
13. Study release configuration build properties.
Find these locations in Properties:
The biggie is that o2 is selected. But we still have -g turned on which is fine.
Can we improve on this benchmark a little? Maybe

Using Opt Configuration

14. Create a NEW build configuration named Opt.
Really? Yep. And its easy to do.
Using the Release configuration, right-click on the project and select Properties (where
youve been many times already).
Click on C6000Optimization and notice the optimization level is -o2.
Look up a few inches and youll see the Configuration: drop-down dialogue.
Click on the down arrow and youll see Debug and Release.
Click on the Manage Configurations button:
Click New and when the following dialogue pops up, name your new configuration Opt,
change the Copy settings from option to use Existing Configuration and make sure you
choose the Release configuration as shown:
Click Ok a few times
Change the Active Configuration to Opt

15. Change the Opt build properties to use o3 and NO g.

We need to change TWO options at this point the optimization level (change to -o3) and
turn off symbolic debug (-g).
Modify the optimization level to -o3.
Modify the Debugging Model to use:
Rebuild your code and benchmark as before. Also look at the CPU Load.
Opt (-o3, no -g) benchmark for cfir()? ___________ cycles
CPU Load = _______ %
The authors benchmark was: And the CPU Load was about 3%:
5263 cycles. Is that incredible or what? Just about 4 years ago (in 2012), at this point in the
lab, the benchmark was 18K cycles. This means that the compiler team continues to work
hard on interpreting your code and finding ways to cut cycles. My hat is tipped to the TI
compiler team.
So, in just 30 minutes of work, we have reduced our benchmark from 639K cycles to just
about 5K cycles.
The down side is that there isnt much else we can do to optimize our code. We WILL do
some more optimizations in the next part, but they wont have much affect. But remember,
everyones mileage will vary, so that is why we go through each step anyway. You will need
all the tools possible for your own application.
Just for kicks and grins, try single stepping your code and/or adding breakpoints in the
middle of a function (like cfir). Is this more difficult with g turned OFF and o3 applied? Yep.
Note: With g turned OFF, you still get symbol capability i.e. you can enter symbol
names into the watch and memory windows. However, it is nearly impossible to single
step C code hence the suggestion to create test vectors at function boundaries to
check the LOGICAL part of your code when you build with the Debug Configuration.
When you turn off g, you need to look at the answers on function boundaries to make
sure it is working properly.

16. Turn on verbose and interlist and then see what the .asm file looks like for fir.asm.
As noted in the discussion material, to see it all, you need to turn on three switches. Turn
them on now, then build, then peruse the fir.asm file. You will see some interesting
information about software pipelining for the loops in mcaspPlayBk_MA_TIRTOS.c.
Turn on:
RunTime Model Options Verbose pipeline info (-mw)
Advanced Optimizations:
Assembler Options Keep .ASM file (-k)
This is the information you will need in order to check to see if SPLOOP was disqualified and
why. If SPLOOP is being used, you know that the loops are small enough to fit in the buffer
and that you are getting maximum performance.
You can re-check the ASM files as you do each step in the next part

Part B Code Tuning

Just a note of caution here as we begin this section. Due to the already incredible benchmark we
have thus far, do not expect much improvement in the following steps. The goal here is to learn
these steps in case they make a drastic improvement in YOUR application.
17. Use #pragma MUST_ITERATE in cfir() function.
Locate the cfir() function in mcaspPlayBk_MA_TIRTOS.c. It is near line 696.
Uncomment the #pragmas for MUST_ITERATE on the two for loops. This pragma gives
the compiler some information about the loops and how to unroll them efficiently. As
always, the more info you can provide to the compiler, the better.
Use the Opt build configuration. Build, load and run your code. Then benchmark the cfir()
function as before
Opt + MUST_ITERATE (-o3, no g) cfir()? __________ cycles
KEEP this benchmark in mind as you do the next cache lab. We will compare the results.
The authors results were:
Ok, so the benchmark is similar if not identical. Thats ok. Your mileage may vary in terms of
your own system. Also, if you were paying attention to the generated ASM files, after using
MUST_ITERATE, the tools only created ONE loop instead of two because we told it what the
min/max trip counts were. We helped the compiler become even more efficient.
18. Use restrict keyword on the results array.
You actually have a few options to tell the compiler there is NO ALIASING. The first method
is to tell the compiler that your entire project contains no aliasing (using the mt compiler
option). However, it is best to narrow the scope and simply tell the compiler that the results
array has no aliasing (because the WRITES are destructive, we RESTRICT the output array).
Comment out the old cfir() declaration and uncomment the new one that contains the
restrict keyword as shown below:
Build, then run again. Now benchmark your code again. Did it improve?
Opt + MUST_ITERATE + restrict (-o3, no g) cfir()? __________ cycles
Because aliasing was already figured out by the tools earlier, there was not much
improvement. The author saw 5720 cycles (a slight increase).

Part C Minimizing Code Size (ms)

19. Determine current cfir benchmark and .text size.
Select the Opt configuration and also make sure MUST_ITERATE and restrict are used
in your code (this is the same setting as the previous lab step).
Rebuild and Run.
Write down your fastest benchmark for cfir:
Opt (-o3, NO g, NO ms3) cfir, ____________ cycles

.text (NO ms) = ___________ h
Open the .map file generated by the linker. Hmmm. Where is it located?
Try to find it yourself without asking anyone else. Hint: which build config did you use
when you hit build ?
20. Add ms3 to Opt Config.
Open the build properties and add ms3 to the compiler options (under Optimization). We
will just put the pedal to the metal for code size optimizations and go all the way to ms3
first. Note here that we also have o3 set also (which is required for the ms option).
In this scenario, the compiler may choose to keep the slow version of the redundant loops
(fast or slow) due to the presence of ms.
Rebuild and run.
Opt + -ms (-o3, NO g, ms3) cfir, ____________ cycles

.text (-ms3) = ___________ h
Did your benchmark get worse with ms3? How much code size did you save? What
conclusions would you draw from this?
____________________________________________________________________
____________________________________________________________________
Keep in mind that you can also apply ms3 (or most of the basic options) to a specific
function using #pragma FUNCTION_OPTIONS( ).
FYI the author saved about 3.3K bytes total out of the .text section and the benchmark was
about 24K.
Also remember that you can apply ms3 on a FILE BY FILE basis. So, a smart way to apply
this is to use it on init routines and keep it far away from your algos that require the best
performance.

Part D Using DSPLib

I will be interesting to test the C compilers best performance vs the DSPLib version of the FIR
filter which was hand-optimized in assembly. Of course, the assembly cant be optimized and
take advantage of the latest compiler upgrades. At one point in 2012, the compiler and the
DSPLib benchmarks were almost identical. Lets see how they compare today
21. Download and install the appropriate DSP Library.
This, fortunately for you, has already been done for you. This directory is located at:
C:\TI_RTOS\C6000\dsplib64x+\lib\dsplib64plus_elf.lib
22. Link the appropriate library to your project.
Find the lib file in the above folder and link it to your project (ELF version).
Also, add the include path for this library to your build properties.
23. Add #include to the system_MA_TIRTOS.h file.
Add the proper #include for the header file for this library to system_MA_TIRTOS.h.
24. Replace the calls to the fir function in fir.c.
Replace:
cfir(rxBufFirL.hist, COEFFS, txBufFirL, ORDER, NUM_SAMPLES_PER_AUDIO_BUF/2);
cfir(rxBufFirR.hist, COEFFS, txBufFirR, ORDER, NUM_SAMPLES_PER_AUDIO_BUF/2);
with
DSP_fir_gen(rxBufFirL.hist, COEFFS, txBufFirL, ORDER,
NUM_SAMPLES_PER_AUDIO_BUF/2);
DSP_fir_gen(rxBufFirR.hist, COEFFS, txBufFirR, ORDER,
NUM_SAMPLES_PER_AUDIO_BUF/2);
25. Build, load, verify and BENCHMARK the new FIR routine in DSPLib.
26. What are the best-case benchmarks?
Yours (compiler/optimizer):___________ DSPLib: ___________
The authors results are here:
Wow, for what we wanted in THIS system (a fast simple FIR routine), we would have been
better off just using DSPLib. Yep. But, in the process, youve learned a great deal about
optimization techniques across the board that may or may not help your specific system.
Remember, your mileage may vary.

Conclusion
Hopefully this exercise gave you a feel for how to use some of the basic compiler/optimizer
switches for your own application. Everyones mileage may vary and there just might be a
magic switch that helps your code and dosent help someone elses. Thats the beauty of trial
and error.
Conclusion? TURN ON THE OPTIMIZER ! Was that loud enough?
Heres what the author came up with how did your results compare?
Optimizations Benchmark
Debug Bld Config No opt 639K
Release (-o2, -g) 27K
Opt (-o3, no g) 5260
Opt + MUST_ITERATE 5260
Opt + MUST_ITERATE + restrict 5720 (slight increase)
DSPLib (FIR) 4384
Regarding ms3, use it wisely. It is more useful to add this option to functions that are large
but not time critical like IDLE functions, init code, maintenance type items.You can save
some code space (important) and lose some performance (probably a dont care). For your
time-critical functions, do not use ms ANYTHING. This is just a suggestion again, your
mileage may vary.
CPU Min was 4K cycles. We got close, but didnt quite reach it. The author believe that it is
possible to get closer to the 4K benchmark by using intrinsics and the DDOTP instruction.
However, the DSPLIB function did quite a nice job.
Keep in mind that these benchmarks are not exactly perfect. Why? Because we never
subtracted out the number of cycles to perform a Timestamp_get32(). Author thinks that
would lower the benchmarks by ~100-150 more cycles. But relative to each other is what
you were keeping track of.
The biggest limiting factor in optimizing the cfir routine is the sliding window. The processor
is only allowed ONE non-aligned load each cycle. This would happen 75% of the time. So,
the compiler is already playing some games and optimizing extremely well given the
circumstances. It would require hand-tweaking via intrinsics and intimate knowledge of the
architecture to achieve much better.
27. Terminate the Debug session, close the project and close CCS. Power-cycle the board.
Throw something at the instructor to let him know that youre done with the
lab. Hard, sharp objects are most welcome

Cache & Internal Memory
Introduction
In this chapter the memory options of the C6000 will be considered. By far, the easiest and
highest performance option is to place everything in on-chip memory. In systems where this is
possible, it is the best choice. To place code and initialize data in internal RAM in a production
system, refer to the chapters on booting and DMA usage.
Most systems will have more code and data than the internal memory can hold. As such, placing
everything off-chip is another option, and can be implemented easily, but most users will find the
performance degradation to be significant. As such, the ability to enable caching to accelerate the
use of off-chip resources will be desirable.
For optimal performance, some systems may beneifit from a mix of on-chip memory and cache.
Fine tuning of code for use with the cache can also improve performance, and assure reliability in
complex systems. Each of these constructs will be considered in this chapter,
Objectives
Objectives
Compare/contrast different uses of

memory (internal, external, cache)
Define cache terms and definitions
Describe C6000 cache architecture
Demonstrate how to configure and use
cache optimally
Lab 14 modify an existing system to
use cache benchmark solutions
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 1
Module Topics
Module Topics
Cache & Internal Memory ......................................................................................................... 14-1
Module Topics ......................................................................................................................... 14-2
Why Cache? ............................................................................................................................ 14-3
Cache Basics Terminology .................................................................................................. 14-4
Cache Example ....................................................................................................................... 14-7
L1P Program Cache........................................................................................................... 14-10
L1D Data Cache................................................................................................................. 14-14
L2 RAM or Cache ?............................................................................................................ 14-16
Cache Coherency (or Incoherency?) .................................................................................... 14-18
Coherency Example .......................................................................................................... 14-18
Cache Functions Summary ............................................................................................ 14-22
Coherency Summary ..................................................................................................... 14-23
Cache Alignment ............................................................................................................... 14-23
MAR Bits Turn On/Off Cacheability ................................................................................... 14-24
Additional Topics ................................................................................................................... 14-26
Chapter Quiz ......................................................................................................................... 14-29
Quiz Answers ................................................................................................................. 14-30
Lab 14 Using Cache........................................................................................................... 14-31
Lab 14 Using Cache Procedure ...................................................................................... 14-32
A. Run System From Internal RAM .................................................................................. 14-32
B. Run System From External DDR2 (no cache)............................................................. 14-34
C. Run System From DDR2 (cache ON) .......................................................................... 14-37
Notes ..................................................................................................................................... 14-40
14 - 2 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Why Cache?
Why Cache?
Parking Dilemma
Close Parking Distant

0 minute walk Parking-Ramp
10 spaces
Sports 10 minute walk
$100/space
Arena 1000 spaces
$5/space
10 minute walk
Parking Choices:
0 minute walk @ $100 for close-in parking
10 minute walk @ $5 for distant parking
or
Why Cache?
Cache Bulk
Memory Memory
Sports Fast Slower
Arena Small Larger
Works like Cheaper
Big, Fast
Memory
Memory Choices:
Small, fast memory
Large, slow memory
or Use Cache:
Combines advantages of both
Like valet, data movement is automatic
Cache Basics Terminology

Using Cache Memory
External
Memory
Cache 0x8000 func1

Program
func2
func3
Cache
CPU EMIF
H/W
Cache hardware automatically transfers

code/data to internal memory, as needed
Addresses in the Memory Map are
associated with locations in cache
Cache locations do not have their own addresses
Lets start with Basic Concepts of a Cache

10
Cache: Block, Line, Index

Cache External
0 Memory
Cache .. 0x8000
Line .
0xF 0x8010
Index
0x8020
Conceptually, a cache divides the entire
memory into blocks equal to its size
A cache is divided into smaller storage
Block
locations called lines
The term Index or Line-Number is used to
specify a specific cache line
How do we know which block is cached?

11
Cache Tags
Tag Index Cache External
800 0 Memory
801 1
.. 0x8000
.
0xF 0x8010
A Tag value keeps track of which block is

associated with a cache block 0x8020
Each line has its own tag -- thus,

the whole cache block wont be erased when
lines from different memory blocks need to be
cached simultaneously
How do we know a cache line is valid (or not)?

12
Valid Bits
Valid Tag Index Cache External
1 800 0 Memory
1 801 1
.. .. 0x8000
. .
0
0 721 0xF 0x8010
A Valid bit keeps track of which lines

contain real information 0x8020
They are set by the cache hardware

whenever new code or data is stored
This type of cache is called ...

13
Direct-Mapped Cache
Index Cache External
0 Memory
.. 0x8000
.
0xF 0x8010
Direct-Mapped Cache associates an address

within each block with one cache line 0x8020
Thus there will be only one unique cache

index for any address in the memory-map
Only one block can have information in a Block
cache line at any given time
Let's look at an example ...

14
Cache Example
Cache Example
Direct-Mapped Cache Example
Valid Tag Index Cache External
0 Memory
1
.. 0x8000
.
E
0xF 0x8010
Lets examine an arbitrary direct- 0x8020

mapped cache example:
A 16-line, direct-mapped cache requires
a 4-bit index 0x8030
If our example P used 16-bit addresses,
this leaves us with a 12-bit tag
15 4 3 0
Tag Index
16
Arbitrary Direct-Mapped
Cache Example
The following example uses:
16-line cache
16-bit addresses, and
Stores one 32-bit instruction per line
C6000 caches have different cache and
line sizes than this example
It is only intended as a simple cache
example to reinforce cache concepts
17
Cache Example
Conceptual Example Code

Address Code
0003h L1 LDH
0004h MPY
0005h ADD
0006h B L2
0026h L2 ADD
0027h SUB cnt
0028h [!cnt] B L1
15 4 3 0
Tag Index
18

Valid Tag Index Cache
0
1
2
000 3 LDH
000 4 MPY
000 5 ADD
000 002 000 6 B ADD B
002 7 SUB
002 8 B
9
Address Code A
0003h L1 LDH .
...
0026h L2 ADD .
0027h SUB cnt F
0028h [!cnt] B L1
33
Cache Example

Valid Tag Index Cache
0
1
2
000 3 LDH
000 4 MPY

Notes: 000 5 ADD
This example
000 was contrived
6 to show
B how
cache 002
lines can thrash
7 ADD
Code thrashing is minimized on the
002 8 SUB
C6000 due
002 to relatively
9 large cacheBsizes
Keeping code in contiguous sections
A
also helps to minimize thrashing
.
Lets review the two types of misses that
we encountered
.
F
34
Types of Misses
Compulsory
Miss when first accessing an new address
Conflict
Lineis evicted upon access of an address whose
index is already cached
Solutions:
Change memory layout
Allow more lines for each index
Capacity (we didnt see this in our example)
Lineis evicted before it can be re-used because
capacity of the cache is exhausted
Solution: Increase cache size
35
L1P Program Cache
L1P Program Cache

L1P Cache
Program DDR2
Cache (L1P)
CPU L2 EMIF
for( i = 0; i < 10; i++ ) {

sum += x[i] * y[i];
}
Zero-waitstate Program Memory

Direct-Mapped Cache
Works exceptionally well for DSP code
(which tends to have many loops)
Can be placed to minimize thrashing
How big is the cache?

38
L1P Size
Device Scheme Size Linesize
Direct 64 bytes
C62x/C67x Mapped
4K bytes
(16 instr)
Direct 32 bytes
C64x Mapped
16K bytes
(8 instr)
C64x+
Direct 32 bytes
C674x Mapped
32K bytes
(8 instr)
C66x
All L1P memories provide zero waitstate access
What does Linesize mean?

39
L1P Program Cache
New Term: Linesize

Cache
0 DDR2
.. 0x8000
.
0x8010
0xF
In our earlier cache example, the size was: 0x8020

Size: 16 bytes
Linesize: 1 byte
# Of indexes: 16
Block
How else could it be configured?

40
New Term: Linesize

Index Cache
0 0 1 DDR2
..
. 0x8000
0x7 0xE 0xF
0x8010
In our earlier cache example, the size was: 0x8020

Size: 16 bytes
Linesize: 1 byte
# Of indexes: 16
Block
If line size increases to TWO bytes, then:
Size: 16 bytes
Linesize: 2 bytes Whats the advantage of greater line size?
# Of indexes: 8
Speed! When cache retrieves one item, it
gets another at the same time.
New C64x+ L1P features...
L1P Program Cache
L1P Cache Comparison

Device Scheme Size Linesize New Features
Direct 64 bytes
C62x/C67x Mapped
4K bytes
(16 instr)
N/A
Direct 32 bytes
C64x Mapped
16K bytes
(8 instr)
N/A
C64x+ Cache/RAM
Direct 32 bytes
C674x 32K bytes Cache Freeze
C66x Mapped (8 instr)
Memory Protection
All L1P memories provide zero waitstate access
Next two slides discuss Cache/RAM and Freeze features.

Memory Protection is not discussed in this workshop. Cache/Ram...
42
C64x+ L1P Cache vs. Addressable RAM

32K
RAM Memory
Cache
Can be configured as Cache or

Addressable RAM 16K
Five cache sizes are available:
0K, 4K, 8K, 16K, 32K
Allows critical loops to be put
into L1P, while still affording 8K
room for cache memory
4K
Cache Freeze...
43
L1P Program Cache
Cache Freeze (C64x+)

Freezing cache prevents data that is currently cached from being evicted
Cache Freeze
Responds to read and write hits normally
No updating of cache on miss
Freeze supported on C64x+ L2/L1P/L1D
Commonly used with Interrupt Service Routines so that one-use code
does not replace realtime algo code
Other cache modes: Normal, Bypass
Cache_xyz: BIOS Cache management module
Cache Mode Management

Mode = Cache_getMode(level) rtn state of specified cache
oldMode = Cache_setMode(level, mode) set state of specified cache
typedef enum { typedef enum {

CACHE_L1D, CACHE_NORMAL,
CACHE_L1P, CACHE_FREEZE,
CACHE_L2 CACHE_BYPASS
} CACHE_Level; } CACHE_Mode;
44
L1D Data Cache
L1D Data Cache

Caching Data
Tag Data Cache
0
DDR2
32K
x
One instruction may access multiple
data elements:
for( i = 0; i < 4; i++ ) {
sum += x[i] * y[i];
} y
What would happen if x and y ended up at
the following addresses?
x = 0x0000
y = 0x8000
They would end up overwriting each other in
the cache --- called thrashing
Increasing the associativity of the cache will
reduce this problem
How do you increase associativity?
46
Increased Associativity
Valid Tag Data Cache
0 DDR2
Way 0 0x00000
16K
0 0x08000
Way 1
16K
0x10000
Split a Direct-Mapped Cache in half
Each half is called a cache way
Multiple ways make data caches more efficient 0x18000
What is a set?
47
L1D Data Cache
What is a Set?
The lines from each way that map to the
same index form a set
DDR2
0x8000
Data Cache
0
Set of index zeroes, 0x8008
i.e. Set 0
0 0x8010
Set 1
0x8018
The number of lines per set defines the

cache as an N-way set-associative cache
With 2 ways, there are now 2 unique cache
locations for each memory address
How do you determine WHICH line gets
replaced? (LRU algo) L1D Summary...
49
L1D Summary
Device Scheme Size Linesize New Features
2-Way
C62x/C67x Set Assoc. 4K bytes 32 bytes N/A
2-Way
C64x Set Assoc.
16K bytes 64 bytes N/A
C64x+ Cache/RAM
2-Way C6455: 32K
C674x 64 bytes Cache Freeze
C66x Set Assoc. DM64xx: 80K
Memory Protection
All L1D memories provide zero waitstate access

Cache/RAM configuration and Cache Freeze work similar to L1P
L1 caches are Read Allocate, thus only updated on memory read misses
50
L2 RAM or Cache ?
L2 RAM or Cache ?
Internal Memory (L2)
L1 Device Size L2 Features
Program
(L1P) Unified (code or data)
64KB -
C671x Config as Cache or RAM
128K
None, or 1 to 4 way cache
L2
CPU Program
& Data Unified (code or data)
C64x 64KB - 1MB Config as Cache or RAM
Cache is always 4-way
L1
Data Unified (code or data)
(L1D) Config as Cache or RAM
C64x+ 64KB - 2MB Cache is always 4-way
Cache Freeze
Memory Protection
L2 linesize for all devices is 128 bytes
L2 caches are Read/Write Allocate memories
L2 Cache Configuration...
52
C64x+/C674x L2 Memory Configuration

Configuration
2MB on C6455
L2 Ways are When enabled, its always

4-Way (same as C64x)
Configurable in Size
Linesize
Linesize = 128 bytes
Same linesize as C671x & C64x
Performance
L2 L1P
1-8 Cycles
L2 L1D
L2 SRAM hit: 12.5 cycles
0 32K 64K 128K 256K
L2 Cache hit: 14.5 cycles
Pipelined: 4 cycles
When required, minimize
latency by using L1D RAM
Using the Config Tool...
53
L2 RAM or Cache ?
Setting Cache Sizes

The default settings for TI-RTOS (via platform file) are:
L1D : 32K
L1P : 32K
L2 : 0K
If you want to change these settings, use the following

commands:
Cache_setSize(&Cache_Size_struct);
typedef struct Cache_Size { #include <ti/sysbios/family/c64p/Cache.h>

l1pSize; Cache_Size cache_size_struct;
l1dSize; cache_size_struct.l1pSize = Cache_L1Size_32K;
l2Size; cache_size_struct.l1dSize = Cache_L1Size_32K;
} CACHE_Size; cache_size_struct.l2Size = Cache_L2Size_0K;
Cache_setSize(&cache_size_struct);
Or set them via the BIOS .CFG file (what we will do in the lab)
54
Cache Performance Summary

Device L1P L1D L2 Performance
Zero Zero L2 L1P: 16 instr in 5 cycles

C62x/C67x Waitstate Waitstate
L2 L1D: 32 bytes in 4 cycles
Cache Cache
L2 L1P: 8 instr in 1-8 cycles

Zero Zero L2 L1D: 64 bytes in:
C64x Waitstate Waitstate L2 SRAM: 6 cycles
Cache Cache L2 Cache: 8 cycles
Pipelined: 2 cycles
L2 L1P: 8 instr in 1-8 cycles

C64x+ Zero Zero L2 L1D: 64 bytes in:
C674x Waitstate Waitstate L2 SRAM: 12.5 cycles
C66x Cache/RAM Cache/RAM L2 Cache: 14.5 cycles
Pipelined: 4 cycles
55
Cache Coherency (or Incoherency?)

Coherency Example
Coherency Example: Description
DDR2
L1D L2 EDMA
RcvBuf
XmtBuf
CPU
EDMA
For this example, L2 is set up as cache

Examples Data Flow:
EDMA fills RcvBuf in DDR
CPU reads RcvBuf, processes data, and writes to XmtBuf
EDMA moves data from XmtBuf (e.g. to a serial port xmt)
57
CPU Reading Buffers - RCV

DDR2
L1D L2 EDMA
RcvBuf RcvBuf RcvBuf
CPU
CPU reads the buffer for processing

This read causes a cache miss in L1D and L2
The RcvBuf is added to both caches
Space is allocated in each cache
L2 is R/W allocate, L1 is read-allocate only
What happens on the WRITE?
60
Where Does the CPU Write To?

DDR2
L1D L2
RcvBuf
XmtBuf XmtBuf
CPU
EDMA
After processing, the CPU writes to XmtBuf

Write misses to L1D are written directly to the
next level of memory (L2)
Thus, the write does not go directly to external memory
Cache line Allocated: L1D on Read only
L2 on Read or Write
61
Coherency Issue Write

DDR2
L1D L2
RcvBuf
XmtBuf XmtBuf
CPU
EDMA
EDMA is set up to transfer the buffer from ext. mem

The buffer resides in cache, not in ext. memory
So, the EDMA transfers whatever is in ext. memory,
probably not what you wanted
What is the solution?
62
Coherency Solution Write (Flush/Writeback)

DDR2
L1D L2
RcvBuf
XmtBuf XmtBuf
CPU writeback
EDMA
When the CPU is finished with the data (and has written it to XmtBuf in L2), it can
be sent to ext. memory with a cache writeback
A writeback is a copy operation from cache to memory, writing back the modified
(i.e. dirty) memory locations all writebacks operate on full cache lines
Use BIOS Cache APIs to force a writeback:
BIOS: Cache_wb (XmtBuf, BUFFSIZE, L2, CACHE_NOWAIT);
What happens with the "next" RCV buffer? 63
Coherency Issue Read

DDR2
L1D L2 EDMA
CPU
EDMA writes a new RcvBuf buffer to ext. memory

When the CPU reads RcvBuf a cache hit occurs
since the buffer (with old stale data) is still valid in cache
Thus, the CPU reads the old data instead of the new
Solution?
64
Coherency Solution Read

DDR2
L1D L2 EDMA
CPU
To get the new data, you must first invalidate the old data before trying to read
the new data (clears cache lines valid bits)
Again, cache operations (writeback, invalidate) operate on cache lines
BIOS provides an invalidate option:
BIOS: Cache_inv (RcvBuf, BUFFSIZE, L2, CACHE_WAIT);
65
Another Solution: Place Buffers in L2

DDR2
L1D L2 EDMA
RcvBuf RcvBuf
XmtBuf
CPU
EDMA
Configure some of L2 as RAM

Locate buffers in this RAM space
Coherency issues do not exist between L1D and L2
To summarize Cache Coherency...

67
Cache Functions Summary

BIOS Cache Functions Summary
Cache Cache_inv(blockPtr, byteCnt, type, wait)
Invalidate Cache_invL1pAll()
Cache Cache_wb(blockPtr, byteCnt, type, wait)

Writeback Cache_wbAll()
Invalidate & Cache_wbInv(blockPtr, byteCnt, type, wait)

Writeback Cache_wbInvAll()
Sync waiting Cache_wait()

for Cache
blockPtr : start address of range to be invalidated
byteCnt : number of bytes to be invalidated
Type : type of cache (L1, L2)
Wait : 1 = wait until operation is completed
What if the EDMA is reading/writing INTERNAL memory (L2)?
Coherency Summary
Coherence Summary
Internal (L1/L2) Cache Coherency is Maintained
Coherence between L1D and L2 is maintained by cache controller
No Cache_fxn operations needed for data stored in L1D or L2 RAM
L2 coherence operations implicitly operate upon L1, as well
Simple Rules for Error Free Cache (for DDR, L3)

TAKING OWNERSHIP Before the DSP begins reading a shared
external INPUT buffer, it should first BLOCK INVALIDATE the buffer
GIVING OWNERSHIP After the DSP finishes writing to a shared
external OUTPUT buffer, it should initiate an L2 BLOCK WRITEBACK
DEBUG NOTE: An easy way to identify cache coherency problems is to allocate your
buffers in L2. Problem goes away? Its probably a cache coherency issue.
What about "cache alignment" ?
68
Cache Alignment
Cache Alignment
False Addresses Buffer
Cache Buffer
Lines
Buffer False Addresses
Problem: How can I invalidate (or writeback) just the buffer?

In this case, you cant
Definition: False Addresses are neighbor data in the cache line,
but outside the buffer range
Why Bad: Writing data to buffer marks the line dirty, which will cause entire line to
be written to external memory, thus
External neighbor memory could be overwritten with old data
Avoid False Address problems by aligning #define BUF 128

buffers to cache lines (and filling entire line) #pragma DATA_ALIGN (in, BUF)
Align memory to 128 byte boundaries short in[256];
Allocate memory in multiples of 128 bytes
69
MAR Bits Turn On/Off Cacheability

"Turn Off" the DATA Cache (MAR)
DDR2
L1D L2 EDMA
RcvBuf
XmtBuf
CPU
Memory Attribute Registers (MARs) enable/disable DATA caching memory ranges

Dont use MAR to solve basic cache coherency performance will be too slow
Use MAR when you have to always read the latest value of a memory location,
such as a status register in an FPGA, or switches on a board.
MAR is like volatile. You must use both to always read a memory location: MAR
for cache; volatile for the compiler
Looking more closely at the MAR registers ...
71
Memory Attribute Regs (MAR) DATA

Use MAR registers to
enable/disable caching
of external DATA ranges CS2
Useful when external data
is modified outside the MAR4 0
scope of the CPU MAR5 1
You can specify MAR
MAR6 1
values in Config Tool
C671x: MAR7 1
16 MARs Reserved
4 per CE space
CS4
0 = Not cached
Each handles 16MB 1 = Cached
C64x/C64x+/C674x:
Each handles 16MB CS5
256/224 MARs
16 per space
(on current C64x, some are rsvd)
Settings for C6748 LCDK..
72
Memory Attribute Registers : MARs

256 MAR bits define
cache-ability of 4G of Start Address End Address Size Space
addresses as 16MB 0x6000 0000 0x60FF FFFF 16MB CS2_
groups
0x6200 0000 0x62FF FFFF 16MB CS3_
Many 16MB areas not
used or present on given 0x6400 0000 0x64FF FFFF 16MB CS4_
board 0x6600 0000 0x66FF FFFF 16MB CS5_
Example: Usable 6748 0xC000 0000 0xDFFF FFFF 512MB DDR2
EMIF addresses at right
EVM6748 memory is:
128MB of DDR2 starting at MAR MAR Address EMIF Address Range
0xC000 0000
FLASH, NAND Flash, or
192 0x0184 8200 C000 0000 C0FF FFFF
SRAM in CS2_ space at
0x6000 0000 193 0x0184 8204 C100 0000 C1FF FFFF
Note: with the C64x+ 194 0x0184 8208 C200 0000 C2FF FFFF
program memory is always 195 0x0184 820C C300 0000 C3FF FFFF
cached regardless of MAR 196 0x0184 8210 C400 0000 C4FF FFFF
settings
197 0x0184 8214 C500 0000 C5FF FFFF
...
223 DF00 0000 DFFF FFFF
Using .cfg file to specify MAR bits 73
Configure MAR via GCONF (C6748)

Add THIS module to your .CFG file
Then, modify the MAR settings:
Example: C6748 EVM

MAR 192-199 (DDR2) turned on
(starting at address 0xC000_0000)
74
Additional Topics
Additional Topics
L1D: DATA_MEM_BANK Example
Only one L1D access per bank per cycle
Use DATA_MEM_BANK pragma to begin paired arrays in different banks
Note: sequential data are not down a bank, instead they are along a horizontal
line across banks, then onto the next horizontal line
Only even banks (0, 2, 4, 6) can be specified
3 2 1 0 7 6 5 4 B A 9 8 F E D C 13 12 11 10 17 16 15 14 1B 1A 19 18 1F 1E 1D 1C
23 22 21 20 27 26 25 24 2B 2A 29 28 2F 2E 2D 2C 33 32 31 30 37 36 35 34 3B 3A 39 38 3F 3E 3D 3C
Bank 0 Bank 2 Bank 4 Bank 6

512x32 512x32 512x32 512x32 512x32 512x32 512x32 512x32
#pragma DATA_MEM_BANK(a, 4);

short a[256];
#pragma DATA_MEM_BANK(x, 0);
short x[256];
for(i = 0; i < count ; i++) Optimizing the
sum += a[i] * x[i]; cache...
76
Cache Optimization
Optimize for Level 1
Multiple Ways and wider lines maximize efficiency
TI did this for you!
Main Goal - maximize line reuse before eviction
Algorithms can be optimized for cache
Touch Loops can help with compulsory misses
Run once thru loop in init code
Touch buffers to pre-load data cache
Up to 4 write misses can happen sequentially, but the
next read or write will stall
Bus has 4 deep buffer between CPU/L1 and beyond
Be smart about data output by one function then read
by another (touch it first)
When data is output by first function, where does it go?
If you touch output buffer first, then where will output data go?
Docs...
77
Additional Topics
Updated Cache Documentation

SPRU609: C621x/C671x
Cache Reference SPRU610: C64x
More comprehensive SPRU871: C64x+/C674
description of C6000 cache SPRUGW0: C66x
Revised terminology for cache coherence operations
Cache Users Guide

SPRU656: C62x/C64x/C67
Cache Basics SPRU862: C64x+/C674
Using C6000 Cache SPRUGY8: C66x
Optimization for Cache Performance
Summary...
78
Cache General Terminology

Associativity: The # of places a piece of data can map to inside the cache.
Coherence: assuring that the most recent data gets written back from a cache
when there is different data in the levels of memory
Dirty: When an allocated cache line gets changed/updated by the CPU (*file)
Read-allocate cache: only allocates space in the cache during a read miss.
C64x+ L1 cache is read-allocate only.
Write-allocate cache: only allocates space in the cache during a write miss.
Read-write-allocate cache: allocates space in the cache for a read miss or a
write miss. C64x+ L2 cache is read-write allocate.
Write-through cache: updates to cache lines will go to ALL levels of memory

such that a line is never dirty (less efficient than WB cache more DDR xfrs).
Write-back cache: updates occur only in the cache. The line is marked as
dirty and if it is evicted, updates are pushed out to lower levels of memory.
All C64x+ cache is write-back*.
80
Additional Topics
*** this page is not blank ***
Chapter Quiz
Chapter Quiz
Chapter Quiz
1. How do you turn ON the cache ?
2. Name the three types of caches & their associated memories:
3. All cache operations affect an aligned cache line. How big is a line?
4. Which bit(s) turn on/off cacheability and where do you set these?
5. How do you fix coherency when two bus masters access extl mem?
6. If a dirty (newly written) cache line needs to be evicted, how does

that dirty line get written out to external memory?
Chapter Quiz
Quiz Answers
Chapter Quiz
1. How do you turn ON the cache ?
Set size > 0 in platform package (or via Cache_setSize() during runtime)
2. Name the three types of caches & their associated memories:

Direct Mapped (L1P), 2-way (L1D), 4-way (L2)
3. All cache operations affect an aligned cache line. How big is a line?
L1P 32 bytes (256 bits), L1D 64 bytes, L2 128 bytes
4. Which bit(s) turn on/off cacheability and where do you set these?
MAR (Mem Attribute Register), affects 16MB Extl data space, .cfg
5. How do you fix coherency when two bus masters access extl mem?
Invalidate before a read, writeback after a write (or use L2 mem)
6. If a dirty (newly written) cache line needs to be evicted, how does

that dirty line get written out to external memory?
Cache controller takes care of this
83
Lab 14 Using Cache
Lab 14 Using Cache

In the following lab, you will gain some experience benchmarking the use of cache in the system.
In Part A, well run the code with the buffers residing in L2 RAM with cache on just like the
previous lab. The benchmark should be the same. In Part B, the buffers will reside off chip with
NO cache and well check the performance (which should be abysmal). In Part C, we will leave
the buffers off chip and turn the cache ON with no invalidate/writebacks first. Our benchmark
should be ok, but the audio will sound bad. Then, we will add invalidate/writeback commands to
get the audio cleaned up.
This will provide a decent understanding of what you can expect when using cache in your own
application.
Lab 14 Using Cache

aic31_MA_TIRTOS.c
ADC
RxCh rxBuf0
rxBuf0
(48 KHz) Mod PSETs
Copy HIST
DAC txBuf0
Start EDMA3 XFRs
EDMA3CCComplIsr()
Procedure {
// post Rx SEM
Benchmark the following: }
2. Part A Buffers in L2, cache ON
3. Part B Buffers external, cache OFF
Clk1
4. Part C Buffers external, cache ON
500ms
Tick
Time = 30min 49
Lab 14 Using Cache Procedure

A. Run System From Internal RAM
1. Close all previous projects and import Lab14.
This project is actually the solution for Lab 13 (OPT) with all optimizations in place (NOT
the DSPLIB solution or using restrict keyword). It was the solution as of the first change in
Part B (using MUST_ITERATE). So go back and see what that benchmark was and write it
down:
Lab 13 (Part B, MUST_ITERATE) benchmark: __________ cycles
As we do this lab, we want to compare our results to the previous benchmarks.
Note: For all benchmarks throughout this lab, use the Opt build configuration when you build.
Do NOT use the Debug or Release config.
2. Find out where the buffers are mapped to in memory.

This application uses user-defined section names for the transmit and receive buffers.
Open mcaspPlayBk_MA_TIRTOS.c and scroll down to about line 109-145.
Notice how the buffers are allocated. They are a multiple of an L2 cache line and aligned on
cache line boundaries. Good stuff and absolutely necessary for flawless cache
performance.
Next, look at the name of the section they are allocated into, for example .far:txBufs.
When you create a user-defined section name, you must also create a user .CMD file to
allocate these sections into memory areas.
If you look at the .cmd file created by TI-RTOS, and opened it, you would see the following
memory regions (this is located in the folder Opt, configPkg, linker.cmd in your project)
See the names of the regions? IRAM and DDR are ones we will use in the lab. IRAM points
to the L2 memory region.
Open the file RxTxBuf_MA_TIRTOS.cmd.
This is where the user-defined section names are allocated to the memory areas. Notice that
the buffers are allocated in L2. This is exactly where we want them for this part of the lab.
Later, we will change the region to DDR to move the buffers off chip in order to test the
cache performance.
3. Which cache areas are turned on/off (circle your answer)?

L1P OFF/ON
L1D OFF/ON
L2 OFF/ON
This is actually a trick question. As we stated before, TI-RTOS will set the cache sizes for
L1P and L1D automatically to 32K and turned on and L2 cache as zero/off. It is actually the
platform file that specifies these numbers. However, you can override these with code or via
the .cfg file which is what we will do later.
4. Build, load, Run.

First clean the project. Then build, load and run the application. Make sure you have audio
running first. Run the code for about 5 seconds. We are just testing to see if the code works
properly before moving on and double-checking the benchmark for cfir().
Write down below the benchmarks for cfir():
Buffers in L2 (L1P/D cache ON): __________ cycles
The benchmark from the Log_info should be around 5260 cycles. If not, clean the project,
delete the Opt folder and rebuild/load/run/benchmark.
Well compare this buffers in L2, cache ON benchmark to all external and all external with
cache ON numbers as we proceed through the lab. You just might be surprised
5. Check the size of the caches that TI-RTOS set for you using ROV.
As your code is halted at the moment, open up ROV and locate the cache sizes via Cache:
So, the sizes are exactly what we predicted. These were set by the platform file evmc6748.
B. Run System From External DDR2 (no cache)
6. Place the buffers in external DDR2 memory and turn OFF the cache.
So you have a choice you can either write code (which is commented out at the bottom of
the hardwareInitTaskFxn() routine OR you can use the .cfg file. The author recommends
you use the .cfg file for two reasons: (1) you dont have to write code simply work with a
GUI; (2) the tools will take the sizes into consideration as it creates the .cmd file vs you
having to do this yourself.
So how do you use the .cfg file to specify sizes? Oh, and where are the MAR bits set? Or are
they set? Thankfully because the tools now (via the platform file) that the evmC6748 is
being used, the MAR bits are set automatically. But, in your own application, youll need to
know how to modify all of the MAR bits to match your application.
Lets go see the magical place where the cache sizes and MAR bits are set in the .cfg file
Open the .cfg file so you can see the Outline view and Available Products.
There are multiple ways to view the same thing, so the author has chosen the most direct
path to the information.
In the Available Products window, drag Cache over to the Outline view.
This is using the target specific (C6748) cache settings and placing them into the Outline
view so you can edit them.
Once you have this module in your Outline view, click on it to configure it. Notice the cache
size settings below:
These settings MATCH the platform file defaults L1D/P is max at 32K and L2 cache is off.
We need to turn L1D/P OFF, so change the top two settings to L1D/P = 0K:
If we want all cache turned OFF, these are the proper settings. So just leave them this way.
We will come back to this later to turn ON the caches. Note that the L1P setting will have an
affect on performance because program memory (.text) is allocated in DDR according to the
linker.cmd file. We dont care about this for the moment. We just want to break the whole
thing and then turn on all the caches in the next section. The key performance problem will
be the buffers in DDR with no cache on.
Lets look at the MAR bits:
The bits in question are MAR 192-223 if you remember from the discussion material. MAR
bits 192-199 are set to 1 which covers the 128MB of external DDR memory starting at
address C000_0000h. Great. We dont have to touch those, but now you know where they
are located so you can change them for your own application.
Save your .cfg file. You will get a warning that says cache settings override platform
settings. Thats great. Just ignore it.
Now that the cache is off, we need to allocate the buffers into the DDR memory area using
our user linker command file
Open RxTxBuf_MA_TIRTOS.cmd. Change IRAM to DDR:
Save the .cmd file.

In this scenario, the audio data buffers are all external. Cache is not turned on. This is the
worst case situation.
Do you expect the audio to sound ok? ____________________
7. Clean project, build, load, run using the Opt Configuration.

Right-click on your project and select Clean Project to clearn your project.
Then Build and load your code.
Run your code.
Listen to the audio how does it sound?
Would you buy this MP3 player? I wouldnt. It sounds terrible. And that is what we predicted.
Write down your FIR benchmark below
Buffers in DDR (Cache OFF): ___________ cycles
If you look at the CPU load, it shows nothing. Why? The CPU is loaded more than 100%. So
the IDLE thread never runs and reports the CPU Load. Ok, this sounds reasonable.
The author saw the following cycle count for cfir():
Almost like the old Debug build configuration cycles. Our application is NOT meeting real
time, but that is to be expected. If you have important stuff in DDR2 memory and you dont
turn the cache on, youre in trouble.
Lets go fix this and do it properly.
C. Run System From DDR2 (cache ON)

8. Turn on the cache (L1P/D, L2) and re-run the application.
Via the .cfg file, turn ON all of the caches:
So L1D/P are maxed at 32K and L2 is set for 64K.

Save the .cfg file.
Given the size of our buffers and code, this is a good setting. In fact, if you have ANYTHING
in DDR2, this is the recommended starting point for cache settings. Sometimes people want
to determine the perfect settings from the start. Forget that. Set the cache settings at these
levels and then only change L2 (larger) and take some benchmarks. That will tell you the
perfect settings for your application.
The system we now have is identical to one of the slides in the discussion material.
Before you run the new code, what is your prediction as to the number of cycles for cfir()?
The best case was 5260 cycles running from L2 SRAM with L1P/D cache turned ON. Write
down your guess for the benchmark below running from DDR2 with all caches turned on:
CPU CYCLE GUESS ___________ cycles
9. Build, load, run (you will hear buzzing)

Build, load and run the application for 5 seconds. How does the audio sound?
It should sound fine. No? Wait. What happened? Oh, that darn cache coherency bit us in the
you know what. We have two masters the CPU and EDMA in action and we forgot to
invalidate and writeback properly. Darn.
But what was your benchmark?
Buffers in DDR (L1 P/D, L2 cache ON): _________ cycles
The authors benchmark was:
Ok about the same. And this is to be expected. The cfir() routine is actually reading/writing
internal SRAM because the cache is on. The read buffers are cached in L1 (ONCE) and the
transmit buffers are written to L2 (ONCE) because the invalidate/writeback commands have
not been added. So this is really not a fair benchmark. In real systems, you need to add the
invalidate/writeback commands so it forces the CPU to read from DDR vs. internal SRAM.
10. Fix the cache coherency problem.

Do you remember the commands to use from the discussion material? Sure, invalidate and
writeback.
So, we must INVALIDATE the Rx buffers BEFORE we read them. Lets do that piece first
Add the following line of code as indicated (around line 625, before the de-interleave):
Fill in the call to Cache_inv() as follows:

Cache_inv (pRxBufLocal, AUDIO_BUF_SIZE, Cache_Type_L2, CACHE_WAIT);
That takes care of the READ, now lets take care of the WRITE
Add the following line of code as indicated (around line 652 just AFTER the interleave of
the Tx data):
Fill in the Cache_wb() call with the following code:

Cache_wb (pTxBufLocal, AUDIO_BUF_SIZE, Cache_Type_L2, CACHE_NOWAIT);
11. Build, load, run again.
Now that the proper invalidate and writeback commands are there, lets hope the audio
sounds better.
Build, load and run the application for 5 seconds or so.
How does the audio sound? What is your final benchmark?
Buffers in DDR (L1 P/D, L2 cache ON): _________ cycles
The authors audio sounded fine now and the benchmarks were:
Not consistent, but this has to do with cache snooping (L1 to L2), pipelined reads and re-use
of data. The average went up also expected due to having to access external DDR memory
for the first read followed by re-use in the cache. But the cache performance is extraordinary
given the fact that these buffers are in external memory. So if you need to use DDR, do so,
but TURN THE CACHE ON.
Again, your mileage may vary, but now you know the ins, outs and dollar signs associated
with cache.
Youre finished with this lab. Congrats. This is the last lab in the workshop.
Notes
Notes
Using EDMA3
Introduction
In this chapter, you will learn the basics of the EDMA3 peripheral. This transfer engine in the
C64x+ architecture can perform a wide variety of tasks within your system from memory to
memory transfers to event synchronization with a peripheral and auto sorting data into separate
channels or buffers in memory. No programming is covered. For programming concepts, see
ACPY3/DMAN3, LLD (Low Level Driver covered in the Appendix) or CSL (Chip Support
Library). Heck, you could even program it in assembly, but dont call ME for help.
Objectives
Objectives
Provide an overview of the key

capabilities of EDMA3
Go through MANY examples to learn
how EDMA3 works
Define EDMA terms and definitions
Describe, in detail, capabilities like
syncing, indexing, linking, chaining,
channel sorting
Describe how EDMA interrupts work
C6000 Embedded Design Workshop - Using EDMA3 15 - 1

Module Topics
Module Topics
Using EDMA3 ............................................................................................................................. 15-1
Module Topics ......................................................................................................................... 15-2
Overview ................................................................................................................................. 15-3
What is a DMA ? ............................................................................................................... 15-3
Multiple DMAs ................................................................................................................... 15-4
EDMA3 in C64x+ Device .................................................................................................... 15-5
Terminology ............................................................................................................................ 15-6
Overview ............................................................................................................................. 15-6
Element, Frame, Block ACNT, BCNT, CCNT .................................................................. 15-7
Simple Example .................................................................................................................. 15-7
Channels and PARAM Sets ................................................................................................ 15-8
Examples ................................................................................................................................ 15-9
Synchronization ..................................................................................................................... 15-12
Indexing ................................................................................................................................. 15-13
Events Transfers Actions ................................................................................................ 15-15
Overview ........................................................................................................................... 15-15
Triggers ............................................................................................................................. 15-16
Actions Transfer Complete Code ................................................................................... 15-16
EDMA Interrupt Generation .................................................................................................. 15-17
Linking ................................................................................................................................... 15-18
Chaining ................................................................................................................................ 15-19
Channel Sorting .................................................................................................................... 15-21
Architecture & Optimization .................................................................................................. 15-22
Programming EDMA3 Using Low Level Driver (LLD) ........................................................ 15-23
Chapter Quiz ......................................................................................................................... 15-25
Quiz Answers ................................................................................................................. 15-26
Additional Information ........................................................................................................... 15-27
Notes ..................................................................................................................................... 15-28
15 - 2 C6000 Embedded Design Workshop - Using EDMA3

Overview
Overview
What is a DMA ?
What is DMA ?
When we say DMA, what do we mean? Well, there are MANY
forms of DMA (Direct Memory Access) on this device:
EDMA3 Enhanced DMA handles 64 DMA CHs and 4 QDMA CHs
DMA 64 channels that can be triggered manually or by events/chaining
QDMA 8 channels of Quick DMA triggered by writing to a trigger word
EDMA3
EVTx Q0 TC0
64
Chain DMA Q1 TC1 Switched
Manual SCR Central
Q2 TC2 Resource
4 TC3
Trigger Word QDMA Q3
IDMA 2 CHs of Internal DMA (Periph Cfg, Xfr L1 L2)

IDMA
L1D
L2
Ch0 PERIPH L1 Ch1 L2
Peripheral DMAs Each master device hooked to the Switched

Central Resource (SCR) has its own DMA (e.g. SRIO, EMAC, etc.)
4

Overview
Multiple DMAs
Multiple DMAs : EDMA3 and QDMA
VPSS EDMA3 C64x+ DSP
(System DMA)
L1P L1D
Master Periph DMA QDMA
(sync) (async)
L2
DMA QDMA
Enhanced DMA (version 3) Quick DMA
DMA to/from peripherals DMA between memory
Can be syncd to peripheral events Async must be started by CPU
Handles up to 64 events 4-16 channels available
Both Share (number depends upon specific device)

128-256 Parameter RAM sets (PARAMs)
64 transfer complete flags
2-4 Pending transfer queues
5
Multiple DMAs : Master Periphs & C64x+ IDMA

VPSS EDMA3 C64x+ DSP
Front End (capture) (System DMA)
Back End (display) L1P L1D
Master Periphs DMA QDMA
USB ATA (sync) (async)
Ethernet IDMA L2
VLYNQ
Master Peripherals IDMA

VPSS (and other master periphs) Built into all C64x+ DSPs
include their own DMA functionality Performs moves between internal
USB, ATA, Ethernet, VLYNQ share memory blocks and/or config bus
bus access to SCR Dont confuse with iDMA API
Notes: Both ARM and DSP can access the EDMA3

Only DSP can access hardware IDMA
6

Overview
EDMA3 in C64x+ Device
M
SCR & EDMA3 S
Master Slave 32
EDMA3 SCR = Switched Central Resource L3
McASP
TC0
TC1
CC McBSP
x2 TC2 PCI L1P
DDR2/3
C64x+ MegaModule L1P PERIPH
EMAC EMIF Mem AET
HPI Ctrl M S
PCI Cfg
D D
L2 S S
L2 Mem IDMA CPU M M
ARM Ctrl
L L
External L1D
S M Mem
128 Cntl Mem
Ctrl PERIPH =
M S M All peripherals
128
Cfg registers
S
DATA L1D
32
SCR
CFG
EDMA3 is a master on the DATA SCR it can initiate data transfers
EDMA3s configuration registers are accessed via the CFG SCR (by the CPU)
SCR
Each TC has its own connection (and priority) to the DATA SCR. Refer to the connection matrix to determine valid connections
7

Terminology
Terminology
Overview
DMA : Direct Memory Access

Goal : Copy from memory to memory HARDWARE memcpy(dst, src, len);
Faster than CPU LD/ST. One INT per block vs. one INT per sample
Original Copied
Data DMA Data
Block Block
Examples : Import raw data from off-chip to on-chip before processing

Export results from on-chip to off-chip afterward
Controlled by : Transfer Configuration (i.e. Parameter Set - aka PaRAM or PSET)

Transfer configuration primarily includes 8 control registers
Source
BCNTLengthACNT
Destination Transfer
Configuration

Terminology
Element, Frame, Block ACNT, BCNT, CCNT

How Much to Move?
Block Frame Element
Frame 1 Elem 1
Frame 2 Elem 2
A Count
. . (# of contig bytes)
. .
Elem N
Frame M
B Count
C Count
Transfer Configuration
Options
Source
B
Transfer CountA
Destination B Count (# Elements) A Count (Element Size)
Index 31 16 15 0
Cnt Reload Link Addr
Index Index
C Count (# Frames)
Rsvd C 31 16 15 0
Let's look at a simple example... 10
Simple Example
Example How do you VIEW the transfer?
Lets start with a simple example or is it simple?
We need to transfer 12 bytes from here to there.
Note: these are contiguous

8-bit memory locations
What is ACNT, BCNT and CCNT? Hmmm.

You can view the transfer several ways:
ACNT = 1 ACNT = 2 ACNT = 12

BCNT = 4 BCNT = 2 BCNT = 1
CCNT = 3 CCNT = 3 CCNT = 1
= 12
Which view is the best? Well, that depends on what your system
needs and the type of sync and indexing (covered later)
11

Terminology
Channels and PARAM Sets
C6748 EDMA Channel/Parameter RAM Sets

EDMA3 has 128-256 Parameter RAM sets (PSETs) that contain
configuration information about a transfer
64 DMA CHs and 8 QDMA CHs can be mapped to any one of
the 256 PSETs and then triggered to run (by various methods)
64 DMA CHs
PaRAM Set 0 Options
0
.. PaRAM Set 1 Source
.. BCNT ACNT
63 Destination
PSET 63 DSTBIDX SRCBIDX
8 QDMA CHs
PSET 64 BCNTRLD LINK
0
.. .. DSTCIDX SRCCIDX
RSVD CCNT
PSET 255 31 0
7
Each PSET contains 12 registers:
Options (interrupt, chaining, sync mode, etc) 4 SRC/DST Indexes (bump addr after xfr)
SRC/DST addresses BCNTRLD (BCNT reload for 3D xfrs)
ACNT/BCNT/CCNT (size of transfer) LINK (pointer to another PSET)
Note: PSETs are dedicated EDMA RAM (not part of IRAM)

12

Examples
Examples
EDMA Example : Simple (Horizontal Line)
loc_8 (bytes)
1 2 3 4 5 6 8
Goal: myDest:
7 8 9 10 11 12 9
Transfer 4 elements 13 14 15 16 17 18 10
from loc_8 to myDest 19 20 21 22 23 24 11
25 26 27 28 29 30
8 bits
DMA always increments across ACNT fields

B and C counts must be 1 (or more) for any actions to occur
Any indexing needed?
Source = &loc_8
1= BCNT ACNT =4
Destination = &myDest
CCNT =1 Is there another way

to set this up?
14
EDMA Example : Simple (Horizontal Line)

loc_8 (bytes)
1 2 3 4 5 6 8
Goal: myDest:
7 8 9 10 11 12 9
Transfer 4 elements 13 14 15 16 17 18 10
25 26 27 28 29 30
8 bits
Here, ACNT was defined as element size : 1 byte
Therefore, BCNT will now be framesize : 4 bytes
B indexing (after ACNT is transferred) must now be specified as well
BIDX often = ACNT for contiguous operations
Source = &loc_8
4= BCNT ACNT =1
1= DSTBIDX SRCBIDX =1
Why is this a less
0= DSTCIDX SRCCIDX =0 efficient version?
CCNT =1
15

Examples
EDMA Example : Indexing (Vertical Line)

loc_8 (bytes) myDest: 8
1 2 3 4 5 6
Goal:
7 8 9 10 11 12 14
Transfer 4 vertical elements 13 14 15 16 17 18
from loc_8 to a port 8 bits
19 20 21 22 23 24 20
25 26 27 28 29 30
31 32 33 34 35 36 26
ACNT is again defined as element size : 1 byte
Therefore, BCNT is still framesize : 4 bytes
SRCBIDX now will be 6 skipping to next column
DSTBIDX now will be 2
Source = &loc_8
4= BCNT ACNT =1
2= DSTBIDX SRCBIDX =6
0= DSTCIDX SRCCIDX =0
CCNT =1
16
EDMA Example : Block Transfer (less efficient)

16-bit Pixels myDest: 8
Goal: 1 2 3 4 5 6 9
Transfer a 5x4 subset 7 8 9 10 11 12 10
16 bits
19 20 21 22 23 24 14
25 26 27 28 29 30 15
31 32 33 34 35 36 ...
ACNT is defined here as short element size : 2 bytes
BCNT is again framesize : 4 elements
CCNT now will be 5 as there are 5 frames
SRCCIDX skips to the next frame
Source = &loc_8
4= BCNT ACNT =2
2= DSTBIDX SRCBIDX = 2 (2 bytes going from block 8 to 9)
2= DSTCIDX SRCCIDX = 6 (3 elements from block 11 to 14)

CCNT =5
17

Examples
EDMA Example : Block Transfer (more efficient)

16-bit Pixels myDest: 8
Goal: 1 2 3 4 5 6 9
Transfer a 5x4 subset 7 8Elem
9 10 111 12 10
from loc_8 to myDest Elem
13 14 15 16 2
17 18
16 bits 11
Elem
19 20 21 22 3
23 24 14
Elem
25 26 27 28 4
29 30 15
Elem
31 32 33 34 5
35 36 ...
ACNT is defined here as the entire frame : 4 * 2 bytes
BCNT is the number of frames : 5
CCNT now will be 1
SRCBIDX skips to the next frame
Source = &loc_8
5= BCNT ACNT =8
(4*2) is 8 = DSTBIDX SRCBIDX = 12 is (6*2) (from block 8 to 14)
0= DSTCIDX SRCCIDX =0
CCNT =1
18

Synchronization
Synchronization
A Synchronization
An event (like the McBSP receive register full), triggers
the transfer of exactly 1 array of ACNT bytes (2 bytes)
Example: McBSP tied to a codec (you want to sync each transfer

of a 16-bit word to the receive buffer being full
or the transmit buffer being empty).
EVTx EVTx EVTx
Frame 1
Array1 Array2 Array BCNT
Frame 2
Frame CCNT
20
AB Synchronization
An event triggers a two-dimensional transfer of BCNT arrays
of ACNT bytes (A*B)
Example: Line of video pixels (each line has BCNT pixels

consisting of 3 bytes each Y, Cb, Cr)
EVTx
Frame 1
Frame 2
Frame CCNT
21

Indexing
Indexing
Indexing BIDX, CIDX
EDMA3 has two types of indexing: BIDX and CIDX
Each index can be set separately for SRC and DST (next slide)
BIDX = index in bytes between ACNT arrays (same for A-sync and AB-sync)
CIDX = index in bytes between BCNT frames (different for A-sync vs. AB-sync)
BIDX/CIDX: signed 16-bit, -32768 to +32767
A-Sync AB-Sync
EVTx EVTx EVTx EVTx
.. ..
BIDX BIDX
CIDXAB
CIDXA
.. ..
CIDX distance is calculated from the starting address of the previously

transferred block (array for A-sync, frame for AB-sync) to the next frame to
be transferred.
23
Indexed Transfers
EDMA3 has 4 indexes allowing higher flexibility for
complex transfers:
SRCBIDX = # bytes between arrays (Ex: SRCBIDX = 2)
SRCCIDX = # bytes between frames (Ex: SRCCIDXA = 2, SRCCIDXAB = 4)
Note: CIDX depends on the synchronization used A or AB
DSTBIDX = # bytes between arrays (Ex: DSTBIDX = 3)
DSTCIDX = # bytes between frames (Ex: DSTCIDXA = 5, DSTCIDXAB = 8)
SRCBIDX DSTBIDX
1 3 1 3
SRCCIDXA DSTCIDXA
5 7
9 11 5 7
13 15
SRC (8-bit) 9 11
(contiguous)
Note: ACNT = 1, BCNT = 2, CCNT = ____ DST (8-bit)

(contiguous) 24

Indexing
Example Using Indexing

Remember this example? Ok, so for each view, fill
in the proper SOURCE index values:
Note: these are contiguous

8-bit memory locations
ACNT = 1 ACNT = 2 ACNT = 12

BCNT = 4 BCNT = 2 BCNT = 1
CCNT = 3 CCNT = 3 CCNT = 1
BIDX = 1 BIDX = 2 BIDX = N/A
CIDXA = 1 CIDXA = 2 CIDXA = N/A
CIDXAB = 4 CIDXAB = 4 CIDXAB = N/A
Which view is the best? Well, that depends on what you

are transferring from/to and which sync mode is used.
25

Events Transfers Actions

Overview
EDMA3 Basics Review

1 2 3 4 5 6
Count How many items to move 7 8 9 10 11 12
A, B, and C counts 13 14 15 16 17 18
19 20 21 22 23 24
Addresses the source & destination addresses 25 26 27 28 29 30
Index How far to increment the src/dst after each transfer
T
(xfer config)
Done
Options
Source
B
Transfer CountA
Destination
E T
Index
Cnt Reload Link Addr
A
(event) Index Index
(xfer config) (action)
Rsvd C
Event triggers the transfer to begin

Transfer the transfer config describes the transfers to be executed when triggered
Resulting Action what do you want to happen after the transfer is complete?
Let's look at triggers (events) and actions in more detail...

27

Triggers
How to TRIGGER a Transfer
There are 3 ways to trigger an EDMA transfer:
1 Event Sync from peripheral

ER = Event Register (flag)
McASP0 EDMA3
EER = Event Enable Register (user)
RRDY
ER EER Start Ch Xfr
XRDY
2 Manually Trigger the Channel to Run

Application Channel y
ESR = Event Set Register (user)
Set Ch #y; ESR Start Ch Xfr
3 Chain Event from another channel (more details later)

Channel x Channel y
TCCHEN = TC Chain Enable (OPT)
TCCHEN_EN
CER Start Ch Xfr
TCC = Chy
28
Actions Transfer Complete Code

Transfer Complete Code (TCC)
31 18 17 12 11 10 0
Options Reg TCC TCCMODE
Ch 0-63 NORMAL
EARLY
TCC is generated when a transfer completes. This is referred to

as the Final TCC.
TCC can be used to trigger an EDMA interrupt and/or another
transfer (chaining)
Each TR below is a transfer request which can be either ACNT bytes
(A-sync) or ACNT * BCNT bytes (AB-sync). Final TCC only occurs
after the LAST TR.
EVTx EVTx EVTx EVTx TCC
TC
TR TR TR TR Ack
29

EDMA Interrupt Generation
EDMA Interrupt Generation

Generate EDMA Interrupt (Setting IERbit)
EDMA Channels EDMA Interrupt Generation
Channel # Options TCC IPR IER
0 TCINTEN=0 TCC=0 0 IER0 = 0
1 TCINTEN=0 TCC=1 0 IER1 = 0 EDMA3CC_INT

..
. TCINTEN=1 TCC=14 1
IER14 = 1
63 TCINTEN=0 TCC=63 0 IER63 = 0
Options TCINTEN TCC

IER EDMA Interrupt Enable Register (NOT the CPU IER)
20 17 12 IPR EDMA Interrupt Pending Register (set by TCC)
Use EDMA3 Low-Level Driver (LLD) to program EDMAs IER bits
64 Channels and ONE interrupt? How do you determine WHICH channel completed?
31
EDMA Interrupt Dispatcher

Heres the interrupt chain from beginning to end:
1. An interrupt occurs 2. Interrupt Selector 3. HWI_INT5 Properties
HWI_INT5
EDMA3CC_INT (#24)
4. EDMA Interrupt Dispatcher 5. ISR (interrupt handler)
Read IPR bits void edma_rcv_isr (void)

Determine which one is set {
Call corresponding handler SEM_post (&semaphore);
(ISR) in Fxn Table }
How does the ISR Fxn Table (in #4 above) get loaded with the proper handler Fxn names?
Use EDMA3 LLD to program the proper callback fxn for this HWI.
32

Linking
Linking
Linking Action Overview
T Alias: Re-load
Auto-init
(xfer config) Done
Options
Source 1 2 3 4 5 6
B
Transfer CountA 7 8 9 10 11 12
Destination 13 14 15 16 17 18
E T
Index
Cnt Reload Link Addr A 19
25
20
26
21
27
22
28
23
29
24
30
(event) Index Index
Rsvd C
Need: auto-reload channel with new config How does linking work?
Ex1: do the same transfer again User must specify the LINK field
Ex2: ping/pong system (covered later) in the config to link to another PSET.
When the current xfr (0) is complete,
Solution: use linking to reload Ch config the EDMA auto reloads the new
config (1) from the linked PSET.
Concept:
Linking two or more channels together allows Config 0 Config 1
the EDMA to auto-reload a new configuration
when the current transfer is complete. reload
LINK LINK
Linking still requires a trigger to start the 1 NULL
transfer (manual, chain, event).
You can link as many PSETs as you like
it is only limited by the #PSETs on a device. Note: Does NOT start xfr !!
34

Chaining
Chaining
Reminder Triggering Transfers
There are 3 ways to trigger an EDMA transfer:
1 Event Sync from peripheral

ER = Event Register (flag)
McASP0 EDMA3
EER = Event Enable Register (user)
RRDY
ER EER Start Ch Xfr
XRDY
2 Manually Trigger the Channel to Run

Application Channel y
ESR = Event Set Register (user)
Set Ch #y; ESR Start Ch Xfr
3 Chain Event from another channel (next example)

Channel x Channel y
TCCHEN = TC Chain Enable (OPT)
TCCHEN_EN
CER Start Ch Xfr
TCC = Chy
Lets do a simple example on chaining 36
Chaining Action & Event Overview

T
(xfer config) Done
Options
Source 1 2 3 4 5 6
B
Destination 13 14 15 16 17 18
E T
Index
25
20
26
21
27
22
28
23
29
24
30
(event) Index Index
Rsvd C
Need: When one transfer completes, trigger How does chaining work?
another transfer to run Set the TCC field to match the next
Ex: ChX completes, kicks off ChY (i.e. chained) channel #
Turn ON chaining
Solution: Use chaining to kick off next xfr
When the current xfr (X) is complete,
Concept: it triggers the next Ch (Y) to run
Chaining actually refers to both both an action Ch X Ch Y
and an event the completed action from the
1st channel is the event for the next channel Y ?
TCC Done ? TCC
You can chain as many Chans as you like
it is only limited by the #Chs on a device RUN Y
EN DIS
Chaining does NOT reload current Chan config Chain EN Chain EN
that can only be accomplished by linking. It
simply triggers another channel to run.
37

Chaining
Example #1 Simple Chaining

EDMA Chain/Evt EDMA Channels EDMA Interrupt Gen
ESR
CER Ch # OPT.TCINTEN OPT.TCC IPR IER
0 5 5=0
0 5=0
7
0 6 6=0
1 6=1 EDMA3CC_GINT
0
1 7 7=1
0 7=0
6
0 55 55 = 0 55
0 55 = 0
OPT.TCCHEN CER
CER = Chain Evt Reg
0
Channel #5 5=1 7 ESR Evt Set Reg
Triggered manually by ESR 6=0 0 TCINTEN = Final TCC will
0 interrupt the CPU
Chains to Ch #7 (Ch #5s TCC = 7) 1 TCCHEN = Final TCC will
7=0 6 chain to next channel
Channel #7 55 = 0 55
0
Triggered by chaining from Ch #5
Notes:
Interrupts the CPU when finished
(sets TCC = 6) Any Ch can chain to any other Ch by enabling
OPT.TCCHEN and specifying the next TCC
ISR checks IPR (TCC=6) to determine which
channel generated the interrupt Any Ch can interrupt the CPU by enabling its
OPT.TCINTEN option (and specifying the TCC)
IPR bit set depends on previous Chs TCC setting

Channel Sorting
Channel Sorting
Channel Sort Transfer Config Overview
T
(xfer config) Done
Options
Source 1 2 3 4 5 6
B
Destination 13 14 15 16 17 18
E T
Index
25
20
26
21
27
22
28
23
29
24
30
(event) Index Index
Rsvd C
Need: De-interleave (sort) two (or more) How does channel sorting work?
channels User can specify the BIDX and CIDX
Ex: stereo audio (LRLR) into L & R buffers values to accomplish auto sorting
Solution: Use DMA indexing to perform PERIPH MEM
sorting automatically L0 EDMA L0
R0 L1
Concept: L1 BIDX L2
In many applications, data comes from the R1 CIDX
peripheral as interleaved data (LRLR, etc.) R0
L2 R1
Most algos that run on data require these R2 R2
channels to be de-interleaved
Indexing, built into the EDMA3, can auto-sort
these channels with no time penalty
40

Architecture & Optimization
Architecture & Optimization

Periphs
E63 E1 E0
EDMA Architecture
Evt Reg (ER) Queue CC TC
PSET 0
Evt Enable Reg Q0 PSET 1 TC0
(EER)
Q1 .. TR TC1 DATA
Evt Set Reg . Submit SCR
(ESR) Q2 TC2
PSET 254
Chain Evt Reg Q3 Early TC3
PSET 255 TCC
(CER)
EDMAINT Int Pending Reg IPR Completion Normal

Int Enable Reg IER Detection TCC
SCR = Switched Central Resource
EDMA consists of two parts: Channel Controller (CC) and Transfer Controller (TC)
An event (from periph-ER/EER, manual-ESR or via chaining-CER) sends the transfer
to 1 of 4 queues (Q0 is mapped to TC0, Q1-TC1, etc. Note: McBSP can use TC1 only)
Xfr mapped to 1 of 256 PSETs and submitted to the TC (1 TR transmit request per ACNT
bytes or A*B CNT bytes based on sync). Note: Dst FIFO allows buffering of writes while more reads occur.
The TC performs the transfer (read/write) and then sends back a transfer completion code (TCC)
The EDMA can then interrupt the CPU and/or trigger another transfer (chaining Chap 6)
42
EDMA Performance Tips, References

Spread Out the Transfers Among all Qs
Dont use the same Q for too many transfers (causes congestion)
Break long non-realtime transfers into smaller xfrs using self-chaining
Manage Priorities
Can adjust TC0-3 priority to the SCR (MSTPRI register)
In general, place small transfers at higher priorities
Tune transfer size to FIFO length and bus width

Place large transfers on TCs w/larger FIFOs (typically TC2/3)
Place smaller, real-time transfers on TC0/1
Match transfers sizes (A, A*B) to bus width (16 bytes)
Align src/dst on 16-byte boundaries
References Programming EDMA3 using LLD (wiki) + examples (see next slide)
TC Optimization Rules (SPRUE23)
EDMA3 User Guide (SPRU966)
EDMA3 Controller (SPRU234)
EDMA3 Migration Guide (SPRAAB9)
EDMA Performance (SPRAAG8)
43

Programming EDMA3 Using Low Level Driver (LLD)
EDMA3 LLD Wiki

Download the detailed app note
Use the examples to learn the APIs
44

*** this page used to have very valuable information on it ***

Chapter Quiz
Chapter Quiz
Chapter Quiz
1. Name the 4 ways to trigger a transfer?
2. Compare/contrast linking and chaining
3. Fill out the following values for this channel sorting example (5 min):
PERIPH MEM 16-bit stereo audio (interleaved)
L0 L0 Use EDMA to auto channel sort to memory
R0 L1
L1 L2 ACNT: _____
R1 L3 BCNT: _____
BUFSIZE
L2 R0 CCNT: _____
R2 R1 BIDX: _____
L3 R2 CIDX: _____
R3 R3 Could you calculate these ?

Chapter Quiz
Quiz Answers
Chapter Quiz
1. Name the 4 ways to trigger a transfer?
Manual start, Event sync, chaining and (QDMA trigger word)
2. Compare/contrast linking and chaining

linking copy new configuration from existing PARAM (link field)
chaining completion of one channel triggers another (TCC) to start
3. Fill out the following values for this channel sorting example (5 min):
PERIPH MEM 16-bit stereo audio (interleaved)
L0 L0 Use EDMA to auto channel sort to memory
R0 L1
L1 L2 2
ACNT: _____
R1 L3 2
BCNT: _____
BUFSIZE
L2 R0 CCNT: _____
4
R2 R1 BIDX: _____
8
L3 R2 -6
CIDX: _____
R3 R3 Could you calculate these ? 47

Additional Information
Additional Information

Notes
Notes

MA C6000 2DAY Student Guide Rev2.3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MA C6000 2DAY Student Guide Rev2.3

Uploaded by

Copyright:

Available Formats

in association with

C6000 Embedded Design Workshop

C6000 Embedded Design Workshop

C6000 Embedded Design Workshop - Cover 0-1

2.1 July 2016 updated labs/solutions files, minor errata

2.3 Dec 2016 updated labs/solutions files, minor errata

0-2 C6000 Embedded Design Workshop - Cover

Introduce the C6000 Core and the

C6000 Embedded Design Workshop - C6000 Introduction 11 - 1

11 - 2 C6000 Embedded Design Workshop - C6000 Introduction

Real-time ARM ARM ARM C66 + C66

C6000 Embedded Design Workshop - C6000 Introduction 11 - 3

Digital sampling of Most DSP algorithms can be

for (i = 0; i < count; i++){

How is the architecture designed to maximize computations like this?

'C6x CPU Architecture

11 - 4 C6000 Embedded Design Workshop - C6000 Introduction

C6000 DSP Family CPU Roadmap

C6000 DSP Family CPU Roadmap

C621x DMAX (PRU)

C6000 Embedded Design Workshop - C6000 Introduction 11 - 5

Devices & Documentation

C62x C67x C620x, C670x

C621x C67x C6211, C671x

Key C6000 Manuals

CPU Instruction Set Ref Guide SPRU732 SPRUFE8 SPRUGH7

Megamodule/Corepac Ref Guide SPRU871 SPRUFK5 SPRUGW0

Peripherals Overview Ref Guide SPRUE52 SPRUFK9 N/A

Cache Users Guide SPRU862 SPRUG82 SPRUGY8

To find a manual, at www.ti.com

11 - 6 C6000 Embedded Design Workshop - C6000 Introduction

Peripherals PRU Video/Display

Well just look at three of these: PRU and SCR/EDMA3 16

C6000 Embedded Design Workshop - C6000 Introduction 11 - 7

PRU SubSystem : IS / IS-NOT

11 - 8 C6000 Embedded Design Workshop - C6000 Introduction

There is a default priority (0 to 7) to TC2

TMS320C6748 Interconnect Matrix

Note: not ALL connections are valid

C6000 Embedded Design Workshop - C6000 Introduction 11 - 9

How many pins are on your device?

Pin Muxing Tools

Graphical Utilities For Determining which Peripherals can be Used Simultaneously

11 - 10 C6000 Embedded Design Workshop - C6000 Introduction

Example Device: C6748 DSP

Switched Central Resource (SCR)

C674x+ DSP Core Communications

C6000 Embedded Design Workshop - C6000 Introduction 11 - 11

11 - 12 C6000 Embedded Design Workshop - C6000 Introduction

C6000 Arch Catchup

User is responsible for setting up the following:

C64x+ Hardware Interrupts

C6000 Embedded Design Workshop - C6000 Introduction 11 - 13

Occur? Care? Both Yes? Interrupt

EVTFLAG[1] EVTMASK[1] MEVTFLAG[1]

EVTFLAG[2] EVTMASK[2] MEVTFLAG[2]

11 - 14 C6000 Embedded Design Workshop - C6000 Introduction

Target Config Files

Specify GEL script here

More on GEL files... 37

What is a GEL File ?

The board manufacturer (e.g. SD or LogicPD) supplies GEL files

C6000 Embedded Design Workshop - C6000 Introduction 11 - 15

* this page is blank for absolutely no reason *

* HTTP ERROR 404 PAGE NOT FOUND *