USING THE IDT7052/7054
FOURPORT™ SRAMs
IN DSP AND MATRIX
PROCESSING APPLICATIONS
By Tao Lin, Julie Lin, and Yupling Chung
APPLICATION
NOTE
AN-42
Introduction
Most digital signal processing (DSP) algorithms have inherent par-
allelism and may be pipelined. Usually, these algorithms are computa-
tion intensive. In real-time applications, multiprocessor or parallel dis-
tributed processor systems are commonly used to implement these
DSP algorithms. In these types of systems it is necessary for different
processors to randomly and independently access different locations at
the same time in the same memory space. The IDT7052 (2Kx8) and
IDT7054 (4Kx8) FourPort RAMs are powerful devices to efficiently
and compactly implement the memory space in these applications. More-
over, the IDT7052 and IDT7054 can increase the speed of these
types of systems since the FourPort SRAMs are as fast as conven-
tional SRAMs and eliminate the complex external logic which intro-
duces extra delay in these systems. In this application note, we will
demonstrate some examples of using the IDT7052 to implement a high
performance FFT processor and a matrix multiplication engine.
C
G
G = C + e jΩ • D
H = C - e jΩ • D
H
Figure 1. The signal flow graph of the butterfly
2684 drw 01
e jΩ
D
Using the IDT7052 in an FFT
Processor
The IDT7052 FourPort SRAM can dramatically simplify the design
of a high-speed pipelined FFT processor. The basic operation of any
FFT algorithm is the butterfly computation:
G = C + e
jW
H=C-e
x(0)
W
0
x(1)
x(2)
W
0
x(3)
x(4)
W
0
x(5)
x(6)
W
0
x(7)
Stage 1
W
2
W
2
W
0
W
0
jW
• D
(1-1)
• D
where C, D, G, and H are complex numbers. Figure 1 shows the
signal flow graph of the butterfly with one complex multiplication and
two complex additions. Given N = 2
L
input data samples x(0), x(1).....,
x(N-1), the FFT algorithm performs the Discrete Fourier Transform on
the input data to obtain the output data X(0), X(1)....., X(N-1) in L
stages of computation. Each stage consists of N/2 butterfly operations.
There are two basic versions of the FFT algorithm: decimation-in-time
(DIT) and decimation-in-frequency (DIF). Each version of the algo-
rithm can be implemented using two schemes: not-in-place computation
and in-place computation. A detailed discussion of the FFT algorithm
and its implementations is given in Reference (1).
Figure 2 shows the signal flow graph of the not-in-place computa-
tion of the DIT FFT algorithm for N = 8(L=3). A close look at Figure 2
will reveal the major strength of the not-in-place scheme. The signal
X(0)
W
0
X(4)
X(2)
W
1
X(6)
X(1)
W
2
X(5)
X(3)
W
3
X(7)
Stage 2
W
k
Stage 3
2684 drw 02
=
-j2πk/N
Figure 2. Signal Flow Graph of Not-In-Place Decimation-In-Time FFT for N=8
MARCH 2000
6.01
1
©2000 Integrated Device Technology, Inc.
2684/2
Using The IDT7052/7054 FourPort™ SRAMs
in DSP and Matrix Processing Applications
Application Note AN-42
paths from the initial inputs to the first intermediary step are repeated
between the first and second intermediary steps, and again between
the second and third. This means that three stages have identical data
access sequences. Therefore, the address generator can be very
easily implemented using the IDT7381/83, as compared with the in-
place scheme where more complex logic is required to generate the
addresses. On the other hand, from Figure 2 it is obvious that in each
stage of computation the output data is not in the same order as the
input data. For example, in the first stage the first and second inputs
x(0) and x(1) will go to the first and fifth locations after the butterfly
operation. Therefore, two separate buffers are needed to temporarily
store the input and output data in each stage computation.
A conventional implementation of the input and output buffers uses
two sets of dual-port SRAMs as illustrated in Figure 3. Suppose the
input data is already loaded into Buffer 1. Then, in the first stage of
computation the butterfly unit takes data from Buffer 1 and then loads
the results into Buffer 2. In the second stage of computation the butterfly
unit takes data from Buffer 2 and then loads the results into Buffer 1,
and so on. To switch between these two buffers, external logic such as
multiplexers and tri-state buffers are necessary as shown in Figure 3.
These devices not only occupy board space but also introduce extra
delay in the data path thus, decreasing the system performance. It
must be noted that C, D, G, H, and e
jW
in Figure 3 are all complex
numbers. Therefore, physically two groups of memories and buses
Address
Generator
IDT7381
Address
Generator
IDT7381
Address
Generator
IDT7381
Address
Generator
IDT7381
Dual-Port
SRAM
IDT7132/42
Addr
Dual-Port Addr
Buffer 1
SRAM
IDT7132/42 Data
Data
D
C
G
H
MUX
C
e jΩ
C
Addr
Addr
Buffer 2
Data
Data
D
MUX
D
G
H
Butterfly Unit
IDT7381
IDT7216
G
H
2684 drw 03
Figure 3. I/O Buffers Implemented by Two Sets of Dual-Port SRAM
2
6.01
Using The IDT7052/7054 FourPort™ SRAMs
in DSP and Matrix Processing Applications
Application Note AN-42
Address
Generator
IDT7381
Addr
Address
Generator
IDT7381
Addr
Address
Generator
IDT7381
Addr
Address
Generator
IDT7381
Addr
The IDT7052 FourPort SRAM
Data
C
Data
D
Data
H
Data
G
C
e
jΩ
D
Butterfly Unit
IDT7381
IDT7216
G
H
2684 drw 04
Figure 4. I/O Buffer Implemented By The IDT7052 FourPort SRAM
are needed to store and transmit the real part and the imaginary part
separately.
The IDT7052 FourPort SRAM provides a much simpler and more
efficient way to implement the input and output buffers as shown in
Figure 4. In this implementation, the input buffer and output buffer are
merged into a single memory space. Since each of the four ports can
access the whole memory space, two of them can be dedicated to
sending the data C and D to the butterfly unit and the
other two can be dedicated to receiving the results G and H from the
butterfly unit. In this way, all external logic can be eliminated and the
system performance is greatly improved.
Using the IDT7052 in a Matrix
Multiplication Engine for
Graphics and DSP
Matrix multiplication is one of the most often used operations in DSP
algorithms. In addition, matrix multiplication is the basic operation at the
heart of computer graphics. For example, changing the position, orien-
tation, and size of objects in a drawing requires a geoetrical transfor-
mation
M
which is generally represented by a series of matrix multipli-
cations.
M = M
1
• M
2
• M
3
•......•Mn
(2-1)
3
6.01
Using The IDT7052/7054 FourPort™ SRAMs
in DSP and Matrix Processing Applications
Application Note AN-42
Initial Address and Controls
CLK
Pipeline Reg
IDT
7382/81
Address
Generator
1
Pipeline Reg
Pipeline Reg
Address
Generator
2
Pipeline Reg
Pipeline Reg
Address
Generator
3
Pipeline Reg
Main
Processor
Addr
IDT6116 SRAM
Matrix A
I/O
Addr
IDT6116 SRAM
Matrix B
I/O
Addr
IDT6116 SRAM
Matrix C
I/O
Data
Data
Data
Main
Memory
and
Peripherals
Pipeline Reg
IDT7210
Pipeline Reg
Multiplier/Accumulator
(MAC)
Pipeline Reg
System
Bus
2684 drw 05
Figure 5. Implementation of Matrix Multiplication Engine Using Standard SRAMs
where
M
1
is a scaling, translation, or rotation matrix.
In high performance systems, a matrix multiplication engine (MME)
is necessary to facilitate the operation. A typical pipelined MME has the
architecture shown in Figure 5 [2]. Since the MME operates in a
pipelined manner, three standard SRAMs (IDT6116 2Kx8 SRAMs)
are needed to store the multiplicand matrix
A,
multiplier matrix
B,
and
the product matrix
C = A•B.
The matrices
A
and
B
are preloaded into
the two SRAMs from the main memory or a peripheral. The MME then
performs the matrix multiplica-tion and loads the product matrix
C
into
the third SRAM. Finally, the multiplication result is sent back to the main
memory or the peripheral. This implementation has two drawbacks:
1. Three separate sets of SRAMs are needed. This results in a high
chip count and a complicated interface to the system bus.
2. The arithmetic unit (IDT7210) of the MME is sitting idle when the
data is transferred between the memory buffers and the system
main memory. This dramatically decreases the system performance
especially when the MME executes a series of matrix multiplications
as given in (2-1).
Now, with the advent of the IDT7052, system designers can con-
siderably improve the performance of the MME by using the FourPort
4
6.01
Using The IDT7052/7054 FourPort™ SRAMs
in DSP and Matrix Processing Applications
Application Note AN-42
Initial Address and Controls
CLK
Pipeline Reg
Address
Generator
1
Pipeline Reg
Pipeline Reg
Address
Generator
2
Pipeline Reg
Pipeline Reg
Address
Generator
3
Pipeline Reg
Pipeline Reg
Address
Generator
3
Pipeline Reg
Main
Processor
IDT
7383/81
A-P1
I/O-P1
A-P2
A-P3
FourPort™ SRAM IDT7052
I/O-P2
I/O-P3
A-P4
I/O-P4
Data
Main
Memory
and
Peripherals
Pipeline Reg
IDT7210
Pipeline Reg
Multiplier/Accumulator
(MAC)
Pipeline Reg
System
Bus
2684 drw 06
Figure 6. New Implementation of Matrix Multiplication Engine Using The IDT7052 FourPort SRAM
single-chip SRAM instead of the standard SRAM. As shown in Figure
6, the new implementation reduces the chip count and simplifies the
interface between the MME and the other part of the system. More-
over, when executing a series of matrix multiplications as given in (2-
1), the MME is able to perform the arithmetic operation and the data
transfer in parallel, as illustrated in Figure 7. First, the matrices
M
1
and
M
2
are loaded into the FourPort SRAM. Then, while the arithmetic unit
performs the operation
M
1
•M
2
, a new matrix
M
3
can be loaded into an
unused area of the FourPort SRAM through the 4-th I/O port. Then,
the MME will perform the multiplication
M•M
3
and the result will be
stored in the location originally occupied by
M
1
. At the same time a
new matrix
M
4
can be loaded into the FourPort SRAM to replace
M
2
and so on. The operation sequence of the two implementations is
shown in Figure 8, where t
L
is the time to load a matrix into the
IDT7052, t
E
is the time for the arithmetic unit to perform a matrix multipli-
cation, and t
M
is the maximum of t
L
and t
E
. It can be readily seen from
Figure 8, where the total time to execute the operation given in (2-1) is
t
L
+(n-1)•(t
L
+t
E
) when conventional SRAMs are used. On the other
hand, the total time is 2t
L
+(n-1)•t
M
when the IDT7052 FourPort SRAM
is used. If we make t
L
and t
E
almost equal to each other then we can
almost double the system performance.
M
1
M
2
Unused
M
1
M
1
M
2
M
3
M
4
M
3
M
2684 drw 07
Unused
Figure 7. Using FourPort SRAMs, the MME Can Perform Arithmetic Operation and Data Transfer in Parallel
5
6.01