Unit 4 | unit 4 design of arithmetic building blocks and subsystem

Introduction to VLSI Design

Unit - 4

DESIGN OF ARITHMETIC BUILDING BLOCKS AND SUBSYSTEM

4.1 Arithmetic Building Blocks: Data Paths

Consider the example of an adder, that can add 4-values. Assume, we want to add, a, b, c, d. If the output is s, the code can be written as,

s <= ((a + b) + c) + d;

However, this results in a large circuit because the circuit will have 3 adders as shown in the figure,

Figure1. Adders

Use one adder and use it sequentially to add up all the four values. This is when we will need FSM/a control path. Let’s take a look how the circuit might look like,

Diagram, schematic

Description automatically generated

Figure2. Control path circuit

To define the data path, it works on 2-bit input sel and 1-bit inputs load and clear from the control path.

-- datapath

Library ieee;

Use ieee.std_logic_1164.all;

Use ieee.std_logic_arith.all;

Use work.averager_types.all;

Entity datapath is

Port (

a, b, c, d: in num;

Sum: out num;

Sel: in std_logic_vector (1 downto 0);

Load, clear, clk: in std_logic

);

End datapath;

Architecture rtl of datapath is

Signal mux_out, sum_reg, next_sum_reg:num;

Constant sum_zero:num:=

Conv_unsigned(0, next_sum_reg’length) ;

Begin

-- mux to select input to add

With sel select mux_out<=

a when "00",

b when "01",

c when "10",

d when others;

-- mux to select register input

Next_sum_reg<=

Sum_reg + mux_out when load = ’1’ else

Sum_zero when clear = ’1’ else

Sum_reg;

-- register sum

Process(clk)

Begin

If clk’event and clk = ’1’ then

Sum_reg<= next_sum_reg;

End if;

End process;

-- entity output is register output

Sum <= sum_reg;

End rtl;

The next state is the controller which controls the datapath.

-- controller

Library ieee;

Use ieee.std_logic_1164.all;

Use work.averager_types.all;

Entity controller is

Port (

Update in std_logic ;

Sel: out std_logic_vector (1 downto 0);

Load, clear: out std_logic;

Clk: in std_logic

);

End controller;

Architecture rtl of controller is

Signal s, holdns, ns: states ;

Signal tmp :std_logic_vector (3 downto 0) ;

Begin

-- select next state

With s select ns <=

Add_a when clr,

Add_b when add_a,

Add_c when add_b,

Add_d when add_c,

Hold when add_d,

Holdns when others; -- hold

-- next state if in hold state

Holdns<=

Clr when update = ’1’ else

Hold;

-- state register

Process(clk)

Begin

If clk’event and clk = ’1’ then

s <= ns;

End if;

End process;

-- controller outputs

With s select sel<=

"00" when add_a,

"01" when add_b,

"10" when add_c,

"11" when others;

Load <= ’0’ when s = clr or s = hold else ’1’;

Clear <= ’1’ when s = clr else ’0’;

End rtl;

4.2 Adders

Basic 4-bit adder

Simplest design: cascade full adders

- Critical path goes from Cin to Cout

- Design full adder to have fast carry delay

Critical path passes through majority gate

- Built from minority + inverter

- Eliminate inverter and use inverting full adder

Carry Look ahead adder

To improve the speed of addition operation, carry propagation delay of adders is important.

Carry look ahead adder reduces circuit delay in ripple carry adder by calculation carry generation Cg, carry propagation, Cp.
Carry-look ahead adder computes Gi:0 for many bits in parallel.
Hence, addition can be reduced to a three-step process: - Computing bitwise generate and propagate signals using Gi=Ai.Bi, Pi=Ai  Bi
Combining PG signals to determine group generates Gi–1:0 for all N > i >1 using Gi=Ci+PiGi-1
Calculating the sums using Si=Pi  Gi–1:0
Some of the hardware can be shared in the bitwise PG logic

Fig 3 Look ahead carry

Uses higher-valency cells with more than two inputs

In general, a CLA using k groups of n bits each has a delay of tCLA = tpg +tpg(n)+[(n-1) + (k-1)] tAO + txor

Carry Bypass Adder

Carry-ripple is slow through all N stages
Carry-skip allows carry to skip over groups of n bits
Decision based on n-bit propagate signal

It involves the initial PG logic producing a carry out of bit 1, three AND-OR gates rippling it to bit 4, three multiplexers bypassing it to C12, 3 AND-OR gates rippling through bit 15, and a final XOR to produce S16.
In general, an N-bit carry-skip adder using k n-bit groups (N = n × k) has a delay of Tskip = tpg +2[(n-1) + (k-1)] tAO + txor

Fig 4 Carry By-pass adder

Carry select adder

Trick for critical paths dependent on late input X - Precompute two possible outputs for X = 0, 1 - Select proper output when X arrives
Carry-select adder precomputes n-bit sums - For both possible carries into n-bit group
One adder calculates the sums assuming a carry-in of 0 while the other calculates the sums assuming a carry-in of 1.
The actual carry triggers a multiplexer that chooses the appropriate sum. The critical path delay is Tselect = tpg +[n+(k-2)] tAO + tmux

Fig 5 Carry select adder

4.3 Multipliers

Booth Multiplier

Booth encoding was originally proposed to accelerate serial multiplication.
Modified Booth encoding allows higher radix parallel operation without generating the hard 3Y multiple by instead using negative partial products.

Inputs			Partial Product	Boolean series
X2l+1	X2l	X2l-1	PP1	SINGLEL	DOUBLEl	NEGl
0	0	0	0	0	0	0
0	0	1	Y	1	0	0
0	1	0	Y	1	0	0
0	1	1	2Y	0	1	0
1	0	0	-2Y	0	1	1
1	0	1	-Y	1	0	1
1	1	0	-Y	1	0	1
1	1	1	-0 (=0)	0	0	1

Booth Hardware

Booth encoder generates control lines for each PP
Booth selectors choose PP bits

Sign Extension

Partial products can be negative
Require sign extension, which is cumbersome
High fanout on most significant bit

Simplified Sign Ext

Sign bits are either all 0’s or all 1’s
Note that all 0’s is all 1’s + 1 in proper column
Use this to reduce loading on MSB

4x4 array multiplier

Fast multipliers use carry-save adders to sum the partial products.
Each cell contains a 2-input AND gate that forms a partial product and a full adder (CSA) to add the partial product into the running sum.
The first row converts the first partial product into carry-save redundant form. Each later row uses the CSA to add the corresponding partial product to the carry-save redundant result of the previous row and generate a carry-save redundant result.
The least significant N output bits are available as sum outputs directly from CSAs.
The array is regular in structure and uses a single type of cell, so it is easy to design and layout.

Fig 6 Array Multiplier

Total Number of logic units in n-bit × m bit Array Multiplier • n × m two-input ANDs and (m – 1) units of n-bit adders.

Rectangular Array

Squash array to fit rectangular floorplan

Delay calculation

Total Delay in n-bit × m bit Array Multiplier
Delay due to ANDs in partial products at all level is just one unit AND gate delay.
But delay at levels 1 to (m – 1) units of n-bit adders = (m – 1) × delay of one unit 16-bit adder
The delay in adders is very large if ripple carry adders are used.
The delay in adders reduced by using carry-look ahead adders.

Multiplier for 5 bits by 4 bits

5 X 4 array multiplier

Each cell contains a 2-input AND gate that forms a partial product and a full adder (CSA) to add the partial product into the running sum.
The least significant N output bits are available as sum outputs directly from CSAs.
The array is regular in structure and uses a single type of cell, so it is easy to design and layout.

It requires 2 Half adders, 5 Full adders.
So, it needs more area. Difficult to design.

Wallace multiplier

A Wallace tree is an efficient hardware implementation of a digital circuit that multiplies two integers.
The Wallace tree has three steps:
Multiply each bit of one of the arguments, by each bit of the other, yielding n2 results.
Reduce the number of partial products to two by layers of full and half adders.
Group the wires in two numbers, and add them with a conventional adder.
The second phase works as follows. As long as there are three or more wires with the same weight add a following layer
Take any three wires with the same weights and input them into a full adder. The result will be an output wire of the same weight and an output wire with a higher weight for each three input wires.
If there are two wires of the same weight left, input them into a half adder.
If there is just one wire left, connect it to the next layer.
The conventional Wallace tree algorithm reduces the propagation by incorporating 3:2 compressors.
However, Wallace tree algorithm can also reduce the propagation using higher order compressor. The below fig explains the steps of Wallace tree multiplier.

Fig 7 Wallace Multiplier

Key takeaway

A multiplication method in which an array of identical cells generates new partial product and accumulation of it at the same time. We can use pipelines at each level. The delay is logarithmically proportional to the bit size of multiplicand and multiplier if we use the high-speed array multiplier circuit.

4.4 Shifters

Logical Shift: – Shifts number left or right and fills with 0’s

• 1011 LSR 1 = 0101 1011 LSL1 = 0110

Arithmetic Shift: – Shifts number left or right. Rt shift sign extends

• 1011 ASR1 = 1101 1011 ASL1 = 0110

Rotate: – Shifts number left or right and fills with lost bits

• 1011 ROR1 = 1101 1011 ROL1 = 01

Funnel Shifter

A funnel shifter can do all six types of shifts
Selects N-bit field Y from 2N–1-bit input
Shift by k bits (0 ≤ k < N)
Logically involves N N:1 multiplexer.

Shift Type	Z2N-2:N	ZN-1	ZN-2:0	Offset
Rotate Right	AN-2:0	AN-1	AN-2:0	K
Logical Right	0	AN-1	AN-2:0	K
Arithmetic Right	Sign	AN-1	AN-2:0	K
Rotate Left	AN-1:1	A0	AN-1:1
Logical/ Arithmetic Left	AN-1:1	A0	0

Array Funnel Shifter

N N-input multiplexers
Use 1-of-N hot select signals for shift amount
nMOS pass transistor design (Vt drops!)

Fig 8 Array Funnel shifter

Logarithmic Funnel Shifter

Log N stages of 2-input muxes
No select decoding needed

Fig 9 Log Funnel Shifter

Barrel Shifter

Barrel shifters perform right rotations using wraparound wires.
Left rotations are right rotations by N – k = k + 1 bits.
Shifts are rotations with the end bits masked off.

Fig 10 Barrel shifter

32-bit Logarithmic Barrel

Datapath never wider than 32 bits. First stage pre-shifts by 1 to handle left shifts.

Fig 11 32-bit Log barrel

4.5 ALUs

ALU’s comprise the combinational logic that implements logic operations such as AND, OR, NOT gate and arithmetic operations, such as Adder, Subtractor.

Functionally, the operation of typical ALU is represented as shown in diagram below.

Figure 12. Functional Description of 4-bit Arithmetic Logic Unit

Controlled by the three function select inputs (sel 2 to 0), ALU can perform all the 8 possible logic operations

VHDL Code for 4-bit ALU

Library IEEE;

Use IEEE.STD_LOGIC_1164.ALL;

Use IEEE.NUMERIC_STD.ALL;

Entity alu is

Port ( inp_a : in signed(3 downto 0);

Inp_b: in signed(3 downto 0);

Sel: in STD_LOGIC_VECTOR (2 downto 0);

Out_alu: out signed(3 downto 0));

End alu;

Architecture Behavioral of alu is

Begin

Process(inp_a, inp_b, sel)

Begin

Case sel is

When "000" =>

Out_alu<= inp_a + inp_b; – addition

When "001" =>

Out_alu<= inp_a - inp_b; – subtraction

When "010" =>

Out_alu<= inp_a - 1; – sub 1

When "011" =>

Out_alu<= inp_a + 1; – add 1

When "100" =>

Out_alu<= inp_a and inp_b; – AND gate

When "101" =>

Out_alu<= inp_a or inp_b; – OR gate

When "110" =>

Out_alu<= not inp_a; – NOT gate

When "111" =>

Out_alu<= inp_a xor inp_b; – XOR gate

When others =>

NULL;

End case;

End process;

End Behavioral;

Testbench VHDL Code for 4-Bit ALU

LIBRARY ieee;

USE ieee.std_logic_1164.ALL;

USE ieee.numeric_std.ALL;

ENTITY Tb_alu IS

END Tb_alu;

ARCHITECTURE behavior OF Tb_alu IS

– Component Declaration for the Unit Under Test (UUT)

COMPONENT alu

PORT(

Inp_a: IN signed(3 downto 0);

Inp_b: IN signed(3 downto 0);

Sel: IN std_logic_vector(2 downto 0);

Out_alu: OUT signed(3 downto 0)

);

END COMPONENT;

– Inputs

Signal inp_a: signed(3 downto 0) := (others => '0');

Signal inp_b: signed(3 downto 0) := (others => '0');

Signal sel: std_logic_vector(2 downto 0) := (others => '0');

– Outputs

Signal out_alu: signed(3 downto 0);

BEGIN

– Instantiate the Unit Under Test (UUT)

Uut: alu PORT MAP (

Inp_a =>inp_a,

Inp_b =>inp_b,

Sel =>sel,

Out_alu =>out_alu

);

– Stimulus process

Stim_proc: process

Begin

– hold reset state for 100 ns.

Wait for 100 ns;

– insert stimulus here

Inp_a<= "1001";

Inp_b<= "1111";

Sel<= "000";

Wait for 100 ns;

Sel<= "001";

Wait for 100 ns;

Sel<= "010";

Wait for 100 ns;

Sel<= "011";

Wait for 100 ns;

Sel<= "100";

Wait for 100 ns;

Sel<= "101";

Wait for 100 ns;

Sel<= "110";

Wait for 100 ns;

Sel<= "111";

End process;


END;

Simulation Result for 4-bit ALU

Figure 13. ALU

4.6 Power and speed tradeoffs

Speed, area and power can be trade off through the choice of the supply voltages, transistor threshold and device sizes.
Some design techniques are implemented at design time.
Transistor width s and lengths can be fixed at the time of design.
A reduction in supply voltage results in power savings and thus is the most attractive approach.
Reduced supply evenly lowers the power dissipation of all the logic gates.
In this approach, non –critical path having timing slack is supplied with low voltage without affecting the system performance.
Important design concepts:

- To select right structure before starting a circuit optimization.

- Determine the critical timing path through the circuit.

- Circuit size is not only determined by the number and size of the transistors.

- An obscure optimization can sometimes help to get a better result.

- Power and speed can be traded off through a choice of circuit sizing, supply voltages and transistor threshold.

4.7 Case Study: Design as a trade-off

In order to fully evaluate the benefit of a serial link with respect to the parallel topology we should consider both Ag and Ec. Since the proposed scheme guarantees no performance loss, the benefits of serialization for a specific design can be easily evaluated without advanced architectural studies. Given the specifications for the power consumption, we can estimate the maximum energy per bit allowed for the intra-chip communication. Hence, for each design, the best serialization level can be chosen such that the power consumption of the vertical link does not exceed the one required by the design specifications.

Fig 14: Ag and Ec for (a) 5µm TSVs (b) 10µm TSVs and (c) 40µm TSVs

This study can be considered as a reference for 3D -ICs architects. The graphs in Figure a, b and c depicts the area-energy trends for each TSV technology in a 2-layer system. For large TSVs the energy cost due to the serialization is almost negligible with respect to the area saved by reducing the number of TSVs. Reducing the TSV diameter we can see that Ag lowers while Ec increases. As discussed in the previous section, despite the power overhead, the energy per bit still remains low enough to consider the link as low-power. We define the energy gain as

EG = 1/Ec

The best trade-off between area and energy can be evaluated by finding the serialization level that gives the most area saved at the minimum energy cost. To the end, the product Ag xEg plotted in Figure below.

Fig 15 Area gain - energy gain product.

The best area-energy trade-off corresponds to the maximum of the Ag xEg plot. It is interesting to notice that for the TSV technologies considered, this is independent from the TSV size, with 8-bit serialization level being the optimum point.

4.8 Designing Memory and Array structures: Memory Architectures and Building Blocks

The memory array is classified into 3 types - Random Access memory (RAM), Serial access memory and content addressable memory (CAM). We will discuss each type in detail.

Read only memory (ROM)

The basic idea of the memory that can only be read and never altered is called Read only memories. There are vast and variety of potential applications for these kinds of memories. Programs for processors with fixed applications such as washing machines, calculators and game machines, once developed and debugged, need only reading. Fixing the contents at manufacturing time leads to small and fast implementation. There are different ways to implement the logic of ROM cells, the fact that the contents of a ROM cell are permanently fixed considerably simplifies its design. The cell should be designed so that a „0‟ or „1‟ is presented to the bit line upon activation of its” wordline. The different approaches for implementing the ROM cells are Diode ROM, MOS ROM 1 and MOS ROM 2. These are the main approaches for designing a larger density ROMs.

Programmable ROM (PROM):

The technology that offers its users to program the memory one time is called Programmable ROM. It is also called as WRITE ONCE device. This is most often accomplished by introducing fuses (implemented in nichrome, polysilicon, or other conductors) in the memory cell. During the programming phase, some of these fuses are blown by applying a high current, which disables the connected transistor. While PROMs have the advantage of being “customer programmable,” the single write phase makes them unattractive. For instance, a single error in the programming process or application makes the device unstable. This explains the current preference for devices that can be programmed several times. The Floating-Gate transistor is the device at the heart of the majority of reprogrammable memories. Various attempts have made to create a device with electrically alterable characteristics and enough reliability to support a multitude of write cycles. The floating gate structure is similar to a traditional MOS device, except that an extra polysilicon strip is inserted between the gate and channel.

This strip is not connected to anything and is called a floating gate. The most obvious impact of inserting this extra gate is to double the gate oxide thickness tox, which results in a reduced device transconductance as well as an increased threshold voltage. Though these properties are not desirable but from other point of view this device acts as a normal transistor. The most important property of this device is that the threshold voltage of this device is programmable. By applying a high voltage (above 10V) between the source and the gate-drain terminals creates a high electric field and causes avalanche injection to occur.

Electrons acquire sufficient energy to become “hot” and traverse through the first oxide insulator, so that they get trapped on the floating gate. In reference to the programming mechanism, the floating-gate transistor is often called a floating-gate avalanche- injection MOS. The trapping of electrons on the floating gate effectively drops the voltage on the gate. This process is self-limiting – the negative charge accumulated on the floating gate reduces the electrical field over the oxide so that ultimately it becomes incapable of accelerating any more hot electrons. Virtually all non-volatile memories are currently based on the floating-gate mechanism. Different classes can be identified, based on the erasure mechanism.

Erasable-programmable Read-Only Memory (EPROM):

The erasure mechanism in EPROM is based on the shining ultraviolet light on the cells through a transparent window in the package. The UV radiation renders the oxide to conduct by the direct generation of electron-hole pairs in the material. The erasure process is slow depending on the UV source, it can take from seconds to several minutes. The programming takes several µs/word. Alternatively, there is another problem which exists is the limited endurance - the number of erase/program cycles is limited to a maximum of one thousand mainly as a result of UV erasing procedure. The device thresholds might vary with repeated programming cycles. The on-chip circuitry is designed in such a way that it also controls the value of the thresholds to within a specified range during programming.

The injection of large channel current of 0.5 mA at a control gate voltage of 12.5V causes high power dissipation during programming. On the other hand, EPROM is extremely simple and dense, making it possible to fabricate large memories at a low cost. Therefore, EPROMs were attractive in applications that do not require reprogramming. The major disadvantage of the EPROM is that the erasure procedure has to occur “off system”. This means the memory must be removed from the board and placed in an EPROM programmer for programming.

Electrically Erasable Programmable Read-Only Memory EEPROM)

The disadvantage of the EPROM [16] is solved by using a method to inject or remove charges from a floating-gate namely – tunneling. A modified floating-gate device called FLOTOX (floating-gate tunneling oxide) transistor is used as programmable device that supports an electrical-erasure procedure. It resembles FAMOS (floating-gate avalanche MOS) device, except that a portion of the dielectric separating the floating gate from the channel and drain is reduced in thickness to about 10 nm or less.

The main advantage of this programming approach is that it is reversible; that is, erasing is simply achieved by reversing the voltage applied during the writing process. The electrons injection on floating-gate raises the threshold, while the reverse operation lowers the VT. When a voltage of approximately 10V (equivalent to 109 V/m) is applied over the thin insulator, electrons travel to and from the floating gate through a mechanism called Fowler – Nordheim tunneling.

Flash Electrically Erasable Programmable ROM (Flash):

The concept of Flash EEPROMs is a combination of density of EPROM with versatility of EEPROM structures, with cost and functionality ranging from somewhere between two. Most Flash EEPROM devices use the avalanche hot-electron-injection approach to program the device. Erasure is performed using Fowler – Nordheim tunneling, as from EEPROM cells. The main difference is that erasure procedure is performed in bulk for a complete chip or for the subsection of the memory. Erasing complete memory core at once makes it possible to carefully monitor of the device characteristics during erasure.

The monitoring control hardware on the memory chip regularly checks the value of the threshold during erasure, and adjusts the erasure time dynamically. This approach is only practical when erasing large chunks of memory at a time; hence the flash concept. One of the many existing alternatives for Flash EEPROMs memories are ETOX devices. It resembles a FAMOS gate except that a very thin tunneling gate oxide is utilized (10 nm). Different areas of the gate oxide are used for programming and erasure. Programming is performed by applying a high voltage (12V) on the gate and drain terminals for a grounded source, while erasure occurs with the gate rounded and the source at 12V.

The Programming cycle starts with an erase operation. In erase operation, A 0V gate voltage is applied and a 12V supply is given at source. Electrons, if any, are ejected to the source by tunneling. All cells are erased simultaneously. The variations caused in the threshold voltage at the end of erase operation are due to different initial values of cell threshold voltage and variations in oxide thickness. This can be solved in two methods: 1.

The array cells are programmed before applying the erase pulse so that the entire threshold starts at approximately same time. 2. An erase pulse of controlled width is applied. Subsequently the whole array is read to ensure that all the cells are erased. If not, another erase pulse is applied followed by the read cycle. For write (programming) operation, a high voltage is applied to the gate of the selected device. If a „1‟ is applied to the drain at that time, hot electrons are generated and injected onto the floating gate, raising the threshold. Read operation corresponds as the wordline is raised to 5V; it causes a conditional discharge of bitline.

Random Access memory (RAM):

Random access memory is a type of computer data storage. It is made of integrated circuits that allow the stored data to be accessed in any order i.e., at random and without the physical movement of storage medium or a physical reading head. RAM is a volatile memory as the information or the instructions stored in the memory will be lost if the power is switched off. The word “random” refers to the fact that any piece of data can be returned at a constant time regardless of its physical location and whether or not it is related to the previous piece of data.

This contrasts with the physical movement devices such as tapes, magnetic disks and optical disks, which rely on physical movement of the recording medium or reading head. In these devices, the retrieval time varies with the physical location and the movement time takes longer than the data transfer. The main advantages of RAM over types of storage which require physical movement is that retrieval times are short and consistent. Short because no physical movement is necessary and consistent the time taken to retrieve the data does not depend on the current distance from a physical head. The access time for retrieving any piece of data in RAM chip is same.

The disadvantages are its cost compared to the physical moving media and loss of data when power is turned off RAM is used as 'main memory' or primary storage because of its speed and consistency. The working area used for loading, displaying and manipulating applications and data. In most personal computers, the RAM is not an integral part of the motherboard or CPU. It comes in the easily upgraded form of modules called memory sticks. These can quickly be removed and replaced when they are damaged or when the system needs up gradation of memory depending on current purposes. A smaller amount of random- access memory is also integrated with the CPU, but this is usually referred to as "cache" memory, rather than RAM. Modern RAM generally stores a bit of data as either a charge in a capacitor, as in dynamic RAM, or the state of a flip-flop, as in static RAM.

Static Random Access Memory (SRAM)

The typical SRAM design is shown in figure the memory array contains the memory cells which are readable and writable. The Row decoder selects from 1 out of n = 2k rows, while the column decoder selects l = 2i out of m = 2j columns. The addresses are not multiplexed as it in the DRAM. Sense amplifier detects small voltage variations on the memory complimentary bitline which reduces the reading time. The conditioning circuit is used to pre-charge the bitlines.

Fig 14 SRAM Architecture

In a read operation, the bitlines are precharged to some reference voltage usually close to the supply voltage. When word line turns high, the access transistor connected to the node storing „0‟ starts discharging the bitline while the complementary bitline remains in its precharged state, resulting in a differential voltage between the bitline pair. Since the SRAM has an optimized area results in a small cell current and slow bitline discharge rate. In order to speed up the RAM access, sense amplifiers are used which amplify the small bitline signal and eventually drive it to the external world.

The word “static” means that the memory retains its contents as long as the power is turned on. Random access means that locations in the memory can be written to or read from in any order, regardless of the memory location that was last accessed. Each bit in an SRAM is stored on four transistors that form two cross-coupled inverters. This storage cell has two stable states which are used to denote „0‟ and „1‟.

The access transistors are used to access the stored bits in the SRAM during read or write mode. It thus typically takes six MOSFETs to store one memory bit. Access to the cell is enabled by the word line WL which controls the two access transistors N1 and N2 which, in turn, control whether the cell should be connected to the bitlines BL and /BL. They are used to transfer data for both read and write operations.

Fig 15 A logic diagram of a CMOS static memory cell

Fig 16 A schematic of a CMOS static memory cell

Principles of operations

In order to consider operation of the static read/write memory we have to take into account: Relatively large parasitic column capacitances, CC and Cc

Column pull-up pMOS transistors, as shown in Figure

A CMOS static memory cell with column pull-up transistors and parasitic column capacitances

When none of the word lines is selected, that is, all S signals are ‘0’, the pass transistors n3, n4 are turned off and the data is retained in all memory cells. The column capacitances are charged by the drain currents of the pull-up pMOS transistors, p3, p4. The column voltages VC and V c¯ both reach the level just below VDD − VT p, say 3.5V for VDD = 5V and the threshold voltage VT p = 1V.

For the read or write operations we select the cell asserting the word line signal S= ‘1’. For the write operation we apply a low voltage to one of the bit line, holding the other one high. To write ‘0’ in the cell, the column voltage VC is forced to low (C = 0). This low voltage acts through a related pass transistor (n3) on the gates of the corresponding inverter (n2, p2) so that its input goes high. This sets the signal at the other inverter Q = 0. Similarly, to write ‘1’ in the cell, the opposite column voltage VC¯ is forced to low (C¯ = 0) which sets the signal Q = 1.

During the read ‘1’ operation, when the stored bit is Q = 1, transistors n3, p1 and n4, n2 are turned on. This maintains the column voltage VC at its steady state high level (say 3.5V) while the opposite column voltage VC¯ is being pulled down discharging the column capacitance CC¯ through transistors n4, n2 so that VC > VC¯. Similarly, during the read ‘0’ operation we have VC < VC¯. The difference between the column voltages is small, say 0.5V, and must be detected by the sense amplifiers from data-read circuitry.

Key takeaway

EEPROM The main advantage of this programming approach is that it is reversible; that is, erasing is simply achieved by reversing the voltage applied during the writing process. The electrons injection on floating-gate raises the threshold, while the reverse operation lowers the VT.

4.9 Memory Core

SRAM Read Circuitry

The structure of the read circuitry is shown in Figure

Fig 17 The structure of the write circuitry associated with one column of the memory cells. During the read operation the voltage level on one of the bit lines drops slightly after the pass transistors in the memory cell are turned on.

The read circuitry consists of two-level sense amplifiers:

• One simple cross-coupled sense amplifier per column of memory cells,

• One current-mirror differential sense amplifier per the memory chip.

The cross-coupled sense amplifier works as a latch. Assume that the voltage on the bit line C start to drop slightly when the memory access pass transistors are activated by the word line signal S, and that the clk signal is high so that the transistor M3 is turned on. Now, higher voltage on the gate of M1 transistor than on the gate of M2 starts the latching operation which pulls the VC voltage further down switching the transistor M2 off.

As a result, the parasitic capacitance, CC is discharged through M1 and M3. In this way a small difference between column voltages is amplified. The amplified (discriminated) column voltages are passed through transistors M4 and M5 to the main sense amplifier. The schematic of a typical differential current-mirror sense amplifier is shown in Figure

Fig 18 A CMOS differential current-mirror sense amplifier

Dynamic Read-Write Memory (DRAM)

In the static CMOS read-write memory data is stored in six-transistor cells. Such a memory is fast and consumed small amount of static power. The only problem is that a SRAM cell occupies a significant amount of silicon space. This problem is addressed in the dynamic read-write memory (DRAM). In a dynamic RAM binary data is stored as charge in a capacitor. The memory cell consists of a storage capacitor and an access transistor as shown in Figure

Fig 19 A one-transistor DRAM memory cell

Data stored as charge in a capacitor can be retained only for a limited time due to the leakage current which eventually removes or modifies the charge. Therefore, all dynamic memory cells require a periodic refreshing of the stored data before unwanted stored charge modifications occur. Typical storage capacitance has a value of 20 to 50 Ff. Assuming that the voltage on the fully charged storage capacitor is V = 2.5V, and that the leakage current is I = 40pA

The DRAM cell occupies much smaller silicon area than the SRAM cell. The size of a DRAM cell is in the order of 8F2, where F is the smallest feature size in a given technology. For F = 0.2μm the size is 0.32μm2

• No static power is dissipated for storing charge in a capacitance. The storage capacitance CS, which is connected between the drain of the access transistor (the storage node) and the ground, is formed as a trench or stacked Capacitor. The stacked capacitor is created between a second polysilicon layer and a metal plate covering the whole array area. The plate is effectively connected to the ground terminal. To consider read/write operations we have to take into account a significant parasitic capacitance CC associated with each column, as shown in Figure

Fig 19 A single SRAM cells with a column capacitance shown

Typically, before any operation is performed each column capacitance b is precharged high. The cell is selected for a read/write operation by asserting its word line high (S = 1). This connects the storage capacitance to the bit line. The write operation is performed by applying either high or low voltage to the bit line thus charging (write ‘1’) or discharging (write ‘0’) the storage capacitance through the access transistor. During read operation there is a flow of charges between the storage capacitance C1 and the column capacitance, CC.

As a result, the column voltage either increases (read ‘1’) or decreases (read ‘0’) slightly. This difference can then be amplified by the sense amplifier. Note that the read operation destroys the charge stored on the storage capacitance C1 (“destructive readout”). Therefore, the data must be restored (refreshed) each time the read operation is performed.

Fig 20 Three Transistor Dynamic RAM

The write operation performed is shown for three transistor Dynamic RAM (Figure 1.2) as the appropriate data value is written on BL1 and asserting the write-wordline (WWL). The data is retained as charge on capacitance Cs once WWL is lowered. When reading the cell, the readwordline (RWL) is raised. The storage transistor M2 is either on or off depending upon the stored value. The bitline BL2 is precharged to VDD before performing read operation. The series connection of M2 and M3 pulls BL2 low when a “1” is stored. BL2 remains high in the opposite case. The cell is inverting; that is, the inverse value of the stored signal is sensed on the bitline.

DRAM Timing:

 DRAM module is asynchronous, Timing depends on how long it takes to respond to each operation.

Fig 21 Timing Diagram

DRAM cannot be read as fast (or as easy) as SRAM.

4.10 Memory Peripheral Circuitry

The peripheral memory circuitry consists of

Row and column decoders
Read bit line precharge logic
Sense amplifiers
Timing and control

Row Decoders

Collection of 2M complex logic gates organized in a regular, dense fashion

(N)AND decoder for 8 address bits

WL (0) =! A7 &! A6 &! A5 &! A4 &! A3 &! A2 &! A1 &! A0

WL (255) = A7 & A6 & A5 & A4 & A3 & A2 & A1 & A0

NOR decoder for 8 address bits

WL (0) =! (A7 | A6 | A5 | A4 | A3 | A2 | A1 | A0)

WL (255) =! (! A7 |! A6 |! A5 |! A4 |! A3 |! A2 |! A1 |! A0)

Implementing a Wide NOR Function

Single stage 8x256 bit decode

One 8 input NOR gate per row x 256 rows = 256 x (8+8) = 4,096

Pitch match and speed/power issues

Decompose logic into multiple levels

! WL (0) =! (! (A 7 | A 6) &! (A 5 | A 4) &! (A 3 | A 2) &! (A1 | A 0))

First level is the predecoder (for each pair of address bits, form Ai|Ai-1, Ai|! Ai-1, Ai|Ai-1, and! Ai|! Ai-1) Sp11 CMPEN 411 L24 S.8 i i-1 i i-1 i i-1 i i-1

Second level is the word line driver

Predecoders reduce the number of transistors required Four sets of four 2-bit NOR predecoders = 4 x 4 x (2+2) = 64

256-word line drivers, each a four input NAND – 256 x (4+4) = 2,048 - 4,096 vs 2,112 = almost a 50% savings

Number of inputs to the gates driving the WLs is halved, so the propagation delay is reduced by a factor of ~4

Fig NAND decoder using 2-input pre input pre-decoders

Pass Transistor Based Column Decoder

Read: connect BLs to the Sense Amps (SA)
Writes: drive one of the BLs low to write a 0 into the cell
Fast since there is only one transistor in the signal path. However, there is a large transistor count ((K+1)2 K + 2 x 2 K)
For K = 2 → 3 x 2 2 (decoder) + 2 x 22 (PTs) = 12 + 8 = 20

Tree Based Column Decoder

Number of transistors reduced to (2 x 2 x (2 K -1))
For K = 2 → 2 x 2 x (2 2 – 1) = 4 x 3 = 12
Delay increases quadratically with the number of sections (K) (so prohibitive for large decoders)
Can fix with buffers, progressive sizing, combination of tree and pass transistor approaches.

Bit Line Precharge Logic

First step of a Read cycle is to precharge (PC) the bit lines to VDD every differential signal in the memory must be equalized to the same voltage level before Read
Turn off PC and enable WL: The grounded PMOS load limits the bit line swing (speeding up the next precharge cycle)

Sense Amplifiers

Amplification – resolves data with small bit line swings (in some DRAMs required for proper functionality)

Delay reduction – compensates for the limited drive capability of the memory cell to accelerate BL transition

Power reduction – eliminates a large part of the power dissipation due to charging and discharging bit lines

Signal restoration – for DRAMs, need to drive the bit lines full swing after sensing (read) to do data refresh

Approaches to Memory Timing

Key takeaway

Decoder complexity comparison

Conf.	Data/Row	Row Decoder	Column Decoder
1D	8b	10b = a 10x210 decoder single stage = 20,480 Two stage = 10.320
2D	32b (32x256 core)	8b=8x28 decoder Single stage = 4,096 T Two stage = 2,112T	2b = 2x22 decoder PT=76 T Tree=96 T
2D	64b (64x128 core)	7b=7x27 decoder Single stage=1,792 T Two Stage = 1,072 T	3b = 3x23 decoder PT = 160 T Tree = 224 T
2D	128b (128x64 core)	6b = 6x26 decoder Single Stage = 768T Two stage = 432 T	4b = 4x24 decoder PT = 336 T Tree = 480 T

References:

1. D.A. Pucknell, K. Eshraghian, ‘Basic VLSI Design’, 3rd Edition, Prentice Hall of India, New Delhi,

2003.

2. Debprasad Das, VLSI Design, Oxford University Press, 2010.

3. Eugene D. Fabricius, ‘Introduction to VLSI Design’, Tata McGraw Hill, 1990.

Sign Up

Index

Notes

Highlighted

Underlined

Browse by Topics

Notes

Highlighted

Underlined