

#### **INTEGRA:**

Fast Multi-Bit Flip-Flop Clustering for Clock Power Saving Based on Interval Graphs



## IRIS HUI-RU JIANG

CHIH-LONG CHANG
YU-MING YANG



EVAN YU-WEN TSAI LANCER SHENG-FONG CHEN



Nat'l Chiao Tung Univ. / Faraday Tech Corp.

#### Introduction

**Problem & properties** 

Algorithm - INTEGRA

**Experimental results** 

Conclusion

## **Clock Power Dominates!**

- Power has become one bottleneck for circuit implementation
- Clock power is the major dynamic power source
  - The clock signal toggles in each cycle ⇒ High switching activity
- Clock power model: dynamic power
  - $P_{clk} = C_{clk} V_{dd}^2 f_{clk}$





Power breakdown of an ASIC

Chen *et al*. Using multi-bit flip-flop for clock power saving by DesignCompiler. *SNUG*, 2010.

## Multi-Bit Flip-Flops

- A multi-bit flip-flop (MBFF)
  - Cluster several single-bit flip-flops (share the drive strength)



0.713

# Clock Power Saving using MBFFs (1/2)

#### Reduce switching capacitance charged/discharged by clock

| Switching capacitance  | Clock power saving          | Other benefits           |
|------------------------|-----------------------------|--------------------------|
| Clock sinks            | Small FF capacitance:       | Small area:              |
| (Flip-flops)           | Share C into FF clock pins  | Share the inverter chain |
| Clock network          | Small wire/buf capacitance: | Regular topology and     |
| (wires, clock buffers) | #leaf ↓ ⇒ depth ↓ #buffer ↓ | easy skew control        |



## Clock Power Saving using MBFFs (2/2)

- Clock power reduction can be significant
  - □ FF clock pins, clock buffers/inverters, wires in clock network
- Wire power overhead on data pins is small
  - Wirelength on data pins << total wirelength</p>



## Prior Works on MBFF Clustering

7

- Logic synthesis
  - [Chen et al., SNUG-10]
- Early physical synthesis
  - □ [Hou et al., ISQED-09]
- Post-placement: timing and routing
  - [Yan and Chen, ICGCS-10]
    - Minimum clique paritioning
    - Greedy clustering
    - Contiguous and infinite MBFF library
  - [Chang et al., ICCAD-10]
    - Window-based clustering
    - Maximum independent set
    - Discrete and finite MBFF library



#### **INTEGRA**

- Since post-placement MBFF clustering is NP-hard, our goal is to solve it effectively and efficiently instead of optimally.
  - Do not enumerate all possible combinations (maximal cliques)
  - Do not relate to the number of layout grids/bins
  - Do not manipulate on a general graph

#### Features:

- Efficient representation: a pair of linear-size sequences
- Fast operations: coordinate transformation
- Few decision points: #decision points << #flip-flops</p>
  - We cluster flip-flops at only decision points thus leading to an efficient clustering scheme.
- Global relationships among flip-flops: cross bin boundaries

Introduction

**Problem & properties** 

Algorithm - INTEGRA

**Experimental** results

Conclusion

# The Multi-Bit Flip-Flop Clustering Problem

- Clock power saving using multi-bit flip flops
- Given
  - MBFF library
  - Nelist & Placement
  - Timing slack constraints (in terms of wirelength)
  - Placement density constraint
- Find
  - MBFF clustering to
  - Minimize
    - Clock dynamic power
    - Wirelength
  - Subject to
    - Timing slack constraints (in terms of wirelength)
    - Placement density constraints



## MBFF Library

#### MBFF library

Lexicographical order: <1,100,100>, <2,172,192>, <4,312,285>

| Bit number | Power | Area | Normalized power per bit | Normalized area per bit |
|------------|-------|------|--------------------------|-------------------------|
| 1          | 100   | 100  | 1.00                     | 1.00                    |
| 2          | 172   | 192  | 0.86                     | 0.96                    |
| 4          | 312   | 285  | 0.78                     | 0.71                    |

## **Placement**

#### 12

- □ Chip area =  $W_cH_c$  bins = WH grids
- Flip-flops should be placed on grid (left-bottom corner)
- Placement density constraint for bin b:

- A<sub>fb</sub>: FF area
- A<sub>cb</sub>: Combinational logic area
- □ A<sub>pb</sub>: macro area
- A<sub>g</sub>: grid area
- □ T<sub>b</sub>: target density



## Timing Slack and Feasible Region

#### **Input slack**

#### Slack ⇔ wirelength



Feasible region

## Coordinate Transformation (1/3)

 It's hard to determine if a grid point is located inside or outside the feasible region

 $S_{fi}(i)$   $S_{fo}(i)$   $S_{fo}(i)$ Fanout gate  $V' = e_{V'}(i)$ 

- Rotate 45°
   clockwise; we have rectangles instead
  - Easy checking!

 $y' \qquad x' = s_{x'}(i) \quad x' = e_{x'}(i)$   $X' = s_{x'}(i) \quad x' = e_{x'}(i)$ 

INTEGRA - ISPD'11

## Coordinate Transformation (2/3)

Coordinate transformation is done by integer operations



## Coordinate Transformation (3/3)



Introduction

**Problem & properties** 

**Algorithm - INTEGRA** 

**Experimental** results

Conclusion

## Overview of INTEGRA



- 1. Analyzes the design intent
- 2. Finds a decision point in X' and extracts the essential flip-flops and their related flip-flops
- 3. Finds the maximal clique in the partial Y' for each essential flip-flop
- 4. Clusters each essential flip-flop
- 5. Places the clustered flip-flop at a legal location with routing cost and density consideration
- 6. Repeats steps 2–5 until all flipflops are investigated

# Example (1/5)

#### Initial

# LIFO KHY LIFT X

#### **Transformed**



# Example (2/5)

- Representation



## Example (2/5)

## - Representation



## Overview of INTEGRA



- 1. Analyzes the design intent
- 2. Finds a decision point in X' and extracts the essential flip-flops and their related flip-flops
- 3. Finds the maximal clique in the partial Y' for each essential flip-flop
- 4. Clusters each essential flip-flop
- 5. Places the clustered flip-flop at a legal location with routing cost and density consideration
- 6. Repeats steps 2–5 until all flipflops are investigated

## Decision Points and Essential Flip-Flops

- **Definition:** If there exist two consecutive points  $x_k'$  and  $x_{k+1}'$  in X', where  $x_k' = s_{x'}(i)$ ,  $x_{k+1}' = e_{x'}(j)$ ,  $1 \le i, j \le n$ , a decision point is the coordinate of  $x_{k+1}'$ , i.e.,  $e_{x'}(j)$ .
- Definition: The essential flip-flops with respect to a decision point are the flip-flops whose end points ordered from this decision point to the next decision point or to the end of X' for the last decision point.





## Decision Points and Essential Flip-Flops

- Theorem: Consider X', a decision point, and the corresponding essential flip-flops. The maximal clique containing the essential flip-flops in x' interval graph can be found at this decision point.
- Corollary: A decision point corresponds to at least one essential flip-flop. Hence, the number of decision points is less than or equal to the number of flipflops.





## Example (3/5)

# - Flip-Flop Clustering

#### X': Find candidates



## Overview of INTEGRA



- 1. Analyzes the design intent
- 2. Finds a decision point in X' and extracts the essential flip-flops and their related flip-flops
- 3. Finds the maximal clique in the partial Y' for each essential flip-flop
- 4. Clusters each essential flip-flop
- 5. Places the clustered flip-flop at a legal location with routing cost and density consideration
- 6. Repeats steps 2–5 until all flipflops are investigated

# Example (3/5)

## - Flip-Flop Clustering



# Example (4/5)

# - Flip-Flop Clustering

#### Initial



#### **MBFFs & their feasible regions**



## Runtime Decision Points Are Few!

#### 29

- Corollary: A decision point corresponds to at least one essential flip-flop. Hence, the number of decision points is less than or equal to the number of flip-flops.
- □ Runtime decision points ≤ initial decision points
  - Runtime decision points are shifted because of removed flipflops.



## Overview of INTEGRA



- Analyzes the design intent
- 2. Finds a decision point in X' and extracts the essential flip-flops and their related flip-flops
- 3. Finds the maximal clique in the partial Y' for each essential flip-flop
- 4. Clusters each essential flip-flop
- 5. Places the clustered flip-flop at a legal location with routing cost and density consideration
- 6. Repeats steps 2–5 until all flipflops are investigated

## **Legal Grid Points**

- Place MBFFs at legal grid points.
- A legal grid point satisfies the following conditions:
  - It is a grid point.
  - It is not occupied by other gates or flip-flops.
  - It is density-safe.

# Flip-Flop Placement

- Goal: Find a legal placement with wirelength consideration
  - Optimal location: Within the bounding box of median coordinates of fanin and fanout gates



# Example (5/5)

# - Flip-Flop Placement

#### Initial



#### **Placed MBFFs**



## Procedure of INTEGRA

```
Algorithm INTEGRA
// Initialization
1. lexicographically sort the MBFF library
2. collapse MBFFs
3. X' \leftarrow \text{sort } \{s_x(i), e_{x'}(i): i = 1..n\}, j \leftarrow 1, Q \leftarrow \emptyset
// Main body
4. while (X' is not empty) do
       find a decision point in X'
6.
       Q \leftarrow Q + essential flip-flops and related flip-flops
      Y' \leftarrow \text{sort} \{ s_{y}(i), e_{y}(i) : i \in Q \}
        foreach essential flip-flop k do
          // Flip-flop clustering
          K_{\text{max}} \leftarrow \text{max\_clique}(Y', k)
find the appropriate MBFF cell of bit number B for |K_{\text{max}}|
9.
10.
          K_{\text{max}} \leftarrow \text{sort } \{e_x(i): i \in K_{\text{max}} - \{k\}\}\

K_i \leftarrow \text{flip-flop } k \text{ and the first } (B-1) \text{ flip-flops in } K_{\text{max}}
11.
12.
          //Flip-flop placement
          find bounding box B_b for K_i
13.
14.
          project B_{i}'s corner and center points to F_{i}(K_{i})
15.
          find the projected point with min distance between B_b and F_r(K_i)
16.
          legalize this point and assign it to MBFF K_i
17.
          if legalization fails then go to line 9
18.
          Q \leftarrow Q - K_i, X' \leftarrow X' - K_i
19.
           i++
```

## Outline

Introduction

**Problem & properties** 

Algorithm - INTEGRA

**Experimental results** 

Conclusion

# Comparison

## - Post-Placement MBFF Clustering

36

| Circuit #FFs | <b>455</b> ° | Chip size     | Initial    |             |  |  |  |
|--------------|--------------|---------------|------------|-------------|--|--|--|
|              | #ГГ5         | (#Grids)      | Power      | Wirelength  |  |  |  |
| <b>C1</b>    | 120          | 600×600       | 11,384     | 89,425      |  |  |  |
| C2           | 480          | 1,200×1,200   | 46,404     | 348,920     |  |  |  |
| C3           | 1,920        | 2,400×2,400   | 185,616    | 1,395,680   |  |  |  |
| C4           | 5,880        | 4,200×4,200   | 566,972    | 4,290,655   |  |  |  |
| <b>C</b> 5   | 12,000       | 6,000×6,000   | 1,160,100  | 8,723,000   |  |  |  |
| C6           | 192,000      | 24,000×24,000 | 18,561,600 | 139,568,000 |  |  |  |



FF library cells (Bit-number, power, area): (1,100,100), (2,172,192), (4,312,285)

| Circuit       | Lower bound |          | Modified Yan&Chen |          | Chang <i>et al.</i> |             | INTEGRA  |             |             |          |       |             |
|---------------|-------------|----------|-------------------|----------|---------------------|-------------|----------|-------------|-------------|----------|-------|-------------|
|               | Power ratio | WL ratio | Power ratio       | WL ratio | Time<br>(s)         | Power ratio | WL ratio | Time<br>(s) | Power ratio | WL ratio | #Dec  | Time<br>(s) |
| C1            | 82.2%       | 48.7%    | 82.8%             | 123.0%   | 0.03                | 85.2%       | 91.7%    | < 0.01      | 82.8%       | 96.4%    | 28    | < 0.01      |
| C2            | 80.7%       | 49.9%    | 81.2%             | 124.8%   | 0.11                | 83.1%       | 94.7%    | 0.02        | 80.9%       | 102.0%   | 90    | < 0.01      |
| C3            | 80.7%       | 49.9%    | 81.3%             | 125.2%   | 0.53                | 82.9%       | 94.8%    | 0.07        | 80.8%       | 103.6%   | 229   | < 0.01      |
| C4            | 80.9%       | 49.7%    | 81.5%             | 124.7%   | 2.55                | 83.2%       | 94.5%    | 0.23        | 81.0%       | 104.1%   | 458   | 0.02        |
| <b>C5</b>     | 80.7%       | 49.9%    | 81.3%             | 124.2%   | 8.01                | 82.9%       | 94.9%    | 0.52        | 80.7%       | 104.8%   | 690   | 0.05        |
| C6            | 80.7%       | 49.9%    | 81.3%             | 124.4%   | 1994.61             | 82.8%       | 94.9%    | 76.94       | 80.7%       | 105.3%   | 3,007 | 1.11        |
| Avg.<br>ratio | +0.00%      |          | +0.60%            | )<br>)   | 358.61              | +2.36%      |          | 16.87       | +0.17%      |          | 12%   | 1.00        |

Chang *et al*. Post-placement power optimization with multi-bit flip-flops. *ICCAD*, 2010.

Yan and Chen. Construction of constrained multi-bit flip-flops for clock power reduction. *ICGCS*, 2010.

## Comparison

## - MBFF Clustering at Logic Synthesis



Chen *et al.* Using multi-bit flip-flop for clock power saving by DesignCompiler. *SNUG*, 2010.

# Comparison

# - MBFF Clustering at Logic Synthesis

| RISC32 CPU                                     | Chen et al. | Ours   |
|------------------------------------------------|-------------|--------|
| # Single-bit FFs                               | 3,689       | 75     |
| # Dual-bit FFs                                 | 2,155       | 3.962  |
| FF replacement rate                            | 53.88%      | 99.06% |
| # Clock tree leaves                            | 5,844       | 4.037  |
| Clock tree synthesis report                    |             |        |
| Normalized dynamic power for combinational ckt | 1.000       | 1.009  |
| Normalized dynamic power for clock buffers     | 1.000       | 0.789  |
| Normalized dynamic power for FFs               | 1.000       | 0.933  |
| # Clock subtrees                               | 157         | 150    |
| # Clock buffers                                | 165         | 110    |
| Depth of clock tree                            | 5           | 5      |

- 1. RISC32 CPU: gate count 120k, 7999 flip-flops.
- 2. 55nm process; power supply voltage is 0.9 V; the target clock skew is 300 ps.
- 3. MBFF library: 1-bit FF, 2-bit FF

## Conclusion

- INTEGRA is a fast post-placement multi-bit flip-flop clustering algorithm for clock power saving.
  - Based on coordinate transformation and interval graphs, we adopt a pair of linear-size sequences as the representation.
  - The concept of decision points helps us significantly reduce the times of clustering applied.
- Compared with prior work applying MBFF clustering at postplacement and early design stages, our results show the superior efficiency and effectiveness of our algorithm.

40

## Thank You!

Contact info: Iris Hui-Ru Jiang huiru.jiang@gmail.com



# Backup Slides

## Timing Issue

#### Timing slack setting:

- Timing budgeting avoids dynamic interference among multi-bit flip-flops.
- Update the feasible regions of timing related FF's once an MBFF is formed
  - Scanning sequence X' from left to right

#### Timing safety

- STA approval.
- For the Synopsys Liberty library, the delay of a gate, lumped with its output wire delay, is dominated by its output loading.

$$C(i) = C_W(i) + C_O(i) + \sum_{g_j \in FO(g_i)} C_I(j),$$

Since the placement of combinational elements is unchanged during post-placement MBFF clustering, the timing slack between a flip-flop and its fanin/fanout gate depends on only the wire loading, i.e., the Manhattan distance between them.

## Placement Issue

43

- Placement density constraint
  - MBFF consume less area
  - Density constraint becomes looser and looser during MBFF clustering
- Legalization?
  - Easy and doable

- Find maximal cliques in some region in Y'
  - Find decision points
  - Compare their cardinalities
- Scan Y' from the starting point of the essential flip-flop found in X' to its end point.
- Count the size
  - □ s: +1
  - □ e: -1
  - Largest partial sum

