

# Reducing Switching Noise Effects by Advanced Clock Management

M. Krstic, X. Fan, M. Babic, E. Grass, T. Bjerregaard, A. Yakovlev



innovations for high performance microelectronics



July 2017 EMC Compo 2017 Saint Petersburg





.....

| 1 | State of the art methods for reducing switching noise |
|---|-------------------------------------------------------|
| 2 | GALS paradigm and GALS for low-noise SoC Integration  |
| 3 | Evaluation of GALS Concept                            |
| 4 | GALS Automation                                       |
| 5 | Conclusions and future prospects                      |
|   |                                                       |

# **Motivation**



- Simultaneous switching noise in digital and mixed signal design:
  - Digital design: Power supply noise leads to variations on gate and interconnect delays
  - Mixed-signal design: Substrate noise impacts the performance of RF frontend circuits.
- Existing solutions such as inserting <u>decoupling capacitors</u> on chip are very costly:
  - 20% of silicon area and 10-20% of power dissipation can be consumed
  - Process downscaling increases the leakage power

#### • **On-chip clock scheduling** presents a low-cost design solution to the noise optimization:

- It spreads the switching activity of a circuit over each clock cycle
- Peak-to-peak and RMS noise voltages can be efficiently reduced as a result
- Mixed-signal design need the <u>frequency-specific noise optimization</u>:
  - The RF frontend circuits are sensitive to the in-band noise components
  - Few work has been done on frequency-domain noise optimization by clock scheduling

#### **Noise Reduction by Switching Desynchronization**



- Switching desynchronization could affect noise generated by simultaneous switching of digital components
- System analysis, based on the lumped RLC model of the power supply networks



Average switching current (α is switching ratio)

$$i_{avg\_gates}(\alpha) = \frac{l_{\sum gates}}{2(N_s + N_q)} = \frac{1}{2}\alpha\beta(V_{GS} - V_t)$$

# **Modelling Switching Current**



- Switching current spectrum in digital circuits can be separated in two components:
  - discrete peaks
  - continuous noise floor
- Discrete peaks correspond to clock period averaged switching current waveform
  - Most power is concentrated there
- Noise floor corresponds to the fluctuations in waveforms between the clock periods.
- Triangular simplified representation



# Synchronous Methods for Switching Noise Reduction in Frequency Domain



- Clock Latency Scheduling (Synchronous Current Shaping)
  - Partitioning of the synchronous design into subdomains
  - Each subdomain with its own phase shift introduced by the clock
  - Overall current shape without strong peaks
  - Method compliant to the regular design flow (Tools FloorDirector)
- Spread Spectrum Clocking (SSC)
  - Adding jitter to the clock source
  - Method compliant to the regular design flow
  - Overhead in critical path



# **Clock Latency Scheduling**

- Partition a circuit into a number of sections
- Schedule clock latency within for section
- Shapes dynamic power signature



# **Clock Latency Scheduling**



- Desynchronization of the switching activity in individual blocks over time, and spreading the current pulse in each clock cycle
  - leads to a decrease on pulse peak and an increase both on rise and fall time.
  - Usually applied for optimization in timing domain
  - Addresses also resonant frequency, but with careful optimization it can address also the target frequency



# **Clock Modulation (Spread Spectrum Clocking)**



- Attenuation of the spectral peaks of can be obtained by modulating the clock frequency.
- The power at each of clock harmonics is spread over a wide bandwidth, leading to the lower spectral peaks
- It is particularly effective for higher harmonic frequencies
- Clock modulation can be using programmable delay lines.
  - Disadvantage: additionally increases the critical path in the system



# **Principle of GALS**





# **Standard Pausible Clocking GALS Approach**





Reference: J. Muttersbach, Globally-Asynchronous Locally-Synchronous Architectures for VLSI Systems, PhD Thesis, ETHZ 2001



GALS makes data transfer between the blocks very easy

- Design problems as timing closure or clock-tree generation are limited to the level of much smaller local blocks
  - Modularity of IPs is simplified
- Power saving is automatically integrated in asynchronous wrapper
- Clock tree complexity is smaller, leading to power reduction
- Low noise and low power features of this technique improve mixed-signal integration, power integrity and <u>switching noise features</u>

### **An Application Field – NoC Interconnect**



- Networks on Chip (NoCs) is the alternative to bus interconnect topologies
- GALS is effective technique for the implementation of NoC interconnect



**Figure:** T. Bjerregaard, S. Mahadevan, "A Survey of Research and Practices of Network-on-Chip", ACM Comput. Surv., vol. 38, Mar. 2006



A. Ghiribaldi, D. Bertozzi, S. Nowick, A Transition-signaling Bundled Data NOC Switch Architecture for Cost-effective GALS Mulitcore Systems, DATE 2013

# **GALS Approach targeting switching noise reduction**



### Main target mixed-signal circuits

- Example: IHP BiCMOS technology is optimal for high performance RF applications
- Main issues
  - Switching Noise
    - EMI
    - Ground bounce
    - Substrate noise
  - Additional features:
    - Power Consumption
    - System Integration
  - Target GALS technique
    - Pausible clocking
    - Adaptive change of frequency
    - Suitable for applications such as DSP processing

# Main Task - GALS Design for Switching Noise Attenuation

### Plesiochronous GALS design for attenuation at low frequency components

- Why plesiochronous clocking?
  - It maintains the processing capability of each locally synchronous module;
- Three degrees of freedom for attenuation analysis:
  M: the amount of GALS blocks;
  P<sub>m</sub>: the power breakdown of the system;
  λ<sub>m</sub>: the width of current pulse in each block.

# For <u>power balanced partitioning</u>, GALS design contributes to an attenuation of around *20logM*

- Constant at the lower clock harmonics
- Robust to the variations of current shape
- Recently equation confirmed also for higher harmonics





# Utilizing GALS for reduction of switching noise at higher harmonics



#### Proposed solution: Harmonic balanced plesiochronous GALS design

partitioning into LSMs based on balancing in-band harmonics of switching current

locally synchronous modules (LSMs) have different but close frequencies Globally Asynchronous Locally Synchronous

- Enables frequency-selective substrate noise reduction, also on higher frequencies
- Recently theoretically confirmed achievable noise peak attenuation:  $20 \log(M)$  for M LSMs
- Strongly dependent on the current shape in GALS blocks
  - Optimized partitioning required



Milan Babic, Steffen Zeidler, Milos Krstic, "GALS Partitioning Methodology for Substrate Noise Reduction in Mixed-Signal Integrated Circuits,"

Proc. of 22nd IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), pp. 67-74, Porto Alegre, Brazil 2016.



| Mathad                                | Maximum attenuation                 |                                                              |  |  |
|---------------------------------------|-------------------------------------|--------------------------------------------------------------|--|--|
| imethod                               | Low harmonics                       | High harmonics                                               |  |  |
| SCS                                   | marginal, if any at all             | 40log(1/λ <sub>0</sub> )                                     |  |  |
| SSC                                   | 10log(2(nβ+1))<br>low due to low nβ | 10log(2(nβ+1)), n <n<sub>overlap</n<sub>                     |  |  |
| Power-balanced<br>Plesiochronous GALS | 20logM                              | could be addressed with<br>harmonic balanced<br>partitioning |  |  |

 $\boldsymbol{\beta}$  is the modulation index, n harmonic number

 $\lambda 0$  is the relative width of the triangular pulse for the system before current shaping

M – number of GALS partitions



- Moonrake Chip (2010)
  - Complex GALS DSP implementation in 40 nm CMOS
  - Target: noise reduction in scaled CMOS process, power reduction
- Lighthouse Chip (2012)
  - Mixed signal radar chip in 130 nm
  - Target: substrate noise reduction
- Screamer Chip (2014)
  - Trusted Sensor Node in 130 nm
  - Target: noise reduction with synchronous methods

# **Moonrake Chip - Concept**



- Implementing GALS and synchronous version of gigabit OFDM transmitter design for 60 GHz band
  - Use pausible clocking point-to-point GALS interfaces
- Provision of flexible and cost-effective NoC interfaces



## **Complex GALS ASIC - Moonrake Architecture**





- The OFDM transmitter is a complex structure that includes, among the other blocks, 256-point FFT and 6 interleaver units.
  - Originally implemented in Xilinx FPGA, with complexity of 7.8 M gates.

# Partitioning Methodology



#### Area/power equalization between the GALS blocks

- Compact blocks can be much easily optimized by the CAD tools
- Clock trees of significantly reduced complexity

#### Reduce the number of communication links between the blocks

Using improved pausible clocking controllers

### Most complex block is only $\underline{15\%}$ larger than the average Most power hungry block consumes only $\underline{11\%}$ more than the average



# **GALS Technology - Improved GALS Interfaces**

- Local clock generator
  - Request acknowledge latency
- Two cascaded programmable delay lines with a feedback loop for latency minimization.
- Input data latching
  - Synchronization failure due to clock tree insertion delays
- Mutually exclusive double latching mechanism for safe region maximization.

Xin Fan, Milos Krstic, and Eckhard Grass, "Analysis and optimization of pausible clocking based GALS design," in *IEEE Intl. Conf. on Computer Design (ICCD)*, Lake Tahoe, USA, 2009, **Best Paper Award**.





# **Designflow of Moonrake SYNC/GALS ASIC**

Design flow of pausible clocking based GALS design

 ASIP: Integrated simulation tool for Sync-Async mixed design at behaviour level;

- Hard Macros: Asynchronous port controllers are implemented as specific hard macros.
- STA, Constraining and Optimization:
  - Utilization of commercial tools







### Moonrake Chip - GALS and synchronous OFDM gigabit transmitter for 60 GHz band

- 16M equivalent gates, 30% core logic;
- 218 memory: 8 FIFOs (64Kb), 86 SROMs (192Kb), 134 SRAMs (400Kb);



TSMC 40-nm CMOS process; 4000µm<sup>2</sup>x2250µm<sup>2</sup>=9mm<sup>2</sup>; LBGA-345 package;







#### simplified clock trees with better timing balance

### **Results - Area**





#### smaller area due to the simplified clock network

# **Measurements of Moonrake Chip**

### GALS has shown:

- much better EMI profile
  - 26dB att. on chip (A)
  - 19dB att. on board (B)
- improved power consumption and
- reduced area!

|            | Area               | Power               | Spectral amplitude of Core VDD (dBm) |                      |                      |
|------------|--------------------|---------------------|--------------------------------------|----------------------|----------------------|
|            | (mm <sup>2</sup> ) | Dissipation<br>(mW) | 1 <sup>st</sup> peak                 | 2 <sup>nd</sup> peak | 3 <sup>rd</sup> peak |
| SYNC TX    | 2.33<br>(43.2%)    | 258                 | -15                                  | -32                  | -23                  |
| GALS TX    | 2.22<br>(41.0%)    | 237                 | -41                                  | -48                  | -53                  |
| Difference | +4.7%              | +8.2%               | 26dB                                 | 16dB                 | 30dB                 |





Amplitude of on-chip core VDD from GALS TX



# Mixed Signal GALS ASIC Lighthouse – 120 GHz Radar Transceiver Chip





- Highlights
  - AFE and digital baseband processor implemented into single die
  - 130 nm IHP BiCMOS, Chip area = 17.1 mm<sup>2</sup>, Number of pads = 149
  - FMCW and CW radar mode supported
  - FMCW: synchronous and GALS version
  - GALS Part with 5 blocks
  - Power: AFE 450mW, BB- 80 mw



# **GALS FMCW Coprocessor**

# Why FMCW Coprocessor ?

- Most power hungry component is the FMCW coprocessor (i.e. 4096-point FFT)
- About 80 % of the complete BB processor

# GALS partitioning of FMCW Coprocessor

• To <u>balance power consumption</u> of each GALS clock domain





# **Lighthouse – Measurements**



### 12.3 dB (RBW=1KHz) peak reduction in GALS case!



(a) GALS working mode

(b) Synchronous working mode

#### VCO spectrum / FMCW GALS in parallel





First time measured the benefits of GALS methods for mixed signal design circuits!

#### www.ihp-microelectronics.com | © 2017- All rights reserved |

# **Noise Optimization Based on Clock Scheduling**

### **Clock latency manipulation**

- Partition a circuit into a number of sections
- Manipulate clock latency within each section
- Targets spectral noise reduction in a user-defined application-specific frequency band
- Take the modeling of on-chip supply current and PDN both into account for the power noise estimation

### Two-phase clocking

- Reduce the switching noise at the fundamental clock frequency, which often dominates in noise power
- Partition a design into two clock domains, one is triggered by the rising, one by the falling clock edge
- Bundled-data communication for clock domain crossing
- Clock latency scheduling can be further applied within each clock domain



FloorDirector®

Placement



#### **Design automation by FloorDirector**





# SCREAMER Chip – TSN DBB Macro

# Trusted Sensor Node Digital Baseband Processor (TSN DBB):

- 32-bit LEON2 μProcessor + 3 crypto-cores: ECC, AES, SHA1
- Interconnection via the AMBA AHB/APB bus system
- Process: IHP 130nm 7-metal CMOS process
- Frequency: 50 MHz, Power: <50 mW</p>
- Two noise probes on chip:
  - Substrate noise by a p-tap probe pad
  - Conductive EMI noise on a core-VDD pad



Substrate noise probe pad



#### **Floorplan view**





# **SCREAMER Chip – Parallel Integration of TSN DBB Macros**

# Chip implementation

- Macro0 (BASELINE)
  - Baseline design with standard CTS by EDI
- Macro1 (Conservative FD-OPT)
  - CTS with FloorDirector for clock scheduling
  - Only allows for zero hold-time slack
- Macro2 (Aggressive FD-OPT)
  - CTS with FloorDirector for clock scheduling
  - Allows for negative hold-time slacks of 1 ns
- Macro3 (Conservative FD-OPT + TPC)
  - CTS with *FloorDirector* for clock scheduling based on two-phase clocking

### Noise Optimization at:

- GSM-850 band (790-910MHz)
- 50-MHz clock frequency



# **Chip Specification**

- Size: 4.582 x 4.582 mm<sup>2</sup>;
- TQFP-176 package;
- TO @ Q3, 2014.

# **Standard ASIC Design Flow Integration**



- Model on-chip supply current and PDN dynamic voltage response
- Target spectral noise reduction in user specified frequency band
- Uses standard formats works in all major vendor P&R flows







# **Functional Verification and Noise Measurement**

### Test Equipment

- Advantest V93000 VLSI tester for functional test on chip
- Adapter board with SMA connectors for noise probing

### Measurement

- Functional validation:
  - All macros are fully functional on chip
- Noise measurement:
  - Trigger signal activates measurement on exactly the same processing on each macro
  - Switching noise measured
    - in frequency domain by spectrum analyzer
    - in time domain by oscilloscope









# **Screamer Chip - Measurement Results**

### Substrate noise reduction:

- Macro1 to Macro3 obtain dramatic noise reduction specifically in the 850MHz-band
- Macro2 with aggressive FD-OPT gives a 11.1dB noise drop at 850MHz
- Macro3 significantly lowers the noise peak at 50MHz by 9.6dB

|         | Macro0 | Macro1 | Macro2 | Macro3 |
|---------|--------|--------|--------|--------|
| 50 MHz  | -32.8  | -33.5  | -33.3  | -42.4  |
| 800 MHz | -63.1  | -67.1  | -65.2  | -64.5  |
| 850 MHz | -51.3  | -57.2  | -62.4  | -55.0  |
| 900 MHz | -58.1  | -65.3  | -66.1  | -66.7  |





### Substrate noise reduction in time domain

- All Macros incorporating noise optimization techniques dramatically reduce noise
- Macro3 also introduces two distinct pulses of substrate noise with each clock cycle
- This reduces the P2P noise by 34% and the RMS noise by 8% on Macro3



|                       | Macro0 | Macro1 | Macro2 | Macro3 |
|-----------------------|--------|--------|--------|--------|
| Peak-to-<br>peak (mV) | 31.3   | 27.5   | 25.0   | 20.6   |
| RMS (mV)              | 9.39   | 9.37   | 8.9    | 8.6    |

# **Screamer Chip - Measurement Results**

### Conductive EMI noise reduction:

- Conductive EMI noise was measured via an impedance matching network on board
- Macro1 to Macro3 all obtain dramatic noise reduction specifically in the 850MHz-band
- Macro2 with aggressive FD-OPT gives a 12.9dB EMI noise drop at 850MHz
- Macro3 significantly lowers the EMI noise peak at 50MHz by 6.1dB

|         | Macro0 | Macro1 | Macro2 | Macro3 |
|---------|--------|--------|--------|--------|
| 50 MHz  | -59.8  | -56.8  | -56.3  | -65.9  |
| 800 MHz | -79.1  | -81.7  | -86.7  | -86.1  |
| 850 MHz | -62.3  | -69.1  | -75.2  | -70.5  |
| 900 MHz | -68.5  | -78.1  | -84.8  | -84.9  |







- An overview of different methods for switching noise reduction is provided.
  - Methodologies based on clock activity management can efficiently reduce the generation of switching noise in time domain and in the frequency domain.
- Clock scheduling techniques have been successfully demonstrated in SCREAMER chip
- GALS Technique has been introduced
  - GALS is effective mean for robust SoC integration in mixed-signal ASICs
  - More than 20 dB EMI reduction has been demonstrated in state-of-the-art 40 nm demonstrator
  - Moreover the potential has been shown in respect to power, area, and clock tree complexity reduction
- However, lots of issues still need to be answered
  - Substrate noise reduction
    - Ongoing DFG Project GASEBO
  - Explore the EDA aspects for EMI, as well as power supply modulation
    - In many mixed-signal applications the future design will involve both data processing and power management electronics

# **GALAXY Project**







# **Project ID card**



ALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA

Funded under: FP7

MANCHESTER 1824

EU contribution: 2.9 million euro

Execution: from 2007 to 2010

Project status: Completed











# **SUCCESS Project**







Silicon-based Ultra Compact Cost-Efficient System Design for mmWave-Sensors



# **IC-NAO Project**





# **Teklatech**<sup>®</sup>















# Thank you for your attention!

Milos Krstic

IHP – Leibniz-Institut für innovative Mikroelektronik Im Technologiepark 25 15236 Frankfurt (Oder) Tel.: +49 (0) 335 5625 729 Fax: +49 (0) 335 5625 671 E-Mail: krstic@ihp-microelectronics.com

www.ihp-microelectronics.com



innovations for high performance microelectronics



# **Ground bounce in Frequency Domain**



- Ground bounce product of switching current spectrum and power network transfer function
  - Significant effect of the package impedance/ decoupling capacity
  - Numerical analysis of ground bounce spectrum in MATLAB

#### Synchronous system, 100 MHz clock frequency



GALS system, 5 LSMs with plesiochronous clocking around 100 MHz frequency





### Comparison in power consumption:

- Compared to Macro0, Macro1 Macro3 all give rise to the power penalty
- On Macro1, more clock buffers are inserted on chip for the clock latency manipulation
- On Macro2 and Macro3, negative holdtime slack in CTS leads to buffers inserted on data paths for timing closure
- On Macro3, a child-tree dedicated to negative edge triggered flip-flops is synthesized, thus complicating the clock tree

|               | Macro0 | Macro1 | Macro2 | Macro3 |
|---------------|--------|--------|--------|--------|
| Power<br>(mW) | 34.81  | 37.08  | 38.72  | 44.20  |
| Overhead      |        | 6.52%  | 11.23% | 26.98% |

|                 | Macro 0        | Macro 1        | Macro 2        | Macro 3        |
|-----------------|----------------|----------------|----------------|----------------|
| # Levels        | 5              | 43             | 54             | 43             |
| # Buffers       | 240            | 255            | 210            | 271            |
| Latency<br>(ns) | 1.26<br>- 1.42 | 3.96<br>- 4.44 | 4.48<br>- 5.08 | 3.14<br>- 3.55 |
| Skew (ps)       | 156.4          | 471.6          | 595.1          | 414.9          |

# **Methods for Automated partitioning - EMIAS Tool**





### Frequency-optimized automated Partitioning (DFG Project GASEBO)

- Analysis of the impact of different GALS partitioning and floorplanning of the blocks
- Automation of GALS partitioning for targeted spectral domain suppression
  - Noise driven partitioning
  - aiming 20logM reduction
- Methods combined with simplified modelling of the propagation of substrate noise
  - Regular positioning, substrate direct coupling





