# 18-447 Lecture 12: Energy and Power

James C. Hoe Department of ECE Carnegie Mellon University

18-447-S24-L12-S1, James C. Hoe, CMU/ECE/CALCM, ©2024

# Housekeeping

- Your goal today
  - a working understanding of energy and power
  - appreciate their significance in comp arch today
- Notices
  - Lab 2, due this week
  - Lab 3 posted but starts after break
  - HW 3, due \*\*Wed\*\* 3/13 (Handout #8)
  - Midterm 1, Wed 3/13, covers up to Lec 1312
- Readings
  - Design challenges of technology scaling, Borkar, 1999.
  - Synthesis Lectures (advanced optional): Power-Efficient Comp Arch: Recent Advances, 2014

## **First some intuitions**

## **Energy and Power**

- CMOS logic transitions involve charging and discharging of parasitic capacitances
- Energy (Joule) dissipated as resistive heat when "charges" flow from VDD to GND
  - take a certain <u>amount</u> of energy per operation
     (e.g., addition, reg read/write, (dis)charge a node)
  - to the first order, energy  $\infty$  amount of compute
- Power (Watt=Joule/s) is <u>rate</u> of energy dissipation
  - more op/sec then more Joules/sec
  - to the first order, power  $\infty$  performance

Power concerns usually more about heat removal

## **Heat and Thermal Resistance**

- Resistive heat in the circuit must be removed in steadystate (www.youtube.com/watch?v=BSGcnRanYMM)
- Can summarize everything between circuit and ambient by characteristic R<sub>thermal</sub>=K/W
  - convey power W in heat across temperature difference K=T<sub>circuit</sub>–T<sub>ambient</sub>
- To dissipate more power in circuit
  - 1. let T<sub>circuit</sub> get hotter (to a point)
  - 2. turn-up AC  $\Rightarrow$  lower T<sub>ambient</sub>
  - 3. better cooling  $\Rightarrow$  lower R
- Economics/market driven choices



## Work and Perf. from Joules and Watt

- Fastest without energy/power awareness won't be fastest once constrained
  - power bounds performance directly
  - energy bounds work directly; want for lower J/op bounds performance indirectly
- Consider in context

recall that power  $\infty$  perf<sup>( $\alpha > 1$ )</sup>

- mobile device: limited energy source, hard-to-cool form factor
- desktop: cooler size, noise, complexity, cost
- data-center: electric bill, cooling capacity and cost

Ultimately driven by desirability and economics

#### **Cooler transistors also faster transistors**



[image from Wikipedia, "Overclocking"]

## Hot Transistors Leak More, Get Hotter

- Beyond a threshold, stopping the clock cannot arrest positivefeedback runaway
- Modern processors have temp sensors to slow the clock before entering runaway



Thermal runaway in integrated circuits, [Vassighi and Sachdev, 2006]

## Some (first-order) nitty-gritty

## Work and Runtime

- Work
  - scalar quantity for "amount of work" associated with a task
  - e.g., number of instructions to compute a SHA256 hash
- *T* = *Work* / *k*<sub>perf</sub>
  - runtime to perform a task
  - k<sub>perf</sub> is a scalar constant for the rate in which work is performed, e.g., "instructions per second"

#### **Energy and Power**

- $E_{switch} = k_{switch} \cdot Work$ 
  - "switching" energy associated with task
  - k<sub>switch</sub> is a scalar constant for "energy per unit work"
- $E_{static} = k_{static} \cdot T = k_{static} \cdot Work / k_{perf}$ 
  - "leakage" energy just to keep the chip powered on
  - k<sub>static</sub> is the so called "leakage power"

Faster execution means lower leakage energy???

- $E_{total} = E_{switch} + E_{static} = (k_{switch} + k_{static}/k_{perf}) \cdot Work$
- $P_{total} = E_{total} / T = k_{switch} \cdot k_{perf} + k_{static}$

## In Short

• *T* = *Work* / *k*<sub>perf</sub>

less work finishes faster

• 
$$E = E_{switch} + E_{static} = (k_{switch} + k_{static} / k_{perf}) \cdot Work$$
  
less work use less energy

- $P = P_{switch} + P_{static} = k_{switch} \cdot k_{perf} + k_{static}$ power independent of amount of work
- Reality check
  - Work not a simple scalar, inst mix, dependencies ...
  - k's are neither scalar nor constant

k<sub>perf</sub>: inst/sec
k<sub>switch</sub>: J/inst
k<sub>static</sub>: J/sec



# Why so important now?

# Ideal Technology Scaling

- Planned scaling occurs in discrete "nodes" where each is ~0.7x of the previous in linear dimension
- Take the same design, reducing linear dimensions by 0.7x (aka "gate shrink") leads to \*\*ideally\*\*
  - die area = 0.5x
  - delay = 0.7x; frequency=1.43x
  - capacitance = 0.7x
  - Vdd = 0.7x (constant field) or 1x (constant voltage)
  - power = 0.5x (const. field) or 1x (const. voltage)
- Take the same area, then
  - transistor count = 2x, transistor speed=1.43x
  - power = 1x (const field) or 2x (const voltage)

18-447-S24-L12-S15, James C. Hoe, CMU/ECE/CALCM, ©2024

Why so far off?

# Moore's Law $\rightarrow$ Performance

According to scaling theory

@constant complexity ("gate-shrink"):

 $\Rightarrow 1.43x \text{ performance at } 0.5x \text{ power } \text{freed}$ @max complexity /// if perf & t-cnt × freq

@max complexity ("reticle limited"):

2x transistors at 1.43x frequency

 $\Rightarrow$  2.8x performance at constant power

Historical (until 2005'ish), for high-perf CPUs

expected – ~2x transistors

higher – ~2x frequency (note: faster than scaling predicts)

higher

 all together, ~2x performance at ~2x power lower

#### **The Other Moore's Law**



# **Performance (In)efficiency**

- To hit "expected" performance target
  - push frequency harder by deepening pipelines
  - used the 2x transistors to build more complicated microarchitectures so fast/deep pipelines don't stall (i.e., caches, BP, superscalar, out-of-order)
- The consequence of performance inefficiency is



## **Moore's Law without Dennard Scaling**



#### Under fixed power ceiling, more ops/second only achievable if less Joules/op?

#### What Moore's Law has come down to





[Wikipedia, MOSFET]

[IEEE Spectrum, "The Nanosheet Transistor is the Next (and Maybe Last) Step in Moore's Law"]

# Frequency and Voltage Scaling: run <u>slower</u> at lower energy-per-op

## **Frequency and Voltage Scaling**

• Switching energy per transition is

<sup>1</sup>/<sub>2</sub>CV<sup>2</sup> (modeling parasitic capacitance)

• Switching power at **f** transitions-per-sec is

#### **½***CV*<sup>2</sup>*f*

- To reduce power, slow down the clock
- If clock is slower (f'), reduce supply voltage (V') too since transistors don't need to be as fast
  - reduced switching energy,  $\frac{1}{2}CV^2 \rightarrow \frac{1}{2}CV'^2$
  - lower V' also reduced leakage current/power

## **Frequency Scaling (by itself)**

- If Work / k<sub>perf</sub> < T<sub>bound</sub>, we can derate performance by frequency scaling by a factor s<sub>freq</sub> (Work/k<sub>perf</sub>)/T<sub>bound</sub> < s<sub>freq</sub><1</li>
  s.t. k<sub>perf</sub>'=k<sub>perf</sub> s<sub>freq</sub>
  T' = Work / (k<sub>perf</sub> s<sub>freq</sub>)

  1/s<sub>freq</sub> longer runtime

  P' = k<sub>switch</sub>·k<sub>perf</sub> s<sub>freq</sub> + k<sub>static</sub>
  - lower (switching) power due to longer runtime
- $E' = (k_{switch} + k_{static} / (k_{perf} s_{freq})) \cdot Work$ 
  - higher (leakage) energy due to longer runtime

#### Not such a good idea

## **Intel P4 660 Frequency Scaling: FFT<sub>64K</sub>**

circa 2005, 90nm



 $k_{perf}$ =145 FFT64K/sec;  $k_{switching}$ =0.24 J/FFT64K;  $k_{static}$ =49.4J/sec

## **Intel P4 660 Frequency Scaling: FFT<sub>64K</sub>**



**k**<sub>perf</sub>=145 FFT64K/sec; **k**<sub>switching</sub>=0.24 J/FFT64K; **k**<sub>static</sub>=49.4J/sec

## Frequency + Voltage Scaling

- Frequency scaling by s<sub>freq</sub> allows supply voltage to be scaled by a corresponding factor s<sub>voltage</sub>
- $E \propto V^2$  thus  $k_{switch}$  "= $k_{switch} \cdot s_{voltage}^2$
- $k_{static}$  "= $k_{static}$   $s_{voltage}^{2^{3}}$  <= very gross approximation of something complicated
- $T'' = Work / (k_{perf} \cdot s_{freq})$ -  $1/s_{freq}$  longer runtime
- $E'' = (k_{switch} \cdot s_{voltage}^2 + k_{static} \cdot s_{voltage}^3 / k_{perf} \cdot s_{freq}) \cdot Work$
- $P'' = k_{switch} \cdot s_{voltage}^2 k_{perf} \cdot s_{freq} + k_{static} \cdot s_{voltage}^3$ 
  - superlinear reduction in power and energy to performance degradation

## Intel P4 660 F+V Scaling: FFT<sub>64K</sub>



18-447-S24-L12-S27, James C. Hoe, CMU/ECE/CALCM, ©2018

circa 2005, 90nm

## Intel P4 660 F+V Scaling: FFT<sub>64K</sub>



## **Parallelization:**

#### run <u>faster</u> at lower energy-per-op by running <u>slower</u> at lower energy-per-op

#### **Cost of Performance in Power**



Microprocessors, Grochowski et al., 2006]

power  $\infty$  perf( $\alpha > 1$ )

## Parallelization

• Ideal parallelization over **N** CPUs (to go fast)

 $-T = Work / (k_{perf} \cdot N)$ 

 $- E = (k_{switch} + k_{static} / k_{perf}) \cdot Work$ 

**N**-times static power, but **N**-times faster runtime

$$- P = N (k_{switch} \cdot k_{perf} + k_{static})$$

- Alternatively, forfeit speedup for power and energy reduction by s<sub>freq</sub>=1/N (assume s<sub>voltage</sub> ≈s<sub>freq</sub> below) not true!
  - $T = Work / k_{perf}$
  - $-E'' = (k_{switch} / N^2 + k_{static} / (k_{perf} N)) \cdot Work$

$$-P'' = k_{switch} \cdot k_{perf} / N^2 + k_{static} / N$$

• Also works with using **N** slower-simpler CPUs

## So what is the problem?

- "Easy" to pack more cores on a die to stay on Moore's law for "aggregate" or "throughput" performance
- How to use them?
  - life is good if your N units of work are N independent programs ⇒ just run them
  - what if your N units of work are N operations of the same program? ⇒ rewrite as parallel program
  - what if your *N* units of work are *N* sequentially dependent operations of the same program? ⇒ ??
     How many cores can you use up meaningfully?

#### **Moore's Law Scaling with Cores**



# Remember: it is all about Perf/Watt and Ops/Joules



#### **Heterogenous System-on-Chip**



[raw M1 die photo from apple.com]

18-447-S24-L12-S35, James C. Hoe, CMU/ECE/CALCM, ©2024