Extracting Parallelism is key to performance.

Key goal of hardware, systems, and for more than a decade. The only way to get performance.

## Old Slides from ~2017. But these are main ideas, we'll see them at multiple scales.

Biased by my own work because I have slides and am lazy... not because I think it's best.

## Message

## **Statistical algorithms** have **relaxed** notions of correctness leads to **new opportunities** for:

- Algorithms,
- Systems, and
- Hardware.

## **Key Issue:** Balance <u>Statistical</u> versus <u>Hardware</u> Efficiency.

- Statistical efficiency how many steps you take
- Hardware efficiency how efficiently you take each of those steps

**Ce Zhang, CR** DimmWitted: A Study of Main-Memory Statistical Analytics. VLDB14.

## Three driving trends in hardware

(1) Lots of smaller cores,
(2) Non-Uniform Memory (NUMA), and
(3) Single-Instruction Multiple Data
(SIMD) (and SIMT)

Approximation allows major performance improvements.

## Trend 1: Many different Cores



Steve Wright



Ben Recht



Feng Niu

## Single Cores are not getting faster.



Chips now contain many cores, so throughput is increasing... but need to rewrite algos!

## Statistical Analytics Crash Course

Staggering amount of machine learning/stats can be written as:



N (number of y<sub>i</sub>s, data) typically in the billions Ex: Classification, Recommendation, Deep Learning.



## Multicore: Independent Case





Jobs with little communication, 2 cores executes twice as faster!

## Multicore: Dependent Case



Protocol for "whose turn," called **locking**, takes 100 cycles.

345085-0024 BKDH5

CPU

93765103754H-ALLP4

345085-0024 BKDH5

CPU

93765103754H-ALLP4

ML-0092847

0092847

0703846725-30

0703846725-30



Communication scales quadratically

## Suppose it takes 1 second to communicate with 2 cores.

4 cores takes 4 second



The key algorithm in machine learning

## The core algorithm of modern learning called is **Stochastic Gradient Descent (SGD)**



## SGD consists of **BILLIONS** of tiny jobs!

Implemented in a classical way (locking) SGD actually gets *slower* with more cores

So what can we do?

## Multicore: Hogwild! Case



Job 4



## Is it my turn? Yes!

#### Ignore the locks!

### How do we run SGD in Parallel?

## Just ignore the locking protocol... As we say, go **Hogwild**!

## This is computer science heresy!

**Theorem (roughly, NIPS11):** If we do **no locking**, SGD converges to correct answer—at essentially the same rate!

Hogwild! [Niu, Recht, Ré, Wright NIPS11] AsySCD [Liu, Wright et al. ICML14, JMLR14] Buckwild! [De Sa, Olukotun, Ré NIPS15]

## Cortana: Microsoft's Digital Assistant WIRED

AI breakthrough: Microsoft's 'Project Adam' identifies dog breeds, points to future of machine learning



All web companies have similar: image rec, voice, mobile, search, etc.

"...using a technology called, of all things, **Hogwild!"** 

http://www.wired.com/2014/07/microsoft-adam/ http://www.geekwire.com/2014/artificial-intelligence-breakthrough-microsofts-project-adam-identifies-dog-breeds/

## A larger trend?



## Relaxing **consistency** to be **architecturally aware** can be a big performance win.



#### Asynchrony in Deep Learning







Heard at **#icml2016**: "SGD is so robust that your bugs in your SGD implementation will work as regularizers."



A regularizer is a (sane) statistical penalty... Bugs in your implementation are not helpful

## Trend 2: NUMA Non-Uniform Memory Access



Steve Wright



Ji Liu Z





Ce Zhang

Krishna Sridhar

### A view inside a box... (more later



Modern version: Thousands of cores with close by memory (Called high-bandwidth memory, called HBM)

### One Example: Quadratic Programming with Orthant Constraints (on cpu, same tradeoffs)



## One Example.



## What about multiple sockets?



## Model Replication

#### PerMachine (Hogwild!)





## Model Replication



#### PerNode



## Statistical versus Hardware Efficiency



Relaxing consistency results in new tradeoffs.

1. Access methods

- {Row, Column, Row-col} 2. Model Replication
  - {Core, Node, Machine}
- 3. Data Replication
  - {Full, Importance, Shard}

Can be 100x faster than classical choices

DimmWitted: A Study of Main-Memory Statistical Analytics. VLDB14.

## Trend 3: Single Instruction Multiple Data (SIMD)



Chris De Sa Kunle Olukotun

Modern processors offer fine-grained parallelism. [NIPS15]

## SIMD Processing: Fine-grained parallelism

#### Single instruction multiple data (SIMD)



## Same operation on *multiple data points* in parallel

## SIMD: Doubling again!

#### **Intel® Advanced Vector Extensions**

#### VX-512: 512-bit vectors 8X peak FLOPs over 4 generations 32 registers, Masking Knights Landing /Future Xeon AVX2: FMA (2x peak flops) er SIMD. "Gather" Instructions Haswell (22 nm Tock) Ivybridge (22nm Tick) AVX: 2X flops: 256-bit wide floating-point vectors Sandy Bridge Since 2001: (32 nm Tock) 128-bit Vectors Image courtesy of Intel Corporation 2012 2010 2011 2013

Good old days of Moore's Law! ... If we can take advantage of fine-grained parallelism

#### SIMD bandwidth has **doubled each** of the last four generations.

## Precision vs. Parallelism



#### Tradeoff between precision & parallelism

# A hardware model for precision [ISCA17]





Chris De Sa Kunle Olukotun

## Four Classes of Numbers

#### Dataset numbers

used to store the immutable input data

### Model numbers

used to represent the vector we are updating

### **G**radient numbers

used as intermediates in gradient computations

### Communication numbers

used to communicate among parallel workers

## Quantize classes independently

- Using low-precision for different number classes has different effects on performance.
  - e.g. quantizing the gradient numbers improves compute throughput, but has little effect on memory

Existing work often quantizes some classes, but doesn't consider the others.

## The DMGC Model

Idea: associate each implementation with a DMGC signature that displays its precision for all four number classes
 Lets us classify previous work and future systems

 $D^8 M^{16} G^{32f} C^{16}$ It communicates The algorithm It uses 16-bit It computes among workers uses 8-bit numbers for gradients as with 16-bit numbers to store the model. 32-bit floats. numbers. the dataset.

## Be warned: Your learning **parameters** depend on the hardware and those numbers. (e.g. momentum and delay are connected)





Ioannis Mitliagkas

Jian Zhang

#### What shook my belief in progress through optimization...

- Turns out Optimization is a **leaky** abstraction for deep learning.
  - There are approaches that cause the loss to go down more slowly (worse optimization) but generalizing better (better test performance).
- Happy to give examples if you ask, so many out there it's bizarre....
- This is so much more interesting than it should be!