👀 AOT and JIT compilation

Just-in-time:

PyPy: compile the most frequent code
Translate some code to LLVM (Numba). then compile it

Ahead-of-time:

Python C-API: Writing code in C/C++/Rust and making an interface understandable for CPython
Cython (uses C-API of Python interpreter)
Numba can do AOT

📗 Back to the notebook…

🏁 Any questions?

[1]: %timeit -n100 X.dot(Y)

148 µs ± 6.04 µs per loop

🐍 Python: multiprocessing/multithreading

🐍 Python: threadingmultiprocessing/multi

🐍 thonPy: multi/multithreading/processing

Process

Process is a launched program
Every process has isolated resources:
- virtual memory space
- pointer to execution place
- call stack
- system resources, e.g. file descriptors
Alternative: a thread

Thread

Threads are executed independently
Threads are executed inside some process and can share memory space and system resources
Managed by OS

Concurrency vs. Parallelism

Regular stuff

import math
def integrate(f, a, b, *, n_iter=1000):
... acc=0
... step=(b-a)/n_iter
... for i in range(n_iter):
...     acc += f(a + i * step) * step
... return acc
...
>>> integrate(math.cos, 0, math.pi / 2) 1.0007851925466296
>>> integrate(math.sin, 0, math.pi) 1.9999983550656637

Multithreading!

from functools import partial
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000):
    executor = ThreadPoolExecutor(max_workers=n_jobs)
    spawn = partial(executor.submit, integrate, f, 
                    n_iter=n_iter // n_jobs)
    step=(b-a)/n_jobs 
    fs=[spawn(a+i*step,a+(i+1)*step)
        for i in range(n_jobs)]
    return sum(f.result() for f in as_completed(fs))

>>> integrate_faster(math.cos, 0, math.pi / 2, n_jobs=2) 1.0003926476775074
>>> integrate_faster(math.sin, 0, math.pi, n_jobs=2) 1.9999995887664657

Benchmark

In [1]: %%timeit -n100
   ...: integrate(math.cos, 0, math.pi / 2, ...: n_iter=10**6)
   ...:
100 loops, best of 3: 279 ms per loop
In [2]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6,
   ...:                 n_jobs=2)
100 loops, best of 3: 283 ms per loop
In [3]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6,
   ...:                 n_jobs=4)
100 loops, best of 3: 275 ms per loop
# ???

GIL

GIL – global interpreter lock – mutex that guarantees that only one thread at a time has access to the interpreter state
(GIL can be disabled)

e.g. Cython

In [2]: %%cython
...: from libc.math cimport cos
...: def integrate(f, double a, double b, long n_iter):
...: #             ^
   ...:     cdef double
   ...:     cdef double
   ...:     cdef long i
   ...:     with nogil:
   ...:         for i in range(n_iter):
   ...:             acc += cos(a + i * step) * step
   ...:     return acc
In [3]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6, n_jobs=2)
100 loops, best of 3: 9.58 ms per loop
In [4]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6, n_jobs=4)
100 loops, best of 3: 7.95 ms per loop

>>> import multiprocessing as mp
>>> p = mp.Process(target=countdown, args=(5, )) >>> p.start()
>>> 4 left
3 left
2 left
1 left
0 left
>>> p.name, p.pid
('Process-2', 65130)
>>> p.daemon
False
>>> p.join()
>>> p.exitcode
0

>>> def ponger(conn):
...     conn.send("pong")
...
>>> parent_conn, child_conn = mp.Pipe() >>> p = mp.Process(target=ponger,
... args=(child_conn, )) >>> p.start()
>>> parent_conn.recv()
'pong'
>>> p.join()

joblib

from joblib import Parallel, delayed
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000, backend=None):
    step = (b - a) / n_jobs
    with Parallel(n_jobs=n_jobs,
                  backend=backend) as parallel: 
        fs = (delayed(integrate)(a + i * step,
                  a + (i + 1) * step,
                  n_iter=n_iter // n_jobs) 
              for i in range(n_jobs))
    return sum(parallel(fs))

🏁 Any questions?

Stuff to discuss

how to stay sane
practical deep learning
course work
exam