👀 AOT and JIT compilation
Just-in-time:
- PyPy: compile the most frequent code
- Translate some code to LLVM (Numba). then compile it
Ahead-of-time:
- Python C-API: Writing code in C/C++/Rust and making an interface understandable for CPython
- Cython (uses C-API of Python interpreter)
- Numba can do AOT
📗 Back to the notebook…
🏁 Any questions?
[1]: %timeit -n100 X.dot(Y)
148 µs ± 6.04 µs per loop
🐍 Python: multiprocessing/multithreading
🐍 Python: threadingmultiprocessing/multi
🐍 thonPy: multi/multithreading/processing
Process
- Process is a launched program
- Every process has isolated resources: - virtual memory space
- pointer to execution place
- call stack
- system resources, e.g. file descriptors
 
- Alternative: a thread
Thread
- Threads are executed independently
- Threads are executed inside some process and can share memory space and system resources
- Managed by OS
Concurrency vs. Parallelism

Regular stuff
import math
def integrate(f, a, b, *, n_iter=1000):
... acc=0
... step=(b-a)/n_iter
... for i in range(n_iter):
...     acc += f(a + i * step) * step
... return acc
...
>>> integrate(math.cos, 0, math.pi / 2) 1.0007851925466296
>>> integrate(math.sin, 0, math.pi) 1.9999983550656637
Multithreading!
from functools import partial
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000):
    executor = ThreadPoolExecutor(max_workers=n_jobs)
    spawn = partial(executor.submit, integrate, f, 
                    n_iter=n_iter // n_jobs)
    step=(b-a)/n_jobs 
    fs=[spawn(a+i*step,a+(i+1)*step)
        for i in range(n_jobs)]
    return sum(f.result() for f in as_completed(fs))
>>> integrate_faster(math.cos, 0, math.pi / 2, n_jobs=2) 1.0003926476775074
>>> integrate_faster(math.sin, 0, math.pi, n_jobs=2) 1.9999995887664657
Benchmark
In [1]: %%timeit -n100
   ...: integrate(math.cos, 0, math.pi / 2, ...: n_iter=10**6)
   ...:
100 loops, best of 3: 279 ms per loop
In [2]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6,
   ...:                 n_jobs=2)
100 loops, best of 3: 283 ms per loop
In [3]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6,
   ...:                 n_jobs=4)
100 loops, best of 3: 275 ms per loop
# ???
GIL
- GIL – global interpreter lock – mutex that guarantees that only one thread at a time has access to the interpreter state
- (GIL can be disabled)
gc
e.g. Cython
In [2]: %%cython
...: from libc.math cimport cos
...: def integrate(f, double a, double b, long n_iter):
...: #             ^
   ...:     cdef double
   ...:     cdef double
   ...:     cdef long i
   ...:     with nogil:
   ...:         for i in range(n_iter):
   ...:             acc += cos(a + i * step) * step
   ...:     return acc
In [3]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6, n_jobs=2)
100 loops, best of 3: 9.58 ms per loop
In [4]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6, n_jobs=4)
100 loops, best of 3: 7.95 ms per loop
>>> import multiprocessing as mp
>>> p = mp.Process(target=countdown, args=(5, )) >>> p.start()
>>> 4 left
3 left
2 left
1 left
0 left
>>> p.name, p.pid
('Process-2', 65130)
>>> p.daemon
False
>>> p.join()
>>> p.exitcode
0
>>> def ponger(conn):
...     conn.send("pong")
...
>>> parent_conn, child_conn = mp.Pipe() >>> p = mp.Process(target=ponger,
... args=(child_conn, )) >>> p.start()
>>> parent_conn.recv()
'pong'
>>> p.join()
joblib
from joblib import Parallel, delayed
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000, backend=None):
    step = (b - a) / n_jobs
    with Parallel(n_jobs=n_jobs,
                  backend=backend) as parallel: 
        fs = (delayed(integrate)(a + i * step,
                  a + (i + 1) * step,
                  n_iter=n_iter // n_jobs) 
              for i in range(n_jobs))
    return sum(parallel(fs))
🏁 Any questions?
Stuff to discuss
- how to stay sane
- practical deep learning
- course work
- exam