👀 AOT and JIT compilation
Just-in-time:
- PyPy: compile the most frequent code
- Translate some code to LLVM (Numba). then compile it
Ahead-of-time:
- Python C-API: Writing code in C/C++/Rust and making an interface understandable for CPython
- Cython (uses C-API of Python interpreter)
- Numba can do AOT
📗 Back to the notebook…
🏁 Any questions?
[1]: %timeit -n100 X.dot(Y)
148 µs ± 6.04 µs per loop
🐍 Python: multiprocessing/multithreading
🐍 Python: threadingmultiprocessing/multi
🐍 thonPy: multi/multithreading/processing
Process
- Process is a launched program
- Every process has isolated resources:
- virtual memory space
- pointer to execution place
- call stack
- system resources, e.g. file descriptors
- Alternative: a thread
Thread
- Threads are executed independently
- Threads are executed inside some process and can share memory space and system resources
- Managed by OS
Concurrency vs. Parallelism
Regular stuff
import math
def integrate(f, a, b, *, n_iter=1000):
... acc=0
... step=(b-a)/n_iter
... for i in range(n_iter):
... acc += f(a + i * step) * step
... return acc
...
>>> integrate(math.cos, 0, math.pi / 2) 1.0007851925466296
>>> integrate(math.sin, 0, math.pi) 1.9999983550656637
Multithreading!
from functools import partial
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000):
executor = ThreadPoolExecutor(max_workers=n_jobs)
spawn = partial(executor.submit, integrate, f,
n_iter=n_iter // n_jobs)
step=(b-a)/n_jobs
fs=[spawn(a+i*step,a+(i+1)*step)
for i in range(n_jobs)]
return sum(f.result() for f in as_completed(fs))
>>> integrate_faster(math.cos, 0, math.pi / 2, n_jobs=2) 1.0003926476775074
>>> integrate_faster(math.sin, 0, math.pi, n_jobs=2) 1.9999995887664657
Benchmark
In [1]: %%timeit -n100
...: integrate(math.cos, 0, math.pi / 2, ...: n_iter=10**6)
...:
100 loops, best of 3: 279 ms per loop
In [2]: %%timeit -n100
...: integrate_faster(math.cos, 0, math.pi / 2,
...: n_iter=10**6,
...: n_jobs=2)
100 loops, best of 3: 283 ms per loop
In [3]: %%timeit -n100
...: integrate_faster(math.cos, 0, math.pi / 2,
...: n_iter=10**6,
...: n_jobs=4)
100 loops, best of 3: 275 ms per loop
# ???
GIL
- GIL – global interpreter lock – mutex that guarantees that only one thread at a time has access to the interpreter state
- (GIL can be disabled)
gc
e.g. Cython
In [2]: %%cython
...: from libc.math cimport cos
...: def integrate(f, double a, double b, long n_iter):
...: # ^
...: cdef double
...: cdef double
...: cdef long i
...: with nogil:
...: for i in range(n_iter):
...: acc += cos(a + i * step) * step
...: return acc
In [3]: %%timeit -n100
...: integrate_faster(math.cos, 0, math.pi / 2,
...: n_iter=10**6, n_jobs=2)
100 loops, best of 3: 9.58 ms per loop
In [4]: %%timeit -n100
...: integrate_faster(math.cos, 0, math.pi / 2,
...: n_iter=10**6, n_jobs=4)
100 loops, best of 3: 7.95 ms per loop
>>> import multiprocessing as mp
>>> p = mp.Process(target=countdown, args=(5, )) >>> p.start()
>>> 4 left
3 left
2 left
1 left
0 left
>>> p.name, p.pid
('Process-2', 65130)
>>> p.daemon
False
>>> p.join()
>>> p.exitcode
0
>>> def ponger(conn):
... conn.send("pong")
...
>>> parent_conn, child_conn = mp.Pipe() >>> p = mp.Process(target=ponger,
... args=(child_conn, )) >>> p.start()
>>> parent_conn.recv()
'pong'
>>> p.join()
joblib
from joblib import Parallel, delayed
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000, backend=None):
step = (b - a) / n_jobs
with Parallel(n_jobs=n_jobs,
backend=backend) as parallel:
fs = (delayed(integrate)(a + i * step,
a + (i + 1) * step,
n_iter=n_iter // n_jobs)
for i in range(n_jobs))
return sum(parallel(fs))
🏁 Any questions?
Stuff to discuss
- how to stay sane
- practical deep learning
- course work
- exam