7 · Last, but not least · performance · hacks · multiprocessing

By on

👀 AOT and JIT compilation


Just-in-time:

  • PyPy: compile the most frequent code
  • Translate some code to LLVM (Numba). then compile it

Ahead-of-time:

  • Python C-API: Writing code in C/C++/Rust and making an interface understandable for CPython
  • Cython (uses C-API of Python interpreter)
  • Numba can do AOT

📗 Back to the notebook…



🏁 Any questions?

[1]: %timeit -n100 X.dot(Y)

148 µs ± 6.04 µs per loop


🐍 Python: multiprocessing/multithreading


🐍 Python: threadingmultiprocessing/multi


🐍 thonPy: multi/multithreading/processing


Process

  • Process is a launched program
  • Every process has isolated resources:
    • virtual memory space
    • pointer to execution place
    • call stack
    • system resources, e.g. file descriptors
  • Alternative: a thread

Thread

  • Threads are executed independently
  • Threads are executed inside some process and can share memory space and system resources
  • Managed by OS

Concurrency vs. Parallelism


dEnqiA2.png


Regular stuff

import math
def integrate(f, a, b, *, n_iter=1000):
... acc=0
... step=(b-a)/n_iter
... for i in range(n_iter):
...     acc += f(a + i * step) * step
... return acc
...
>>> integrate(math.cos, 0, math.pi / 2) 1.0007851925466296
>>> integrate(math.sin, 0, math.pi) 1.9999983550656637

Multithreading!

from functools import partial
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000):
    executor = ThreadPoolExecutor(max_workers=n_jobs)
    spawn = partial(executor.submit, integrate, f, 
                    n_iter=n_iter // n_jobs)
    step=(b-a)/n_jobs 
    fs=[spawn(a+i*step,a+(i+1)*step)
        for i in range(n_jobs)]
    return sum(f.result() for f in as_completed(fs))

>>> integrate_faster(math.cos, 0, math.pi / 2, n_jobs=2) 1.0003926476775074
>>> integrate_faster(math.sin, 0, math.pi, n_jobs=2) 1.9999995887664657

Benchmark

In [1]: %%timeit -n100
   ...: integrate(math.cos, 0, math.pi / 2, ...: n_iter=10**6)
   ...:
100 loops, best of 3: 279 ms per loop
In [2]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6,
   ...:                 n_jobs=2)
100 loops, best of 3: 283 ms per loop
In [3]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6,
   ...:                 n_jobs=4)
100 loops, best of 3: 275 ms per loop
# ???

GIL


  • GIL – global interpreter lock – mutex that guarantees that only one thread at a time has access to the interpreter state
  • (GIL can be disabled)

gc


e.g. Cython

In [2]: %%cython
...: from libc.math cimport cos
...: def integrate(f, double a, double b, long n_iter):
...: #             ^
   ...:     cdef double
   ...:     cdef double
   ...:     cdef long i
   ...:     with nogil:
   ...:         for i in range(n_iter):
   ...:             acc += cos(a + i * step) * step
   ...:     return acc
In [3]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6, n_jobs=2)
100 loops, best of 3: 9.58 ms per loop
In [4]: %%timeit -n100
   ...: integrate_faster(math.cos, 0, math.pi / 2,
   ...:                 n_iter=10**6, n_jobs=4)
100 loops, best of 3: 7.95 ms per loop

>>> import multiprocessing as mp
>>> p = mp.Process(target=countdown, args=(5, )) >>> p.start()
>>> 4 left
3 left
2 left
1 left
0 left
>>> p.name, p.pid
('Process-2', 65130)
>>> p.daemon
False
>>> p.join()
>>> p.exitcode
0

>>> def ponger(conn):
...     conn.send("pong")
...
>>> parent_conn, child_conn = mp.Pipe() >>> p = mp.Process(target=ponger,
... args=(child_conn, )) >>> p.start()
>>> parent_conn.recv()
'pong'
>>> p.join()

joblib

from joblib import Parallel, delayed
def integrate_faster(f, a, b, *, n_jobs, n_iter=1000, backend=None):
    step = (b - a) / n_jobs
    with Parallel(n_jobs=n_jobs,
                  backend=backend) as parallel: 
        fs = (delayed(integrate)(a + i * step,
                  a + (i + 1) * step,
                  n_iter=n_iter // n_jobs) 
              for i in range(n_jobs))
    return sum(parallel(fs))

🏁 Any questions?


Stuff to discuss

  • how to stay sane
  • practical deep learning
  • course work
  • exam