Tasks for those who "feel like a pro"

TASK 1

Write the code to enumerate items in the list:

  • items are not ordered
  • items are not unique
  • don't use loops
  • try to be as short as possible (not considering import statements)

Example:

Input

items = ['foo', 'bar', 'baz', 'foo', 'baz', 'bar']

Output

#something like:
[0, 1, 2, 0, 2, 1]

TASK 2

For each element in a list [0, 1, 2, ..., N] build all possible pairs with other elements of that list.

  • exclude "self-pairing" (e.g. 0-0, 1-1, 2-2)
  • don't use loops
  • try to be as short as possible (not considering import statements)

Example:

Input:

[0, 1, 2, 3] or just 4

Output:

0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3

1, 2, 3, 0, 2, 3, 0, 1, 3, 0, 1, 2

Learning Resources

Online

Learning by doing!

Reading (in the future)

  • Al Sweigart, "Automate the Boring Stuff with Python", https://automatetheboringstuff.com
  • Mark Lutz, "Python Pocket Reference" (250 pages)
  • Mark Lutz, "Learning Python" (1600 pages!)

Google Foobar

Programming in python

Writing code

Some anti-patterns

Python basics

Verify your python version by running

python --version

This notebook is written in Python 3.

Basic types

variables

a = b = 3

c, d = 4, 5

c, d = d, c

strings

In [ ]:
greeting = 'Hello'
guest = "John"
my_string = 'Hello "John"'
named_greeting = 'Hello, {name}'.format(name=guest)

named_greeting2 = '{}, {}'.format(greeting, guest)

print(named_greeting)
print(named_greeting2)

data containers

  • list
  • tuple
  • set
  • dictionary

lists

In [ ]:
fruit_list = ['apple', 'orange', 'peach', 'mango', 'bananas', 'pineapple']

name_length = [len(fruit) for fruit in fruit_list]
print(name_length)
In [ ]:
name_with_p = [fruit for fruit in fruit_list if fruit[0]=='p']  #even better: fruit.startswith('p')
In [ ]:
numbered_fruits = []
In [ ]:
for i, fruit in enumerate(fruit_list):
    numbered_fruits.append('{}.{}'.format(i, fruit))
    
numbered_fruits

Indexing starts with zero.

General indexing rule (mind the brackets): [start:stop:step]

In [ ]:
numbered_fruits[0] = None
In [ ]:
numbered_fruits[1:4]
In [ ]:
numbered_fruits[1:-1:2]
In [ ]:
numbered_fruits[::-1]

tuples

immutable type!

In [ ]:
p_fruits = (name_with_p[1], name_with_p[0])
p_fruits[1] = 'mango'
In [ ]:
single_number_tuple = 3,
single_number_tuple
In [ ]:
single_number_tuple + (2,) + (1, 0)

sets

Immutable type. Stores only unique elements.

In [ ]:
my_set = set([0, 1, 2, 1, 1, 1, 3])
my_set
In [ ]:
my_set.remove(2)
my_set
In [ ]:
my_set.add("word")
my_set

dictionaries

In [ ]:
fruit_list = ['apple', 'orange', 'mango', 'banana', 'pineapple']
quantities = [3, 5, 2, 3, 4]

order_fruits = {fruit: num \
                for fruit, num in zip(fruit_list, quantities)}
order_fruits
In [ ]:
order_fruits['pineapple'] = 2
order_fruits
In [ ]:
print(order_fruits.keys())
print(order_fruits.values())
In [ ]:
for fruit, amount in order_fruits.items():
    print('Buy {num} {entity}s'.format(num=amount, entity=fruit))

Functions

general patterns

In [ ]:
def my_func(var1, var2, default_var1=0, default_var2 = False):
    """
    This is a generic example of python a function.
    You can see this string when do call: my_func?
    """
    #do something with vars
    if not default_var2:
        result = var1
    elif default_var1 == 0:
        result = var1
    else:
        result = var1 + var2
    return result

function is just another object (like almost everything in python)

In [ ]:
print('Function {} has the following docstring:\n{}'\
        .format(my_func.__name__, my_func.__doc__))

functions as arguments

In [ ]:
def function_over_function(func, *args, **kwargs):
    function_result = func(*args, **kwargs)
    return function_result
In [ ]:
function_over_function(my_func, 3, 5, default_var1=1, default_var2=True)

lambda evaluation

In [ ]:
function_over_function(lambda x, y, factor=10: (x+y)*factor, 1, 2, 5)

Don't assign lambda expressions to variables. If you need named instance - create standard function with def

In [ ]:
my_simple_func = lambda x: x+1

vs

In [ ]:
def my_simple_func(x):
    return x + 1

Numpy - scientific computing

Building matrices and vectors

In [ ]:
import numpy as np
In [ ]:
matrix_from_list = np.array([[1, 3, 4],
                             [2, 0, 5],
                             [4, 4, 1],
                             [0, 1, 0]])

vector_from_list = np.array([2, 1, 3])

print('The matrix is\n{matrix}\n\nthe vector is\n{vector}'\
        .format(vector=vector_from_list, matrix=matrix_from_list))

Basic manipulations

matvec

In [ ]:
matrix_from_list.dot(vector_from_list)
In [ ]:
matrix_from_list @ vector_from_list

broadcasting

In [ ]:
matrix_from_list + vector_from_list
In [ ]:
i = np.arange(4)
j = np.arange(4)
mat = i[:, None] - j[None, :]
mat

forcing dtype

In [ ]:
single_precision_vector = np.array([1, 3, 5, 2], dtype=np.float32)
single_precision_vector.dtype

converting dtypes

In [ ]:
vector_from_list.dtype
In [ ]:
vector_from_list.astype(np.int16)

shapes (singletons)

mind dimensionality!

In [ ]:
row_vector = np.array([[1,2,3]])

print('New vector {} has dimensionality {}'\
        .format(row_vector, row_vector.shape))

print('The dot-product is: ', matrix_from_list.dot(row_vector))
In [ ]:
singleton_vector = row_vector.squeeze()
print('Squeezed vector {} has shape {}'.format(singleton_vector, singleton_vector.shape))
In [ ]:
matrix_from_list.dot(singleton_vector)

adding new dimension

In [ ]:
print(singleton_vector[:, np.newaxis])
In [ ]:
print(singleton_vector[:, None])

Indexing, slicing

In [ ]:
vector12 = np.arange(12)
vector12

Guess what is the output:

vector12[:3]
vector12[-1]
vector12[:-2]
vector12[3:7]
vector12[::2]
vector12[::-1]
In [ ]:
matrix43 = vector12.reshape(4, 3)
matrix43

Guess what is the output:

matrix43[:, 0]
matrix43[-1, :]
matrix43[::2, :]
matrix43[:3, :-1]
matrix43[3:, 1]

Unlike Matlab, numpy arrays are column-major (or C-major) by default, not row-major (or F-major).

View vs Copy

Working with views is more efficient and is a preferred way.

view is returned whenever basic slicing is used

more details at http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

making copy is simple:

In [ ]:
matrix43_copy = matrix43[:]

Reshaping

In [ ]:
matrix_to_reshape = np.random.randint(10, 99, size=(6, 4))
matrix_to_reshape
In [ ]:
reshaped_matrix = matrix_to_reshape.reshape(8, 3)
reshaped_matrix

reshape always returns view!

Boolean indexing

In [ ]:
idx = matrix43 > 4
matrix43[idx]

Useful numpy functions

eye, ones, zeros, diag

Example: Build three-diagonal matrix with -2's on main diagonal and 1's and subdiagonals

Is this code valid?

In [ ]:
def three_diagonal(N):
    A = np.zeros((N, N), dtype=np.int)
    for i in range(N):
        A[i, i] = -2
        if i > 0:
            A[i, i-1] = 1
        if i < N-1:
            A[i, i+1] = 1
    return A

print(three_diagonal(5))

Let's rewrite it in numpy!

In [ ]:
def numpy_three_diagonal(N):
    main_diagonal = -2 * np.eye(N)
    
    suddiag_value = np.ones(N-1,)
    lower_subdiag = np.diag(suddiag_value, k=-1)
    upper_subdiag = np.diag(suddiag_value, k=1)
    
    result = main_diagonal + lower_subdiag + upper_subdiag
    return result.astype(np.int)

numpy_three_diagonal(5)

reducers: sum, mean, max, min, all, any

In [ ]:
A = numpy_three_diagonal(5)
A[0, -1] = 5
A[-1, 0] = 3

print(A)
print(A.sum())
print(A.min())
print(A.max(axis=0))
print(A.sum(axis=0))
print(A.mean(axis=1))
print((A > 4).any(axis=1))

numpy math functions

In [ ]:
print(np.pi)
In [ ]:
args = np.arange(0, 2.5*np.pi, 0.5*np.pi)
In [ ]:
print(np.sin(args))
In [ ]:
print(np.round(np.sin(args), decimals=2))

Meshes

linspace, meshgrid

Let's produce a function $$ f(x, y) = \sin(x+y) $$ on some mesh.

In [ ]:
linear_index = np.linspace(0, np.pi, 10, endpoint=True)
mesh_x, mesh_y = np.meshgrid(linear_index, linear_index)

values_3D = np.sin(mesh_x + mesh_y)
In [ ]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

fig = plt.figure(figsize=(10,6))
ax = fig.gca(projection='3d')

ax.plot_wireframe(mesh_x, mesh_y, values_3D)
ax.view_init(azim=-45, elev=30)

plt.title('The plot of $f(x, y) = sin(x+y)$')

Scipy - scientific computing 2

Building sparse matrix

In [ ]:
import scipy.sparse as sp
In [ ]:
def scipy_three_diagonal(N):
    main_diagonal = -2 * np.ones(N, )
    suddiag_values = np.ones(N-1,)
    
    diagonals = [main_diagonal, suddiag_values, suddiag_values]
    # Another option: use sp.eye(N) and add subdiagonals
    offsets = [0, 1, -1]
    
    result = sp.diags(diagonals, offsets, shape=(N, N), format='coo')
    return result

my_sparse_matrix = scipy_three_diagonal(5)

How does scipy represent sparse matrix?

In [ ]:
my_sparse_matrix

Sparse matrix stores only non-zero elements (and their indices)

In [ ]:
print(my_sparse_matrix)

Restoring full matrix

In [ ]:
my_sparse_matrix.toarray()
In [ ]:
my_sparse_matrix.A
In [ ]:
from scipy.linalg import toeplitz, hankel
In [ ]:
hankel(range(4), [-1, -2, -3, -4])
In [ ]:
toeplitz(range(4))

Timing - measuring performance

Simplest way to measure time

In [ ]:
N = 100
%timeit three_diagonal(N)
%timeit numpy_three_diagonal(N)
%timeit scipy_three_diagonal(N)

You can also use %%timeit magic to measure run time of the whole cell

In [ ]:
%%timeit
N = 100
calc = three_diagonal(N)
calc = scipy_three_diagonal(N)
del calc

Storing timings in a separate variable

Avoid using time.time() or time.clock() directly as their behaviour's different depending on platform; default_timer makes the best choice for you. It measures wall time though, e.g. not very precise.

In [ ]:
from timeit import default_timer as timer
In [ ]:
dims = [300, 1000, 2000, 3000]
bench_names = ['loop', 'numpy', 'scipy']
timings = {bench:[] for bench in bench_names}

for n in dims:
    start_time = timer()
    calc = three_diagonal(n)
    time_delta = timer() - start_time
    timings['loop'].append(time_delta)
    
    start_time = timer()
    calc = numpy_three_diagonal(n)
    time_delta = timer() - start_time
    timings['numpy'].append(time_delta)
    
    start_time = timer()
    calc = scipy_three_diagonal(n)
    time_delta = timer() - start_time
    timings['scipy'].append(time_delta)

Let's make the code less redundant

In [ ]:
dims = [300, 1000, 2000, 3000]
bench_names = ['loop', 'numpy', 'scipy']
timings = {bench_name: [] for bench_name in bench_names}

def timing_machine(func, *args, **kwargs):
    start_time = timer()
    result = func(*args, **kwargs)
    time_delta = timer() - start_time
    return time_delta

for n in dims:
    timings['loop'].append(timing_machine(three_diagonal, n))
    timings['numpy'].append(timing_machine(numpy_three_diagonal, n))
    timings['scipy'].append(timing_machine(scipy_three_diagonal, n))

Matplotlib - plotting in python

Configuring matplotlib

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline 

%matplotlib inline ensures all graphs are plotted inside your notebook

Global controls

In [ ]:
# plt.rcParams.update({'axes.labelsize': 'large'})
plt.rcParams.update({'font.size': 14})

Combined plot

In [ ]:
plt.figure(figsize=(10,6))

for bench_name, values in timings.items():
    plt.semilogy(dims, values, label=bench_name)
    
plt.legend(loc='best')
plt.title('Benchmarking results', y=1.03)
plt.xlabel('Matrix dimension size')
plt.ylabel('Time, s')

Think, why:

  • "loop" was faster then "numpy"
  • "scipy" is almost constant
  • results for default_timer and "best of timeit" are different

You might want to read the docs:

Remark: starting from python 3.3 it's recommended to use time.perf_counter() and time.process_time() https://docs.python.org/3/library/time.html#time.perf_counter

Also note, that for advanced benchmarking it's better to use profiling tools.

Combined plot "one-liner"

In [ ]:
k = len(timings)
iter_xyf = [item for sublist in zip([dims]*k,
                                    timings.values(),
                                    list('rgb'))\
                                for item in sublist]

plt.figure(figsize=(10, 6))
plt.semilogy(*iter_xyf)

plt.legend(timings.keys(), loc=2, frameon=False)
plt.title('Benchmarking results - "one-liner"', y=1.03)
plt.xlabel('Matrix dimension size')
plt.ylabel('Time, s')

Even simpler way - also gives you granular control on plot objects

In [ ]:
plt.figure(figsize=(10, 4))

figs = [plt.semilogy(dims, values, label=bench_name)\
        for bench_name, values in timings.items()];

ax0, = figs[0]
ax0.set_dashes([5, 10, 20, 10, 5, 10])

ax1, = figs[1]
ax1.set_marker('s')
ax1.set_markerfacecolor('r')

ax2, = figs[2]
ax2.set_linewidth(6)
ax2.set_alpha(0.3)
ax2.set_color('m')

Plot formatting

matplotlib has a number of different options for styling your plot

In [ ]:
all_markers = [
'.', # point
',', # pixel
'o', # circle
'v', # triangle down
'^', # triangle up
'<', # triangle_left
'>', # triangle_right
'1', # tri_down
'2', # tri_up
'3', # tri_left
'4', # tri_right
'8', # octagon
's', # square
'p', # pentagon
'*', # star
'h', # hexagon1
'H', # hexagon2
'+', # plus
'x', # x
'D', # diamond
'd', # thin_diamond
'|', # vline
]

all_linestyles = [
'-',  # solid line style
'--', # dashed line style
'-.', # dash-dot line style
':',  # dotted line style
'None'# no line
]

all_colors = [
'b', # blue
'g', # green
'r', # red
'c', # cyan
'm', # magenta
'y', # yellow
'k', # black
'w', # white
]

Subplots

Iterating over subplots

In [ ]:
n = len(timings)
experiment_names = list(timings.keys())

fig, axes = plt.subplots(1, n, sharey=True, figsize=(16,4))

colors = np.random.choice(list('rgbcmyk'), n, replace=False)
markers = np.random.choice(all_markers, n, replace=False)
lines = np.random.choice(all_linestyles, n, replace=False)

for ax_num, ax in enumerate(axes):
    key = experiment_names[ax_num]
    ax.semilogy(dims, timings[key], label=key,
            color=colors[ax_num],
            marker=markers[ax_num],
            markersize=8,
            linestyle=lines[ax_num],
            lw=3)
    ax.set_xlabel('matrix dimension')
    ax.set_title(key)

axes[0].set_ylabel('Time, s')
plt.suptitle('Benchmarking results', fontsize=16,  y=1.03)

Solutions

Task 1

In [ ]:
items = ['foo', 'bar', 'baz', 'foo', 'baz', 'bar']
In [ ]:
from collections import defaultdict

item_ids = defaultdict(lambda: len(item_ids))
list(map(item_ids.__getitem__, items))
In [ ]:
import pandas as pd

pd.DataFrame({'items': items}).groupby('items', sort=False).grouper.group_info[0]
In [ ]:
import numpy as np

np.unique(items, return_inverse=True)[1]

Task 2

In [ ]:
N = 1000
In [ ]:
from itertools import permutations

%timeit list(permutations(range(N), 2))

Hankel matrix: $a_{ij} = a_{i-1, j+1}$

In [ ]:
import numpy as np
from scipy.linalg import hankel

def pairs_idx(n):
    return np.vstack((np.repeat(range(n), n-1), hankel(range(1, n), range(-1, n-1)).ravel()))
In [ ]:
%timeit pairs_idx(N)