- In the lecture we used
`system.time()`

function to analyse function performance - Albeit conveniently built-in, the main drawback is that it's rather coarse
- While useful for detecting large performance gaps, it often doesn't capture more subtle differences
- The reason is that it only runs once and uses seconds as a standard unit of measurement
- Here we will use
`microbenchmark`

package and identically named function to time function calls - Remember to print out the results of
`microbenchmark`

, otherwise times of individual runs are returnes

In [2]:

```
library("microbenchmark")
```

- Consider a data frame with 20 different variables below
- We want to know the mean of each variable in the matrix
- There are 2 principal ways of estimating them
- One using
`apply()`

function (as in Week 9) - Or using built-in
`colMeans()`

function - Apply each of those function to calculate means
- Benchmark the time it took to run using
`system.time`

and`microbenchmark`

- What do you find?

In [3]:

```
set.seed(2021)
# Here we create a data frame of 1000 observations of 50 variables
# where each variable is a random draw from a normal distribution with mean
# drawn from a uniform distribution between 0 and 10 and standard deviation 1
dat <- data.frame(mapply(
function(x) cbind(rnorm(n = 1000, mean = x, sd = 1)),
runif(n = 50, min = 0, max = 10)
))
```

In [4]:

```
dim(dat)
```

[1] 1000 50

- It is possible to measure timing of operation in Python with built-in
`time`

module - But it would require recording time before a call and after and then taking a difference
- Python's built-in
`timeit`

module provides a better alternative as it does it automatically an more - It behaves similar to
`microbenchmark`

in R in that it averages over many runs - It is also available in IPython (and, as a result, in Jupyter) as a magic command that can be called with
`%timeit`

- In order to be able to continue with Python part of the exercises you can switch your kernel
- Got to
`Kernel`

,`Change kernel`

and pick Python from the drop-down menu

In [1]:

```
import random
import numpy as np
import pandas as pd
```

In [2]:

```
# Random numbers in Python can be generated either using
# the built-in `random` module or using `numpy` external
# module (which is underlying a lot of `pandas` operations)
random.gauss(mu = 0, sigma = 1)
```

Out[2]:

-0.7261368325293743

In [3]:

```
# Instead of just a float number it returns an array
np.random.randn(1)
```

Out[3]:

array([-0.88354514])

In [4]:

```
# Let's start our benchmarking experiments from looking
# at random number generation in Python.
# First let's draw a sample of 1M using both built-in `random` module
# And `numpy`'s methods
```

In [5]:

```
N = 1000000
```

In [6]:

```
# We can use `for _` expression to inicate that returned value is being discarded
%timeit [random.gauss(mu = 0, sigma = 1) for _ in range(N)]
```

358 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]:

```
# `numpy` is order of magnitude faster than built-in module
%timeit np.random.normal(size = N)
```

18.2 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

- Now let's replicate the calculation of some summary statistics in
`pandas`

DataFrame - As in the case of R, there are 2 principal ways of doing this
- First, is iterating over columns in a data set with a list comprehension
- And applying some function to each of columns (e.g.
`mean()`

from`statistics`

module) - Alternatively, you can apply one of the built-in statistical summary methods (check Week 5 for a list)
- Apply each of those approaches to a data frame below
- How do these two approaches compare?

In [8]:

```
from statistics import mean
```

In [9]:

```
# Here we are, essnetially, replicating the process of data frame creation as in R above
# each variable is a random draw from a normal distribution with mean
# drawn from a uniform distribution between 0 and 10 and standard deviation 1
dat2 = pd.DataFrame(np.concatenate([
np.random.normal(loc = x, scale = 1, size = (1000, 1))
for x
in np.random.uniform(low = 0, high = 10, size = 50)
], axis = 1))
```

In [10]:

```
dat2.shape
```

Out[10]:

(1000, 50)

- Practice order of growth calculation, benchmarking and data wrangling
- Due at 11:00 on Monday, 29th November