Jekyll2023-08-05T06:15:34-05:00https://kozodoi.me/feed.xmlNikita KozodoiBlog on AI, ML and other cool acronymsImplementing PCA from Scratch2023-03-26T00:00:00-05:002023-03-26T00:00:00-05:00https://kozodoi.me/blog/20230326/pca-from-scratch

Last update: 26.03.2023. All opinions are my own.

# 1. Overview

This blog post provides a tutorial on implementing the Principal Component Analysis algorithm using Python and NumPy. We will set up a simple class object, implement relevant methods to perform the decomposition, and illustrate how it works on a toy dataset.

Why are we implementing PCA from scratch if the algorithm is already available in `scikit-learn`? First, coding something from scratch is the best way to understand it. You may know many ML algorithms, but being able to write it down indicates that you have really mastered it. Second, implementing algorithms from scratch is a common task in ML interviews in tech companies, which makes it a useful skill that a job candidate should practice. Last but not least, it's a fun exercise, right? :)

This post is part of "ML from Scratch" series, where we implement established ML algorithms in Python. Check out other posts to see further implementations.

# 2. How PCA works

Before jumping to implementation, let's quickly refresh our minds. How does PCA work?

PCA is a popular unsupervised algorithm used for dimensionality reduction. In a nutshell, PCA helps you to reduce the number of feature in your dataset by combining the features without loosing too much information. More specifically, PCA finds a linear data transformation that projects the data into a new coordinate system with a fewer dimensions. To capture the most variation in the original data, this projection is done by finding the so-called principal components - eigenvectors of the data's covariance matrix - and multiplying the actual data matrix with a subset of the components. This procedure is what we are going to implement.

P.S. If you need a more detailed summary of how PCA works, check out this Wiki page.

# 3. Implementing PCA

Let's start the implementation! The only library we need to import is `numpy`:

```import numpy as np
```

In line with object-oriented programming practices, we will implement PCA as a class with a set of methods. We will need the following five:

1. `__init__()`: initialize the class object.
2. `fit()`: center the data and identify principal components.
3. `transform()`: transform new data into the identified components.

Let's sketch a class object template. Since we implement functions as class methods, we include `self` argument for each method:

```class PCA:

def __init__(self):
pass

def fit(self):
"""
Find principal components
"""
pass

def transform(self):
"""
Transform new data
"""
pass
```

Now let's go through each method one by one.

The `__init__()` method is run once when the initialize the PCA class object.

One thing need to do on the initialization step is to store meta-parameters of our algorithm. For PCA, there is only one meta-parameter we will specify: the number of components. We will save it as `self.num_components`.

Apart from the meta-parameters, we will create three placeholders that we will use to store important class attributes:

• `self.components`: array with the principal component weights
• `self.mean`: mean variable values observed in the training data
• `self.variance_share`: proportion of variance explained by principal components
```def __init__(self, num_components):
self.num_components = num_components
self.components     = None
self.mean           = None
self.variance_share = None
```

Next, let's implement the `fit()` method - the heart of our PCA class. This method will be applied to a provided dataset to identify and memorize principal components.

We will do the following steps:

1. Center the data by subtracting the mean values for each variable. Normalizing variables is important to make sure that their impact in the data variation is similar and does not depend on the range of that variable. We will also memorize the mean values as `self.mean` as we will need it later for the data transformation.
2. Calculate eigenvectors of the covariance matrix. First, we will use `np.cov()` to get the covariance matrix of the data. Next, we will leverage `np.linalg.eig()` to do the eigenvalue decomposition and obtain both eigenvalues and eigenvectors.
3. Sort eigenvalues and eigenvectors in the decreasing order. Since we will use a smaller number of components compared to the number of variables in the original data, we would like to focus on components that reflect more data variation. In our case, eigenvectors that correspond to larger eigenvalues capture more variation.
4. Store an array with the top `num_components` components as `self.components`.

Finally, we will calculate and memorize the data variation explained by the selected components as `self.variance_share`. This can be computed as a cumulative sum of the corresponding eigenvalues divided by the total sum of eigenvalues.

```def fit(self, X):
"""
Find principal components
"""

# data centering
self.mean = np.mean(X, axis = 0)
X        -= self.mean

# calculate eigenvalues & vectors
cov_matrix      = np.cov(X.T)
values, vectors = np.linalg.eig(cov_matrix)

# sort eigenvalues & vectors
sort_idx = np.argsort(values)[::-1]
values   = values[sort_idx]
vectors  = vectors[:, sort_idx]

# store principal components & variance
self.components = vectors[:self.num_components]
self.variance_share = np.sum(values[:self.num_components]) / np.sum(values)
```

The most difficult part is over! Last but not least, we will implement a method to perform the data transformation.

This will be run after calling the `fit()` method on the training data, so we only need to implement two steps:

1. Center the new data using the same mean values that we used on the fitting stage.
2. Multiply the data matrix with the matrix of the selected components. Note that we will need to transpose the components matrix to ensure the right dimensionality.
```def transform(self, X):
"""
Transform data
"""

# data centering
X -= self.mean

# decomposition
return np.dot(X, self.components.T)
```

Putting everything together, this is how our implementation looks like:

```class PCA:

def __init__(self, num_components):
self.num_components = num_components
self.components     = None
self.mean           = None
self.variance_share = None

def fit(self, X):
"""
Find principal components
"""

# data centering
self.mean = np.mean(X, axis = 0)
X        -= self.mean

# calculate eigenvalues & vectors
cov_matrix      = np.cov(X.T)
values, vectors = np.linalg.eig(cov_matrix)

# sort eigenvalues & vectors
sort_idx = np.argsort(values)[::-1]
values   = values[sort_idx]
vectors  = vectors[:, sort_idx]

# store principal components & variance
self.components = vectors[:self.num_components]
self.variance_share = np.sum(values[:self.num_components]) / np.sum(values)

def transform(self, X):
"""
Transform data
"""

# data centering
X -= self.mean

# decomposition
return np.dot(X, self.components.T)
```

# 4. Testing the implementation

Now that we have our implementation, let's check whether it actually works. We will generate two toy data samples with 10 features using the `np.random` module to draw feature values from a random Normal distribution:

```X_old = np.random.normal(loc = 0, scale = 1, size = (1000, 10))
X_new = np.random.normal(loc = 0, scale = 1, size = (500, 10))

print(X_old.shape, X_new.shape)
```
```(1000, 10) (500, 10)
```

Now, let's instantiate our PCA class, fit it on the old data and transform both datasets!

To see if the algorithm works properly, we will generate four new examples as `X_new`, gradually increasing the feature values from 1 to 5. We expect the label predicted by KNN to increase from 0 to 1, since we are getting closer to examples in `X1`. Let's check!

```# initialize PCA object
pca = PCA(num_components = 8)

# fit PCA on old data
pca.fit(X_old)

# check explained variance
print(f"Explained variance: {pca.variance_share:.4f}")
```
```Explained variance: 0.8325
```

Eight components explain more than 83% of the data variation. Not bad! Let's transform the data:

```# transform datasets
X_old = pca.transform(X_old)
X_new = pca.transform(X_new)

print(X_old.shape, X_new.shape)
```
```(1000, 8) (500, 8)
```

Yay! Everything works as expected. The new datasets have eight features instead of the original ten features.

# 5. Closing words

This is it! I hope this tutorial helps you to refresh your memory on how PCA works and gives you a good idea on how to implement it yourself. You are now well-equipped to do this exercise on your own!

If you liked this tutorial, feel free to share it on social media and buy me a coffee :) Don't forget to check out other posts in the "ML from Scratch" series. Happy learning!

]]>
Nikita Kozodoi
Implementing KNN from Scratch2023-03-19T00:00:00-05:002023-03-19T00:00:00-05:00https://kozodoi.me/blog/20230319/knn-from-scratch

Last update: 26.03.2023. All opinions are my own.

# 1. Overview

This blog post provides a tutorial on implementing the K Nearest Neighbors algorithm using Python and NumPy. We will set up a simple class object, implement relevant methods to perform the prediction, and illustrate how it works on a toy dataset.

Why are we implementing KNN from scratch if the algorithm is already available in `scikit-learn`? First, coding something from scratch is the best way to understand it. You may know many ML algorithms, but being able to write it down indicates that you have really mastered it. Second, implementing algorithms from scratch is a common task in ML interviews in tech companies, which makes it a useful skill that a job candidate should practice. Last but not least, it's a fun exercise, right? :)

This post is part of "ML from Scratch" series, where we implement established ML algorithms in Python. Check out other posts to see further implementations.

# 2. How KNN works

Before we jump to the implementation, let's quickly refresh our minds. How does KNN work?

KNN is one of the so-called lazy algorithms, which means that there is no actual training step. Instead, KNN memorizes the training data by storing the feature values of training examples. Given a new example to be predicted, KNN calculates distances between the new example and each of the examples in the training set. The prediction returned by the KNN algorithm is simply the average value of the target variable across the K nearest neighbors of the new example.

P.S. If you need a more detailed summary of how KNN works, check out this Wiki page.

# 3. Implementing KNN

Let's start the implementation! The only library we need to import is `numpy`:

```for size in layer_sizes:
x = tf.keras.layers.Dense(
size,
kernel_initializer="he_uniform",
activation=activation_fn,
)(x)
if size < layer_sizes - 1:
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(dropout_rate)(x)

x = tf.keras.layers.Dense(
n_outputs, kernel_initializer="he_uniform", activation="sigmoid", name="events_predictions"
)(x)
```
```import numpy as np
```

In line with object-oriented programming practices, we will implement KNN as a class with a set of methods. We will need the following five:

1. `__init__()`: initialize the class object.
2. `fit()`: memorize the training data and store it as a class variable.
3. `predict()`: predict label for a new example.
4. `get_distance()`: helper function to calculate distance between two examples.
5. `get_neighbors()`: helper function to find and rank neighbors by distance.

The last two functions are optional: we can implement the logic inside the `predict()` method, but it will be easier to split the steps.

Let's sketch a class object template. Since we implement functions as class methods, we include `self` argument for each method:

```class KNN:

def __init__(self):
pass

def fit(self):
"""
Memorize training data
"""
pass

def predict(self):
"""
Predict labels
"""
pass

def get_distance(self):
"""
Calculate distance between two examples
"""
pass

def get_neighbors(self):
"""
Find nearest neighbors
"""
pass
```

Now let's go through each method one by one.

The `__init__()` method is run once when the initialize the KNN class object. The only thing we need to do on the initialization step is to store meta-parameters of our algorithm. For KNN, there is only one key meta-parameter we specify: the number of neighbors. We will save it as `self.num_neighbors`:

```def __init__(self, num_neighbors: int = 5):
self.num_neighbors = num_neighbors
```

Next, let's implement the `fit()` method. As we mentioned above, on the training stage, KNN needs to memorize the training data. To simplify further calculations, we will provide the input data as two `numpy` arrays: features saved as `self.X` and labels saved as `self.y`:

```def fit(self, X: np.array, y: np.array):
"""
Memorize training data
"""
self.X = X
self.y = y
```

Now, let's write down a helper function to calculate distance between two examples, which are two `numpy` arrays with feature values. For simplicity, we will assume that all features are numeric. One of the most commonly used distance metrics is the Euclidean distance, which is calculated as a root of the sum of the squared differences between feature values. If the last sentence sounds complicated, here is how simple it looks in Python:

```def get_distance(self, a: np.array,  b: np.array):
"""
Calculate Euclidean distance between two examples
"""
return np.sum((a - b) ** 2) ** 0.5
```

Now we are getting to the most difficult part of the KNN implementation! Below, we will write a helper function that finds nearest neighbors for a given example. For that, we will do several steps:

1. Calculate distance between the provided example and each example in the memorized dataset `self.X`.
2. Sort examples in `self.X` by their distances to the provided example.
3. Return indices of the nearest neighbors based on the `self.num_neighbors` meta-parameter.

For step 1, we will leverage the `get_distance()` function defined above. The trick to implement step 2 is two save a tuple (example ID, distance) when going through the training data. This will allow to sort the examples by distance and return the relevant IDs at the same time:

```def get_neighbors(self, example: np.array):
"""
Find and rank nearest neighbors of example
"""

# placeholder
distances = []

# calculate distances as tuples (id, distance)
for i in range(len(self.X)):
distances.append((i, self.get_distance(self.X[i], example)))

# sort by distance
distances.sort(key = lambda x: x)

# return IDs and distances top neighbors
return distances[:self.num_neighbors]
```

The final step is to do the prediction! For this purpose, we implement the `predict()` method that expects a new dataset as a `numpy` array and provides an array with predictions. For each example in the new dataset, the method will go through its nearest neighbors identified using the `get_neighbors()` helper, and average labels across the neighbors. That's it!

```def predict(self, X: np.array):
"""
Predict labels
"""

# placeholder
predictions = []

# go through examples
for idx in range(len(X)):
example     = X[idx]
k_neighbors = self.get_neighbors(example)
k_y_values  = [self.y[item] for item in k_neighbors]
prediction  = sum(k_y_values) / self.num_neighbors
predictions.append(prediction)

# return predictions
return np.array(predictions)
```

Putting everything together, this is how our implementation looks like:

```### END-TO-END KNN CLASS

class KNN:

def __init__(self, num_neighbors: int = 5):
self.num_neighbors = num_neighbors

def fit(self, X: np.array, y: np.array):
"""
Memorize training data
"""
self.X = X
self.y = y

def get_distance(self, a: np.array,  b: np.array):
"""
Calculate Euclidean distance between two examples
"""
return np.sum((a - b) ** 2) ** 0.5

def get_neighbors(self, example: np.array):
"""
Find and rank nearest neighbors of example
"""

# placeholder
distances = []

# calculate distances as tuples (id, distance)
for i in range(len(self.X)):
distances.append((i, self.get_distance(self.X[i], example)))

# sort by distance
distances.sort(key = lambda x: x)

# return IDs and distances top neighbors
return distances[:self.num_neighbors]

def predict(self, X: np.array):
"""
Predict labels
"""

# placeholder
predictions = []

# go through examples
for idx in range(len(X)):
example     = X[idx]
k_neighbors = self.get_neighbors(example)
k_y_values  = [self.y[item] for item in k_neighbors]
prediction  = sum(k_y_values) / self.num_neighbors
predictions.append(prediction)

# return predictions
return np.array(predictions)
```

# 4. Testing the implementation

Now that we have our implementation, let's check whether it actually works. We will generate toy data using `numpy`. The `gen_data()` function below uses the `np.random` module to draw feature values from a random Normal distribution and assign a 0/1 label.

```### HELPER FUNCTION

def gen_data(
mu: float = 0,
sigma: float = 1,
y: int = 0,
size: tuple = (1000, 10),
):
"""
Generate random data
"""
X = np.random.normal(loc = mu, scale = sigma, size = size)
y = np.repeat(y, repeats = size)

return X, y
```

To simulate a simple ML problem, we will generate a dataset consisting of two samples:

1. 30 examples with mean features value of 1 and a label of 0.
2. 20 examples with mean features value of 5 and a label of 1.
```### TOY DATA GENERATION

X0, y0 = gen_data(mu = 1, sigma = 3, y = 0, size = (30, 10))
X1, y1 = gen_data(mu = 5, sigma = 3, y = 1, size = (20, 10))
X = np.concatenate((X0, X1), axis = 0)
y = np.concatenate((y0, y1), axis = 0)
```

Now, let's instantiate our KNN class, fit it on the training data and provide predictions for some new examples!

To see if the algorithm works properly, we will generate four new examples as `X_new`, gradually increasing the feature values from 1 to 5. We expect the label predicted by KNN to increase from 0 to 1, since we are getting closer to examples in `X1`. Let's check!

```### PREDICTION

# fit KNN
clf = KNN(num_neighbors = 5)
clf.fit(X, y)

# generate new examples
X_new = np.stack((
np.repeat(1, 10),
np.repeat(2, 10),
np.repeat(4, 10),
np.repeat(5, 10),
))

# predict new examples
clf.predict(X_new)
```
`array([0. , 0.2, 0.8, 1. ])`

Yay! Everything works as expected. Our KNN algorithm provides four predictions for the new examples, and the prediction goes up with the increase in feature values. Our job is done!

# 5. Closing words

This is it! I hope this tutorial helps you to refresh your memory on how KNN works and gives you a good idea on how to implement it yourself. You are now well-equipped to do this exercise on your own!

If you liked this tutorial, feel free to share it on social media and buy me a coffee :) Don't forget to check out other posts in the "ML from Scratch" series. Happy learning!

]]>
Nikita Kozodoi
Layer-Wise Learning Rate in PyTorch2022-03-29T00:00:00-05:002022-03-29T00:00:00-05:00https://kozodoi.me/blog/20220329/discriminative-lr

Last update: 29.03.2022. All opinions are my own.

# 1. Overview

In deep learning tasks, we often use transfer learning to take advantage of the available pre-trained models. Fine-tuning such models is a careful process. On the one hand, we want to adjust the model to the new data set. On the other hand, we also want to retain and leverage as much knowledge learned during pre-training as possible.

Discriminative learning rate is one of the tricks that can help us guide fine-tuning. By using lower learning rates on deeper layers of the network, we make sure we are not tempering too much with the model blocks that have already learned general patterns and concentrate fine-tuning on further layers.

This blog post provides a tutorial on implementing discriminative layer-wise learning rates in `PyTorch`. We will see how to specify individual learning rates for each of the model parameter blocks and set up the training process.

# 2. Implementation

The implementation of layer-wise learning rates is rather straightforward. It consists of three simple steps:

1. Identifying a list of trainable layers in the neural net.
2. Setting up a list of model parameter blocks together with the corresponding learning rates.
3. Supplying the list with this information to the model optimizer.

Let's go through each of these steps one by one and see how it works!

## 2.1. Identifying network layers

The first step in our journey is to instantiate a model and retrieve the list of its layers. This step is essential to figure out how exactly to adjust the learning rate as we go through different parts of the network.

As an example, we will load one of the CNNs from the `timm` library and print out its parameter groups by iterating through `model.named_parameters()` and saving their names in a list called `layer_names`. Note that the framework discussed in this post is model-agnostic. It will work with any architecture, including CNNs, RNNs and transformers.

```# instantiate model
import timm
model = timm.create_model('resnet18', num_classes = 2)

# save layer names
layer_names = []
for idx, (name, param) in enumerate(model.named_parameters()):
layer_names.append(name)
print(f'{idx}: {name}')
```
```0: conv1.weight
1: bn1.weight
2: bn1.bias
3: layer1.0.conv1.weight
4: layer1.0.bn1.weight
5: layer1.0.bn1.bias
...
58: layer4.1.bn2.weight
59: layer4.1.bn2.bias
60: fc.weight
61: fc.bias
```

As the output suggests, our model has 62 parameter groups. When doing a forward pass, an image is fed to the first convolutional layer named `conv1`, whose parameters are stored as `conv1.weight`. Next, the output travels through the batch normalization layer `bn1`, which has weights and biases stored as `bn1.weight` and `bn1.bias`. From that point, the output goes through the network blocks grouped into four big chunks labeled as `layer1`, ..., `layer4`. Finally, extracted features are fed into the fully connected part of the network denoted as `fc`.

In the cell below, we reverse the list of parameter group names to have the deepest layer in the end of the list. This will be useful on the next step.

```# reverse layers
layer_names.reverse()
layer_names[0:5]
```
```['fc.bias',
'fc.weight',
'layer4.1.bn2.bias',
'layer4.1.bn2.weight',
'layer4.1.conv2.weight']```

## 2.2. Specifying learning rates

Knowing the architecture of our network, we can reason about the appropriate learning rates.

There is some flexibility in how to approach this step. The key idea is to gradually reduce the learning rate when going deeper into the network. The first layers should already have a pretty good understanding of general domain-agnostic patterns after pre-training. In a computer vision setting, the first layers may have learned to distinguish simple shapes and edges; in natural language processing, the first layers may be responsible for general word relationships. We don't want to update parameters on the first layers too much, so it makes sense to reduce the corresponding learning rates. In contrast, we would like to set a higher learning rate for the final layers, especially for the fully-connected classifier part of the network. Those layers usually focus on domain-specific information and need to be trained on new data.

The easiest approach to incorporate this logic is to incrementally reduce the learning rate when going deeper into the network. Let's simply multiply it by a certain coefficient between 0 and 1 after each parameter group. In our example, this would gives us 62 gradually diminishing learning rate values for 62 model blocks.

Let's implement it in code! Below, we set up a list of dictionaries called `parameters` that stores model parameters and learning rates. We will simply go through all parameter blocks and iteratively reduce and assign the appropriate learning rate. In our example, we start with `lr = 0.01` and multiply it by `0.9` at each step. Each item in `parameters` becomes a dictionary with two elements:

• `params`: tensor with the model parameters
• `lr`: corresponding learning rate
```# learning rate
lr      = 1e-2
lr_mult = 0.9

# placeholder
parameters = []

# store params & learning rates
for idx, name in enumerate(layer_names):

# display info
print(f'{idx}: lr = {lr:.6f}, {name}')

# append layer parameters
parameters += [{'params': [p for n, p in model.named_parameters() if n == name and p.requires_grad],
'lr':     lr}]

# update learning rate
lr *= lr_mult
```
```0: lr = 0.010000, fc.bias
1: lr = 0.009000, fc.weight
2: lr = 0.008100, layer4.1.bn2.bias
3: lr = 0.007290, layer4.1.bn2.weight
4: lr = 0.006561, layer4.1.conv2.weight
5: lr = 0.005905, layer4.1.bn1.bias
...
58: lr = 0.000022, layer1.0.conv1.weight
59: lr = 0.000020, bn1.bias
60: lr = 0.000018, bn1.weight
61: lr = 0.000016, conv1.weight
```

As you can see, we gradually reduce our learning rate from `0.01` for the bias on the classification layer to `0.00001` on the first convolutional layer. Looks good, right?!

Well, if you look closely, you will notice that we are setting different learning rates for parameter groups from the same layer. For example, having different learning rates for `fc.bias` and `fc.weight` does not really make that much sense. To address that, we can increment the learning rate only when going from one group of layers to another. The cell below provides an improved implementation.

```#collapse-hide

# learning rate
lr      = 1e-2
lr_mult = 0.9

# placeholder
parameters      = []
prev_group_name = layer_names.split('.')

# store params & learning rates
for idx, name in enumerate(layer_names):

# parameter group name
cur_group_name = name.split('.')

# update learning rate
if cur_group_name != prev_group_name:
lr *= lr_mult
prev_group_name = cur_group_name

# display info
print(f'{idx}: lr = {lr:.6f}, {name}')

# append layer parameters
parameters += [{'params': [p for n, p in model.named_parameters() if n == name and p.requires_grad],
'lr':     lr}]
```

```0: lr = 0.010000, fc.bias
1: lr = 0.010000, fc.weight
2: lr = 0.009000, layer4.1.bn2.bias
3: lr = 0.009000, layer4.1.bn2.weight
4: lr = 0.009000, layer4.1.conv2.weight
5: lr = 0.009000, layer4.1.bn1.bias
...
58: lr = 0.006561, layer1.0.conv1.weight
59: lr = 0.005905, bn1.bias
60: lr = 0.005905, bn1.weight
61: lr = 0.005314, conv1.weight
```

This looks more interesting!

Note that we can become very creative in customizing the learning rates and the decay speed. There is no fixed rule that always works well. In my experience, simple linear decay with a multiplier between 0.9 and 1 is a good starting point. Still, the framework provides a lot of space for experimentation, so feel free to test out your ideas and see what works best on your data!

## 2.3. Setting up the optimizer

We are almost done. The last and the easiest step is to supply our list of model parameters together with the selected learning rates to the optimizer. In the cell below, we provide `parameters` to the Adam optimizer, which is one of the most frequently used ones in the field.

Note that we don't need to supply the learning rate to `Adam()` as we have already done it in our `parameters` object. As long as individual learning rates are available, `optimizer` will prioritize them over the single learning rate supplied to the `Adam()` call.

```# set up optimizer
import torch.optim as optim
```

This is it! Now we can proceed to training our model as usual. When calling `optimizer.step()` inside the training loop, the optimizer will update model parameters by subtracting the gradient multiplied by the corresponding group-wise learning rates. This implies that there is no need to adjust the training loop, which usually looks something like this:

```#collapse-hide

# loop through batches

# extract inputs and labels
inputs = inputs.to(device)
labels = labels.to(device)

# passes and weights update

# forward pass
preds = model(inputs)
loss  = criterion(preds, labels)

# backward pass
loss.backward()

# weights update
optimizer.step()
```

# 3. Closing words

In this post, we went through the steps of implementing a layer-wise discriminative learning rate in `PyTorch`. I hope this brief tutorial will help you set up your transfer learning pipeline and squeeze out the maximum of your pre-trained model. If you are interested, check out my other blog posts on tips on deep learning and `PyTorch`. Happy learning!

]]>
Nikita Kozodoi

Last update: 21.11.2021. All opinions are my own.

# 1. Overview

Estimating text complexity and readability is a crucial task for teachers. Offering students text passages at the right level of challenge is important for facilitating a fast development of reading skills. The existing tools to estimate readability rely on weak proxies and heuristics. Deep learning may help to improve the accuracy of the used text complexity scores.

This blog post overviews an interactive web app that estimates reading complexity of a custom text with deep learning. The app relies on transformer models that are part of my top-9% solution to the CommonLit Readability Prize Kaggle competition. The app is built in Python and deployed in Streamlit. The blog post provides a demo of the app and includes a summary of the modeling pipeline and the app implementation.

# 2. App demo

You can open the app by clicking on this link. Alternatively, just scroll down to see the app embedded in this blog post. If the embedded version does not load, please open the app in a new tab. Feel free to play around with the app by typing or pasting custom texts and estimating their complexity with different models! Scroll further down to read some details on the app and the underlying models.

# 3. Implementation

## 3.1. Modeling pipeline

The app is developed in the scope of the CommonLit Readability Prize Kaggle competition on text complexity prediction. My solution is an ensemble of eight transformer models, including variants of BERT, RoBERTa and other architectures. All transformers are implemented in `PyTorch` and feature a custom regression head that uses a concatenated output of multiple hidden layers.

The project uses pre-trained transformer weights published on the HuggingFace model hub. Each model is then fine-tuned on a data set with 2834 text snippets, where readability of each snippet was evaluated by human experts. To avoid overfitting, fine-tuning relies on text augmentations such as sentence order shuffle, backtranslation and injecting target noise in the readability scores.

Each transformer model is fine-tuned using five-fold cross-validation repeated three times with different random splits. This GitHub repo provides the source code and documentation for the modeling pipeline. The table below summarizes the main architecture and training parameters. The ensemble of eight transformer models places in the top-9% of the Kaggle competition leaderboard. The web app only includes two lightweight models from a single fold to ensure fast inference on CPU: DistilBERT and DistilRoBERTa.

## 3.2. App implementation

The app is built in Python using the Streamlit library. Streamlit allows implementing a web app in a single Python code file and deploying the app to the cloud server so that anyone with the Internet access can check it out.

The app code is provided in `web_app.py` located in the root folder of the project GitHub repo. The app is hosted on a virtual machine provided by Streamlit, which includes the list of dependencies specified in `requirements.txt`. It also imports some helper functions used within the modeling pipeline for text preprocessing and model initialization.

The app works by downloading weights of the selected transformer model to the virtual machine after a user selects which model to use for text readability prediction. The weights of each model are made available as release files on GitHub. After downloading the weights, the app transforms the text entered by a user into the token sequence with the tokenizer that uses text processing settings specified in the model configuration file. Next, the app runs a single forward pass through the initialized transformer network and displays the output prediction.

The snippet below provides the app source code. The code imports relevant Python modules and configures the app page. Next, it provides functionality for entering the custom text and selecting the NLP model. Finally, the code includes the inference function and some further documentation.

```#collapse-show

##### PREPARATIONS

# libraries
import gc
import os
import pickle
import sys
import urllib.request
import requests
import numpy as np
import pandas as pd
import streamlit as st
from PIL import Image

# custom libraries
sys.path.append('code')
from model import get_model
from tokenizer import get_tokenizer

mybar = None
def show_progress(block_num, block_size, total_size):
global mybar
if mybar is None:
mybar = st.progress(0.0)
else:
mybar.progress(1.0)

# page config
page_icon             = ":books:",
layout                = "centered",
initial_sidebar_state = "collapsed",

# title

# image cover
image = Image.open(requests.get('https://i.postimg.cc/hv6yfMYz/cover-books.jpg', stream = True).raw)
st.image(image)

# description
st.write('This app uses deep learning to estimate the reading complexity of a custom text. Enter your text below, and we will run it through one of the two transfomer models and display the result.')

##### PARAMETERS

# title

# model selection
model_name = st.selectbox(
'Which model would you like to use?',
['DistilBERT', 'DistilRoBERTa'])

# input text
input_text = st.text_area('Which text would you like to rate?', 'Please enter the text in this field.')

##### MODELING

# specify paths
if model_name == 'DistilBERT':
folder_path = 'output/v59/'
elif model_name == 'DistilRoBERTa':
folder_path = 'output/v47/'

if not os.path.isfile(folder_path + 'pytorch_model.bin'):
urllib.request.urlretrieve(weight_path, folder_path + 'pytorch_model.bin', show_progress)

# compute predictions
with st.spinner('Computing prediction...'):

# clear memory
gc.collect()

config = pickle.load(open(folder_path + 'configuration.pkl', 'rb'))
config['backbone'] = folder_path

# initialize model
model = get_model(config, name = model_name.lower(), pretrained = folder_path + 'pytorch_model.bin')
model.eval()

tokenizer = get_tokenizer(config)

# tokenize text
text = tokenizer(text                  = input_text,
truncation            = True,
max_length            = config['max_len'],
return_token_type_ids = True,
return_tensors        = 'pt')

# clear memory
del tokenizer, text, config
gc.collect()

# compute prediction
if input_text != '':
prediction = prediction['logits'].detach().numpy()
prediction = 100 * (prediction + 4) / 6 # scale to [0,100]

# clear memory
gc.collect()

# print output
st.write('**Note:** readability scores are scaled to [0, 100%]. A higher score means that the text is easier to read.')
st.success('Success! Thanks for scoring your text :)')

##### DOCUMENTATION

# example texts
with st.expander('Show example texts'):
st.table(pd.DataFrame({
'Text':  ['A dog sits on the floor. A cat sleeps on the sofa.',  'This app does text readability prediction. How cool is that?', 'Training of deep bidirectional transformers for language understanding.'],
'Score': [1.5571, -0.0100, -2.4025],
}))

# models
st.write("Both transformer models are part of my top-9% solution to the CommonLit Readability Kaggle competition. The pre-trained language models are fine-tuned on 2834 text snippets. [Click here](https://github.com/kozodoi/Kaggle_Readability) to see the source code and read more about the training pipeline.")

# metric
st.write("The readability metric is calculated on the basis of a Bradley-Terry analysis of more than 111,000 pairwise comparisons between excerpts. Teachers spanning grades 3-12 (a majority teaching between grades 6-10) served as the raters for these comparisons. More details on the used reading complexity metric are available [here](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240886).")
```

# 4. Closing words

This blog post provided a demo of an interactive web app that uses deep learning to estimate text reading complexity. I hope you found the app interesting and enjoyed playing with it!

If you have any data science projects in your portfolio, I highly encourage you to try developing a similar app yourself. There are many things you could demonstrate, ranging from interactive EDA dashboards to inference calls to custom ML models. Streamlit makes this process very simple and allows hosting the app in the cloud. Happy learning!

]]>
Nikita Kozodoi
Test-Time Augmentation for Tabular Data2021-09-08T00:00:00-05:002021-09-08T00:00:00-05:00https://kozodoi.me/blog/20210908/tta-tabular

Last update: 08.09.2021. All opinions are my own.

# 1. Overview

Test time augmentation (TTA) is a popular technique in computer vision. TTA aims at boosting the model accuracy by using data augmentation on the inference stage. The idea behind TTA is simple: for each test image, we create multiple versions that are a little different from the original (e.g., cropped or flipped). Next, we predict labels for the test images and created copies and average model predictions over multiple versions of each image. This usually helps to improve the accuracy irrespective of the underlying model.

In many business settings, data comes in a tabular format. Can we use TTA with tabular data to enhance the accuracy of ML models in a way similar to computer vision models? How to define suitable transformations of test cases that do not affect the label? This blog post explores the opportunities for using TTA in tabular data environments. We will implement TTA for `scikit-learn` classifiers and test its performance on multiple credit scoring data sets. The preliminary results indicate that TTA might be a tiny bit helpful in some settings.

Note: the results presented in this blog post are currently being extended within a scope of a working paper. The post will be updated once the paper is available on ArXiV.

# 2. Adapting TTA to tabular data

TTA has been originally developed for deep learning applications in computer vision. In contrast to image data, tabular data poses a more challenging environment for using TTA. We will discuss two main challenges that we need to solve to apply TTA to structured data:

• how to define transformations?
• how to treat categorical features?

## 2.1. How to define transformations?

When working with image data, light transformations such as rotation, brightness adjustment, saturation and many others modify the underlying pixel values but do not affect the ground truth. That is, a rotated cat is still a cat. We can easily verify this by visually checking the transformed images and limiting the magnitude of transformations to make sure the cat is still recognizable. This is different for tabular data, where the underlying features represent different characteristics of the observed subjects. Let's consider a credit scoring example. In finance, banks use ML models to support loan allocation decisions. Consider a binary classification problem, where we predict whether the applicant will pay back the loan. The underlying features may describe the applicant's attributes (age, gender), loan parameters (amount, duration), macroeconomic indicators (inflation, growth). How to do transformations on these features? While there is no such thing as rotating a loan applicant (at least not within the scope of machine learning), we could do a somewhat similar exercise: create copies of each loan applicant and slightly modify feature values for each copy. A good starting point would be to add some random noise to each of the features.

This procedure raises a question: how can we be sure that transformations do not alter the label? Would increasing the applicant's age by 10 years affect her repayment ability? Arguably, yes. What about increasing the age by 1 year? Or 1 day? These are challenging questions that we can not answer without more information. This implies that the magnitude of the added noise has to be carefully tuned. We need to take into account the variance of each specific feature as well as the overall data set variability. Adding too little noise will create synthetic cases that are too similar to the original applications, which is not very useful. On the other hand, adding too much noise risks changing the label of the corresponding application, which would harm the model accuracy. The trade-off between these two extremes is what can potentially bring us closer to discovering an accuracy boost.

## 2.2. How to treat categorical features?

It is rather straightforward to add noise to continuous features such as age or income. However, tabular data frequently contains special gifts: categorical features. From gender to zip code, these features present another challenge for the application of TTA. Adding noise to the zip code appears non-trivial and requires some further thinking. Ignoring categorical features and only altering the continuous ones sounds like an easy solution, but this might not work well on data sets that contain a lot of information in the form of categorical data.

In this blog post, we will try a rather naive approach to deal with categorical features. Every categorical feature can be encoded as a set of dummy variables. Next, considering each dummy feature separately, we can occasionally flip the value, switching the person's gender, country of origin or education level with one click. This would introduce some variance in the categorical features and provide TTA with more diverse synthetic applications. This approach is imperfect and can be improved on, but we have to start somewhere, right?

Now that we have some ideas about how TTA should work and what are the main challenges, let's actually try to implement it!

# 3. Implementing TTA

This section implements a helper function `predict_proba_with_tta()` to extend the standard `predict_proba()` method in `scikit-learn` such that predictions take advantage of the TTA procedure. We focus on a binary classification task, but one could easily extend this framework to regression tasks as well.

The function `predict_proba_with_tta()` requires specifying the underlying `scikit-learn` model and the test set with observations to be predicted. The function operates in four simple steps:

1. Creating `num_tta` copies of the test set.
2. Implementing random transformations of the synthetic copies.
3. Predicting labels for the real and synthetic observations.
4. Aggregating the predictions.

Considering the challenges discussed in the previous section, we implement the following transformations for the continuous features:

• compute STD of each continuous feature denoted as `std`
• generate a random vector `n` using the standard normal distribution
• add `alpha * n * std` to each feature , where `alpha` is a meta-parameter.

And for the categorical features:

• convert categorical features into a set of dummies
• flip each dummy variable with a probability `beta`, where `beta` is a meta-parameter.

By varying `alpha` and `beta`, we control the transformation magnitude, adjusting the noise scale in the synthetic copies. Higher values imply stronger transformations. The suitable values can be identified through some meta-parameter tuning.

```#collapse-show

def predict_proba_with_tta(data,
model,
dummies = None,
num_tta = 4,
alpha   = 0.01,
beta    = 0.01,
seed    = 0):
'''
Predicts class probabilities using TTA.

Arguments:
- data (numpy array): data set with the feature values
- model (sklearn model): machine learning model
- dummies (list): list of column names of dummy features
- num_tta (integer): number of test-time augmentations
- alpha (float): noise parameter for continuous features
- beta (float): noise parameter for dummy features
- seed (integer): random seed

Returns:
- array of predicted probabilities
'''

# set random seed
np.random.seed(seed = seed)

# original prediction
preds = model.predict_proba(data) / (num_tta + 1)

# select numeric features
num_vars = [var for var in data.columns if data[var].dtype != 'object']

# find dummies
if dummies != None:
num_vars = list(set(num_vars) - set(dummies))

# synthetic predictions
for i in range(num_tta):

# copy data
data_new = data.copy()

# introduce noise to numeric vars
for var in num_vars:
data_new[var] = data_new[var] + alpha * np.random.normal(0, 1, size = len(data_new)) * data_new[var].std()

# introduce noise to dummies
if dummies != None:
for var in dummies:
probs = np.random.binomial(1, (1 - beta), size = len(data_new))
data_new.loc[probs == 0, var] = 1 - data_new.loc[probs == 0, var]

# predict probs
preds_new = model.predict_proba(data_new)
preds    += preds_new / (num_tta + 1)

# return probs
return preds
```

# 4. Empirical benchmark

Let's test our TTA function! This section performs empirical experiment on multiple data sets to check whether TTA can improve the model performance. First, we import relevant modules and load the list of prepared data sets. All data sets come from a credit scoring environment, which represents a binary classification setup. Some of the data sets are publically available, whereas the others are subject to NDA. The public data sets include australian), german), pakdd, gmsc, homecredit and lendingclub. The sample sizes and the number of features vary greatly across the datasets. This allows us to test the TTA framework in different conditions.

```#collapse-hide

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

import os
import time
```

```#collapse-show
datasets = os.listdir('../data')
datasets
```

```['thomas.csv',
'german.csv',
'hmeq.csv',
'bene2.csv',
'lendingclub.csv',
'bene1.csv',
'cashbus.csv',
'uk.csv',
'australian.csv',
'pakdd.csv',
'gmsc.csv',
'paipaidai.csv']```

Apart from the data sets, TTA needs an underlying ML model. In our experiment, on each data set, we will use a Random Forest classifier with 500 trees, which is a good trade-off between good performance and computational resources. We will not go deep into tuning the classifier and keep the parameters fixed for all data sets. We will then use stratified 5-fold cross-validation to train and test models with and without TTA.

```#collapse-show

# classifier
clf = RandomForestClassifier(n_estimators = 500, random_state = 1, n_jobs = 4)

# settings
folds = StratifiedKFold(n_splits     = 5,
shuffle      = True,
random_state = 23)
```

The cell below implements the following experiment:

1. We loop through the datasets and perform cross-validation, training Random Forest on each fold combination.
2. Next, we predict labels of the validation cases and calculate the AUC of the model predictions. This is our benchmark.
3. We predict labels of the validation cases with the same model but now implement TTA to adjust the predictions.
4. By comparing the average AUC difference before and after TTA, we can judge whether TTA actually helps to boost the predictive performance.

```#collapse-show

# placeholders
auc_change = []

# timer
start = time.time()

# modeling loop
for data in datasets:

##### DATA PREPARATION

# import data

# convert target to integer

# extract X and y

# create dummies
X = pd.get_dummies(X, prefix_sep = '_dummy_')

# data information
print('-------------------------------------')
print('Dataset:', data, X.shape)
print('-------------------------------------')

##### CROSS-VALIDATION

# create objects
oof_preds_raw = np.zeros((len(X), y.nunique()))
oof_preds_tta = np.zeros((len(X), y.nunique()))

# modeling loop
for fold_, (trn_, val_) in enumerate(folds.split(y, y)):

# data partitioning
trn_x, trn_y = X.iloc[trn_], y.iloc[trn_]
val_x, val_y = X.iloc[val_], y.iloc[val_]

# train the model
clf.fit(trn_x, trn_y)

# identify dummies
dummies = list(X.filter(like = '_dummy_').columns)

# predictions
oof_preds_raw[val_, :] =  clf.predict_proba(val_x)
oof_preds_tta[val_, :] =  predict_proba_with_tta(data    = val_x,
model   = clf,
dummies = dummies,
num_tta = 5,
alpha   = np.sqrt(len(trn_x)) / 3000,
beta    = np.sqrt(len(trn_x)) / 30000,
seed    = 1)

# print performance
print('- AUC before TTA = %.6f ' % roc_auc_score(y, oof_preds_raw[:,1]))
print('- AUC with TTA   = %.6f ' % roc_auc_score(y, oof_preds_tta[:,1]))
print('-------------------------------------')
print('')

# save the AUC delta
delta = roc_auc_score(y, oof_preds_tta[:,1]) - roc_auc_score(y, oof_preds_raw[:,1])
auc_change.append(delta)

# display results
print('-------------------------------------')
print('Finished in %.1f minutes' % ((time.time() - start) / 60))
print('-------------------------------------')
print('TTA improves AUC in %.0f/%.0f cases' % (np.sum(np.array(auc_change) > 0), len(datasets)))
print('Mean AUC change = %.6f' % np.mean(auc_change))
print('-------------------------------------')
```

```-------------------------------------
Dataset: thomas.csv (1225, 28)
-------------------------------------
- AUC before TTA = 0.612322
- AUC with TTA   = 0.613617
-------------------------------------

-------------------------------------
Dataset: german.csv (1000, 61)
-------------------------------------
- AUC before TTA = 0.796233
- AUC with TTA   = 0.796300
-------------------------------------

-------------------------------------
Dataset: hmeq.csv (5960, 20)
-------------------------------------
- AUC before TTA = 0.975995
- AUC with TTA   = 0.976805
-------------------------------------

-------------------------------------
Dataset: bene2.csv (7190, 28)
-------------------------------------
- AUC before TTA = 0.801193
- AUC with TTA   = 0.799387
-------------------------------------

-------------------------------------
Dataset: lendingclub.csv (43344, 114)
-------------------------------------
- AUC before TTA = 0.625029
- AUC with TTA   = 0.628207
-------------------------------------

-------------------------------------
Dataset: bene1.csv (3123, 84)
-------------------------------------
- AUC before TTA = 0.788607
- AUC with TTA   = 0.789447
-------------------------------------

-------------------------------------
Dataset: cashbus.csv (15000, 642)
-------------------------------------
- AUC before TTA = 0.629648
- AUC with TTA   = 0.624874
-------------------------------------

-------------------------------------
Dataset: uk.csv (30000, 51)
-------------------------------------
- AUC before TTA = 0.712042
- AUC with TTA   = 0.723359
-------------------------------------

-------------------------------------
Dataset: australian.csv (690, 42)
-------------------------------------
- AUC before TTA = 0.931787
- AUC with TTA   = 0.931958
-------------------------------------

-------------------------------------
Dataset: pakdd.csv (50000, 373)
-------------------------------------
- AUC before TTA = 0.620081
- AUC with TTA   = 0.623080
-------------------------------------

-------------------------------------
Dataset: gmsc.csv (150000, 68)
-------------------------------------
- AUC before TTA = 0.846187
- AUC with TTA   = 0.855176
-------------------------------------

-------------------------------------
Dataset: paipaidai.csv (60000, 1934)
-------------------------------------
- AUC before TTA = 0.716398
- AUC with TTA   = 0.721679
-------------------------------------

-------------------------------------
Finished in 206.1 minutes
-------------------------------------
TTA improves AUC in 10/12 cases
Mean AUC change = 0.002364
-------------------------------------
```

Looks like TTA is working! Overall, TTA improves the AUC in 10 out of 12 data sets. The observed performance gains are rather small: on average, TTA improves AUC by `0.00236`. The results are visualized in the barplot below:

```#collapse-hide

objects = list(range(len(datasets)))
y_pos   = np.arange(len(objects))
perf    = np.sort(auc_change2)

plt.figure(figsize = (6, 8))
plt.barh(y_pos, perf, align = 'center', color = 'blue', alpha = 0.66)

plt.ylabel('Dataset')
plt.yticks(y_pos, objects)
plt.xlabel('AUC Gain')
plt.title('')
ax.plot([0, 0], [1, 12], 'k--')
plt.tight_layout()
``` We should bear in mind that performance gains, although appearing rather small, come almost "for free". We don't need to train a new model and only require a relatively small amount of extra resources to create synthetic copies of the loan applications. Sounds good!

It is possible that further fine-tuning of the TTA meta-parameters can uncover larger performance gains. Furthermore, a considerable variance of the average gains from TTA across the data sets indicates that TTA can be more helpful in specific settings. The important factors influencing the TTA performance may relate to both the data and the classifier used to produce predictions. More research is needed to identify and analyze such factors.

# 5. Closing words

The purpose of this tutorial was to explore TTA applications for tabular data. We have discussed the corresponding challenges, developed a TTA wrapper function for `scikit-learn` and demonstrated that it could indeed be helpful on multiple credit scoring data sets. I hope you found this post interesting.

The project described in this blog post is a work in progress. I will update the post once the working paper on the usage of TTA for tabular data is available. Stay tuned and happy learning!

]]>
Nikita Kozodoi
Extracting Intermediate Layer Outputs in PyTorch2021-05-27T00:00:00-05:002021-05-27T00:00:00-05:00https://kozodoi.me/blog/20210527/extracting-features

Last update: 23.10.2021. All opinions are my own.

# 1. Overview

In deep learning tasks, we usually work with predictions outputted by the final layer of a neural network. In some cases, we might also be interested in the outputs of intermediate layers. Whether we want to extract data embeddings or inspect what is learned by earlier layers, it may not be straightforward how to extract the intermediate features from the network.

This blog post provides a quick tutorial on the extraction of intermediate activations from any layer of a deep learning model in `PyTorch` using the forward hook functionality. The important advantage of this method is its simplicity and ability to extract features without having to run the inference twice, only requiring a single forward pass through the model to save multiple outputs.

# 2. Why do we need intermediate features?

Extracting intermediate activations (also called features) can be useful in many applications. In computer vision problems, outputs of intermediate CNN layers are frequently used to visualize the learning process and illustrate visual features distinguished by the model on different layers. Another popular use case is extracting intermediate outputs to create image or text embeddings, which can be used to detect duplicate items, included as input features in a classical ML model, visualize data clusters and much more. When working with Encoder-Decoder architectures, outputs of intermediate layers can also be used to compress the data into a smaller-sized vector containing the data represenatation. There are many further use cases in which intermediate activations can be useful. So, let's discuss how to get them!

# 3. How to extract activations?

To extract activations from intermediate layers, we will need to register a so-called forward hook for the layers of interest in our neural network and perform inference to store the relevant outputs.

For the purpose of this tutorial, I will use image data from a Cassava Leaf Disease Classification Kaggle competition. In the next few cells, we will import relevant libraries and set up a Dataloader object. Feel free to skip them if you are familiar with standard `PyTorch` data loading practices and go directly to the feature extraction part.

## Preparations

```#collapse-hide

##### PACKAGES

import numpy as np
import pandas as pd

import torch
import torch.nn as nn

!pip install timm
import timm

import albumentations as A
from albumentations.pytorch import ToTensorV2

import cv2
import os

device = torch.device('cuda')
```

```#collapse-hide

##### DATASET

class ImageData(Dataset):

# init
def __init__(self,
data,
directory,
transform):
self.data      = data
self.directory = directory
self.transform = transform

# length
def __len__(self):
return len(self.data)

# get item
def __getitem__(self, idx):

# import
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# augmentations
image = self.transform(image = image)['image']

return image
```

We will use a standrd `PyTorch` dataloader to load the data in batches of 32 images.

```#collapse-show

# import data

# augmentations
transforms = A.Compose([A.Resize(height = 128, width = 128),
A.Normalize(),
ToTensorV2()])

# dataset
data_set = ImageData(data      = df,
directory = '../input/cassava-leaf-disease-classification/train_images/',
transform = transforms)

batch_size  = 32,
shuffle     = False,
num_workers = 2)
```

image_id label
0 1000015157.jpg 0
1 1000201771.jpg 3
2 100042118.jpg 1
3 1000723321.jpg 1
4 1000812911.jpg 3

## Model

To extract anything from a neural net, we first need to set up this net, right? In the cell below, we define a simple `resnet18` model with a two-node output layer. We use `timm` library to instantiate the model, but feature extraction will also work with any neural network written in `PyTorch`.

We also print out the architecture of our network. As you can see, there are many intermediate layers through which our image travels during a forward pass before turning into a two-number output. We should note the names of the layers because we will need to provide them to a feature extraction function.

```##### DEFINE MODEL

model    = timm.create_model(model_name = 'resnet18', pretrained = True)
model.fc = nn.Linear(512, 2)
model.to(device)
```
```ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): ReLU(inplace=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): ReLU(inplace=True)
)
)

...

(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): ReLU(inplace=True)
)
)
(fc): Linear(in_features=512, out_features=2, bias=True)
)```

## Feature extraction

The implementation of feature extraction requires two simple steps:

1. Registering a forward hook on a certain layer of the network.
2. Performing standard inference to extract features of that layer.

First, we need to define a helper function that will introduce a so-called hook. A hook is simply a command that is executed when a forward or backward call to a certain layer is performed. If you want to know more about hooks, you can check out this link.

In out setup, we are interested in a forward hook that simply copies the layer outputs, sends them to CPU and saves them to a dictionary object we call `features`.

The hook is defined in a cell below. The `name` argument in `get_features()` specifies the dictionary key under which we will store our intermediate activations.

```##### HELPER FUNCTION FOR FEATURE EXTRACTION

def get_features(name):
def hook(model, input, output):
features[name] = output.detach()
return hook
```

After the helper function is defined, we can register a hook using `.register_forward_hook()` method. The hook can be applied to any layer of the neural network.

Since we work with a CNN, extracting features from the last convolutional layer might be useful to get image embeddings. Therefore, we are registering a hook for the outputs of the `(global_pool)`. To extract features from an earlier layer, we could also access them with, e.g., `model.layer1.act2` and save it under a different name in the `features` dictionary. With this method, we can actually register multiple hooks (one for every layer of interest), but we will only keep one for the purpose of this example.

```##### REGISTER HOOK

model.global_pool.register_forward_hook(get_features('feats'))
```
`<torch.utils.hooks.RemovableHandle at 0x7f2540254290>`

Now we are ready to extract features! The nice thing about hooks is that we can now perform inference as we usually would and get multiple outputs at the same time:

• outputs of the final layer
• outputs of every layer with a registered hook

The feature extraction happens automatically during the forward pass whenever we run `model(inputs)`. To store intermediate features and concatenate them over batches, we just need to include the following in our inference loop:

1. Create placeholder list `FEATS = []`. This list will store intermediate outputs from all batches.
2. Create placeholder dict `features = {}`. We will use this dictionary for storing intermediate outputs from each batch.
3. Iteratively extract batch features to `features`, send them to CPU and append to the list `FEATS`.
```##### FEATURE EXTRACTION LOOP

# placeholders
PREDS = []
FEATS = []

# placeholder for batch features
features = {}

# loop through batches

# move to device
inputs = inputs.to(device)

# forward pass [with feature extraction]
preds = model(inputs)

# add feats and preds to lists
PREDS.append(preds.detach().cpu().numpy())
FEATS.append(features['feats'].cpu().numpy())

# early stop
if idx == 9:
break
```

This is it! Looking at the shapes of resulting arrays, you can see that the code worked well: we extracted both final layer outputs as `PREDS` and intermediate activations as `FEATS`. We can now save these features and work with them further.

```##### INSPECT FEATURES

PREDS = np.concatenate(PREDS)
FEATS = np.concatenate(FEATS)

print('- preds shape:', PREDS.shape)
print('- feats shape:', FEATS.shape)
```
```- preds shape: (320, 2)
- feats shape: (320, 512)
```

# 4. Closing words

The purpose of this tutorial was to learn you how to extract intermediate outputs from the most interesting layers of your neural networks. With hooks, you can do all feature extraction in a single inference run and avoid complex modifications of your model. I hope you found this post helpful.

If you are interested, check out my other blog posts to see more tips on deep learning and `PyTorch`. Happy learning!

]]>
Nikita Kozodoi
Tracking ML Experiments with Neptune.ai2021-04-30T00:00:00-05:002021-04-30T00:00:00-05:00https://kozodoi.me/blog/20210430/neptune

This post is also published on the Neptune.ai blog. All opinions are my own.*

# 1. Introduction

Many ML projects, including Kaggle competitions, have a similar workflow. You start with a simple pipeline with a benchmark model. Next, you begin incorporating improvements: adding features, augmenting the data, tuning the model... On each iteration, you evaluate your solution and keep changes that improve the target metric. The figure illustrates the iterative improvement process in ML projects.

This workflow involves running a lot of experiments. As time goes by, it becomes difficult to keep track of the progress and positive changes. Instead of working on new ideas, you spend time thinking:

• “have I already tried this thing?”,
• “what was that hyperparameter value that worked so well last week?”

You end up running the same stuff multiple times. If you are not tracking your experiments yet, I highly recommend you to start! In my previous Kaggle projects, I used to rely on spreadsheets for tracking. It worked very well in the beginning, but soon I realized that setting up and managing spreadsheets with experiment meta-data requires loads of additional work. I got tired of manually filling in model parameters and performance values after each experiment and really wanted to switch to an automated solution.

This is when I discovered Neptune.ai. This tool allowed me to save a lot of time and focus on modeling decisions, which helped me to earn three medals in Kaggle competitions.

In this post, I will share my story of switching from spreadsheets to Neptune for experiment tracking. I will describe a few disadvantages of spreadsheets, explain how Neptune helps to address them, and give a couple of tips on using Neptune for Kaggle.

# 2. What is wrong with spreadsheets for experiment tracking?

Spreadsheets are great for many purposes. To track experiments, you can simply set up a spreadsheet with different columns containing the relevant parameters and performance of your pipeline. It is also easy to share this spreadsheet with teammates.

Sounds great, right?

Unfortunately, there are a few problems with this. The figure illustrates ML experiment tracking with spreadsheets.

## Manual work

After doing it for a while, you will notice that maintaining a spreadsheet starts eating too much time. You need to manually fill in a row with meta-data for each new experiment and add a column for each new parameter. This will get out of control once your pipeline becomes more sophisticated.

It is also very easy to make a typo, which can lead to bad decisions.

When working on one deep learning competition, I incorrectly entered a learning rate in one of my experiments. Looking at the spreadsheet, I concluded that a high learning rate decreases the accuracy and went on working on other things. It was only a few days later when I realized that there was a typo and poor performance actually comes from a low learning rate. This cost me two days of work invested in the wrong direction based on a false conclusion.

## No live tracking

With spreadsheets, you need to wait until an experiment is completed in order to record the performance.

Apart from being frustrated to do it manually every time, this also does not allow you to compare intermediate results across the experiments, which is helpful to see if a new run looks promising.

Of course, you can log in model performance after every epoch, but doing it manually for each experiment requires even more time and effort. I never had enough diligence to do it regularly and ended up spending some computing resources not optimally.

## Attachment limitations

Another issue with spreadsheets is that they only support textual meta-data that can be entered in a cell.

What if you want to attach other meta-data like:

• model weights,
• source code,
• plots with model predictions,
• input data version?

You need to manually store this stuff in your project folders outside of the spreadsheet.

In practice, it gets complicated to organize and sync experiment outputs between local machines, Google Colab, Kaggle Notebooks, and other environments your teammates might use. Having such meta-data attached to a tracking spreadsheet seems useful, but it is very difficult to do it.

# 3. Switching from spreadsheets to Neptune

A few months ago, our team was working on a Cassava Leaf Disease competition and used Google spreadsheets for experiment tracking. One month into the challenge, our spreadsheet was already cluttered:

• Some runs were missing performance because one of us forgot to log it in and did not have the results anymore.
• PDFs with loss curves were scattered over Google Drive and Kaggle Notebooks.
• Some parameters might have been entered incorrectly, but it was too time-consuming to restore and double-check older script versions.

It was difficult to make good data-driven decisions based on our spreadsheet.

Even though there were only four weeks left, we decided to switch to Neptune. I was surprised to see how little effort it actually took us to set it up. In brief, there are three main steps:

• install the neptune package in your environment,
• include several lines in the pipeline to enable logging of relevant meta-data.

You can read more about the exact steps to start using Neptune here. Of course, going through the documentation and getting familiar with the platform may take you a few hours. But remember that this is only a one-time investment. After learning the tool once, I was able to automate much of the tracking and rely on Neptune in the next Kaggle competitions with very little extra effort

# 4. What is good about Neptune? The figure illustrates ML experiment tracking with Neptune.

## Less manual work

One of the key advantages of Neptune over spreadsheets is that it saves you a lot of manual work. With Neptune, you use the API within the pipeline to automatically upload and store meta-data while the code is running.

```import neptune.new as neptune

run = neptune.init(project = '#', api_token = '#') # your credentials

# Track relevant parameters
config = {
'batch_size': 64,
'learning_rate': 0.001,
}
run['parameters'] =  config

# Track the training process by logging your training metrics
for epoch in range(100):
run['train/accuracy'].log(epoch * 0.6)

# Log the final results
run['f1_score'] = 0.66
```

You don’t have to manually put it in the results table, and you also save yourself from making a typo. Since the meta-data is sent to Neptune directly from the code, you will get all numbers right no matter how many digits they have.

It may sound like a small thing, but the time saved from logging in each experiment accumulates very quickly and leads to tangible gains by the end of the project. This gives you an opportunity to not think too much about the actual tracking process and better focus on the modeling decisions. In a way, this is like hiring an assistant to take care of some boring (but very useful) logging tasks so that you can focus more on the creative work.

## Live tracking

What I like a lot about Neptune is that it allows you to do live tracking. If you work with models like neural networks or gradient boosting that require a lot of iterations before convergence, you know it is quite useful to look at the loss dynamics early to detect issues and compare models.

Tracking intermediate results in a spreadsheet is too frustrating. Neptune API can log in performance after every epoch or even every batch so that you can start comparing the learning curves while your experiment is still running. This proves to be very helpful. As you might expect, many ML experiments have negative results (sorry, but this great idea you were working on for a few days actually decreases the accuracy).

This is completely fine because this is how ML works.

What is not fine is that you may need to wait a long time until getting that negative signal from your pipeline. Using Neptune dashboard to compare the intermediate plots with the first few performance values may be enough to realize that you need to stop the experiment and change something.

## Attaching outputs

Another advantage of Neptune is the ability to attach pretty much anything to every experiment run. This really helps to keep important outputs such as model weights and predictions in one place and easily access them from your experiments table.

This is particularly helpful if you and your colleagues work in different environments and have to manually upload the outputs to sync the files.

I also like the ability to attach the source code to each run to make sure you have the notebook version that produced the corresponding result. This can be very useful in case you want to revert some changes that did not improve the performance and would like to go back to the previous best version.

# 4. Tips to improve Kaggle performance with Neptune

When working on Kaggle competitions, there are a few tips I can give you to further improve your tracking experience.

## Using Neptune in Kaggle Notebooks or Google Colab

First, Neptune is very helpful for working in Kaggle Notebooks or Google Colab that have session time limits when using GPU/TPU. I can not count how many times I lost all experiment outputs due to a notebook crash when training was taking just a few minutes more than the allowed 9-hour limit!

To avoid that, I would highly recommend setting up Neptune such that model weights and loss metrics are stored after each epoch. That way, you will always have a checkpoint uploaded to Neptune servers to resume your training even if your Kaggle notebook times out. You will also have an opportunity to compare your intermediate results before the session crash with other experiments to judge their potential.

## Updating runs with the Kaggle leaderboard score

Second, an important metric to track in Kaggle projects is the leaderboard score. With Neptune, you can track your cross-validation score automatically but getting the leaderboard score inside the code is not possible since it requires you to submit predictions via the Kaggle website.

The most convenient way to add the leaderboard score of your experiment to the Neptune tracking table is to use the "resume run" functionality. It allows you to update any finished experiment with a new metric with a couple of lines of code. This feature is also helpful to resume tracking crashed sessions, which we discussed in the previous paragraph.

```import neptune.new as neptune

run = neptune.init(project = 'Your-Kaggle-Project', run = 'SUN-123')

run[“LB_score”] = 0.5

# Continue working
```

Finally, I know that many Kagglers like to perform complex analyses of their submissions, like estimating the correlation between CV and LB scores or plotting the best score dynamics with respect to time.

While it is not yet feasible to do such things on the website, Neptune allows you to download meta-data from all experiments directly into your notebook using a single API call. It makes it easy to take a deeper dive into the results or export the meta-data table and share it externally with people who use a different tracking tool or don’t rely on any experiment tracking.

```import neptune.new as neptune

my_project = neptune.get_project('Your-Workspace/Your-Kaggle-Project')

# Get dashboard with runs contributed by 'sophia'
sophia_df = my_project.fetch_runs_table(owner = 'sophia').to_pandas()
```

# 5. Final thoughts

In this post, I shared my story of switching from spreadsheets to Neptune for tracking ML experiments and emphasized some advantages of Neptune. I would like to stress once again that investing time in infrastructure tools - be it experiment tracking, code versioning, or anything else - is always a good decision and will likely pay off with the increased productivity. Tracking experiment meta-data with spreadsheets is much better than not doing any tracking. It will help you to better see your progress, understand what modifications improve your solution, and help make modeling decisions. Doing it with spreadsheets will also cost you some additional time and effort. Tools like Neptune take the experiment tracking to a next level, allowing you to automate the meta-data logging and focus on the modeling decisions.

I hope you find my story useful. Good luck with your future ML projects!

]]>
Nikita Kozodoi
Computing Mean &amp; STD in Image Dataset2021-03-08T00:00:00-06:002021-03-08T00:00:00-06:00https://kozodoi.me/blog/20210308/image-mean-std

Last update: 16.10.2021. All opinions are my own.

# 1. Overview

In computer vision, it is recommended to normalize image pixel values relative to the dataset mean and standard deviation. This helps to get consistent results when applying a model to new images and can also be useful for transfer learning. In practice, computing these statistics can be a little non-trivial since we usually can't load the whole dataset in memory and have to loop through it in batches.

This blog post provides a quick tutorial on computing dataset mean and std within RGB channels using a regular `PyTorch` dataloader. While computing mean is easy (we can simply average means over batches), standard deviation is a bit more tricky: averaging STDs across batches is not the same as the overall STD. Let's see how to do it properly!

# 2. Preparations

To demonstrate how to compute image stats, we will use data from Cassava Leaf Disease Classification Kaggle competition with about 21,000 plant images. Feel free to scroll down to Section 3 to jump directly to calculations.

First, we will import the usual libraries and specify relevant parameters. No need to use GPU because there is no modeling involved.

```#collapse-hide

####### PACKAGES

import numpy as np
import pandas as pd

import torch
import torchvision

import albumentations as A
from albumentations.pytorch import ToTensorV2

import cv2

from tqdm import tqdm

import matplotlib.pyplot as plt
%matplotlib inline

####### PARAMS

device      = torch.device('cpu')
num_workers = 4
image_size  = 512
batch_size  = 8
data_path   = '/kaggle/input/cassava-leaf-disease-classification/'
```

Now, let's import a dataframe with image paths and create a `Dataset` class that will read images and supply them to the dataloader.

```#collapse-show

```

image_id label
0 1000015157.jpg 0
1 1000201771.jpg 3
2 100042118.jpg 1
3 1000723321.jpg 1
4 1000812911.jpg 3

```#collapse-show

class LeafData(Dataset):

def __init__(self,
data,
directory,
transform = None):
self.data      = data
self.directory = directory
self.transform = transform

def __len__(self):
return len(self.data)

def __getitem__(self, idx):

# import
path  = os.path.join(self.directory, self.data.iloc[idx]['image_id'])

# augmentations
if self.transform is not None:
image = self.transform(image = image)['image']

return image
```

We want to compute stats for raw images, so our data augmentation pipeline should be minimal and not include any heavy transformations we might use during training. Below, we use `A.Normalize()` with mean = 0 and std = 1 to scale pixel values from `[0, 255]` to `[0, 1]` and `ToTensorV2()` to convert numpy arrays into torch tensors.

```#collapse-show

augs = A.Compose([A.Resize(height = image_size,
width  = image_size),
A.Normalize(mean = (0, 0, 0),
std  = (1, 1, 1)),
ToTensorV2()])
```

Let's check if our code works correctly. We define a `DataLoader` to load images in batches from `LeafData` and plot the first batch.

```####### EXAMINE SAMPLE BATCH

# dataset
image_dataset = LeafData(data      = df,
directory = data_path + 'train_images/',
transform = augs)

batch_size  = batch_size,
shuffle     = False,
num_workers = num_workers,
pin_memory  = True)

# display images
fig = plt.figure(figsize = (14, 7))
for i in range(8):
ax = fig.add_subplot(2, 4, i + 1, xticks = [], yticks = [])
plt.imshow(inputs[i].numpy().transpose(1, 2, 0))
break
``` Looks like everithing is working correctly! Now we can use our `image_loader` to compute image stats.

# 3. Computing image stats

The computation is done in three steps:

1. Define placeholders to store two batch-level stats: sum and squared sum of pixel values. The first will be used to compute means, and the latter will be needed for standard deviation calculations.
2. Loop through the batches and add up channel-specific sum and squared sum values.
3. Perform final calculations to obtain data-level mean and standard deviation.

The first two steps are done in the snippet below. Note that we set `axis = [0, 2, 3]` to compute mean values with respect to axis 1. The dimensions of `inputs` is `[batch_size x 3 x image_size x image_size]`, so we need to make sure we aggregate values per each RGB channel separately.

```####### COMPUTE MEAN / STD

# placeholders
psum    = torch.tensor([0.0, 0.0, 0.0])
psum_sq = torch.tensor([0.0, 0.0, 0.0])

# loop through images
psum    += inputs.sum(axis        = [0, 2, 3])
psum_sq += (inputs ** 2).sum(axis = [0, 2, 3])
```
```100%|██████████| 2675/2675 [04:21<00:00, 10.23it/s]
```

Finally, we make some further calculations:

• mean: simply divide the sum of pixel values by the total `count` - number of pixels in the dataset computed as `len(df) * image_size * image_size`
• standard deviation: use the following equation: `total_std = sqrt(psum_sq / count - total_mean ** 2)`

Why we use such a weird formula for STD? Well, because this is how the variance equation can be simplified to make use of the sum of squares when other data is not available. If you are not sure about this, expand the cell below to see a calculation example or read this for some details. ```#collapse-hide

# Consider three vectors:
A = [1, 1]
B = [2, 2]
C = [1, 1, 2, 2]

# Let's compute SDs in a classical way:
1. Mean(A) = 1; Mean(B) = 2; Mean(C) = 1.5
2. SD(A) = SD(B) = 0  # because there is no variation around the means
3. SD(C) = sqrt(1/4 * ((1 - 1.5)**2 + (1 - 1.5)**2 + (1 - 1.5)**2 + (1 - 1.5)**2)) = 1/2

# Note that SD(C) is clearly not equal to SD(A) + SD(B), which is zero.

# Instead, we could compute SD(C) in three steps using the equation above:
1. psum    = 1 + 1 + 2 + 2 = 6
2. psum_sq = (1**2 + 1**2 + 2**2 + 2**2) = 10
3. SD(C)   = sqrt((psum_sq - 1/N * psum**2) / N) = sqrt((10 - 36 / 4) / 4) = sqrt(1/4) = 1/2

# We get the same result as in the classical way!
```

```####### FINAL CALCULATIONS

# pixel count
count = len(df) * image_size * image_size

# mean and std
total_mean = psum / count
total_var  = (psum_sq / count) - (total_mean ** 2)
total_std  = torch.sqrt(total_var)

# output
print('mean: '  + str(total_mean))
print('std:  '  + str(total_std))
```
```mean: tensor([0.4417, 0.5110, 0.3178])
std:  tensor([0.2330, 0.2358, 0.2247])
```

This is it! Now you can plug in the mean and std values to `A.Normalize()` in your data augmentation pipeline to make sure your dataset is normalized :)

# 4. Closing words

I hope this tutorial was helpful for those looking for a quick guide on computing the image dataset stats. From my experience, normalizing images with respect to the data-level mean and std does not always help to improve the performance, but it is one of the things I always try first. Happy learning and stay tuned for the next posts!

]]>
Nikita Kozodoi

Last update: 15.10.2021. All opinions are my own.

# 1. Overview

Deep learning models are getting bigger and bigger. It becomes difficult to fit such networks in the GPU memory. This is especially relevant in computer vision applications where we need to reserve some memory for high-resolution images as well. As a result, we are sometimes forced to use small batches during training, which may lead to a slower convergence and lower accuracy.

This blog post provides a quick tutorial on how to increase the effective batch size by using a trick called gradient accumulation. Simply speaking, gradient accumulation means that we will use a small batch size but save the gradients and update network weights once every couple of batches. Automated solutions for this exist in higher-level frameworks such as `fast.ai` or `lightning`, but those who love using `PyTorch` might find this tutorial useful.

# 2. What is gradient accumulation

When training a neural network, we usually divide our data in mini-batches and go through them one by one. The network predicts batch labels, which are used to compute the loss with respect to the actual targets. Next, we perform backward pass to compute gradients and update model weights in the direction of those gradients.

Gradient accumulation modifies the last step of the training process. Instead of updating the network weights on every batch, we can save gradient values, proceed to the next batch and add up the new gradients. The weight update is then done only after several batches have been processed by the model.

Gradient accumulation helps to imitate a larger batch size. Imagine you want to use 32 images in one batch, but your hardware crashes once you go beyond 8. In that case, you can use batches of 8 images and update weights once every 4 batches. If you accumulate gradients from every batch in between, the results will be