Super-convergence in deep learning is a term coined by research Leslie N. Smith in describing a phenomenon where deep neural networks are trained an order of magnitude faster then when using traditional techniques. The technique has lead to some phenomenal results in the Dawnbench project, leading to the cheapest and fastest models at the time.

The basic idea of super-convergence is to make use of a much higher learning rate while still ensuring the network weights converge.

The is achieved by through use of the 1Cycle learning rate policy. The 1Cycle policy is a specific schedule for adapting the learning rate and, if the optimizer supports it, the momentum parameters during training.

The policy can be described as follows:

- Choose a high maximum learning rate and a maximum and minimum momentum.
- In phase 1, starting from a much lower learning rate (
`lr_max / div_factor`

, where`div_factor`

is e.g.`25.`

) gradually increase the learning rate to the maximum while gradually decreasing the momentum to the minimum. - In phase2, reverse the process: decrease learning rate back to the learning rate minimum while increasing the momentum to the maximum momentum.
- In the final phase, decrease the learning rate further (e.g.
`lr_max / (div_factor *100)`

, while keeping momentum at the maximum.

Work from the FastAI team has shown that the policy can be improved by using just two phases:

- The same phase 1, however cosine annealing is used to increase the learning rate and decrease the momentum.
- Similarly, the learning rate is decreased again using cosine annealing, to a value of approx. 0 while momentum increasing to the maximum momentum.

Over the course of training this leads to the following learning rate and momentum schedules:

For a more in depth analysis of the 1Cycle policy see Sylvain Gugger's post on the topic.

The policy is straightfoward to implement in Tensorflow 2. The implementation given below is based on the FastAI library implementation.

Applying the 1Cycle callback is straightforward, simply add it as a callback when calling `model.fit(...)`

:

```
epochs = 3
lr = 5e-3
steps = np.ceil(len(x_train) / batch_size) * epochs
lr_schedule = OneCycleScheduler(lr, steps)
model = build_model()
optimizer = tf.keras.optimizers.RMSprop(lr=lr)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_ds,epochs=epochs, callbacks=[lr_schedule])
```

For a complete example of how the 1Cycle policy is applied, including how to find an appropriate maximum learning rate, to two CNN based learning tasks, a Kaggle notebook has been made available.

- Super-Convergence: Very Fast Training of Neural Networks Using

Large Learning Rates, Leslie N. Smith, Nicholay, Topin - The 1cycle policy, Sylvain Gugger
- FastAI callbacks.one_cycle
- https://www.kaggle.com/avanwyk/tf2-super-convergence-with-the-1cycle-policy

Found the post useful? ðŸ˜Š

Buy me a beer]]>Choosing a good learning rate is the most important hyper-parameter choice when training a deep neural network (assuming a gradient based optimization algorithm is used).

Choosing a learning rate that's too small leads to extremely long training times. Whereas a learning rate that's too large might miss the optimum and lead to training divergence.

Fortunately there is a simple way to estimate a good learning rate. First described by Leslie Smith in Cyclical Learning Rates for Training Neural Networks, and then popularized by the FastAI library, which has a first class implementation of a learning rate finder.

The technique can be described as follows:

- Start with a very low learning rate e.g. 1-e7.
- After each batch, increase the learning rate and record the loss and learning rate.
- Stop when a very high learning rate (10+) is reached, or the loss value explodes.
- Plot the recorded losses and learning rates against each other and choose a learning rate where the loss is strictly decreasing at a rapid rate.

For a more thorough explanation of the technique see Sylvain Gugger's post.

Implementing the technique in Tensorflow 2 is straightforward when implemented a Keras Callback. A Tensorflow 2 compatible implementation is given below and is also available on Github.

The implementation uses an exponentially increasing learning rate, which means smaller learning rate regions will be explored more thoroughly than larger learning rate regions.

The losses are also smoothed using a smoothing factor to prevent sudden or erratic changes in the loss (due to the stochastic nature of the training) from stopping the search process prematurely.

In order to use the LRFinder: instantiate and compile a model, adding it as a callback. The model can then be fit as usual. The callback will record the losses and learning rates and stop training when the loss value diverges or the maximum learning rate is reached.

```
from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout
def build_model():
return tf.keras.models.Sequential([
Conv2D(32, 3, activation='relu'),
MaxPool2D(),
Flatten(),
Dense(128, activation='relu'),
Dropout(0.1),
Dense(10, activation='softmax')
])
lr_finder = LRFinder()
model = build_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
_ = model.fit(train_ds, epochs=5, callbacks=[lr_finder], verbose=False)
lr_finder.plot()
```

The plot method will produce a graph of the results, allowing visually choosing a learning rate:

A value should be chosen in a region where the loss is rapidly, but strictly decreasing. Examples of such graphs and how they are interpreted are also available in previous posts.

It is important to rebuild and recompile the model after the LRFinder is used in order to reset the weights that were updated during the mock training run.

A complete example of how the LRFinder is applied is available in this Jupyter notebook.

- Cyclical Learning Rates for Training Neural Networks, Leslie N. Smith
- https://docs.fast.ai/callbacks.lr_finder.html
- How Do You Find a Good Learning Rate, Sylvain Gugger

Found the post useful? ðŸ˜Š

Buy me a beer]]>(Note: this post was updated on 2019-05-19 for clarity.)

In this post we will look at an end-to-end case study of how to creating and cleaning your own small image dataset from scratch and then train a ResNet convolutional neural network to classify the images using the FastAI library.

Besides gathering the data, we will also illustrate how to perform model assisted data cleaning to partially automate the cleaning of the data itself.

- Creating an Image Dataset

i. Â Downloading the Data

ii. Â Cleaning the Data - Training the Model

i. Â Building the Dataset

ii. Â Creating the Model

iii. Fitting the Model

iv. Initial Results - Model Assisted Data Cleaning
- Full Model Training

i. Â Results - Conclusion

The classification problem we will be solving is *the classification of major species of African antelope in the wild.* The dataset we will create will consist of 13 African antelope species. As we will see this is an interesting challenge, as orientation, colour and very specific features of the antelope (e.g. the horns) are often necessary to distinguish each species. We will also see that it's not always necessary to have a very large dataset in order to use deep learning.

A Jupyter notebook and Python script with the complete code for the example is available on Github. In order to run the notebook or script, ensure you have a FastAI environment setup.

We will start by downloading the images for our dataset. When creating your own dataset, carefully think of the use case you are building it for, and think of the type of images that are representative of the actual problem you are trying to solve.

In the case of antelope, there are a few things to consider:

*Male*and*female*variants of the species have significant differences.- We are interested in pictures of the animals in the
*wild*as opposed to captivity. - The
*young*of each species could be very different from the adult. - The
*colour*of a species could be a distinguishing factor. For example, photos taken at dawn or dusk might not be appropriate.

In general try and think of any *biases* or specific *contexts* present in your subject matter that might not be applicable to the problem being solved.

In order to download the actual images, we will use google-images-download, an open source tool that can download images from Google Images based on keyword search.

The code to download the images is as follows:

```
def download_antelope_images(output_path: Path, limit: int = 50) -> None:
"""Download images for each of the antelope to the output path.
Each species is put in a separate sub-directory under output_path.
"""
response = google_images_download.googleimagesdownload()
for antelope in ANTELOPE:
for gender in ['male', 'female']:
output_directory = str(output_path/antelope).replace(' ', '_')
arguments = {
'keywords': f'wild {antelope} {gender} -hunting -stock',
'output_directory': output_directory,
'usage_rights': 'labeled-for-nocommercial-reuse',
'no_directory': True,
'size': 'medium',
'limit': limit
}
response.download(arguments)
```

The code above searches for images of each antelope species in the `ANTELOPE`

list. For every species, we perform two searches: one for male examples and one for female examples. We add the keyword `wild`

to look for examples of the antelope in the wild, while excluding the keywords `hunting`

and `stock`

to limit the search to images applicable to our use case. Also be sure to search for images with the appropriate usage rights.

The images are downloaded putting each species in a separate folder named for the species, thereby building an '*Imagenet-style'* dataset. This is compatible with FastAI's `ImageDataBunch.from_folder`

helper which will use to load the dataset for training.

The download was limited to 50 examples each for the male and female of each species.

Even though Google does a very good job of finding the correct images for keyword searches, we still have to make sure the images are appropriate for our use case.

Unfortunately this is a time consuming process which is hard to automate (more on that later). Some checks can be automated, for instance, removing duplicates based on MD5 sums, or using the file names to check for labelling errors (as I do in the Python script). However, I still had to manually inspect the images, removing examples I considered inappropriate. These images mostly involved photos of multiple species in a single example, images of predators hunting or feasting on the antelope, man made illustrations of the antelope or antelopes in captivity.

After the data cleaning I was left with between 60 and 100 images (with an average of 85) per species. This is not a large dataset - we will however see that the deep learning model still performs very well.

With the data prepared we can now build the training and validation datasets and train our model. We will be using transfer learning to train a ResNet model that is pre-trained on the ImageNet dataset.

FastAI makes use of `DataBunch`

objects to group the training, validation and test datasets. The `DataBunch`

object also makes sure the Pytorch `DataLoader`

loads to the correct device (GPU/CPU) and supports applying image transforms for data augmentation. Further, the DataBunch normalizes the data using the ImageNet statistics, which is necessary, as the model is pre-trained on the ImageNet data. The `ImageDataBunch`

can be created with:

```
image_data = ImageDataBunch.from_folder(DATA_PATH, valid_pct=VALID_PCT,\
ds_tfms=get_transforms(),\
size=IMAGE_SIZE,\
bs=BATCH_SIZE)\
.normalize(imagenet_stats)
```

We specify the percentage of the data to use for the validation set with `VALID_PCT`

( `0.2`

or 20% in this case), the `IMAGE_SIZE`

(224 for ImageNet trained models) and a `BATCH_SIZE`

(32 for this example, but you can use a smaller or larger batch size, depending on how much VRAM your GPU has).

Creating the ResNet model is very straightforward with FastAI. We use the `cnn_learner`

helper method, specifying our `ImageDataBunch`

and chosen ResNet architecture:

`learn = cnn_learner(image_data, models.resnet50, metrics=[error_rate, accuracy])`

Here we use a pre-trained `resnet50`

model from the Pytorch Torchvision library. If you have a smaller GPU, a pre-trained `resnet34`

works equally well.

We also specify the `error_rate`

as a metric that will be calculated during training.

We are now ready to fit the model to our data. The initial training will only fine-tune the top fully-connected layers of the model; the other layer weights being frozen.

Before starting the training, we have to choose an appropriate learning rate, which is perhaps the single most important choice for effective training. FastAI provides the supremely useful `lr_find`

method for this purpose, which is based on the technique discussed in Cyclical Learning Rates for Training Neural Networks.

```
learn.lr_find()
learn.recorder.plot()
```

We when simply choose a learning rate (or range) where the loss is strictly decreasing. It's beneficial to choose the largest learning rate that has a decreasing loss, as this will speed up training.

Having chosen a learning rate range ( `[1e-3, 1e-2]`

) , we perform 5 training epochs using the 1cycle learning policy.

```
learn.fit_one_cycle(5, max_lr=slice(1e-3, 1e-2))
learn.save('stage-1')
```

```
epoch train_loss valid_loss error_rate time
0 1.352547 0.909331 0.281369 00:14
1 1.032153 0.774388 0.205323 00:13
2 0.737094 0.570336 0.178707 00:13
3 0.476649 0.451232 0.129278 00:13
```

epochtrain_lossvalid_losserror_ratetime01.3525470.9093310.28136900:1411.0321530.7743880.20532300:1320.7370940.5703360.17870700:1330.4766490.4512320.12927800:13

After the initial training we reach a validation accuracy of `87.07%`

. We can use FastAI's `ClassificationInterpretation`

to further interpret the model's performance:

```
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)
```

The interpreter also has a very useful feature that allows us to easily plot the the examples that had the largest loss values.

`interp.plot_top_losses(9, figsize=(12,12))`

On issue seems to be images of close-up views of the antelope's face or photos where the antelope is not presented in the typical broadside view. In both these cases, *distinguishing features* such as patterns on the animal's coat or it's horns might be missing from the image. This highlights a potential **flaw in how we gather the data**: having many examples of one perspective of the subject matter, but neglecting other, valid perspectives.

The FastAI library provides an extremely useful Jupyter Notebook widget that aids in automating data clean-up by using the trained model itself: the ImageCleaner.

Using an ImageDataBunch, the dataset is then indexed by which images lead to the highest losses using the trained model. The `ImageCleaner`

is then instantiated from the dataset and indices:

```
from fastai.widgets import *
images = (ImageList.from_folder(DATA_PATH)
.split_none()
.label_from_folder()
.transform(custom_transforms(), size=224)
.databunch())
ds, idxs = DatasetFormatter().from_toplosses(learn)
ImageCleaner(ds, idxs, DATA_PATH)
```

The widget then allows the you to remove images from the dataset or re-label the images in the case that images are incorrectly labelled.

Importantly, the `ImageCleaner`

widget does not modify the data itself but instead creates a `.csv`

file that contains the paths and labels of the cleaned data. We then need to construct an `ImageDataBunch`

from the `.csv`

file:

```
df = pd.read_csv(DATA_PATH/'cleaned.csv', header='infer')
image_data = (ImageDataBunch.from_df(DATA_PATH, df,
valid_pct=VALID_PCT,
ds_tfms=custom_transforms(),
size=IMAGE_SIZE,
bs=BATCH_SIZE)
.normalize(imagenet_stats))
```

Next we can look at training all the layers of the model instead of just the last, fully-connected layers. This is done by 'unfreezing' the other layers of the model before training.

We also have to find a new learning rate as the optimisation landscape has now completely changed:

```
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
```

Finally, we fit the model again with the 1cycle policy for 20 epochs using a small learning rate:

`learn.fit_one_cycle(20, max_lr=7e-5)`

```
epoch train_loss valid_loss error_rate time
0 0.154993 0.311427 0.127962 00:14
1 0.185968 0.302535 0.127962 00:14
2 0.167942 0.291734 0.109005 00:14
3 0.181434 0.298713 0.094787 00:14
4 0.190612 0.400196 0.118483 00:14
5 0.209943 0.414060 0.118483 00:14
6 0.226450 0.462790 0.132701 00:14
7 0.248497 0.382834 0.113744 00:14
8 0.189046 0.343103 0.113744 00:14
9 0.141687 0.378920 0.132701 00:14
10 0.133787 0.400326 0.099526 00:14
11 0.136122 0.366274 0.109005 00:14
12 0.114380 0.343331 0.094787 00:14
13 0.091698 0.364937 0.109005 00:14
14 0.083694 0.331757 0.113744 00:14
15 0.069167 0.309694 0.104265 00:14
16 0.064571 0.312528 0.094787 00:14
17 0.057514 0.316830 0.085308 00:14
18 0.060952 0.323746 0.104265 00:14
19 0.057364 0.298466 0.085308 00:14
```

Fitting all the layers of the neural network improves our training loss to `0.057`

and our validation loss to `0.298`

and increases our validation accuracy to `91.4692%`

.

Similar to earlier, we can create an interpreter and visualise our top losses:

```
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(9, figsize=(12,12), heatmap=True)
```

Here we pass the parameter `heatmap=True`

to the `plot_top_losses`

method, which will produce Grad-CAM (Gradient-weighted Class Activation Mapping) heatmaps for the images. Grad-CAM visualisations highlight the important regions in the image used for the prediction.

The Grad-CAM visualisations show that the model does however correctly identify the regions containing the antelope and confirms that it tends to focus on regions containing the body and horns of the antelope.

Finally we can calculate our final F1 score, also making use of TTA (Test Time Augmentation). TTA applies the same augmenting transforms we used during training when making a prediction. The actual prediction is then the average of the predictions over the transformations of an example, increasing the chance the model makes the correct prediction.

```
preds, targets = learn.TTA()
predicted_classes = np.argmax(preds, axis=1)
f1_score(targets, predicted_classes, average='micro')
0.9004739336492891
```

We end up with a final F1 score of `0.9`

.

There are number of things we can investigate to further improve the model performance:

- More data could be gathered, especially of specific edge cases the model is struggling with: front and rear views of the animals and close-ups of antelope faces.
- Validate the transformations used to augment the dataset, especially colour distortion and image rotation/cropping. Some very specific features are sometimes required to distinguish one species from another and as such we have to ensure the transformations we use doesn't discard this information.
- Alternative architectures should be investigated that might perform better with the specific use case.

In this post we covered an end-to-end example of creating our own image dataset and using transfer learning to create an accurate deep learning image classifier for African antelope species. Our ResNet50 model reached an F1 score of `0.9`

after only 24 epochs of training on roughly 880 examples spread over the 13 classes.

Unsurprisingly the hardest and most time consuming part of the deep learning exercise was not training the model, indeed the FastAI code to do so is only 5 lines long:

```
image_data = ImageDataBunch.from_folder(DATA_PATH, valid_pct=VALID_PCT,\
ds_tfms=get_transforms(),\
size=IMAGE_SIZE,\
bs=BATCH_SIZE)\
.normalize(imagenet_stats)
learner = cnn_learner(image_data, architecture, metrics=error_rate)
learner.fit_one_cycle(5, max_lr=slice(1e-3, 1e-2))
learner.unfreeze()
learner.fit_one_cycle(5, 1e-4)
```

Instead, the most difficult part is gathering and cleaning the data. Manual inspection of the data is tedious and time consuming, and still resulted in some problems slipping through.

However, we also demonstrated how to use the model itself to aid in cleaning the dataset using the `ImageCleaner`

widget from the FastAI library.

Furthermore, we found that the dataset is not fully representative of the problem we are trying to solve, as the dataset is missing examples of some valid perspectives we might encounter in the real world.

There is no simple solution to creating a high quality and error free dataset (which is why open data initiatives are so valuable). However, an alternative to creating your own dataset is to find a dataset similar to a dataset you would need to solve your problem and then modifying it. In this case, we could have started with a dataset such as the Snapshot Serengeti dataset and used only images of antelope contained therein. An exercise left for next time.

Found the post useful? ðŸ˜Š

Buy me a beer]]>**The most up to date installation instructions are available on Github and the docs site, I would recommend starting there. A list of common troubleshooting issues is also available.**

I list the steps I followed for personal reference, which includes solving some minor issues I encountered in setting up a full DL environment on a GPU equipped laptop running Ubuntu 18.04.

If you are installing FastAI to do one of the deep learning courses, I recommend one of the various cloud solutions available instead of setting up a CUDA/Anaconda environment as below.

The instructions listed below installs FastAI v1 within a freshly created Anaconda virtual environment. The instructions below assume you have Anaconda and NVIDIA CUDA (along with an appropriate NVIDIA driver) installed.

First, ensure conda is up to date, otherwise conda might complain about `PackagesNotFoundError`

s.

```
conda update conda
```

I recommend installing into a virtual environment, to prevent interference from other libraries and system packages. You can create a Python 3.6 virtual environment to install FastAI in as follows:

```
conda create -n fastai python=3.7 mypy pylint jupyter scikit-learn pandas
source activate fastai
```

Next, if you are planning on installing the GPU version, verify which CUDA you have installed:

```
nvcc --version # Cuda compilation tools, release 10.0, V10.0.130
```

You can find the corresponding conda package using:

```
conda search cuda* -c pytorch
```

Look for the `cudaXX`

packages that matches your CUDA version as reported by `nvcc`

.

You can now install `pytorch`

and `fastai`

using conda.

```
conda install cudatoolkit=10.0 -c pytorch -c fastai fastai
```

**A note on CUDA versions**: I recommend installing the latest CUDA version supported by Pytorch if possible (10.0 at the time of writing), however, to avoid potential issues, stick with the same CUDA version you have a driver installed for.

You can verify that the CUDA installation went smoothly and that Pytorch is using your GPU using the following command:

```
python -c "import torch; print(torch.cuda.get_device_name(torch.cuda.current_device()))"
```

It should print the name of the device (GPU) you have attached to the machine.

**Note for NLP (using FastAI v1 for text):** if you plan on using FastAI for NLP, I recommend also downloading the relevant language packages for `spacy`

, otherwise you might hit some obscure errors when attempting to parse textual data.

```
python -m spacy download en
```

A number of Cloud services have first class support for FastAI. I've personally used https://www.paperspace.com/ a lot and can recommend it. There are a number of alternative options. If you are looking for a VM based option (which gives you a little more control over your environment), I recommend Google Cloud Platform or Microsoft Azure.

The FastAI v1 docs are really great, you can find them here: http://docs.fast.ai.

]]>The first major version of the FastAI deep learning library, FastAI v1, was recently released. For those unfamiliar with the FastAI library, it's built on top of Pytorch and aims to provide a consistent API for the major deep learning application areas: vision, text and tabular data. The library also focuses on making state of the art deep learning techniques available seamlessly to its users.

This post will cover getting started with FastAI v1 at the hand of tabular data. It is aimed at people that are at least somewhat familiar with deep learning, but not necessarily with using the FastAI v1 library. For more technical details on the deep learning techniques used, I recommend this post by Rachel of FastAI.

For a guide on installing FastAI v1 on your own machine, or cloud environments you may use, see this post.

Tabular data (referred to as structured data in the library before v1) refers to data that typically occurs in rows and columns, such as SQL tables and CSV files. Tabular data is extremely common in the industry, and is the most common type of data used in Kaggle competitions, but is somewhat neglected in other deep learning libraries. FastAI in turn provides first class API support for tabular data, as shown below.

In the example below we attempt to predict mortality using CDC Mortality data from Kaggle. The complete notebook which includes data pre-processing of the data is available here: https://github.com/avanwyk/fastai-projects/blob/master/cdc-mortality-tabular-prediction/cdc-mortality.ipynb.

The FastAI v1 tabular data API revolves around three types of variables in the dataset: categorical variables, continuous variables and the *dependent *variable.

```
dep_var = 'age'
categorical_names = ['education', 'sex', 'marital_status']
```

Any variable that is not specified as a categorical variable, will be assumed to be a continuous variable.

For Tabular data, FastAI provides a special TabularDataset. The simplest way to construct a `TabularDataset`

is using the `tabular_data_from_df`

helper. The helper also supports specifying a number of transforms that is applied to the dataframe before building the dataset.

```
tfms = [FillMissing, Categorify]
tabular_data = tabular_data_from_df('output', train_df, valid_df, dep_var, tfms=tfms, cat_names=categorical_names)
```

The `FillMissing`

transform will fill in missing values for continuous variables *but not the categorical or dependent variables. *By default is uses the median, but this can be changed to use either a constant value or the most common value.

The `Categorify`

transform will change the variables in the dataframe to Pandas category variables for you.

The transforms are applied to the dataframe before being passed to the dataset object.

The `TabularDataset`

then does some more pre-processing for you. It automatically converts category variables (which might be text) to sequential, numeric IDs starting at 1 (0 is reserved for NaN values). Further, it automatically normalizes the continuous variables using standardization. You can also pass in statistics for each variable to overwrite the mean and standard deviation used for the normalization, otherwise they will automatically be calculated from the training set.

With the data ready to be used by a deep learning algorithm, we can create a Learner:

```
learn = get_tabular_learner(tabular_data,
layers=[100,50,1],
emb_szs={'education': 6,
'sex': 5,
'marital_status': 8})
learn.loss_fn = F.mse_loss
```

We use a helper function `get_tabular_learner`

to setup the tabular data learner for us. We also have to specify an MSE loss function since we are performing a regression task.

A FastAI Learner combines a model with data, a loss function and an optimizer. It also does some other work like encapsulate the metric recorder and has API for saving and loading the model.

In our case, the helper function will build a TabularModel. The model will consist of an Embedding Layer for each categorical variable (with optional sizes specified), with each layer having its own Dropout and Batchnormalization. Those results are concatenated with the continuous input variables which is then followed by Linear and ReLU layers of the specified sizes. Batchnormalization is added between each layer pair and the last layer pair only includes the Linear layer.

By default, an Adam optimizer will be used.

You can print a summary of the model using:

```
learn.model
```

Before we can start training the model, we have to choose a learning rate (LR). This is where one of the FastAI library's more useful and powerful tools come in. The FastAI library has first class support for a technique to find an appropriate learning rate with `lr_find`

.

```
learn.lr_find()
learn.recorder.plot()
```

Doing the above will (after some training), produce a graph such as this:

Another example:

An appropriate LR can then be selected *by choosing a value that is an order of magnitude lower than the minimum. *This learning rate will still be aggressive enough to ensure quick training, but is reasonably safe from exploding. For more details on the technique, see here and here.

We are now ready to train the model:

```
lr = 1e-1
learn.fit_one_cycle(1, lr)
```

The `fit_one_cycle`

call fits the model for the specified number of epochs using the OneCycleScheduler callback. The callback automatically applies a two phase learning rate schedule, first increasing the learning rate to `lr_max`

(which is the learning rate we specify) and then annealing to 0 in the second phase.

Loss and metrics are recorded by the Recorder callback and are accessible through `learn.recorder`

. For example, to plot the training loss you can use:

```
learn.recorder.plot_losses()
```

The FastAI v1 experience has so far been really great. The pre-v1 releases were usable, but definitely lacked some polish (particularly the documentation). The new documentation site is great, and thoroughly explains a lot of the API.

The API itself is incredibly terse and you can do a lot with very few lines of code. I look forward to diving deeper into the API and exploring its flexibility. Another great thing about the API is the consistent use of Python Type Hints which makes it much easier to deduce what the API expects or does while working in notebook environments, in addition to catching obvious errors.

The documentation that was released with FastAI v1 is really great, you can check it out here: http://docs.fast.ai/

Then I also have to mention the really great FastAI forums, its very possibly the best deep learning forums in existence.

Lastly, if you haven't done so already, the FastAI course is strongly recommended. A new version of the course based on v1 of the library will launch in early 2019.

]]>The method itself comes from a paper by Prof. Srinivasan Keshav, an ACM Fellow and researcher at the

]]>A few years ago I came across a method for reading academic papers which I've kept coming back to as a reliable systematic approach to efficiently read important papers of varying complexity.

The method itself comes from a paper by Prof. Srinivasan Keshav, an ACM Fellow and researcher at the University of Waterloo. I recommend reading his paper, but I summarise the system here.

The system uses a top down three pass approach with each pass delving deeper into the details of the paper. Each pass has a specific goal. Depending on what you need to obtain from the paper, completing all three passes may not be necessary.

The goal of the first pass is to get a high level overview of the paper:

- Read the
**title**,**abstract**,**introduction**,**section and subsection**and the*headings***conclusion**. **Glance**at the**references**, noting whether you might have read any of them.

After the first pass you should be able to *categorize* the paper, understand the paper's *context*, validate the basic assumptions for *correctness*, note the main *contributions* and be able to determine the paper's *clarity*.

The first pass is sufficient to determine whether you are interested in the paper, whether it is relevant to your research area and whether there are any questionable assumptions made which may deter your interest.

Also note, if you are writing a paper, a first pass is perhaps all a reviewer will give you. Pay special attention to the parts mentioned above. Strive to be clear and concise in your headings, introduction, conclusion and abstract.

With the second pass the goal is to understand the content of the paper to the point where you could explain it to someone else:

**Carefully read the paper**, but ignore details such as proofs or very technical details.**Make comments and notes**on important points.**Study any figures or graphs**, note details such as the axes, labeled points and whether statistical variance is indicated etc.- Note all
**unread references**for further reading.

Doing a second pass is appropriate for papers that you are interested in, but aren't necessarily directly related to your work. After the second pass you may or may not understand the paper. If it is critical to understand the work, or you are reviewing the paper, move on to the third pass.

The idea of the third pass is to understand the paper with such detail such that you could *re-implement* the paper.

- Read the paper with
**great attention to detail**, identifying and**challenging every assumption**. - Given the same assumptions, think about how you would
**reproduce and present the result**. - If novel techniques or methods are used, make sure you understand them to the
**degree where you could use them yourself**.

Comparing your idea of implementing the paper with the actual paper will highlight areas where the paper excels or falls short. After the final pass you should be able to reconstruct the structure of the paper from memory, be familiar with the techniques used and identify implicit assumptions and missing references.

For more detail on the system and its motivations and related work, please read Prof. Keshav's paper. It also includes a step based approach for doing a literature survey.

- Keshav, S., 2007. How to read a paper. ACM SIGCOMM Computer Communication Review, 37(3), pp.83-84.

- LightGBM Introduction
- Gradient Boosting

i. Algorithm - LightGBM API

i. Plotting

ii. Saving the model - LightGBM Parameters

i. Tree parameters

ii. Tuning for imbalanced data

iii. Tuning for overfitting

iv. Tuning for accuracy - Resources

Although maybe not as fashionable as deep learning algorithms in 2018, the effectiveness of tree and tree ensemble based learning methods certainly cannot be questioned. Across a variety of domains (restaurant visitor forecasting, music recommendation, safe driver prediction, and many more), ensemble tree models - specifically gradient boosted trees - are widely used on Kaggle, often as part of the winning solution.

Decision trees also have certain advantages over deep learning methods: decision trees are more readily interpreted than deep neural networks, naturally better at learning from imbalanced data, often much faster to train, and work directly with un-encoded feature data (such as text).

This post gives an overview of LightGBM and aims to serve as a practical reference. A brief introduction to gradient boosting is given, followed by a look at the LightGBM API and algorithm parameters. The examples given in this post are taken from an end-to-end practical example of applying LightGBM to the problem of credit card fraud detection: https://www.kaggle.com/avanwyk/a-lightgbm-overview.

LightGBM is an open-source framework for gradient boosted machines. By default LightGBM will train a Gradient Boosted Decision Tree (GBDT), but it also supports random forests, Dropouts meet Multiple Additive Regression Trees (DART), and Gradient Based One-Side Sampling (Goss).

The framework is fast and was designed for distributed training. It supports large-scale datasets and training on the GPU. In many cases LightGBM has been found to be more accurate and faster than XGBoost, though this is problem dependent.

Both LightGBM and XGBoost are widely used and provide highly optimized, scalable and fast implementations of gradient boosted machines (GBMs). I have previously used XGBoost for a number of applications, but have yet to take an in depth look at LightGBM.

The section below gives some theoretical background on gradient boosting. The section LightGBM API continues with practicalities on using the LightGBM.

When considering ensemble learning, there are two primary methods: *bagging* and *boosting*. Bagging involves the training of many independent models and combines their predictions through some form of aggregation (averaging, voting etc.). An example of a bagging ensemble is a Random Forest.

Boosting instead trains models *sequentially*, where each model learns from the errors of the previous model. Starting with a weak base model, models are trained iteratively, each adding to the prediction of the previous model to produce a strong overall prediction.

In the case of gradient boosted decision trees, successive models are found by applying gradient descent in the direction of the average gradient, calculated with respect to the error residuals of the loss function, of the leaf nodes of previous models.

An excellent explanation of gradient boosting is given by Ben Gorman over on the Kaggle Blog and I strongly advise reading the post if you would like to understand gradient boosting. A summary is given here.

Considering decision trees, we proceed as follows. We start with an initial fit, \(F_0\), of our data: a constant value that minimizes our loss function \(L\):

$$ F_0(x) = \underset{\gamma}{arg\ \min} \sum^{n}_{i=1} L(y_i, \gamma) $$

in the case of optimizing the mean square error, we can take the mean of the target values:

$$ F_0(x) = \frac{1}{n} \sum^{n}_{i=1} y_i $$

With our initial guess of \(F_0\), we can now calculate the gradient, or *pseudo* residuals, of \(L\) with respect to \(F_0\):

$$ r_{i1} = -\frac{\partial L(y_i, F_{0}(x_i))}{\partial F_{0}(x_i)} $$

We now fit a decision tree \(h_{1}(x)\), to the residuals. Using a regression tree, this will yield the **average gradient** for each of the leaf nodes.

Now we can apply gradient descent to minimize the loss for each leaf by stepping in the direction of the **average gradient** for the leaf nodes as contained in our decision tree \(h_{1}(x)\). The step size is determined by a multiplier \(\gamma_{1}\) which can be optimized by performing a line search. The step size is further shrinked using a learning rate \(\lambda_{1}\), thus yielding a new boosted fit of the data:

$$ F_{1}(x) = F_{0}(x) + \lambda_1 \gamma_1 h_1(x) $$

Putting it all together, we have the following algorithm. For a number of boosting rounds \(M\) and a differentiable loss function \(L\):

Let \( F_0(x) = \underset{\gamma}{arg\ \min} \sum^{n}_{i=1} L(y_i, \gamma) \)

For m = 1 to M:

- Calculate the
*pseudo*residuals \( r_{im} = -\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)} \) - Fit decision tree \( h_m(x) \) to \( r_{im} \)
- Compute the step multiplier \( \gamma_m \) for each leaf of \( h_m(x) \)
- Let \( F_m(x) = F_{m-1}(x) + \lambda_m \gamma_m h_m(x) \), where \( \lambda_m \) is the learning rate for iteration \(m\)

One caveat of the above explanation is that it neglects to incorporate a regularization term in the loss function. An overview of the gradient boosting as given in the XGBoost documentation pays special attention to the regularization term while deriving the objective function.

In terms of LightGBM specifically, a detailed overview of the LightGBM algorithm and its innovations is given in the NIPS paper.

Fortunately the details of the gradient boosting algorithm are well abstracted by LightGBM, and using the library is very straightforward.

LightGBM requires you to wrap datasets in a LightGBM Dataset object:

```
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train, free_raw_data=False)
```

The parameter `free_raw_data`

controls whether the input data is freed after constructing the inner datasets.

LightGBM supports many parameters that control various aspects of the algorithm (more on that below). Some core parameters that should be defined are:

```
core_params = {
'boosting_type': 'gbdt', # rf, dart, goss
'objective': 'binary', # regression, multiclass, binary
'learning_rate': 0.05,
'num_leaves': 31,
'nthread': 4,
'metric': 'auc' # binary_logloss, mse, mae
}
```

We can then call the training API to train a model, specifying the number of boosting rounds and early stopping rounds as needed:

```
evals_result = {}
gbm = lgb.train(core_params, # parameter dict to use
training_set,
init_model=init_gbm, # enables continuous training.
num_boost_round=boost_rounds, # number of boosting rounds.
early_stopping_rounds=early_stopping_rounds,
valid_sets=validation_set,
evals_result=evals_result, # stores validation results.
verbose_eval=False) # print evaluations during training.
```

Early stopping occurs when there is no improvement in either the objective evaluations or the metrics we defined as calculated on the validation data.

LightGBM also supports continuous training of a model through the `init_model`

parameter, which can accept an already trained model.

A detailed overview of the Python API is available here.

LightGBM has a built in plotting API which is useful for quickly plotting validation results and tree related figures.

Given the `eval_result`

dictionary from training, we can easily plot validation metrics:

```
_ = lgb.plot_metric(evals)
```

Another very useful features that contributes to the explainability of the tree is relative feature importance:

```
_ = lgb.plot_importance(model)
```

It is also possible to visualize individual trees:

```
_ = lgb.plot_tree(model, figsize=(20, 20))
```

Models can easily be saved to a file or JSON:

```
gbm.save_model('cc_fraud_model.txt')
loaded_model = lgb.Booster(model_file='cc_fraud_model.txt')
# Output to JSON
model_json = gbm.dump_model()
```

A list of more advanced parameters for controlling the training of a GBDT is given below with a brief explanation of their effect on the algorithm.

```
advanced_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 41, # more increases accuracy, but may lead to overfitting.
'max_depth': 5, # shallower trees reduce overfitting.
'min_split_gain': 0, # minimal loss gain to perform a split.
'min_child_samples': 21, # specifies the minimum samples per leaf node.
'min_child_weight': 5, # minimal sum hessian in one leaf.
'lambda_l1': 0.5, # L1 regularization.
'lambda_l2': 0.5, # L2 regularization.
# LightGBM can subsample the data for training (improves speed):
'feature_fraction': 0.5, # randomly select a fraction of the features.
'bagging_fraction': 0.5, # randomly bag or subsample training data.
'bagging_freq': 0, # perform bagging every Kth iteration, disabled if 0.
'scale_pos_weight': 99, # add a weight to the positive class examples.
# this can account for highly skewed data.
'subsample_for_bin': 200000, # sample size to determine histogram bins.
'max_bin': 1000, # maximum number of bins to bucket feature values in.
'nthread': 4, # best set to number of actual cores.
}
```

Both LightGBM and XGBoost build their trees leaf-wise.

Building the tree leaf-wise results in faster convergence, but may lead to overfitting if the parameters are not tuned accordingly. Important parameters for controlling the tree building are:

`num_leaves`

: the number of leaf nodes to use. Having a large number of leaves will improve accuracy, but will also lead to overfitting.`min_child_samples`

: the minimum number of samples (data) to group into a leaf. The parameter can greatly assist with overfitting: larger sample sizes per leaf will reduce overfitting (but may lead to under-fitting).`max_depth`

: controls the depth of the tree explicitly. Shallower trees reduce overfitting.

The simplest way to account for imbalanced or skewed data is to add a weight to the positive class examples:

`scale_pos_weight`

: the weight can be calculated based on the number of negative and positive examples:`sample_pos_weight = number of negative samples / number of positive samples`

.

In addition to the parameters mentioned above the following parameters can be used to control overfitting:

`max_bin`

: the maximum numbers bins that feature values are bucketed in. A smaller`max_bin`

reduces overfitting.`min_child_weight`

: the minimum sum hessian for a leaf. In conjuction with`min_child_samples`

, larger values reduce overfitting.`bagging_fraction`

and`bagging_freq`

: enables bagging (subsampling) of the training data. Both values need to be set for bagging to be used. The frequency controls how often (iteration) bagging is used. Smaller fractions and frequencies reduce overfitting.`feature_fraction`

: controls the subsampling of features used for training (as opposed to subsampling the actual training data in the case of bagging). Smaller fractions reduce overfitting.`lambda_l1`

and`lambda_l2`

: controls L1 and L2 regularization.

Accuracy may be improved by tuning the following parameters:

`max_bin`

: a larger`max_bin`

increases accuracy.`learning_rate`

: using a smaller learning rate and increasing the number of iterations may improve accuracy.`num_leaves`

: increasing the number of leaves increases accuracy with a high risk of overfitting.

A great overview of both XGBoost and LightGBM parameters, their effect on various aspects of the algorithms and how they relate to each other is available here.

- LightGBM project: https://github.com/Microsoft/LightGBM
- LightGBM paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf
- Documentation: https://lightgbm.readthedocs.io/en/latest/index.html
- Parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html
- Parameter explorer: https://sites.google.com/view/lauraepp/parameters

Found the post useful? ðŸ˜Š

Buy me a beer]]>A key concern when dealing with cyclical features is how we can encode the values such that it is clear to the deep learning algorithm that the features occur in cycles. This is of particular concern in deep learning applications as it may have a significant effect on the convergence rate of the algorithm.

This post looks at a strategy to encode cyclical features in order to clearly express their cyclical nature.

A complete example of using the encoding on weather data, which includes illustrating the effect on a three layer deep neural network, is available as a Kaggle Kernel.

The data used below is hourly weather data for the city of Montreal. A complete description of the data is available here. We will be looking at the `hour`

attribute of the datetime feature to illustrate the problem with cyclical features.

```
data['hour'] = data.datetime.dt.hour
sample = data[:168] # the first week of the data
ax = sample['hour'].plot()
```

Here we can see exactly what we would expect from an hour value for a week: a cycle between 0 and 24, repeating 7 times.

This graph illustrates the problem with presenting cyclical data to a deep learning algorithm: there are jump discontinuities in the graph at the end of each day when the hour value overflows to 0.

From 22:00 to 23:00 one hour has passed, which is adequately represented by the current unencoded values: the absolute difference between 22 and 23 is 1. However, when considering 23:00 and 00:00, the jump discontinuity occurs, and even though the difference is one hour, with the unencoded feature, the absolute difference in the feature is of course 23.

The same will occur for seconds at the end of each minute, for days at the end of each year and so forth.

One method for encoding a cyclical feature is to perform a sine and cosine transformation of the feature:

$$x_{sin} = \sin{(\frac{2 * \pi * x}{\max(x)})}$$

$$x_{cos} = \cos{(\frac{2 * \pi * x}{\max(x)})}$$

```
data['hour_sin'] = np.sin(2 * np.pi * data['hour']/23.0)
data['hour_cos'] = np.cos(2 * np.pi * data['hour']/23.0)
```

Plotting this feature we now end up with a new feature that is cyclical, based on the sine graph:

```
sample = data[:168]
ax = sample['hour_sin'].plot()
```

If we only use the sine encoding we would still have an issue, as two separate timestamps will have the same sine encoding within one cycle (24 hours in our case), as the graph is symmetrical around the turning points. This is why we also perform the cosine transformation, which is phase offset from sine, and leads to unique values within a cycle in two dimensions.

Indeed, if we plot the feature in two dimensions, we end up a perfect cycle:

```
ax = sample.plot.scatter('hour_sin', 'hour_cos').set_aspect('equal')
```

The features can now be used by our deep learning algorithm. As an added benefit, it is also scaled to the range [-1, 1] which will also aid our neural network. A comparison of the effect of the encoding on a simple deep learning model is given in the Kaggle Kernel.

Other machine learning algorithms might be more robust towards raw cyclical features, particularly tree-based approaches. However, deep neural networks stand to benefit from the sine and cosine transformation of such features, particularly in terms of aiding the convergence speed of the network.