avanwyk

Revisiting Java in 2021 - II

Andrich van Wyk — Sun, 19 Sep 2021 14:14:09 GMT

Cover image by Charles Forerunner on Unsplash.

In Part I, I gave an overview of the major language features introduced between Java 11 and Java 17 - showing us what Java looks like in 2021.

I also argued that even if Java doesn't have feature parity with some of the more sophisticated JVM languages (and it definitely doesn't), Java is very deliberately moving forward and definitely in the right direction.

But that still leaves the question, where does Java fit in the modern JVM landscape? Are the newer, more feature-rich languages not the obvious choice for all development teams?

Below, I attempt to address these questions. I also discuss the JVM and give an overview of popular JVM technologies in a variety of contexts.

Goliath vs the Davids

I might make it sound like Kotlin or Scala are nipping at Java's heals, but assuredly, that is not the case. The 2021 Stack Overflow Survey results again showed the continued popularity of Java: for every 1 reported Kotlin developer, there are almost 4 Java developers. It's much worse for Scala, with the ratio being 11:1 in favour of Java (Clojure has it the hardest, but at least Clojure developers are paid well for their efforts).

Java similarly dominates the TIOBE index, sitting pretty at third place, an enviable spot compared to Scala's 32nd place and Kotlin's 37th.

What does this mean? Besides fuel for your favourite language flame-war, not much? There are, however, advantages to being popular.

Building large teams (think hundreds or thousands of developers) requires talent, and talent is easier to find if more people know the language (although, if you want the absolute best people, you might be better off choosing a more niche language).

Similarly, with so many people working in the language, the amount and quality of related resources are much higher. There are many excellent Java libraries, many well-written books on how to program it effectively and many high profile architects, thought leaders and evangelists that continue to refine and redefine how Java should be programmed.

Java also arguably has the best tooling in the business. I'd argue that IntelliJ IDEA (and family) is the best IDE out there, more so if you are coding Java. Even if you prefer VS Code, Java is well supported. Build tools such as Maven and Gradle are fast, stable and effective. There is also the JVM platform itself to consider, and more recently GraalVM, but more on that later.

Is Less More?

When it comes to programming languages, is less, more? Go(lang) certainly thinks so. Go famously rejects complexity to serve its design goal of keeping things simple and fast. The Go language designers deem it necessary to achieve scale in terms of program size and execution and to enable large teams of programmers.

So in terms of Kotlin and Scala, is Java perhaps a better choice precisely because it's a more straightforward language?

Scala specifically has so many features it's considered somewhat of an 'everything and the kitchen sink' language, including ad-hoc polymorphism, macros, ADTs, a purely functional programming model and much more.

Scala's additional complexity indeed enables powerful and elegant new solutions to programming problems. Ad-hoc polymorphism and type classes lead to beautifully flexible code. The benefits of functional programming have been widely evangelised and form the core of the reactive programming model. All of this is why Scala remains one of my favourite languages.

However, powerful new tools aren't free, and with the additional complexity, several ancillary issues are introduced downstream, such as slow compilation and tooling and the steep learning curve. These issues have nothing to do with the language features or how effectively and elegantly you can solve problems with Scala. Instead, they mar the developer experience, which, at least in my opinion, is perhaps the primary reason Scala has not achieved dominance on the JVM. Scala 3 overhauls the language, and I am curious to see how that pans out.

Like Scala, Kotlin also gives us new ways of solving problems that commonly occur in Java codebases or require third-party libraries to solve: compared to Java, Kotlin has better support for functional programming, null-safety, an expanded standard library, and has lightweight concurrency constructs similar to Go.

However, although many see Kotlin as nearly a drop-in replacement for Java (which, at a technical level, it can be), it too has its own learning curve, a fact that is oft poorly acknowledged in my experience.

This is especially true when we move beyond basic knowledge of Kotlin (how to program in Kotlin) and start considering idiomatic and effective use of Kotlin (how you should program in Kotlin). In the absence of a Kotlin expert on your team, or standard, widely accepted guidelines, effective use of a language takes experience and is achieved through trial and error. Even experts will need to make choices specific to their team and team size.

fun usingScoped() {
    val numVowels = getDTO()?.let { dto ->
        dto.string?.let {
            countVowels(it)
        }
    } ?: 0
}

fun usingIf() {
    val dto = getDTO()
    val numVowels = if (dto?.string != null) countVowels(dto.string) else 0
}

fun usingWhenAndScoped() {
    val numVowels = when (val dto = getDTO()) {
        null -> 0
        else -> dto.string?.let { countVowels(it) } ?: 0
    }
}

Which of these is the best approach to deal with the nullability of a returned value and its properties? It could depend on your team.

Certainly, it's here that Java holds some advantage over the more powerful languages. Java, of course, has a learning curve, but it is flatter than either Scala or Kotlin since it's a simpler language. There are also many resources on how to program it effectively. It might not be exciting, but Java is a known quantity.

The JVM Platform

Of course, Java is both a language and a platform through the JVM. James Ward recently gave an excellent overview of the modern Java platform, and I encourage you to check it out; I'll highlight some of it below.

As we saw above, three of the top 20 programming languages are JVM based. It doesn't really matter what your preferred style of programming is; the JVM has you covered.

The ecosystem is also rich with frameworks and libraries to develop just about anything. Frameworks such as Spring Boot, Micronaut and Quarkus are de facto standards for creating web applications, especially backends and microservices. More Rails or Django like full-stack and UI rich options are available in the Play Framework, Vaadin (with Boot) and Spring Boot itself.

Another area where the JVM has a significant presence is in Big Data and data streaming communities. Many of the dominant frameworks in this area are written in Java or Scala, including Apache Kafka, Hadoop, Spark, Beam, Flink, and NiFi. Of course, the implementation language might not be the language used to interface with the framework, but Java is always supported, and in many cases, the default API.

A lot of progress has also been made in the Deep Learning space on the JVM over the last couple of years. There are two major Java-based Deep Learning frameworks: DJL (Deep Java Library) from Amazon and DL4J (Deep Learning for Java), now with the Eclipse foundation. The libraries differ somewhat in their approach. DJL provides interfaces to 'Engines', which provide the lower level n-dimensional arrays and automatic differentiation functionality. Supported engines include MXNet, Pytorch and Tensorflow.
DL4J's approach is more fundamental with its own implementation of n-dimensional arrays (ND4J) and related functionality.

Reactive programming is one of the hallmarks of modern systems. Most major JVM frameworks support reactive coding, with full end-to-end reactive stacks now being possible, including the database layer via R2BC.
It's also worth mentioning Akka, an implementation of the Actor Model on the JVM, which is extremely mature and supports writing highly reactive, distributed systems.

Within the Cloud and Cloud-Native contexts, the JVM also excels. JVM based applications are well-supported by all major cloud providers. The JVM is also easy to host in a container, and containers are well supported by major frameworks such as Spring. Applications run well regardless of whether they are developed in Java, Kotlin or Scala.

However, when running a Cloud-Native or Serverless Java application, there is some concern regarding JVM overhead. Quarkus addresses this directly; however, another major piece of the puzzle here is the GraalVM.

GraalVM

For those unfamiliar with the GraalVM project, it's a significant and impressive piece of engineering with several cutting edge features (such as polyglot programming). Pertinently, it also includes GraalVM Native Image, which enables the ahead-of-time compilation of Java applications to native binaries.

Having native binaries is, of course, game-changing in the Serverless and Cloud-Native contexts, with start-up time seeing reported improvements of 50x, and memory footprints decreasing by as much as 5x. I've used Quarkus before, and though I didn't run benchmarks, I can report startup time was on the order of microseconds.

This does come with some restrictions and caveats; for example, it's not as straightforward to use Reflection in your Java code.

It may also not yet be that straightforward to get your application working on the GraalVM alongside your favourite framework. But progress is being made rapidly.

The Future of Java

There are several high profile Java projects in the pipeline aiming to further modernize the language. Project Loom aims to introduce new lightweight concurrency constructs à la coroutines.

Valhalla is introducing new inline types by updating the memory model, allowing better utilization of modern hardware architectures.

Finally, Project Leyden aims to address startup time and time to peak performance for Java.

Besides Java, the issues mentioned above regarding Kotlin will be addressed over time, and the language itself continues to evolve. I also mentioned Scala 3, which has a lot of potential.

Conclusion

Whether Java is the right choice for you, I believe, depends on context.

Considered, predictable changes are a significant advantage, if not an outright necessity, in specific contexts. Huge teams (thousands of developers) running projects for decades, or projects that have upgrade cycles measured in years, or codebases that work in an environment that necessitates safety and predictability (e.g. finance or medicine) thrive on this kind of roadmap.

Further, as detailed above, Java, as a platform, is certainly not lacking in capability. Regardless of your chosen JVM language, you're unlikely to find yourself in a corner where the isn't a framework, library, or tool to support and accelerate your development.

Am I advocating for Java? Maybe a little. Would I start a new project in Java? Probably not; I'd prefer Kotlin. However, does it make sense if other teams choose Java instead in 2021? Absolutely.

Follow me on Twitter.
Or email me at: interesting at avanwyk.com

Revisiting Java in 2021 - I

Andrich van Wyk — Wed, 01 Sep 2021 21:09:17 GMT

Cover image by Michael Dziedzic on Unsplash.

Introduction

September marks the release of Java 17, the latest LTS Java release. Java 17 is also the culmination of many language and platform improvements that have steadfastly been introduced with every Java release since Java 11.

I've been working on the JVM platform for well over a decade at this stage, and, as I am sure is the case for many others, this does not necessarily mean I have been programming Java.

I fully embraced Scala, an excellent language with a flawed developer experience, when it came along and drastically improved my functional and reactive programming skills in the process.

Recently, however, Kotlin has been my go-to language on the JVM. Although synonymous with Android development, Kotlin has also been embraced by the JVM backend community and works well with several popular frameworks such as Spring Boot, Quarkus, and Micronaut.

With many excellent languages to choose from on the JVM (shoutout to Clojure), where does Java itself fit in? If you haven't been keeping an eye on Java, what have you missed? Is Java still a viable, modern option when programming on the JVM?

The Java 17 language

To many (if not most) developers, Java still refers to Java 8. So let's quickly recap the major features added in Java 9, 10, and 11 (the previous LTS release) for those that may not be familiar.

Catching-up: Java 8 to 11

Java 9 brought modularity to the Java platform. The Modules system (aka Project Jigsaw) was a significant change and affected much of the underlying systems and middleware running Java applications. I encourage you to dive deeper for yourself.
Java 10 brought local type inference, allowing the use of var in declaring local variables. There are some examples shown in the code below.
Java 11 didn't introduce any major language features but came along with the notorious Oracle Java license change.

The above list is by no means comprehensive, and a plethora of JDK improvements, including new GCs, and more minor language changes, were also introduced. Complete lists are widely available.

Java 17

Back to Java 17 then, as mentioned above, Java 17 really brings several related major features together. These being:

Switch expressions (in preview since 12, standard since 14)
Switch with yield (since 13)
Text blocks (since 12, standardized in 15)
Records (since 14, standardized in 16)
Sealed classes (since 15, standardized in 17)
Pattern matching using instanceof (since 14, standard in 16)
Pattern matching in switch statements (still in preview in 17).

Again, the list is not comprehensive; notably, the ZGC and Shenandoah garbage collectors have also since been standardized (since 15), but the features listed above have a significant impact on how we will program Java going forward, which is why I want to focus on them. Let's discuss each of these in more detail.

The code examples shown below are also available on Github.

Switch Expression
The switch can now be used as both a statement and an expression (meaning it returns a value). case labels are now also available in two forms: case VALUE: ... which works the same as always (allowing fall through) and the new arrow syntax: case VALUE -> ... which does not fall through.
yield has also been introduced as a new keyword. yield allows you to specify the 'return' value of a case block but is not required if the case is a single expression. The code sample below illustrates the new switch expression.

public static String switchExpression(String day) {
  var dayType = switch (day) {
    case "MON", "TUE", "WED", "THUR", "FRI" -> { // the arrow means we need no break, it will only match this case.
      System.out.println("Checking Week Day");
      yield "Work day";  // if the case is a block, we can use yield to supply the value.
    }
    case "SAT", "SUN" -> "Weekend day"; // yield is not required for a single expression.
    default -> throw new IllegalArgumentException("Unknown day");
  };
  return dayType;
}

A Java Switch Expression, yielding a value. Also shown is local type inference (the var keyword) introduced in Java 10.

Text Blocks
Text Blocks are a straightforward but exceedingly helpful feature. Text Blocks allow you to specify Strings over multiple lines, avoiding the need to awkwardly concatenate single-line strings and improving readability.

var json = """
    {
      "name": "ext",
      "systemKey": "1234568",
      "owner": {
        "name": "admin",
        "adminCredentials": "abcdef"
      }
    }
    """;

A Java Text Block. The JSON is now easy to read and update. Note, there is no need to escape quotation marks within the string.

Records
Records are immutable data classes. Using the new record keyword, only the data type and name of fields are specified, and Java generates a constructor, accessors, equals/hashcode, and toString methods. This should be familiar to you if you know Kotlin or Python data classes, Scala case classes, or use Lombok's @Data annotation.

Records remove the need to code boilerplate, verbose immutable data classes by hand, are especially well suited as DTOs and, work well to represent JSON objects.

record Admin(String name, String adminCredentials) implements Principal { }

record ExternalSystem(String name, String systemKey, Admin owner) implements Principal { }

static ExternalSystem readJSON() throws JsonProcessingException {
  var objectMapper = new ObjectMapper();
  var json = """
      {
        "name": "ext",
        "systemKey": "1234568",
        "owner": {
          "name": "admin",
          "adminCredentials": "abcdef"
        }
      }
      """;
  return objectMapper.readValue(json, ExternalSystem.class); // Record support needs Jackson 2.12+
}

// ...

final var externalSystem = readJSON();
System.out.println("System: " + externalSystem.name()); // System: ext
System.out.println("Owner: " + externalSystem.owner()); // Owner: Admin[name=admin, adminCredentials=abcdef]

Java Records implementing an interface Principal . Constructors, accessors, equals, hashcode and toString are generated by the compiler. Records make it trivial to write JSON DTOs.

Sealed Classes and Interfaces
Sealed Classes introduce the concept of fixed-sized class hierarchies to Java. Previously, similar functionality could have been achieved using Enum classes and package-private class hierarchies, both with their own caveats. Sealed Interfaces and Classes specify permitted sub-classes using the new permits keyword. Three constraints are imposed on Sealed Classes:

The sealed class and permitted sub-classes must be in the same module or, if declared in an unnamed module, in the same package.
Permitted sub-classes must directly extend the sealed class.
Every permitted sub-class must include a modifier that specifies how the seal is propagated:
- final to terminate the hierarchy.
- sealed thereby permitting further sub-classes.
- non-sealed thereby again opening up the sub-hierarchy for extension by unknown classes; the superclass cannot prevent this.

Records also work well as leaf nodes (values) in sealed hierarchies as they are implicitly final, terminating the hierarchy. An example is given below.

public sealed interface Principal permits User, Admin, ExternalSystem { // Sealed interface
  String name();
}

record Admin(String name, String adminCredentials) implements Principal { } // Records are implicitly final (required to implement sealed interface)

record ExternalSystem(String name, String systemKey, Admin owner) implements Principal { } // Implicitly implements name() from interface

abstract sealed class User implements Principal permits AnonymousUser, RegisteredUser {

  private final String username;

  protected User(String username) {
    this.username = username;
  }

  @Override
  public String name() {
    return username;
  }
}

final class AnonymousUser extends User {

  public static final String ANONYMOUS = "ANONYMOUS";

  AnonymousUser() {
    super(ANONYMOUS);
  }
}

A sealed Java interface and class hierarchy. Sealed interfaces and classes limit the possible sub-classes: the permitted sub-classes are explicitly listed (`RegisteredUser` is omitted for brevity).

Pattern Matching for instanceof
Java 16 introduces pattern matching via instanceof. In its current state, pattern matching is convenient but basic. However, it lays the foundation for more sophisticated pattern matching in future releases. With instanceof, Pattern Matching tests a type pattern against an object and automatically casts to the type while introducing a local variable. The code below illustrates this and highlights the convenience. I'll discuss Pattern Matching further in the context of switch.

if (p instanceof Admin a) { // matches type pattern (Admin a), and casts with a local variable
  return checkCredentials(a);
}

// ...
@Override
public boolean equals(Object o) {
  return (o instanceof RegisteredUser r) &&
      name().equals(r.name()) &&
      password().equals(r.password());
}

Pattern Matching for instanceof. The second example illustrates the convenience of the cast and local variable that is automatically created.

Pattern Matching for switch expressions
Finally, we have pattern matching for switch expressions. Pattern matching for switch allows checking expressions against multiple patterns similar to those seen with instanceof. Two examples are given below, one of which uses the sealed class hierarchy defined above to illustrate that the default case is not required since the hierarchy can be listed exhaustively. The feature is still in preview in Java 17 and is also limited in scope; future work includes the possibility of introducing destructuring in patterns and case guards akin to what's possible with Scala's match based pattern matching.

public static String switchPatterns(Object o) {  // matching type patterns
  return switch (o) {
    case Integer i -> "Integer type: " + i;
    case Boolean b -> "Boolean type: " + b;
    case String s -> "String type: " + s;
    default -> "Unknown type";
  };
}

static boolean authenticate(Principal p) {
  return switch (p) { // pattern matching switch
    case AnonymousUser u -> true;
    case RegisteredUser r -> checkCredentials(r); // no casting required
    case Admin a -> checkCredentials(a);
    case ExternalSystem s -> checkCredentials(s);
  }; // compiler knows cases are exhaustive due to sealed interface/class
}

Pattern Matching in Java switch expressions.

How far have we come with Java 17?

Clearly, Java has made significant progress with the road to Java 17. The features shown above are certainly useful, well-engineered additions to the language. Records alone will save thousands of lines of code, and the enhancements to the switch statement is immediately useful and clearly lays the foundation for further, powerful data-driven programming via pattern matching.

However, the natural question is whether it has been enough to allow Java to "catch up" to its more feature-rich JVM competitors, aka Kotlin and Scala.
That answer would clearly be no in terms of language features, but are we asking the right question?

Instead, I would like to note two things that are clear to me with the changes between 11 and 17.

First, Java is absolutely leveraging its last-movers advantage: the features shown above are well understood in other languages and have a proven track record of being useful. Developers already know how to use them effectively, with minimal foot-gunning risks. This keeps the learning curve low and gives Java projects an easy upgrade path. Java projects have powerful new tools without new threats to productivity.

Second, if Java seemed stagnant, it should be clear that the language is indeed moving forward, but in a predictable, well-considered way. The 6-monthly release cycle is working and is certainly a vast improvement on the multi-year releases of years past.

In Part II (coming soon), I take a harder look at where Java fits in relative to Scala and Kotlin, delve into the other half of the equation: the JDK platform, and attempt to summarize where I see Java in 2021.

Super-convergence in Tensorflow 2 with the 1Cycle Policy

Andrich van Wyk — Mon, 02 Sep 2019 21:45:00 GMT

Super-convergence in deep learning is a term coined by research Leslie N. Smith in describing a phenomenon where deep neural networks are trained an order of magnitude faster then when using traditional techniques. The technique has lead to some phenomenal results in the Dawnbench project, leading to the cheapest and fastest models at the time.

The basic idea of super-convergence is to make use of a much higher learning rate while still ensuring the network weights converge.

The is achieved by through use of the 1Cycle learning rate policy. The 1Cycle policy is a specific schedule for adapting the learning rate and, if the optimizer supports it, the momentum parameters during training.

The policy can be described as follows:

Choose a high maximum learning rate and a maximum and minimum momentum.
In phase 1, starting from a much lower learning rate (lr_max / div_factor, where div_factor is e.g. 25.) gradually increase the learning rate to the maximum while gradually decreasing the momentum to the minimum.
In phase2, reverse the process: decrease learning rate back to the learning rate minimum while increasing the momentum to the maximum momentum.
In the final phase, decrease the learning rate further (e.g. lr_max / (div_factor *100), while keeping momentum at the maximum.

Work from the FastAI team has shown that the policy can be improved by using just two phases:

The same phase 1, however cosine annealing is used to increase the learning rate and decrease the momentum.
Similarly, the learning rate is decreased again using cosine annealing, to a value of approx. 0 while momentum increasing to the maximum momentum.

Over the course of training this leads to the following learning rate and momentum schedules:

1Cycle learning rate and momentum schedules.

For a more in depth analysis of the 1Cycle policy see Sylvain Gugger's post on the topic.

Tensorflow 2 implementation

The policy is straightfoward to implement in Tensorflow 2. The implementation given below is based on the FastAI library implementation.

Application

Applying the 1Cycle callback is straightforward, simply add it as a callback when calling model.fit(...):

epochs = 3
lr = 5e-3
steps = np.ceil(len(x_train) / batch_size) * epochs
lr_schedule = OneCycleScheduler(lr, steps)

model = build_model()
optimizer = tf.keras.optimizers.RMSprop(lr=lr)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(train_ds,epochs=epochs, callbacks=[lr_schedule])

Results

For a complete example of how the 1Cycle policy is applied, including how to find an appropriate maximum learning rate, to two CNN based learning tasks, a Kaggle notebook has been made available.

References

Finding a Learning Rate with Tensorflow 2

Andrich van Wyk — Sun, 28 Jul 2019 11:05:37 GMT

Choosing a good learning rate is the most important hyper-parameter choice when training a deep neural network (assuming a gradient based optimization algorithm is used).

Choosing a learning rate that's too small leads to extremely long training times. Whereas a learning rate that's too large might miss the optimum and lead to training divergence.

Fortunately there is a simple way to estimate a good learning rate. First described by Leslie Smith in Cyclical Learning Rates for Training Neural Networks, and then popularized by the FastAI library, which has a first class implementation of a learning rate finder.

The technique can be described as follows:

Start with a very low learning rate e.g. 1-e7.
After each batch, increase the learning rate and record the loss and learning rate.
Stop when a very high learning rate (10+) is reached, or the loss value explodes.
Plot the recorded losses and learning rates against each other and choose a learning rate where the loss is strictly decreasing at a rapid rate.

For a more thorough explanation of the technique see Sylvain Gugger's post.

Implementation

Implementing the technique in Tensorflow 2 is straightforward when implemented a Keras Callback. A Tensorflow 2 compatible implementation is given below and is also available on Github.

The implementation uses an exponentially increasing learning rate, which means smaller learning rate regions will be explored more thoroughly than larger learning rate regions.

The losses are also smoothed using a smoothing factor to prevent sudden or erratic changes in the loss (due to the stochastic nature of the training) from stopping the search process prematurely.

Application

In order to use the LRFinder: instantiate and compile a model, adding it as a callback. The model can then be fit as usual. The callback will record the losses and learning rates and stop training when the loss value diverges or the maximum learning rate is reached.

from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout

def build_model():
    return tf.keras.models.Sequential([
        Conv2D(32, 3, activation='relu'),
        MaxPool2D(),
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.1),
        Dense(10, activation='softmax')
    ])

lr_finder = LRFinder()
model = build_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
_ = model.fit(train_ds, epochs=5, callbacks=[lr_finder], verbose=False)

lr_finder.plot()

The plot method will produce a graph of the results, allowing visually choosing a learning rate:

The results of the LRFinder. The losses are plotted against the log scaled learning rates. A good learning rate would be in the range where the loss is strictly decreasing at a rapid rate: [1e-3, 1e-2].

A value should be chosen in a region where the loss is rapidly, but strictly decreasing. Examples of such graphs and how they are interpreted are also available in previous posts.

It is important to rebuild and recompile the model after the LRFinder is used in order to reset the weights that were updated during the mock training run.

A complete example of how the LRFinder is applied is available in this Jupyter notebook.

References

African Antelope: A Case Study of Creating an Image Dataset with FastAI

Andrich van Wyk — Sun, 14 Apr 2019 11:19:17 GMT

(Note: this post was updated on 2019-05-19 for clarity.)

In this post we will look at an end-to-end case study of how to creating and cleaning your own small image dataset from scratch and then train a ResNet convolutional neural network to classify the images using the FastAI library.

Besides gathering the data, we will also illustrate how to perform model assisted data cleaning to partially automate the cleaning of the data itself.

Creating an Image Dataset
i. Downloading the Data
ii. Cleaning the Data
Training the Model
i. Building the Dataset
ii. Creating the Model
iii. Fitting the Model
iv. Initial Results
Model Assisted Data Cleaning
Full Model Training
i. Results
Conclusion

The classification problem we will be solving is the classification of major species of African antelope in the wild. The dataset we will create will consist of 13 African antelope species. As we will see this is an interesting challenge, as orientation, colour and very specific features of the antelope (e.g. the horns) are often necessary to distinguish each species. We will also see that it's not always necessary to have a very large dataset in order to use deep learning.

A Jupyter notebook and Python script with the complete code for the example is available on Github. In order to run the notebook or script, ensure you have a FastAI environment setup.

Creating an Image Dataset

We will start by downloading the images for our dataset. When creating your own dataset, carefully think of the use case you are building it for, and think of the type of images that are representative of the actual problem you are trying to solve.

In the case of antelope, there are a few things to consider:

Male and female variants of the species have significant differences.
We are interested in pictures of the animals in the wild as opposed to captivity.
The young of each species could be very different from the adult.
The colour of a species could be a distinguishing factor. For example, photos taken at dawn or dusk might not be appropriate.

In general try and think of any biases or specific contexts present in your subject matter that might not be applicable to the problem being solved.

Downloading the Data

In order to download the actual images, we will use google-images-download, an open source tool that can download images from Google Images based on keyword search.

The code to download the images is as follows:

def download_antelope_images(output_path: Path, limit: int = 50) -> None:
    """Download images for each of the antelope to the output path.
    
    Each species is put in a separate sub-directory under output_path.
    """
    response = google_images_download.googleimagesdownload()

    for antelope in ANTELOPE:
        for gender in ['male', 'female']:
            output_directory = str(output_path/antelope).replace(' ', '_')

            arguments = {
                'keywords': f'wild {antelope} {gender} -hunting -stock',
                'output_directory': output_directory,
                'usage_rights': 'labeled-for-nocommercial-reuse',
                'no_directory': True,
                'size': 'medium',
                'limit': limit
            }
            response.download(arguments)

The code above searches for images of each antelope species in the ANTELOPE list. For every species, we perform two searches: one for male examples and one for female examples. We add the keyword wild to look for examples of the antelope in the wild, while excluding the keywords hunting and stock to limit the search to images applicable to our use case. Also be sure to search for images with the appropriate usage rights.

The images are downloaded putting each species in a separate folder named for the species, thereby building an 'Imagenet-style' dataset. This is compatible with FastAI's ImageDataBunch.from_folder helper which will use to load the dataset for training.

The download was limited to 50 examples each for the male and female of each species.

Cleaning the Data

Even though Google does a very good job of finding the correct images for keyword searches, we still have to make sure the images are appropriate for our use case.

Unfortunately this is a time consuming process which is hard to automate (more on that later). Some checks can be automated, for instance, removing duplicates based on MD5 sums, or using the file names to check for labelling errors (as I do in the Python script). However, I still had to manually inspect the images, removing examples I considered inappropriate. These images mostly involved photos of multiple species in a single example, images of predators hunting or feasting on the antelope, man made illustrations of the antelope or antelopes in captivity.

After the data cleaning I was left with between 60 and 100 images (with an average of 85) per species. This is not a large dataset - we will however see that the deep learning model still performs very well.

Training the Model

With the data prepared we can now build the training and validation datasets and train our model. We will be using transfer learning to train a ResNet model that is pre-trained on the ImageNet dataset.

Building the Dataset

FastAI makes use of DataBunch objects to group the training, validation and test datasets. The DataBunch object also makes sure the Pytorch DataLoader loads to the correct device (GPU/CPU) and supports applying image transforms for data augmentation. Further, the DataBunch normalizes the data using the ImageNet statistics, which is necessary, as the model is pre-trained on the ImageNet data. The ImageDataBunch can be created with:

image_data = ImageDataBunch.from_folder(DATA_PATH, valid_pct=VALID_PCT,\
                                            ds_tfms=get_transforms(),\
                                            size=IMAGE_SIZE,\
                                            bs=BATCH_SIZE)\
                                            .normalize(imagenet_stats)

We specify the percentage of the data to use for the validation set with VALID_PCT ( 0.2 or 20% in this case), the IMAGE_SIZE (224 for ImageNet trained models) and a BATCH_SIZE (32 for this example, but you can use a smaller or larger batch size, depending on how much VRAM your GPU has).

Creating the Model

Creating the ResNet model is very straightforward with FastAI. We use the cnn_learner helper method, specifying our ImageDataBunch and chosen ResNet architecture:

learn = cnn_learner(image_data, models.resnet50, metrics=[error_rate, accuracy])

Here we use a pre-trained resnet50 model from the Pytorch Torchvision library. If you have a smaller GPU, a pre-trained resnet34 works equally well.

We also specify the error_rate as a metric that will be calculated during training.

Fitting the Model

We are now ready to fit the model to our data. The initial training will only fine-tune the top fully-connected layers of the model; the other layer weights being frozen.

Before starting the training, we have to choose an appropriate learning rate, which is perhaps the single most important choice for effective training. FastAI provides the supremely useful lr_find method for this purpose, which is based on the technique discussed in Cyclical Learning Rates for Training Neural Networks.

learn.lr_find()
learn.recorder.plot()

Loss vs Learning Rate graph produced by learn.recorder.plot() after lr_find().

We when simply choose a learning rate (or range) where the loss is strictly decreasing. It's beneficial to choose the largest learning rate that has a decreasing loss, as this will speed up training.

Having chosen a learning rate range ( [1e-3, 1e-2] ) , we perform 5 training epochs using the 1cycle learning policy.

learn.fit_one_cycle(5, max_lr=slice(1e-3, 1e-2))
learn.save('stage-1')

epoch	train_loss	valid_loss	error_rate	time
0	1.352547	0.909331	0.281369	00:14
1	1.032153	0.774388	0.205323	00:13
2	0.737094	0.570336	0.178707	00:13
3	0.476649	0.451232	0.129278	00:13

epochtrain_lossvalid_losserror_ratetime01.3525470.9093310.28136900:1411.0321530.7743880.20532300:1320.7370940.5703360.17870700:1330.4766490.4512320.12927800:13

Initial Results

After the initial training we reach a validation accuracy of 87.07%. We can use FastAI's ClassificationInterpretation to further interpret the model's performance:

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

Confusion matrix produced after initial training of the model. Notably, the model struggles with distinguishing a Lichtenstein's Hartebeest and a Tsessebe, two antelope that are similar in appearance.

The interpreter also has a very useful feature that allows us to easily plot the the examples that had the largest loss values.

interp.plot_top_losses(9, figsize=(12,12))

Top losses after the initial training.

On issue seems to be images of close-up views of the antelope's face or photos where the antelope is not presented in the typical broadside view. In both these cases, distinguishing features such as patterns on the animal's coat or it's horns might be missing from the image. This highlights a potential flaw in how we gather the data: having many examples of one perspective of the subject matter, but neglecting other, valid perspectives.

Model Assisted Data Cleaning

The FastAI library provides an extremely useful Jupyter Notebook widget that aids in automating data clean-up by using the trained model itself: the ImageCleaner.

Using an ImageDataBunch, the dataset is then indexed by which images lead to the highest losses using the trained model. The ImageCleaner is then instantiated from the dataset and indices:

from fastai.widgets import *

images = (ImageList.from_folder(DATA_PATH)
                   .split_none()
                   .label_from_folder()
                   .transform(custom_transforms(), size=224)
                   .databunch())

ds, idxs = DatasetFormatter().from_toplosses(learn)

ImageCleaner(ds, idxs, DATA_PATH)

The FastAI ImageCleaner Jupyter notebook widget allowing re-labelling and removal of images.

The widget then allows the you to remove images from the dataset or re-label the images in the case that images are incorrectly labelled.

Importantly, the ImageCleaner widget does not modify the data itself but instead creates a .csv file that contains the paths and labels of the cleaned data. We then need to construct an ImageDataBunch from the .csv file:

df = pd.read_csv(DATA_PATH/'cleaned.csv', header='infer')
image_data = (ImageDataBunch.from_df(DATA_PATH, df,
                                     valid_pct=VALID_PCT,
                                     ds_tfms=custom_transforms(),
                                     size=IMAGE_SIZE,
                                     bs=BATCH_SIZE)
              .normalize(imagenet_stats))

Full Model Training

Next we can look at training all the layers of the model instead of just the last, fully-connected layers. This is done by 'unfreezing' the other layers of the model before training.

We also have to find a new learning rate as the optimisation landscape has now completely changed:

learn.unfreeze()

learn.lr_find()
learn.recorder.plot()

Loss vs Learning Rate graph produced by learn.recorder.plot() after lr_find() for full model training.

Finally, we fit the model again with the 1cycle policy for 20 epochs using a small learning rate:

learn.fit_one_cycle(20, max_lr=7e-5)

epoch	train_loss	valid_loss	error_rate	time
0	0.154993	0.311427	0.127962	00:14
1	0.185968	0.302535	0.127962	00:14
2	0.167942	0.291734	0.109005	00:14
3	0.181434	0.298713	0.094787	00:14
4	0.190612	0.400196	0.118483	00:14
5	0.209943	0.414060	0.118483	00:14
6	0.226450	0.462790	0.132701	00:14
7	0.248497	0.382834	0.113744	00:14
8	0.189046	0.343103	0.113744	00:14
9	0.141687	0.378920	0.132701	00:14
10	0.133787	0.400326	0.099526	00:14
11	0.136122	0.366274	0.109005	00:14
12	0.114380	0.343331	0.094787	00:14
13	0.091698	0.364937	0.109005	00:14
14	0.083694	0.331757	0.113744	00:14
15	0.069167	0.309694	0.104265	00:14
16	0.064571	0.312528	0.094787	00:14
17	0.057514	0.316830	0.085308	00:14
18	0.060952	0.323746	0.104265	00:14
19	0.057364	0.298466	0.085308	00:14

Results

Fitting all the layers of the neural network improves our training loss to 0.057 and our validation loss to 0.298 and increases our validation accuracy to 91.4692%.

Similar to earlier, we can create an interpreter and visualise our top losses:

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(9, figsize=(12,12), heatmap=True)

Here we pass the parameter heatmap=True to the plot_top_losses method, which will produce Grad-CAM (Gradient-weighted Class Activation Mapping) heatmaps for the images. Grad-CAM visualisations highlight the important regions in the image used for the prediction.

Top losses after the initial training. Heatmaps show Grad-CAM visualisations of predictions.

The Grad-CAM visualisations show that the model does however correctly identify the regions containing the antelope and confirms that it tends to focus on regions containing the body and horns of the antelope.

Finally we can calculate our final F1 score, also making use of TTA (Test Time Augmentation). TTA applies the same augmenting transforms we used during training when making a prediction. The actual prediction is then the average of the predictions over the transformations of an example, increasing the chance the model makes the correct prediction.

preds, targets = learn.TTA()
predicted_classes = np.argmax(preds, axis=1)

f1_score(targets, predicted_classes, average='micro')

0.9004739336492891

We end up with a final F1 score of 0.9.

Further Improvements

There are number of things we can investigate to further improve the model performance:

More data could be gathered, especially of specific edge cases the model is struggling with: front and rear views of the animals and close-ups of antelope faces.
Validate the transformations used to augment the dataset, especially colour distortion and image rotation/cropping. Some very specific features are sometimes required to distinguish one species from another and as such we have to ensure the transformations we use doesn't discard this information.
Alternative architectures should be investigated that might perform better with the specific use case.

Conclusion

In this post we covered an end-to-end example of creating our own image dataset and using transfer learning to create an accurate deep learning image classifier for African antelope species. Our ResNet50 model reached an F1 score of 0.9 after only 24 epochs of training on roughly 880 examples spread over the 13 classes.

Unsurprisingly the hardest and most time consuming part of the deep learning exercise was not training the model, indeed the FastAI code to do so is only 5 lines long:

image_data = ImageDataBunch.from_folder(DATA_PATH, valid_pct=VALID_PCT,\
                                            ds_tfms=get_transforms(),\
                                            size=IMAGE_SIZE,\
                                            bs=BATCH_SIZE)\
                                            .normalize(imagenet_stats)

learner = cnn_learner(image_data, architecture, metrics=error_rate)
learner.fit_one_cycle(5, max_lr=slice(1e-3, 1e-2))
learner.unfreeze()
learner.fit_one_cycle(5, 1e-4)

Instead, the most difficult part is gathering and cleaning the data. Manual inspection of the data is tedious and time consuming, and still resulted in some problems slipping through.

However, we also demonstrated how to use the model itself to aid in cleaning the dataset using the ImageCleaner widget from the FastAI library.

Furthermore, we found that the dataset is not fully representative of the problem we are trying to solve, as the dataset is missing examples of some valid perspectives we might encounter in the real world.

There is no simple solution to creating a high quality and error free dataset (which is why open data initiatives are so valuable). However, an alternative to creating your own dataset is to find a dataset similar to a dataset you would need to solve your problem and then modifying it. In this case, we could have started with a dataset such as the Snapshot Serengeti dataset and used only images of antelope contained therein. An exercise left for next time.

FastAI Installation and Setup

Andrich van Wyk — Mon, 08 Apr 2019 20:55:00 GMT

(Updated April 2019)

The most up to date installation instructions are available on Github and the docs site, I would recommend starting there. A list of common troubleshooting issues is also available.

I list the steps I followed for personal reference, which includes solving some minor issues I encountered in setting up a full DL environment on a GPU equipped laptop running Ubuntu 18.04.

If you are installing FastAI to do one of the deep learning courses, I recommend one of the various cloud solutions available instead of setting up a CUDA/Anaconda environment as below.

The instructions listed below installs FastAI v1 within a freshly created Anaconda virtual environment. The instructions below assume you have Anaconda and NVIDIA CUDA (along with an appropriate NVIDIA driver) installed.

First, ensure conda is up to date, otherwise conda might complain about PackagesNotFoundErrors.

conda update conda

I recommend installing into a virtual environment, to prevent interference from other libraries and system packages. You can create a Python 3.6 virtual environment to install FastAI in as follows:

conda create -n fastai python=3.7 mypy pylint jupyter scikit-learn pandas
source activate fastai

Next, if you are planning on installing the GPU version, verify which CUDA you have installed:

nvcc --version # Cuda compilation tools, release 10.0, V10.0.130

You can find the corresponding conda package using:

conda search cuda* -c pytorch

Look for the cudaXX packages that matches your CUDA version as reported by nvcc.

You can now install pytorch and fastai using conda.

conda install cudatoolkit=10.0 -c pytorch -c fastai fastai

A note on CUDA versions: I recommend installing the latest CUDA version supported by Pytorch if possible (10.0 at the time of writing), however, to avoid potential issues, stick with the same CUDA version you have a driver installed for.

You can verify that the CUDA installation went smoothly and that Pytorch is using your GPU using the following command:

python -c "import torch; print(torch.cuda.get_device_name(torch.cuda.current_device()))"

It should print the name of the device (GPU) you have attached to the machine.

Note for NLP (using FastAI v1 for text): if you plan on using FastAI for NLP, I recommend also downloading the relevant language packages for spacy, otherwise you might hit some obscure errors when attempting to parse textual data.

python -m spacy download en

Cloud Environments

A number of Cloud services have first class support for FastAI. I've personally used https://www.paperspace.com/ a lot and can recommend it. There are a number of alternative options. If you are looking for a VM based option (which gives you a little more control over your environment), I recommend Google Cloud Platform or Microsoft Azure .

Documentation

The FastAI v1 docs are really great, you can find them here: http://docs.fast.ai.

CDC Mortality Prediction with FastAI for Tabular Data

Andrich van Wyk — Fri, 19 Oct 2018 20:57:06 GMT

The first major version of the FastAI deep learning library, FastAI v1, was recently released. For those unfamiliar with the FastAI library, it's built on top of Pytorch and aims to provide a consistent API for the major deep learning application areas: vision, text and tabular data. The library also focuses on making state of the art deep learning techniques available seamlessly to its users.

This post will cover getting started with FastAI v1 at the hand of tabular data. It is aimed at people that are at least somewhat familiar with deep learning, but not necessarily with using the FastAI v1 library. For more technical details on the deep learning techniques used, I recommend this post by Rachel of FastAI.

For a guide on installing FastAI v1 on your own machine, or cloud environments you may use, see this post.

Training a model on Tabular Data

Tabular data (referred to as structured data in the library before v1) refers to data that typically occurs in rows and columns, such as SQL tables and CSV files. Tabular data is extremely common in the industry, and is the most common type of data used in Kaggle competitions, but is somewhat neglected in other deep learning libraries. FastAI in turn provides first class API support for tabular data, as shown below.

In the example below we attempt to predict mortality using CDC Mortality data from Kaggle. The complete notebook which includes data pre-processing of the data is available here: https://github.com/avanwyk/fastai-projects/blob/master/cdc-mortality-tabular-prediction/cdc-mortality.ipynb.

Data loading

The FastAI v1 tabular data API revolves around three types of variables in the dataset: categorical variables, continuous variables and the dependent variable.

dep_var = 'age'
categorical_names = ['education', 'sex', 'marital_status']

Any variable that is not specified as a categorical variable, will be assumed to be a continuous variable.

For Tabular data, FastAI provides a special TabularDataset. The simplest way to construct a TabularDataset is using the tabular_data_from_df helper. The helper also supports specifying a number of transforms that is applied to the dataframe before building the dataset.

tfms = [FillMissing, Categorify]

tabular_data = tabular_data_from_df('output', train_df, valid_df, dep_var, tfms=tfms, cat_names=categorical_names)

The FillMissing transform will fill in missing values for continuous variables but not the categorical or dependent variables. By default is uses the median, but this can be changed to use either a constant value or the most common value.

The Categorify transform will change the variables in the dataframe to Pandas category variables for you.

The transforms are applied to the dataframe before being passed to the dataset object.

The TabularDataset then does some more pre-processing for you. It automatically converts category variables (which might be text) to sequential, numeric IDs starting at 1 (0 is reserved for NaN values). Further, it automatically normalizes the continuous variables using standardization. You can also pass in statistics for each variable to overwrite the mean and standard deviation used for the normalization, otherwise they will automatically be calculated from the training set.

Learner and model

With the data ready to be used by a deep learning algorithm, we can create a Learner:

learn = get_tabular_learner(tabular_data,
                            layers=[100,50,1],
                            emb_szs={'education': 6,
                                     'sex': 5,
                                     'marital_status': 8})
learn.loss_fn = F.mse_loss

We use a helper function get_tabular_learner to setup the tabular data learner for us. We also have to specify an MSE loss function since we are performing a regression task.

A FastAI Learner combines a model with data, a loss function and an optimizer. It also does some other work like encapsulate the metric recorder and has API for saving and loading the model.

In our case, the helper function will build a TabularModel. The model will consist of an Embedding Layer for each categorical variable (with optional sizes specified), with each layer having its own Dropout and Batchnormalization. Those results are concatenated with the continuous input variables which is then followed by Linear and ReLU layers of the specified sizes. Batchnormalization is added between each layer pair and the last layer pair only includes the Linear layer.

By default, an Adam optimizer will be used.

You can print a summary of the model using:

learn.model

Learning rate

Before we can start training the model, we have to choose a learning rate (LR). This is where one of the FastAI library's more useful and powerful tools come in. The FastAI library has first class support for a technique to find an appropriate learning rate with lr_find.

learn.lr_find()
learn.recorder.plot()

Doing the above will (after some training), produce a graph such as this:

Result of running lr_find()

Another example:

Another example of plotting the loss from lr_find()

An appropriate LR can then be selected by choosing a value that is an order of magnitude lower than the minimum. This learning rate will still be aggressive enough to ensure quick training, but is reasonably safe from exploding. For more details on the technique, see here and here.

Training

We are now ready to train the model:

lr = 1e-1
learn.fit_one_cycle(1, lr)

The fit_one_cycle call fits the model for the specified number of epochs using the OneCycleScheduler callback. The callback automatically applies a two phase learning rate schedule, first increasing the learning rate to lr_max (which is the learning rate we specify) and then annealing to 0 in the second phase.

Loss and metrics are recorded by the Recorder callback and are accessible through learn.recorder. For example, to plot the training loss you can use:

learn.recorder.plot_losses()

Training Loss

Initial thoughts on FastAI v1

The FastAI v1 experience has so far been really great. The pre-v1 releases were usable, but definitely lacked some polish (particularly the documentation). The new documentation site is great, and thoroughly explains a lot of the API.

The API itself is incredibly terse and you can do a lot with very few lines of code. I look forward to diving deeper into the API and exploring its flexibility. Another great thing about the API is the consistent use of Python Type Hints which makes it much easier to deduce what the API expects or does while working in notebook environments, in addition to catching obvious errors.

References

The documentation that was released with FastAI v1 is really great, you can check it out here: http://docs.fast.ai/

Then I also have to mention the really great FastAI forums, its very possibly the best deep learning forums in existence.

Lastly, if you haven't done so already, the FastAI course is strongly recommended. A new version of the course based on v1 of the library will launch in early 2019.

How to Read a Paper

Andrich van Wyk — Thu, 05 Jul 2018 17:59:15 GMT

A few years ago I came across a method for reading academic papers which I've kept coming back to as a reliable systematic approach to efficiently read important papers of varying complexity.

The method itself comes from a paper by Prof. Srinivasan Keshav, an ACM Fellow and researcher at the University of Waterloo. I recommend reading his paper, but I summarise the system here.

The Three Pass System

The system uses a top down three pass approach with each pass delving deeper into the details of the paper. Each pass has a specific goal. Depending on what you need to obtain from the paper, completing all three passes may not be necessary.

The First Pass

The goal of the first pass is to get a high level overview of the paper:

Read the title, abstract, introduction, section and subsection headings and the conclusion.
Glance at the references, noting whether you might have read any of them.

After the first pass you should be able to categorize the paper, understand the paper's context, validate the basic assumptions for correctness, note the main contributions and be able to determine the paper's clarity.

The first pass is sufficient to determine whether you are interested in the paper, whether it is relevant to your research area and whether there are any questionable assumptions made which may deter your interest.

Also note, if you are writing a paper, a first pass is perhaps all a reviewer will give you. Pay special attention to the parts mentioned above. Strive to be clear and concise in your headings, introduction, conclusion and abstract.

The Second Pass

With the second pass the goal is to understand the content of the paper to the point where you could explain it to someone else:

Carefully read the paper, but ignore details such as proofs or very technical details.
Make comments and notes on important points.
Study any figures or graphs, note details such as the axes, labeled points and whether statistical variance is indicated etc.
Note all unread references for further reading.

Doing a second pass is appropriate for papers that you are interested in, but aren't necessarily directly related to your work. After the second pass you may or may not understand the paper. If it is critical to understand the work, or you are reviewing the paper, move on to the third pass.

The Third Pass

The idea of the third pass is to understand the paper with such detail such that you could re-implement the paper.

Read the paper with great attention to detail, identifying and challenging every assumption.
Given the same assumptions, think about how you would reproduce and present the result.
If novel techniques or methods are used, make sure you understand them to the degree where you could use them yourself.

Comparing your idea of implementing the paper with the actual paper will highlight areas where the paper excels or falls short. After the final pass you should be able to reconstruct the structure of the paper from memory, be familiar with the techniques used and identify implicit assumptions and missing references.

Conclusion

For more detail on the system and its motivations and related work, please read Prof. Keshav's paper. It also includes a step based approach for doing a literature survey.

References

Keshav, S., 2007. How to read a paper. ACM SIGCOMM Computer Communication Review, 37(3), pp.83-84.

An Overview of LightGBM

Andrich van Wyk — Wed, 16 May 2018 21:40:00 GMT

LightGBM Introduction
Gradient Boosting
i. Algorithm
LightGBM API
i. Plotting
ii. Saving the model
LightGBM Parameters
i. Tree parameters
ii. Tuning for imbalanced data
iii. Tuning for overfitting
iv. Tuning for accuracy
Resources

Although maybe not as fashionable as deep learning algorithms in 2018, the effectiveness of tree and tree ensemble based learning methods certainly cannot be questioned. Across a variety of domains (restaurant visitor forecasting, music recommendation, safe driver prediction, and many more), ensemble tree models - specifically gradient boosted trees - are widely used on Kaggle, often as part of the winning solution.

Decision trees also have certain advantages over deep learning methods: decision trees are more readily interpreted than deep neural networks, naturally better at learning from imbalanced data, often much faster to train, and work directly with un-encoded feature data (such as text).

This post gives an overview of LightGBM and aims to serve as a practical reference. A brief introduction to gradient boosting is given, followed by a look at the LightGBM API and algorithm parameters. The examples given in this post are taken from an end-to-end practical example of applying LightGBM to the problem of credit card fraud detection: https://www.kaggle.com/avanwyk/a-lightgbm-overview.

LightGBM

LightGBM is an open-source framework for gradient boosted machines. By default LightGBM will train a Gradient Boosted Decision Tree (GBDT), but it also supports random forests, Dropouts meet Multiple Additive Regression Trees (DART), and Gradient Based One-Side Sampling (Goss).

The framework is fast and was designed for distributed training. It supports large-scale datasets and training on the GPU. In many cases LightGBM has been found to be more accurate and faster than XGBoost, though this is problem dependent.

Both LightGBM and XGBoost are widely used and provide highly optimized, scalable and fast implementations of gradient boosted machines (GBMs). I have previously used XGBoost for a number of applications, but have yet to take an in depth look at LightGBM.

The section below gives some theoretical background on gradient boosting. The section LightGBM API continues with practicalities on using the LightGBM.

Gradient Boosting

When considering ensemble learning, there are two primary methods: bagging and boosting. Bagging involves the training of many independent models and combines their predictions through some form of aggregation (averaging, voting etc.). An example of a bagging ensemble is a Random Forest.

Boosting instead trains models sequentially, where each model learns from the errors of the previous model. Starting with a weak base model, models are trained iteratively, each adding to the prediction of the previous model to produce a strong overall prediction.

In the case of gradient boosted decision trees, successive models are found by applying gradient descent in the direction of the average gradient, calculated with respect to the error residuals of the loss function, of the leaf nodes of previous models.

An excellent explanation of gradient boosting is given by Ben Gorman over on the Kaggle Blog and I strongly advise reading the post if you would like to understand gradient boosting. A summary is given here.

Considering decision trees, we proceed as follows. We start with an initial fit, $F_0$, of our data: a constant value that minimizes our loss function $L$:
$$ F_0(x) = \underset{\gamma}{arg\ \min} \sum^{n}_{i=1} L(y_i, \gamma) $$
in the case of optimizing the mean square error, we can take the mean of the target values:
$$ F_0(x) = \frac{1}{n} \sum^{n}_{i=1} y_i $$

With our initial guess of $F_0$, we can now calculate the gradient, or pseudo residuals, of $L$ with respect to $F_0$:
$$ r_{i1} = -\frac{\partial L(y_i, F_{0}(x_i))}{\partial F_{0}(x_i)} $$

We now fit a decision tree $h_{1}(x)$, to the residuals. Using a regression tree, this will yield the average gradient for each of the leaf nodes.

Now we can apply gradient descent to minimize the loss for each leaf by stepping in the direction of the average gradient for the leaf nodes as contained in our decision tree $h_{1}(x)$. The step size is determined by a multiplier $\gamma_{1}$ which can be optimized by performing a line search. The step size is further shrinked using a learning rate $\lambda_{1}$, thus yielding a new boosted fit of the data:
$$ F_{1}(x) = F_{0}(x) + \lambda_1 \gamma_1 h_1(x) $$

Algorithm

Putting it all together, we have the following algorithm. For a number of boosting rounds $M$ and a differentiable loss function $L$:

Let $ F_0(x) = \underset{\gamma}{arg\ \min} \sum^{n}_{i=1} L(y_i, \gamma) $
For m = 1 to M:

Calculate the pseudo residuals $ r_{im} = -\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)} $
Fit decision tree $ h_m(x) $ to $ r_{im} $
Compute the step multiplier $ \gamma_m $ for each leaf of $ h_m(x) $
Let $ F_m(x) = F_{m-1}(x) + \lambda_m \gamma_m h_m(x) $, where $ \lambda_m $ is the learning rate for iteration $m$

One caveat of the above explanation is that it neglects to incorporate a regularization term in the loss function. An overview of the gradient boosting as given in the XGBoost documentation pays special attention to the regularization term while deriving the objective function.

In terms of LightGBM specifically, a detailed overview of the LightGBM algorithm and its innovations is given in the NIPS paper.

LightGBM API

Fortunately the details of the gradient boosting algorithm are well abstracted by LightGBM, and using the library is very straightforward.

LightGBM requires you to wrap datasets in a LightGBM Dataset object:

lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train, free_raw_data=False)

The parameter free_raw_data controls whether the input data is freed after constructing the inner datasets.

LightGBM supports many parameters that control various aspects of the algorithm (more on that below). Some core parameters that should be defined are:

core_params = {
    'boosting_type': 'gbdt', # rf, dart, goss
    'objective': 'binary', # regression, multiclass, binary
    'learning_rate': 0.05,
    'num_leaves': 31,
    'nthread': 4,
    'metric': 'auc' # binary_logloss, mse, mae
}

We can then call the training API to train a model, specifying the number of boosting rounds and early stopping rounds as needed:

evals_result = {}
gbm = lgb.train(core_params, # parameter dict to use
                training_set,
                init_model=init_gbm, # enables continuous training.
                num_boost_round=boost_rounds, # number of boosting rounds.
                early_stopping_rounds=early_stopping_rounds,
                valid_sets=validation_set,
                evals_result=evals_result, # stores validation results.
                verbose_eval=False) # print evaluations during training.

Early stopping occurs when there is no improvement in either the objective evaluations or the metrics we defined as calculated on the validation data.

LightGBM also supports continuous training of a model through the init_model parameter, which can accept an already trained model.

A detailed overview of the Python API is available here.

Plotting

LightGBM has a built in plotting API which is useful for quickly plotting validation results and tree related figures.

Given the eval_result dictionary from training, we can easily plot validation metrics:

_ = lgb.plot_metric(evals)

Another very useful features that contributes to the explainability of the tree is relative feature importance:

_ = lgb.plot_importance(model)

It is also possible to visualize individual trees:

_ = lgb.plot_tree(model, figsize=(20, 20))

Saving the model

Models can easily be saved to a file or JSON:

gbm.save_model('cc_fraud_model.txt')

loaded_model = lgb.Booster(model_file='cc_fraud_model.txt')

# Output to JSON
model_json = gbm.dump_model()

LightGBM Parameters

A list of more advanced parameters for controlling the training of a GBDT is given below with a brief explanation of their effect on the algorithm.

advanced_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    
    'learning_rate': 0.01,
    'num_leaves': 41, # more increases accuracy, but may lead to overfitting.
    
    'max_depth': 5, # shallower trees reduce overfitting.
    'min_split_gain': 0, # minimal loss gain to perform a split.
    'min_child_samples': 21, # specifies the minimum samples per leaf node.
    'min_child_weight': 5, # minimal sum hessian in one leaf.
    
    'lambda_l1': 0.5, # L1 regularization.
    'lambda_l2': 0.5, # L2 regularization.
    
    # LightGBM can subsample the data for training (improves speed):
    'feature_fraction': 0.5, # randomly select a fraction of the features.
    'bagging_fraction': 0.5, # randomly bag or subsample training data.
    'bagging_freq': 0, # perform bagging every Kth iteration, disabled if 0.
    
    'scale_pos_weight': 99, # add a weight to the positive class examples.
    # this can account for highly skewed data.
    
    'subsample_for_bin': 200000, # sample size to determine histogram bins.
    'max_bin': 1000, # maximum number of bins to bucket feature values in.
    
    'nthread': 4, # best set to number of actual cores.
}

Tree parameters

Both LightGBM and XGBoost build their trees leaf-wise.

Building the tree leaf-wise results in faster convergence, but may lead to overfitting if the parameters are not tuned accordingly. Important parameters for controlling the tree building are:

num_leaves: the number of leaf nodes to use. Having a large number of leaves will improve accuracy, but will also lead to overfitting.
min_child_samples: the minimum number of samples (data) to group into a leaf. The parameter can greatly assist with overfitting: larger sample sizes per leaf will reduce overfitting (but may lead to under-fitting).
max_depth: controls the depth of the tree explicitly. Shallower trees reduce overfitting.

Tuning for imbalanced data

The simplest way to account for imbalanced or skewed data is to add a weight to the positive class examples:

scale_pos_weight: the weight can be calculated based on the number of negative and positive examples: sample_pos_weight = number of negative samples / number of positive samples.

Tuning for overfitting

In addition to the parameters mentioned above the following parameters can be used to control overfitting:

max_bin: the maximum numbers bins that feature values are bucketed in. A smaller max_bin reduces overfitting.
min_child_weight: the minimum sum hessian for a leaf. In conjuction with min_child_samples, larger values reduce overfitting.
bagging_fraction and bagging_freq: enables bagging (subsampling) of the training data. Both values need to be set for bagging to be used. The frequency controls how often (iteration) bagging is used. Smaller fractions and frequencies reduce overfitting.
feature_fraction: controls the subsampling of features used for training (as opposed to subsampling the actual training data in the case of bagging). Smaller fractions reduce overfitting.
lambda_l1 and lambda_l2: controls L1 and L2 regularization.

Tuning for accuracy

Accuracy may be improved by tuning the following parameters:

max_bin: a larger max_bin increases accuracy.
learning_rate: using a smaller learning rate and increasing the number of iterations may improve accuracy.
num_leaves: increasing the number of leaves increases accuracy with a high risk of overfitting.

A great overview of both XGBoost and LightGBM parameters, their effect on various aspects of the algorithms and how they relate to each other is available here.

Resources

LightGBM project: https://github.com/Microsoft/LightGBM
LightGBM paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf
Documentation: https://lightgbm.readthedocs.io/en/latest/index.html
Parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html
Parameter explorer: https://sites.google.com/view/lauraepp/parameters

Encoding Cyclical Features for Deep Learning

Andrich van Wyk — Fri, 13 Apr 2018 15:50:24 GMT

An updated version of this article is available at https://mlengineer.substack.com/p/encoding-cyclical-features-for-deep-learning

Many features commonly found in datasets are cyclical in nature. The most common of which are time attributes: months, days, weekdays, hours, minutes and seconds all occur in specific cycles. Other examples might include features such as seasonal, tidal or astrological data.

A key concern when dealing with cyclical features is how we can encode the values such that it is clear to the deep learning algorithm that the features occur in cycles. This is of particular concern in deep learning applications as it may have a significant effect on the convergence rate of the algorithm.

This post looks at a strategy to encode cyclical features in order to clearly express their cyclical nature.

A complete example of using the encoding on weather data, which includes illustrating the effect on a three layer deep neural network, is available as a Kaggle Kernel.

The Problem with Cyclical Data

The data used below is hourly weather data for the city of Montreal. A complete description of the data is available here. We will be looking at the hour attribute of the datetime feature to illustrate the problem with cyclical features.


data['hour'] = data.datetime.dt.hour
sample = data[:168] # the first week of the data
ax = sample['hour'].plot()

Here we can see exactly what we would expect from an hour value for a week: a cycle between 0 and 24, repeating 7 times.

This graph illustrates the problem with presenting cyclical data to a deep learning algorithm: there are jump discontinuities in the graph at the end of each day when the hour value overflows to 0.

From 22:00 to 23:00 one hour has passed, which is adequately represented by the current unencoded values: the absolute difference between 22 and 23 is 1. However, when considering 23:00 and 00:00, the jump discontinuity occurs, and even though the difference is one hour, with the unencoded feature, the absolute difference in the feature is of course 23.

The same will occur for seconds at the end of each minute, for days at the end of each year and so forth.

Encoding Cyclical Features

One method for encoding a cyclical feature is to perform a sine and cosine transformation of the feature:
$$x_{sin} = \sin{(\frac{2 * \pi * x}{\max(x)})}$$
$$x_{cos} = \cos{(\frac{2 * \pi * x}{\max(x)})}$$


data['hour_sin'] = np.sin(2 * np.pi * data['hour']/24.0)
data['hour_cos'] = np.cos(2 * np.pi * data['hour']/24.0)

Plotting this feature we now end up with a new feature that is cyclical, based on the sine graph:


sample = data[:168]
ax = sample['hour_sin'].plot()

If we only use the sine encoding we would still have an issue, as two separate timestamps will have the same sine encoding within one cycle (24 hours in our case), as the graph is symmetrical around the turning points. This is why we also perform the cosine transformation, which is phase offset from sine, and leads to unique values within a cycle in two dimensions.

Indeed, if we plot the feature in two dimensions, we end up a perfect cycle:


ax = sample.plot.scatter('hour_sin', 'hour_cos').set_aspect('equal')

The features can now be used by our deep learning algorithm. As an added benefit, it is also scaled to the range [-1, 1] which will also aid our neural network. A comparison of the effect of the encoding on a simple deep learning model is given in the Kaggle Kernel.

Summary

Other machine learning algorithms might be more robust towards raw cyclical features, particularly tree-based approaches. However, deep neural networks stand to benefit from the sine and cosine transformation of such features, particularly in terms of aiding the convergence speed of the network.

avanwyk

Revisiting Java in 2021 - II

Goliath vs the Davids

Is Less More?

The JVM Platform

GraalVM

The Future of Java

Conclusion

Revisiting Java in 2021 - I

Introduction

The Java 17 language

Catching-up: Java 8 to 11

Java 17

How far have we come with Java 17?

Super-convergence in Tensorflow 2 with the 1Cycle Policy

Tensorflow 2 implementation

Application

Results

References

Finding a Learning Rate with Tensorflow 2

Implementation

Application

References

African Antelope: A Case Study of Creating an Image Dataset with FastAI

Contents

Creating an Image Dataset

Downloading the Data

Cleaning the Data

Training the Model

Building the Dataset

Creating the Model

Fitting the Model

Initial Results

Model Assisted Data Cleaning

Full Model Training

Results

Further Improvements

Conclusion

FastAI Installation and Setup

Cloud Environments

Documentation

CDC Mortality Prediction with FastAI for Tabular Data

Training a model on Tabular Data

Data loading

Learner and model

Learning rate

Training

Initial thoughts on FastAI v1

References

How to Read a Paper

The Three Pass System

The First Pass

The Second Pass

The Third Pass

Conclusion

References

An Overview of LightGBM

Contents

LightGBM

Gradient Boosting

Algorithm

LightGBM API

Plotting

Saving the model

LightGBM Parameters

Tree parameters

Tuning for imbalanced data

Tuning for overfitting

Tuning for accuracy

Resources

Encoding Cyclical Features for Deep Learning

An updated version of this article is available at https://mlengineer.substack.com/p/encoding-cyclical-features-for-deep-learning

The Problem with Cyclical Data

Encoding Cyclical Features

Summary

Further Reading