Cassava Leaf Disease Classification — EDA and Baseline Model

4 min readFeb 23, 2021

Work conducted by: Enmin Zhou, Yangyin Ke, Huaqi Nie

Introduction

This blog is created to document our team’s work in Kaggle Competition Cassava Leaf Disease Classification. We are building Convolutional Neural Networks (CNN) to diagnose leaf diseases from their images. These diseases include Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), and Cassava Mosaic Disease (CMD). Considering that cassava acts a significant role in African agriculture, we hope to help African farmers effectively diagnose relevant diseases, assisting them to prevent further spread and make more profits.

Cassava leaves image samples from the Kaggle dataset

Exploratory Data Analysis

Label Distribution

In this project, we are working with more than 20,000 images. There are five labels in total, including four disease labels CBB, CBSD, CGM, and CMD as well as one healthy label. Our exploratory data analysis investigates the distribution of each class. From the graph below, we can see that the training labels are pretty imbalanced, with more than 12,000 CMD images while less than 2,000 CBB images. In other words, more than 60% of the training labels are CMD, while only around 5% of them are CBB. Currently, we are unsure about how dataset imbalance will influence the accuracy of our model. Therefore, we plan to keep it and train our baseline model based on it. We will modify the dataset if we find it necessary in our later phase.

Image Resize

The images downloaded from Kaggle are sized 800*600. In order to train CNN models more efficiently, we downsample the images and re-size them as 512*512. Even though the resolution is decreased, we do not think it will impact our model accuracy too much because these diseases always come with obvious color discrepancies, which can still be detected under relatively low resolution. More complicated image preprocessing will be done in the next step if needed.

Raw images (top) and resized images (bottom)

Baseline Model

Data Loader

Our data loader is based on Amy MiHyun Jang’s notebook. It provides a clear instruction on how to train CNN models based on .tfrec files. She includes significant functions like loading dataset and reading tfrecords. Dataset generated from .tfrec files can be used directly in model training.

Model Architecture

The baseline model is built by transfer learning with VGG16 with imaginet retrained weight. The outputs of VGG16 are flattened before being fed into three fully connected layers with 512 nodes, 256 nodes and 128 nodes. All of them use ReLU as activation function to avoid vanishing gradient. Drop out rates are all set to 0.25 to break model symmetry. The final dense layer, whose number of nodes equals the number of total classes, is added at the end. Considering that there are 5 possible output classes, softmax is used here as the activation function and categorical cross entropy is chosen as the loss function. Additionally, we use Adam as optimizer and accuracy as metrics.

Train History

From the training history plot below, we can see that the performance of our baseline model still needs to be improved. With more epochs involved, even though the training accuracy increases and the training loss decreases, the validation accuracy keeps almost the same and the validation loss increases significantly. In other words, our baseline model suffers from overfitting issue, which should be solved in our next phase by adjusting model hyper-parameters and applying early stopping strategy.

Model training history from Tensorboard.dev

Kaggle Entry

Next Steps

Advanced image preprocessing

Our current image preprocessing only resizes images from 800*600 to 512*512. However, using current image as input will lead to RAM crash if epoch is set large. Also, we did not take image issues like opacity and optical noise into consideration. We plan to spend some time on this area to see how it will improve our model performance.

2. Handle dataset imbalance

Labels in the raw dataset is pretty imbalanced, which will potentially influence the performance of our model. We are going to investigate more and figure out a proper way to handle such imbalance.

3. Adjust neural network architecture

The neural network we have now is relatively basic. We are thinking about adjusting certain hyper-parameters and training strategies. On top of that, we will try combining other models besides VGG16 through transfer learning in the next phase.

…

Cassava Leaf Disease Classification — EDA and Baseline Model

Introduction

Exploratory Data Analysis

Label Distribution

Image Resize

Baseline Model

Data Loader

Model Architecture

Train History

Kaggle Entry

Next Steps

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Yangyin Ke

No responses yet