HW1: Transfer Learning for the Birds
Goals
We spent Week 2 learning about Transfer Learning, including both positive aspects ("Astounding Baseline" paper) and its limitations ("Do ImageNet transfer to real-world data?").
In this HW1, you'll apply Transfer Learning to a real dataset, and wrestle with several questions:
Problem 1: For a specific target classification task of interest, would we rather have a source model trained on a "generic" dataset like ImageNet1k, or a smaller dataset related to our target task?
Problem 2: What are the tradeoffs between fine-tuning just the last layer (aka "linear probing") and fine-tuning a few more layers? Can we compose these to do better?
We spent Week 2 learning about Transfer Learning, including both positive aspects ("Astounding Baseline" paper) and its limitations ("Do ImageNet transfer to real-world data?").
In this HW1, you'll apply Transfer Learning to a real dataset, and wrestle with several questions:
Problem 1: For a specific target classification task of interest, would we rather have a source model trained on a "generic" dataset like ImageNet1k, or a smaller dataset related to our target task?
Problem 2: What are the tradeoffs between fine-tuning just the last layer (aka "linear probing") and fine-tuning a few more layers? Can we compose these to do better?
Starter Code and Provided Data
You can find the starter code and our provided "BirdSnap-10" dataset in the course Github repository here:
Two ways to run the experiments
You can find the starter code and our provided "BirdSnap-10" dataset in the course Github repository here:
Two ways to run the experiments
Option 1: Google Colab cloud-based notebook environment
To get started quickly on Colab:
- Follow this link:
- Use "Ctrl-S" or similar to copy that notebook into your Google Drive to save progress
Here's a quick video demo of how to setup Colab.
To get started quickly on Colab:
- Follow this link:
- Use "Ctrl-S" or similar to copy that notebook into your Google Drive to save progress
Here's a quick video demo of how to setup Colab.
Option 2: Your own environment
You can find .yml files specifiying required python packages in the repo.
You'll be responsible for getting things working yourself.
You can find .yml files specifiying required python packages in the repo.
You'll be responsible for getting things working yourself.
Background
A wildlife conservation organization has reached out for help to develop an automated bird species classifier. Their goal is to classify 10 specific bird species critical to biodiversity conservation. Unfortunately, large datasets for these birds are difficult to acquire.
You have been provided:
- a `train' dataset, to be used for all model development (training and validation)
- a `test' dataset, to be used only to evaluate model generalization
You can obtain these images by unpacking `birdsnap10_224x224only.zip'.
These images come from BirdSnap dataset, a public dataset released by Berg et al. in 2014 paper release page.
You will explore 4 possible pretrained models provided by pytorchcv, that differ across two key axes
- Model architecture
- ResNet-10, with ~5M parameters
- ResNet-26, with ~17M parameters
- Source Dataset
- ImageNet-1k, large and diverse data containing >1 million images from 1000 classes
- CUB-2011-200, specialized to birds, containing 11000 images of 200 bird species
We have tried our best to construct this task so your target BirdSnap-10 dataset has no class overlap with the species in CUB-2011-200, and also no overlap with classes in ImageNet-1k (which does contain several bird classes).
A wildlife conservation organization has reached out for help to develop an automated bird species classifier. Their goal is to classify 10 specific bird species critical to biodiversity conservation. Unfortunately, large datasets for these birds are difficult to acquire.
You have been provided:
- a `train' dataset, to be used for all model development (training and validation)
- a `test' dataset, to be used only to evaluate model generalization
You can obtain these images by unpacking `birdsnap10_224x224only.zip'.
These images come from BirdSnap dataset, a public dataset released by Berg et al. in 2014 paper release page.
You will explore 4 possible pretrained models provided by pytorchcv, that differ across two key axes
- Model architecture
- ResNet-10, with ~5M parameters
- ResNet-26, with ~17M parameters
- Source Dataset
- ImageNet-1k, large and diverse data containing >1 million images from 1000 classes
- CUB-2011-200, specialized to birds, containing 11000 images of 200 bird species
We have tried our best to construct this task so your target BirdSnap-10 dataset has no class overlap with the species in CUB-2011-200, and also no overlap with classes in ImageNet-1k (which does contain several bird classes).
Problem 1: Should Source Models Specialize or Generalize?
Your goal in Problem 1 is to compare the 4 possible source models (each combination of 2 datasets x 2 archs) at last layer fine-tuning.
Your goal in Problem 1 is to compare the 4 possible source models (each combination of 2 datasets x 2 archs) at last layer fine-tuning.
Tasks for Code Implementation
Open models.py, and examine the PTNetForBirdSnap10 class, which defines a pytorch neural net module that uses a pretrained backbone combined with a simple linear classification head for the 10-class BirdSnap data.
CODE 1(i) Edit setup_trainable_params
method so that, given a desired number of layers n
as an int, the last n
layers are set to trainable (accepting gradient updates in the PyTorch computation graph) and all other parameters should be frozen (not accepting gradient updates). Hint: you'll need to edit the boolean property requires_grad
of each parameter tensor.
Now open train.py, which defines a function for performing training on our target BirdSnap data. This function works whether we are just updating the last-layer or more layers.
CODE 1(ii) Edit train_model
method to compute the cross-entropy loss, in two places. First, for the current batch of train data, inside the tr_loader
loop. Second, for the current batch of validation data, inside the va_loader
loop.
The way you take averages differs a bit:
- Train: You want a per-example average from current batch, as a fast yet unbiased estimator for computing loss/gradients.
- Valid: You want the true per-example average over full validation dataset.
Next, implement strategies to avoid overfitting, like L2 penalization of weight magnitudes for the last layer.
CODE 1(iii) Edit train_model
to add a L2 penalty loss on the weights only (not biases) of the last layer ("classification head"). For our provided model, you can use the dict model.trainable_params
to access a tensor using its name as the key. The last-layer weights are named 'output.weight'
. Hint: To compute L2 magnitude, think sum of squares).
CODE 1(iv) Edit train_model
to add early stopping functionality. We track the number of consecutive epochs that validation cross-entropy loss is getting worse. Once this exceeds a threshold, we revert the model to its previous state (which gave the best val-set cross-entropy) and return that model.
Finally, write code to evaluate test set accuracy.
CODE 1(v) Edit eval_acc
(defined in the body of hw1.ipynb) to measure a model's accuracy (defined as the fraction of examples that are correctly predicted, from 0.0 to 1.0, higher is better) on the provided test set.
Open models.py, and examine the PTNetForBirdSnap10 class, which defines a pytorch neural net module that uses a pretrained backbone combined with a simple linear classification head for the 10-class BirdSnap data.
CODE 1(i) Edit setup_trainable_params
method so that, given a desired number of layers n
as an int, the last n
layers are set to trainable (accepting gradient updates in the PyTorch computation graph) and all other parameters should be frozen (not accepting gradient updates). Hint: you'll need to edit the boolean property requires_grad
of each parameter tensor.
Now open train.py, which defines a function for performing training on our target BirdSnap data. This function works whether we are just updating the last-layer or more layers.
CODE 1(ii) Edit train_model
method to compute the cross-entropy loss, in two places. First, for the current batch of train data, inside the tr_loader
loop. Second, for the current batch of validation data, inside the va_loader
loop.
The way you take averages differs a bit:
- Train: You want a per-example average from current batch, as a fast yet unbiased estimator for computing loss/gradients.
- Valid: You want the true per-example average over full validation dataset.
Next, implement strategies to avoid overfitting, like L2 penalization of weight magnitudes for the last layer.
CODE 1(iii) Edit train_model
to add a L2 penalty loss on the weights only (not biases) of the last layer ("classification head"). For our provided model, you can use the dict model.trainable_params
to access a tensor using its name as the key. The last-layer weights are named 'output.weight'
. Hint: To compute L2 magnitude, think sum of squares).
CODE 1(iv) Edit train_model
to add early stopping functionality. We track the number of consecutive epochs that validation cross-entropy loss is getting worse. Once this exceeds a threshold, we revert the model to its previous state (which gave the best val-set cross-entropy) and return that model.
Finally, write code to evaluate test set accuracy.
CODE 1(v) Edit eval_acc
(defined in the body of hw1.ipynb) to measure a model's accuracy (defined as the fraction of examples that are correctly predicted, from 0.0 to 1.0, higher is better) on the provided test set.
Tasks for Experiment Execution
Now, step through the provided notebook hw1.ipynb to achieve the following
EXPERIMENT 1(i): First, do last-layer fine-tuning of ResNet10 using the ImageNet1k pretrained model, on the available training/validation sets. Use the provided train/valid data loaders (don't mess with batch_size or other settings). Monitor train and validation metrics, make your own plots of these metrics as needed. Find a suitable learning rate and L2-regularization strength to minimize over/under-fitting.
EXPERIMENT 1(ii): Repeat step 1(i) above for the other combinations of architecture and source dataset. Using the code's computed train/validation metrics tracked over epochs, find a setting of hyperparamters (n_epochs, lr, l2 penalty, seed). that seems to deliver reasonable heldout performance without too much over/under-fitting.
We recommend saving intermediate results to a .pkl file or similar (see hw1.ipynb), so it is easy to plot later without redoing experiments.
Now, step through the provided notebook hw1.ipynb to achieve the following
EXPERIMENT 1(i): First, do last-layer fine-tuning of ResNet10 using the ImageNet1k pretrained model, on the available training/validation sets. Use the provided train/valid data loaders (don't mess with batch_size or other settings). Monitor train and validation metrics, make your own plots of these metrics as needed. Find a suitable learning rate and L2-regularization strength to minimize over/under-fitting.
EXPERIMENT 1(ii): Repeat step 1(i) above for the other combinations of architecture and source dataset. Using the code's computed train/validation metrics tracked over epochs, find a setting of hyperparamters (n_epochs, lr, l2 penalty, seed). that seems to deliver reasonable heldout performance without too much over/under-fitting.
We recommend saving intermediate results to a .pkl file or similar (see hw1.ipynb), so it is easy to plot later without redoing experiments.
Tasks for Report Writing
In your submitted report, include the following
FIGURE 1(a): Plot loss-vs-epoch for your best runs of (ResNet10, ImageNet1k) and (ResNet10, CUB). Use style provided in the Figure1a block in hw1.ipynb. Include this figure in your report, with a caption that summarizes major takeaway messages (did you see major overfitting? was strong L2-penalization needed? was early stopping beneficial?).
FIGURE 1(b): Make target-task-accuracy vs source-task-accuracy plot, like the main figure in the Fang et al. paper from day04. Use the style provided in the Figure1b block in hw1.ipynb. Include this figure in your report, with a caption that summarizes both the major takeaway messages of your results (which src-dataset is better for our target task? which arch is better?). Try to reason about why given your knowledge from readings.
In your submitted report, include the following
FIGURE 1(a): Plot loss-vs-epoch for your best runs of (ResNet10, ImageNet1k) and (ResNet10, CUB). Use style provided in the Figure1a block in hw1.ipynb. Include this figure in your report, with a caption that summarizes major takeaway messages (did you see major overfitting? was strong L2-penalization needed? was early stopping beneficial?).
FIGURE 1(b): Make target-task-accuracy vs source-task-accuracy plot, like the main figure in the Fang et al. paper from day04. Use the style provided in the Figure1b block in hw1.ipynb. Include this figure in your report, with a caption that summarizes both the major takeaway messages of your results (which src-dataset is better for our target task? which arch is better?). Try to reason about why given your knowledge from readings.
Problem 2: LP then FT
To keep things simple, we'll fix ('ResNet10', 'ImageNet1k') for the arch and source dataset throughout Problem 2. Be sure you're only using this configuration.
Your goal in Problem 2 is to implement LP-then-FT method of Kumar et al. (from our day04 readings). That is, you'll do:
First stage of LP (Call train.train_model with n_trainable_layers=1).
- You can reuse the best hyperparameters from Problem 1 above.
Second stage of FT (Call train.train_model with n_trainable_layers=3). Be sure to initialize from the model produced by stage one. You'll have 3 trainable layers (not just 1), so lots more flexibility but also potential to overfit.
- You'll need to tune lr / l2penalty / n_epochs to be sure you are fitting reasonably
To keep things simple, we'll fix ('ResNet10', 'ImageNet1k') for the arch and source dataset throughout Problem 2. Be sure you're only using this configuration.
Your goal in Problem 2 is to implement LP-then-FT method of Kumar et al. (from our day04 readings). That is, you'll do:
First stage of LP (Call train.train_model with n_trainable_layers=1).
- You can reuse the best hyperparameters from Problem 1 above.
Second stage of FT (Call train.train_model with n_trainable_layers=3). Be sure to initialize from the model produced by stage one. You'll have 3 trainable layers (not just 1), so lots more flexibility but also potential to overfit.
- You'll need to tune lr / l2penalty / n_epochs to be sure you are fitting reasonably
Tasks for Code Implementation
CODE 2(i) : Edit your hw1.ipynb notebook to implement two-phase training. In the first phase, again set n_trainable_layers=1 and use exactly the lr/l2penalty/seed combinations that worked well in Problem 1. In the second phase, you'll want to consider different settings of lr/l2penalty.
To make things quick, you can run the first phase just once, yielding a good LPmodel
, and then tune the hyperparameters of the second phase using copy.deepcopy(LPmodel)
to get a copy of the model to train each hyperparameter config, while leaving the original LPmodel unchanged.
CODE 2(i) : Edit your hw1.ipynb notebook to implement two-phase training. In the first phase, again set n_trainable_layers=1 and use exactly the lr/l2penalty/seed combinations that worked well in Problem 1. In the second phase, you'll want to consider different settings of lr/l2penalty.
To make things quick, you can run the first phase just once, yielding a good LPmodel
, and then tune the hyperparameters of the second phase using copy.deepcopy(LPmodel)
to get a copy of the model to train each hyperparameter config, while leaving the original LPmodel unchanged.
Tasks for Experiment Execution
EXPERIMENT 2(i): Run a small-scale hyperparameter search (either manually or systematically), aiming to find a configuration of lr/l2penalty for the second phase that delivers the best possible validation performance. Don't spend more than about an hour.
EXPERIMENT 2(ii): Compute the test-set accuracy for both the phase 1 and phase 2 "best" models, using eval_acc
EXPERIMENT 2(i): Run a small-scale hyperparameter search (either manually or systematically), aiming to find a configuration of lr/l2penalty for the second phase that delivers the best possible validation performance. Don't spend more than about an hour.
EXPERIMENT 2(ii): Compute the test-set accuracy for both the phase 1 and phase 2 "best" models, using eval_acc
Tasks for Report Writing
In your submitted report, include the following
FIGURE 2(a): Plot loss/error-vs-epoch plots in two panels (left=LP phase, right=FT phase), using style provided in the Figure2a block in hw1.ipynb. Aim to show the best run from experiment 2(i), where ideally reading across the plot there is obvious continuity between the LP and FT phases (e.g. va loss doesn't immediately jump away from the values seen at end of pretraining). Include this figure in your report, with a caption that summarizes major takeaway messages: was your implementation successful?
SHORT ANSWER 2(b) Report the ultimate test-set accuracy for both LP and LP-then-FT. Reflect on any differences.
In your submitted report, include the following
FIGURE 2(a): Plot loss/error-vs-epoch plots in two panels (left=LP phase, right=FT phase), using style provided in the Figure2a block in hw1.ipynb. Aim to show the best run from experiment 2(i), where ideally reading across the plot there is obvious continuity between the LP and FT phases (e.g. va loss doesn't immediately jump away from the values seen at end of pretraining). Include this figure in your report, with a caption that summarizes major takeaway messages: was your implementation successful?
SHORT ANSWER 2(b) Report the ultimate test-set accuracy for both LP and LP-then-FT. Reflect on any differences.
Problem 3: Conceptual Questions
Short Answer 3a
Provide a math formula for computing the complete loss used to train models here, including the cross-entropy and the L2-penalty terms.
Notation:
- B : total number of examples in current batch, indexed by
- C : total number of classes for target task
- : int indicator of class label for example
- : vector of logits for example
- : matrix of weight parameters of last layer
- : vector of bias parameters of last layer
You can only use basic math functions (log, sum, exp). Be sure to clearly define the assumed size of each vector/matrix, using the actual values in the code you implemented in Problem 1.
Provide a math formula for computing the complete loss used to train models here, including the cross-entropy and the L2-penalty terms.
Notation:
- B : total number of examples in current batch, indexed by
- C : total number of classes for target task
- : int indicator of class label for example
- : vector of logits for example
- : matrix of weight parameters of last layer
- : vector of bias parameters of last layer
You can only use basic math functions (log, sum, exp). Be sure to clearly define the assumed size of each vector/matrix, using the actual values in the code you implemented in Problem 1.