In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

<center style="font-size:18px;"><h1><strong>Introduction:</strong> <font color="red"><strong><em>Early Stopping</em></strong></font> in regularized iterative maching learning algorithms using Gradient Descent.</h1></center>

<div style="font-size:16px; border:1px solid black; padding:10px">
    <center><h1>Machine Learning Training Issues.</h1></center>
<ul>
    <li>Not enough training of your model will yield underfitting of both training and learning data sets.</li><br>    
    <li>Too much training will have the opposit affect, overfit the training dataset and this will result in poor performance on the test set.</li><br>
    <li>A compromise is to stop the training early, and to do so when the performance on a validation dataset starts to degrade.</li><br>
    <li>In other words, early stopping is a method to stop training you rmodel when the performance of your model on the validation data no longer does well.</li><br>
    <li>A common metric used to evaluate the performance of a model is the Root Mean Square Error (RMSE).</li><br>
    <li>This approach is common used in complex machine learning models, such as Neural Networks.</li><br>
    <li>The benefits of early stopping is that you can prevent overfitting your model, and improves your models ability to generalize to new data.</li><br>   
</ul>
</div>

<div style="font-size:16px; border:1px solid black; padding:10px">
    <center><h1>Principles of Early Stopping.</h1></center>
<ul>
    <li>Early stopping requires that a model is training multiple times using different set of parameter values, and then select the training model that had the best performance on the validation set (lowest RMSE).</li><br>    
    <li>An <strong>Epoch</strong> is a training model with a unique set of parameter values.</li><br>
    <li>Early stopping is like using a for-loop over a number of epochs, and each epoch iterates over each batch of samples and trains a model.</li><br>    
    <li>The number of epochs that is used during early stop if often large, allowing the learning algorithm to run until the error from the model has been sufficiently minimized.</li><br>
    <li>Learning plots are often used to monitor this process, where the number of epochs are displayed on the x-axis as time and the error (example RMSE) on the y-axis. </li><br>
    <li>Two lines are plotted, a training set and a validation set.</li><br>
    <li>The training set curve shows you how well the model fits the training set, in terms of error, as the number of epochs increases.</li><br>
    <li>The validation set demonstrates how well the model generalizes to new data as the number of epochs increases.</li><br> 
    <li>These plots help diagnose whether the model has over learned, under learned, or is suitably fit to the training dataset.</li><br>
    <li>This process is illustrated in below.</li><br>    
    <li>Early stopping can be described in three steps:
        <ol>
            <li>Monitoring model performance</li>
            <li>Trigger to stop training</li>
            <li>Model Selection</li>
        </ol>
    </li><br>
    <li>The downside of early stopping is that it is computationally inefficient and time-consuming, especially for large models trained on large datasets, because it requires multiple models to be trained and discarded.</li><br>
</ul>
</div>

<hr style="border-top: 3px solid Black;">

<div style="font-size:16px; border:1px solid black; padding:10px">
    <center><h1>Post Goal</h1>
     </center><br>
    <center style="font-size:20px;">Demonstrate how to carry out early stopping techniques using Batch Gradient Descent as an example.</center>
</div>

<hr style="border-top: 5px solid RED;">

<h1>Import Dependencies</h1>

In [2]:
%matplotlib inline
# Scikit-Learn â‰¥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# machine learning
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.base import clone

# Math and dataframe modules
import numpy as np

# to make this notebook's output stable across runs
np.random.seed(42)

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Formats plots and uses seaborn theme
plt.style.use('seaborn')
plt.rc('font', size=14)
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)

<hr style="border-top: 3px solid Black;">
<h1>Data</h1>
<ul>
    <li>Randomly generated using numpy</li>
    <li><code>m</code> is the size of the array of data</li>
    <li><code>np.random.rand(m, 1)</code> is a 2D array with values between 0 - 1</li>
    <li><code><strong>X</strong> = 6 * np.random.rand(m, 1) - 3</code> each value between 0 - 1 is multiplied to 6, and then subtracted by 3, which then makes the domain for X between 0 - 3</li>
    <li><strong>y</strong> is a polynomial function where:
        <ul>
            <li><code>np.random.rand(m, 1)</code> is a 2d array with values between 0 - 1</li>
            <li><code>X**2</code> each value in the arrray is added to the square of the X value (0 - 3^2) that is also divided by half</li>
            <li>This value is then added to the X value (0 - 3)</li>
            <li>A constant of 2 is added.</li>
        </ul>   
</ul>    

<h2>Generate data</h2>

In [3]:
np.random.seed(42)
m = 100 # size of array
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)

<h2>Split data into Equal Sized Groups of Training and Test Set</h2>

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(), test_size=0.5, random_state=10)

<h1>Carry out Early Stopping Methods using Polynomial Regression and Gradient Descent</h1>

In [None]:
poly_scaler = Pipeline([
        ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
        ("std_scaler", StandardScaler())
    ])

# transform data
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

# instantiate gradient descent
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                       penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)

minimum_val_error = float("inf")
best_epoch = None
best_model = None

# loop through a range of epoch's
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train)  # continues where it left off because warm_start=True
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    
    # here we check for the best error
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)

<div style="font-size:16px; border:1px solid black; padding:10px">
    <center><strong>Comments:</strong></center>
<ul> 
    <li>When the fit method is called on the SGDRegressor with <code>warm_start=True</code>, then the method continues to train where it left off instead of restarting from scratch.</li><br>
</ul>
</div>

<h1>Plot Learning Curves to Diagnose Early Stopping</h1>

In [None]:
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                       penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)

n_epochs = 500 # number of epochs
train_errors, val_errors = [], [] # instantiate two empty arrays
for epoch in range(n_epochs): # early stopping for loop to train up to 500 epochs
    sgd_reg.fit(X_train_poly_scaled, y_train) # fit gradient descent model with x trained poly scaled and y train
    y_train_predict = sgd_reg.predict(X_train_poly_scaled) # predict y train value
    y_val_predict = sgd_reg.predict(X_val_poly_scaled) # predict y validation value
    train_errors.append(mean_squared_error(y_train, y_train_predict)) # get the training error and save array
    val_errors.append(mean_squared_error(y_val, y_val_predict)) # get the validation error and save to array

best_epoch = np.argmin(val_errors) # Returns the indices of the minimum values along the validation error array
best_val_rmse = np.sqrt(val_errors[best_epoch]) # return the non-negative square-root of the best epoch
# this value will be used to plot the point at which you have the best model performance

# annotates the plot with an arrow to indicate the best performance on validation data.
plt.annotate(f'Best model\n at epoch = {best_epoch}',
             xy=(best_epoch, best_val_rmse),
             xytext=(best_epoch, best_val_rmse + 1),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.05),
             fontsize=16,
            )

# this will plot a horizontal line to help improve the graph
best_val_rmse -= 0.03  # just to make the graph look better
plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)

# this will plot the validation and training set
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")

# this will plot the legend, and axis labels
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plt.savefig("images/earlystopping.png", bbox_inches='tight')
plt.show()

<div style="font-size:16px; border:1px solid black; padding:10px">
    <center><strong>Comments:</strong></center>
<ul> 
    <li>Epoch between 0 - 250
        <ul>
            <li><strong>The models performance increases with as the number of Epochs increases</strong></li>
            <li>The slope of both the Validation and Training curves are negative, and the error (RMSE) decreases for from 3 to less than 1 for Training, and [1-1.5] for the validation set.</li>
            <li>Recall that the RMSE is the error between the actual value to the predicted value.</li>
        </ul>
    </li><br>
    <li>Epoch between 200 - 300
        <ul>
            <li><font color="red"><strong>The models performance for the validation set is optimal between this domain</strong></font></li> 
            <li>The error (RMSE) reaches its minimal between 200 - 250 epochs. The optimal model should be selected here. After 239 epochs, the error for the Validation Set beings to rise and the peformance degrades.</li>
            <li>This means that the ability to 'predict' begins to decline for new data</li>
            <li>The performance for the training set continue to also improve as the RMSE continues to drop.</li>
            <li>This means that the model is starting to overfit and is learning 'everything' about the training data, and will not generalize well with new data.</li>
        </ul>
    </li><br>    
    <li>Epoch between 300 - 500
        <ul>
            <li><strong>The models performance for the validation set is no longer optimal in this domain</strong></li> 
            <li>The performance of the validation data begins to decrease as the RMSE is increasing.</li>
            <li>The performance for the training set continues to increase as teh RMSE is decreasing, this also means the model is overfitting.</li>
            <li>The model no longer generalizes as well in this range.</li>            
        </ul>
    </li><br>    
    <li>This learning curve serves as an example of how to identify the best model.</li><br>
</ul>
</div>

<hr style="border-top: 5px solid red;">

<div style="font-size:16px; border:1px solid black; padding:10px">
    <center><h1>Final thoughts</h1></center>
<ul>
    <li>In this post, I discussed a method used to stop the training of a model early before it has overfit the training dataset and improve the generalization.</li><br>
    <li>In this example I used Stochastic Gradient Descent, but the same process can be used for other common methods, such as neural networks.</li><br>
    <li>Learning curves are great visualization techniques to monitor and and select the best model.</li><br> 
    <li>The use of early stopping requires the selection of a performance measure to monitor, a trigger for stopping training, and a selection of the model weights to use.</li><br>
</ul>
</div>

<hr style="border-top: 3px solid Black;">