In run.ipynb, we consider a binary classification task of human orientation while exercising. That is, given the location of 17 key-points of the body such as the nose, shoulders, wrist, hips, ankles, etc., the model tries to predict whether the person is in a horizontal (see image 1 below) or a vertical (see image 2 below) position.

Input: 34-dimensional vector that contains the x and y positions of the 17 key-points.
Output: Orientation (horizontal or vertical)

Horizontal_class Vertical_class

In this example, we will see how we can use UpTrain package to identify data drift and out of distribution cases on real-world data.

Data Type Structure

Let’s look at the training data features and visualise some of the training samples. Here, id is the training sample id, gt is the corresponding ground truth, and the rest of the features are the corresponding locations of the key-points of a human body.


Visualizing some training samples for classifying human orientation


The example follows the following steps for monitoring and retraining your model:

Step 1: Train the Deep Neural Network model

We have defined a simple Neural Net comprised of a fully-connected layer with relu activation followed by a fully-connected layer to transfer latent features into model outputs. We compute Binary Entropy loss and are using Adam optimiser to train the model.\ Note: We use PyTorch in this example, but in other examples such as edge-case detection, we have also run UpTrain with Sklearn and Tensorflow.


With the first version of this model, we observe an accuracy of 90.9% on the golden testing dataset. We will now how we can use UpTrain package to identify data distribution shifts, collect edge cases and retrain the model to improve its accuracy.

Step 2: Define the list of checks to perform on model

In this example, we define a simple data drift check to identify any distribution shift between real-world test set and the reference dataset (the training dataset in this case). To achieve this, we set ‘kps’ (Keypoints) as the input variable, the framework performs clustering on the training dataset and checks if the real-world test set is following the similar distribution.

checks = [{
    'type': uptrain.Monitor.DATA_DRIFT,
    'reference_dataset': orig_training_file,
    'is_embedding': True,
    "measurable_args": {
        'type': uptrain.MeasurableType.INPUT_FEATURE,
        'feature_name': 'kps'  #keypoints

Here, the type refers to the anamoly type, which is data drift in this case. The reference_dataset is the training dataset, while is_embedding refers to whether the data type on which drfit is being measured is in a vector/embedding form. Finally, measurable_args define the input features (or any function of them) on which the drift is to be measured.

Step 3: Define the training and evaluation arguments

We now attach the model training and evaluation pipelines so that UpTrain framework can automatically retrain the model in case it sees that the model is facing significant data drift.

# Define the training pipeline to annotate collected edge cases and retrain the model automatically
training_args = {
    "annotation_method": {"method": uptrain.AnnotationMethod.MASTER_FILE, "args": annotation_args}, 
    "training_func": train_model_torch, 
    "orig_training_file": orig_training_file,

# Define evaluation pipeline to test retrained model against original model
evaluation_args = {
    "inference_func": get_accuracy_torch,
    "golden_testing_dataset": golden_testing_file,

Step 4: Define the UpTrain Config

We are now ready to define the UpTrain config as follows

cfg = {
    "checks": checks, 
    "training_args": training_args,
    "evaluation_args": evaluation_args,

    # Retrain when 200 datapoints are collected in the retraining dataset
    "retrain_after": 200,
    # A local folder to store the retraining dataset
    "retraining_folder": "uptrain_smart_data__data_drift",
    # A function to visualize clusters in the data
    "cluster_visualize_func": plot_all_cluster,

Step 5: Deploy the model in production

Ship the model to production worry-free because the UpTrain tool will identify any data drifts, collect interesting data points and automatically retrain the model on them. To mimic deployment behavior, we are running the model on a ‘real-world test set’ and logging model inputs with UpTrain framework. The following is the pseudo-code.

# Load the trained model

for i, x_test in enumerate(real_world_dataset):
    # Do model prediction
    preds = model(x_test)

    # Log model inputs and outputs to the uptrain Framework to monitor data drift
    idens = framework.log(inputs=x_test, outputs=preds)

Automated model retraining performance

After an automated retraining of the model was launched by UpTrain on points that caused the data drift, we observe that the error rate decreased by 20x.

Old model accuracy:  90.9%
Retrained model accuracy (ie 201 smartly collected data-points added):  99.5%

This is how the sample logs look like

    51  edge cases identified out of  11840  total samples
    100  edge cases identified out of  13360  total samples
    150  edge cases identified out of  14864  total samples
    201  edge cases identified out of  21632  total samples
    Kicking off re-training
    Creating retraining dataset: uptrain_smart_data/1/training_dataset.json  by merging  data/training_data.json  and collected edge cases.
    Model retraining done...
    Generating comparison report...
    Evaluating model: version_0  on  15731  data-points
    Evaluating model: version_1  on  15731  data-points
    Old model accuracy:  0.9092873943169538
    Retrained model accuracy (ie 201 smartly collected data-points added):  0.9952323437797979

Hurray! Our model after retraining performs significantly better.

Let’s try to understand how UpTrain helped to improve our classification model.

Training data clusters

While initializing the UpTrain framework, it clusters the reference dataset (i.e. training dataset in our case). We are plotting the centroids and support (ie number of data-points belonging to that cluster) of all the 20 clusters in our training dataset. training_data_clusters

Edge cases clusters

As we see, the UpTrain framework identifies out-of-distribution data-points and collects the edge-cases which are sparsely present in the training dataset. edge_case_clusters

From the above plot generated while monitoring the model in production, we see that data drift occurs for many cases when the person is in a horizontal position. Specifically, cases when the person is in a push-up position are very sparse in our training dataset, causing the model predictions to go wrong for them. In the example of edge-case detection, we will see that how we can use this insight to define a “Pushup” signal, collect all push-up related data-points and specifically retrain on them.

Do more with UpTrain

Apart from data drift, UpTrain has many other features such as

  1. Checking for edge-cases and collecting them for automated retraining
  2. Verifying data integrity,
  3. Monitoring model performance and accuracy of predictions with standard statistical tools,
  4. Write your own custom monitors specific to your use-case, etc.

To dig deeper into it, we recommend you checkout the other examples in the folder “deepdive_examples”.