Overview: In this example, we will see how you can use UpTrain to ensure that your LLM responses are adequate before you use them to perform downstream tasks. A list of defined checks performs the validation. If the LLM’s response is invalid, UpTrain will keep retrying until the model returns a valid one. We will use a Q&A task as an example to highlight the same.

Why is validation Needed: LLMs are great, but they are not 100% reliable. Downstream tasks require the LLM response in a particular structure. Sometimes the response produced by the LLM deviates from the required format. This deviation causes all sorts of problems. LLMs can hallucinate randomly. We surely don’t want to show those results to our users. Hence, we have to run validation checks on our LLM responses, catch where they go wrong and retry the LLMs. This process repeats until the LLM output passes all the validation checks.

Problem: The workflow of our hypothetical Q&A application goes like this,

  • User enters a question.
  • The query converts to an embedding, and relevant sections from the documentation are retrieved using nearest neighbour search.
  • The original query and the retrieved sections are passed to a language model (LM), along with a custom prompt to generate a response.

Solution: We will illustate how to use the “Uptrain Validation framework” to validate the performance of the chatbot. We will use a dataset built from logs generated by a chatbot made to answer questions from the Streamlit user documentation.

Validation Logic: We will check if the LLM response is empty or not for the given query. If empty, we want to return a default message instead of the LLM response.

Install UpTrain with all dependencies

pip install uptrain
uptrain-add --feature full

Make sure to define openai_api_key

import os
import openai
import polars as pl
import json

This notebook uses the OpenAI API to generate text for prompts, make sure the env variable is populated with the API key.

os.environ["OPENAI_API_KEY"] = "..."

Let’s first define our prompt and model

We have designed a prompt template to take in a question and a document and extract the relevant sections from it.

prompt_template = """
    You are a developer assistant that can only quote text from documents. 
    You will be given a section of technical documentation titled {document_title}.
    The input is: '{question}?'. 

    Your task is to answer the question by quoting exactly all sections of the document that are relevant to any topics of the input. 
    Copy the text exactly as found in the original document. 
    Okay, here is the document:
    --- START: Document ---

    -- END: Document ---
    Now do the task. If there are no relevant sections, just respond with \"<EMPTY MESSAGE>\".
    Here is the answer:

Let’s now load our dataset and see how that looks

url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/qna-streamlit-docs.jsonl"
dataset_path = os.path.join("datasets", "qna-notebook-data.jsonl")

if not os.path.exists(dataset_path):
    import httpx

    r = httpx.get(url)
    with open(dataset_path, "wb") as f:

dataset = pl.read_ndjson(dataset_path).select(
    pl.col(["question", "document_title", "document_text"])
print("Number of test cases: ", len(dataset))
print("Couple of samples: ", dataset[0:2])


Let’s now get responses from our LLM by defining our completion function. We are using GPT-3.5-Turbo for the same.

def get_model_response(input_dict):
    prompt = [{"role": "system", "content": prompt_template.format(**input_dict)}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", messages=prompt, temperature=0.1
    message = response.choices[0]["message"]["content"]
    return message

Now that we have completed the setup, let’s try out a few examples to see how they look.

            "input_question": dataset["question"][0],
            "llm_response": get_model_response(dataset.to_dicts()[0]),
            "input_question": dataset["question"][1],
            "llm_response": get_model_response(dataset.to_dicts()[1]),
            "input_question": dataset["question"][5],
            "llm_response": get_model_response(dataset.to_dicts()[5]),


As we can see, our model gives us empty responses for certain cases. Let’s see how we can use the UpTrain Validation Framework to check for the same and retry the LLM whenever that happens.

Using Validation Framework to check for empty responses

Defining the Validation Checks

Let’s define a Check to evaluate if the model response is empty or not. We utilize the pre-built TextComparison operator for the same. After running this on our input data a new variable called ‘is_empty_response’ is created.

from uptrain.framework import Check
from uptrain.operators import TextComparison

check = Check(
            reference_texts="<EMPTY MESSAGE>",

Defining the passing condition

Our pass condition is defined as “any response that is not empty”. UpTrain provides a wrapper function called Signal which allows us to define the pass condition by utilizing mathematical operators (like ~, &, |, +, etc.).

from uptrain.framework import Signal

pass_condition = ~Signal("is_empty_response")

Defining the retry logic

Let’s define the retry logic which dictates how to generate LLM responses in case of validation failures. This could be any python function like modifying prompt, temperature, triggering a tool, returning a default response, etc.

def model_response_when_empty(input_dict):
    return f"We couldn't find a good enough answer for the given question: {input_dict['question']}. Please try asking a different question"

# Call 'model_response_when_empty' when response is empty
        "name": "default_output_when_response_is_empty",
        "signal": Signal("is_empty_response"),
        "completion_function": model_response_when_empty

Tying everything together

UpTrain provides a ValidationManager class that allows us to pass the Check, completion_function and pass_condition. Instead of calling the completion_function, we can call validation_manager. Under the hood, it computes the check, makes sure the pass condition is validated, and if the pass condition is not validated, it will retry until it outputs the correct LLM response.

from validation_wrapper import ValidationManager

validation_manager = ValidationManager(

Let’s run our example

Finally, let’s run it a few values from our input dataset.

for inputs in dataset.to_dicts()[:20]:
    validated_response = validation_manager.run(inputs)