Langfuse is an open source LLM engineering platform which helps teams collaboratively debug, analyze and iterate on their LLM applications. Its core features are observability (tracing), prompt management (versioning), evaluations (scores) and datasets (testing).
Langfuse allows users to score individual executions or traces. Users can customize the scores and scales they use. As such, UpTrain’s evaluations can easily be integrated into Langfuse.
Scores can be used in a variety of ways in Langfuse:
- Data: Attach scores to executions and traces and view them in the Langfuse UI
- Filter: Group executions or traces by scores to e.g. filter for traces with a low-quality score
- Fine Tuning: Filter and export by scores as .csv or .JSONL for fine-tuning
- Analytics: Detailed score reporting and dashboards with drill downs into use cases and user segments
In this guide, we will walk you through using Langfuse to create traces and evaluate them using UpTrain.
How to integrate?
Setup
Enter your Langfuse API keys and OpenAI API key
You need to sign up to Langfuse and fetch your Langfuse API keys in your project’s settings. You also need an OpenAI API key
%pip install langfuse datasets uptrain litellm openai --upgrade
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""
Let’s create a sample data
data = [
{
"question": "What are the symptoms of a heart attack?",
"context": "A heart attack, or myocardial infarction, occurs when the blood supply to the heart muscle is blocked. Chest pain is a good symptom of heart attack, though there are many others.",
"response": "Symptoms of a heart attack may include chest pain or discomfort, shortness of breath, nausea, lightheadedness, and pain or discomfort in one or both arms, the jaw, neck, or back."
},
{
"question": "Can stress cause physical health problems?",
"context": "Stress is the body's response to challenges or threats. Yes, chronic stress can contribute to various physical health problems, including cardiovascular issues.",
"response": "Yes, chronic stress can contribute to various physical health problems, including cardiovascular issues, and a weakened immune system."
},
{
'question': "What are the symptoms of a heart attack?",
'context': "A heart attack, or myocardial infarction, occurs when the blood supply to the heart muscle is blocked. Symptoms of a heart attack may include chest pain or discomfort, shortness of breath and nausea.",
'response': "Heart attack symptoms are usually just indigestion and can be relieved with antacids."
},
{
'question': "Can stress cause physical health problems?",
'context': "Stress is the body's response to challenges or threats. Yes, chronic stress can contribute to various physical health problems, including cardiovascular issues.",
'response': "Stress is not real, it is just imaginary!"
}
]
Run Evaluations using UpTrain Open-Source Software (OSS)
We have used the following 3 metrics from UpTrain’s library:
-
Context Relevance: Evaluates how relevant the retrieved context is to the question specified.
-
Factual Accuracy: Evaluates whether the response generated is factually correct and grounded by the provided context.
-
Response Completeness: Evaluates whether the response has answered all the aspects of the question specified
You can look at the complete list of UpTrain’s supported metrics here
from uptrain import EvalLLM, Evals
import json
import pandas as pd
eval_llm = EvalLLM(openai_api_key=os.environ["OPENAI_API_KEY"])
res = eval_llm.evaluate(
data = data,
checks = [Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)
Using Langfuse
You can use Langfuse in 2 ways:
- Score each Trace: This means you will run the evaluations for each trace item. This gives you much better idea since of how each call to your UpTrain pipelines is performing but can be expensive
- Score as Batch: In this method we will take a random sample of traces on a periodic basis and score them. This brings down cost and gives you a rough estimate the performance of your app but can miss out on important samples.
Method 1: Score with Trace
Now lets initialize a Langfuse client SDK to instrument you app
from langfuse import Langfuse
langfuse = Langfuse()
langfuse.auth_check()
Let’s create a trace for the dataset
question = data[0]['question']
trace = langfuse.trace(name = "uptrain trace")
context = data[0]['context']
trace.span(
name = "retrieval", input={'question': question}, output={'context': context}
)
response = data[0]['response']
trace.span(
name = "generation", input={'question': question, 'context': context}, output={'response': response}
)
Let’s add the scores to the trace in Langfuse
trace.score(name='context_relevance', value=res[0]['score_context_relevance'])
trace.score(name='factual_accuracy', value=res[0]['score_factual_accuracy'])
trace.score(name='response_completeness', value=res[0]['score_response_completeness'])
Method 2: Scoring as batch
Let’s create trace with our original dataset
for interaction in data:
trace = langfuse.trace(name = "uptrain batch")
trace.span(
name = "retrieval",
input={'question': interaction['question']},
output={'context': interaction['context']}
)
trace.span(
name = "generation",
input={'question': interaction['question'], 'context': interaction['context']},
output={'response': interaction['response']}
)
langfuse.flush()
Retrieve the uploaded dataset
def get_traces(name=None, limit=10000, user_id=None):
all_data = []
page = 1
while True:
response = langfuse.client.trace.list(
name=name, page=page, user_id=user_id, order_by=None
)
if not response.data:
break
page += 1
all_data.extend(response.data)
if len(all_data) > limit:
break
return all_data[:limit]
Now lets make a batch and score it using UpTrain.
from random import sample
NUM_TRACES_TO_SAMPLE = 4
traces = get_traces(name="uptrain batch")
traces_sample = sample(traces, NUM_TRACES_TO_SAMPLE)
from random import sample
evaluation_batch = {
"question": [],
"context": [],
"response": [],
"trace_id": [],
}
for t in traces_sample:
observations = [langfuse.client.observations.get(o) for o in t.observations]
for o in observations:
if o.name == 'retrieval':
question = o.input['question']
context = o.output['context']
if o.name=='generation':
answer = o.output['response']
evaluation_batch['question'].append(question)
evaluation_batch['context'].append(context)
evaluation_batch['response'].append(response)
evaluation_batch['trace_id'].append(t.id)
data = [dict(zip(evaluation_batch,t)) for t in zip(*evaluation_batch.values())]
res = eval_llm.evaluate(
data = data,
checks = [Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)
df = pd.DataFrame(res)
df["trace_id"] = [d['trace_id'] for d in data]
for _, row in df.iterrows():
for metric_name in ["context_relevance", "factual_accuracy","response_completeness"]:
langfuse.score(
name=metric_name,
value=row["score_"+metric_name],
trace_id=row["trace_id"]
)
df.head()
You can visualize these results on Lanfuse: