Langfuse
Langfuse is an open source LLM engineering platform which helps teams collaboratively debug, analyze and iterate on their LLM applications. Its core features are observability (tracing), prompt management (versioning), evaluations (scores) and datasets (testing).
Langfuse allows users to score individual executions or traces. Users can customize the scores and scales they use. As such, UpTrain’s evaluations can easily be integrated into Langfuse.
Scores can be used in a variety of ways in Langfuse:
- Data: Attach scores to executions and traces and view them in the Langfuse UI
- Filter: Group executions or traces by scores to e.g. filter for traces with a low-quality score
- Fine Tuning: Filter and export by scores as .csv or .JSONL for fine-tuning
- Analytics: Detailed score reporting and dashboards with drill downs into use cases and user segments
In this guide, we will walk you through using Langfuse to create traces and evaluate them using UpTrain.
How to integrate?
Setup
Enter your Langfuse API keys and OpenAI API key
You need to sign up to Langfuse and fetch your Langfuse API keys in your project’s settings. You also need an OpenAI API key
Let’s create a sample data
Run Evaluations using UpTrain Open-Source Software (OSS)
We have used the following 3 metrics from UpTrain’s library:
-
Context Relevance: Evaluates how relevant the retrieved context is to the question specified.
-
Factual Accuracy: Evaluates whether the response generated is factually correct and grounded by the provided context.
-
Response Completeness: Evaluates whether the response has answered all the aspects of the question specified
You can look at the complete list of UpTrain’s supported metrics here
Using Langfuse
You can use Langfuse in 2 ways:
- Score each Trace: This means you will run the evaluations for each trace item. This gives you much better idea since of how each call to your UpTrain pipelines is performing but can be expensive
- Score as Batch: In this method we will take a random sample of traces on a periodic basis and score them. This brings down cost and gives you a rough estimate the performance of your app but can miss out on important samples.
Method 1: Score with Trace
Now lets initialize a Langfuse client SDK to instrument you app
Let’s create a trace for the dataset
Let’s add the scores to the trace in Langfuse
Method 2: Scoring as batch
Let’s create trace with our original dataset
Retrieve the uploaded dataset
Now lets make a batch and score it using UpTrain.
You can visualize these results on Lanfuse:
Was this page helpful?