Evaluations
Evaluations are the core building block of Orchestrated. They define what to test, how to test it, and how to measure success.
What is an Evaluation?
An evaluation is a structured test that:
- Loads data - Test cases from static arrays, data sources, or custom functions
- (Optional) Runs a task - Transforms inputs into outputs using your LLM or system
- Scores results - Evaluates outputs using built-in or custom scorers
- Reports findings - Displays results in terminal, exports to databases, or uploads to cloud
Evaluations help you answer questions like:
- "Is my new prompt better than the old one?"
- "How well does my model follow guardrails?"
- "What's the quality of my production outputs?"
Evaluation Structure
Every evaluation is defined using the Eval() function:
await Eval("Evaluation Name", {
ctx: {}, // Shared context (optional)
data: [], // Test cases or data source
task: () => {}, // Transform function (optional)
scores: [], // Scorer functions
}, {
exporters: [], // Export results (optional)
reporters: [], // Custom reporters (optional)
})
Required Fields
- Name - Human-readable identifier for the evaluation
- data - Test cases (array) or data source function
- scores - Array of scorer functions or scorer names
Optional Fields
- ctx - Shared context passed to tasks and scorers
- task - Function to generate outputs from inputs
- exporters - Storage backends (SQLite, cloud, etc.)
- reporters - Custom result formatters
Context Flow
Context flows through your evaluation, allowing you to pass configuration, API keys, and state:
await Eval("My Eval", {
ctx: {
apiKey: "sk-...",
model: "gpt-4",
},
data: [
{ input: "test", ctx: { temperature: 0.7 } } // Override per test case
],
task: async (input, ctx) => {
// ctx includes: base ctx + test case ctx + state
return await callLLM(input, ctx.apiKey, ctx.temperature)
},
scores: [/* ... */]
})
Context Merge Order
- Base context - From
Eval()config - Data context - Per test case overrides
- Task context - Task can return
[output, ctxOverride] - State - Auto-injected as
ctx.state
Scorers receive the final merged context along with input, output, and expected values.
Execution Modes
Fire-and-Forget
Run evaluation without waiting for results:
Eval("Background Eval", {
data: [/* ... */],
scores: [/* ... */]
})
// Script continues immediately
Results stream to console as they complete.
Awaitable
Wait for evaluation to complete:
const summary = await Eval("Awaitable Eval", {
data: [/* ... */],
scores: [/* ... */]
})
console.log(summary.mean) // Access aggregated results
Returns an EvalSummary with statistics for all scorers.
With Exporters
Export results to storage backends:
await Eval("Tracked Eval", {
data: [/* ... */],
scores: [/* ... */]
}, {
exporters: [
new SqliteExporter({ dbPath: "./results.db" })
]
})
Results are saved to SQLite for historical tracking and analysis.
Best Practices
Keep Evaluations Focused
Each evaluation should test one thing:
// Good - focused evaluation
await Eval("Prompt A vs Prompt B", { /* ... */ })
// Less good - testing multiple unrelated things
await Eval("Everything Test", { /* ... */ })
Use Descriptive Names
Names should clearly indicate what's being tested:
await Eval("GPT-4 Effectiveness - Production Data - Jan 2025", {
// ...
})
Leverage Context for Configuration
Pass configuration through context instead of hardcoding:
await Eval("Configurable Eval", {
ctx: {
model: process.env.MODEL_NAME,
temperature: 0.7,
},
task: (input, ctx) => callLLM(input, ctx.model, ctx.temperature),
scores: [/* ... */]
})
Export Important Results
Use exporters for evaluations you'll want to track over time:
await Eval("Production Quality", {
data: interactions(),
scores: [Effectiveness],
}, {
exporters: [new SqliteExporter({ dbPath: "./prod-results.db" })]
})