Evaluations

Evaluations are the core building block of Orchestrated. They define what to test, how to test it, and how to measure success.

What is an Evaluation?

An evaluation is a structured test that:

Loads data - Test cases from static arrays, data sources, or custom functions
(Optional) Runs a task - Transforms inputs into outputs using your LLM or system
Scores results - Evaluates outputs using built-in or custom scorers
Reports findings - Displays results in terminal, exports to databases, or uploads to cloud

Evaluations help you answer questions like:

"Is my new prompt better than the old one?"
"How well does my model follow guardrails?"
"What's the quality of my production outputs?"

Evaluation Structure

Every evaluation is defined using the Eval() function:

await Eval("Evaluation Name", {
  ctx: {},              // Shared context (optional)
  data: [],             // Test cases or data source
  task: () => {},       // Transform function (optional)
  scores: [],           // Scorer functions
}, {
  exporters: [],        // Export results (optional)
  reporters: [],        // Custom reporters (optional)
})

Required Fields

Name - Human-readable identifier for the evaluation
data - Test cases (array) or data source function
scores - Array of scorer functions or scorer names

Optional Fields

ctx - Shared context passed to tasks and scorers
task - Function to generate outputs from inputs
exporters - Storage backends (SQLite, cloud, etc.)
reporters - Custom result formatters

Context Flow

Context flows through your evaluation, allowing you to pass configuration, API keys, and state:

await Eval("My Eval", {
  ctx: {
    apiKey: "sk-...",
    model: "gpt-4",
  },
  data: [
    { input: "test", ctx: { temperature: 0.7 } }  // Override per test case
  ],
  task: async (input, ctx) => {
    // ctx includes: base ctx + test case ctx + state
    return await callLLM(input, ctx.apiKey, ctx.temperature)
  },
  scores: [/* ... */]
})

Context Merge Order

Base context - From Eval() config
Data context - Per test case overrides
Task context - Task can return [output, ctxOverride]
State - Auto-injected as ctx.state

Scorers receive the final merged context along with input, output, and expected values.

Execution Modes

Fire-and-Forget

Run evaluation without waiting for results:

Eval("Background Eval", {
  data: [/* ... */],
  scores: [/* ... */]
})
// Script continues immediately

Results stream to console as they complete.

Awaitable

Wait for evaluation to complete:

const summary = await Eval("Awaitable Eval", {
  data: [/* ... */],
  scores: [/* ... */]
})

console.log(summary.mean) // Access aggregated results

Returns an EvalSummary with statistics for all scorers.

With Exporters

Export results to storage backends:

await Eval("Tracked Eval", {
  data: [/* ... */],
  scores: [/* ... */]
}, {
  exporters: [
    new SqliteExporter({ dbPath: "./results.db" })
  ]
})

Results are saved to SQLite for historical tracking and analysis.

Best Practices

Keep Evaluations Focused

Each evaluation should test one thing:

// Good - focused evaluation
await Eval("Prompt A vs Prompt B", { /* ... */ })

// Less good - testing multiple unrelated things
await Eval("Everything Test", { /* ... */ })

Use Descriptive Names

Names should clearly indicate what's being tested:

await Eval("GPT-4 Effectiveness - Production Data - Jan 2025", {
  // ...
})

Leverage Context for Configuration

Pass configuration through context instead of hardcoding:

await Eval("Configurable Eval", {
  ctx: {
    model: process.env.MODEL_NAME,
    temperature: 0.7,
  },
  task: (input, ctx) => callLLM(input, ctx.model, ctx.temperature),
  scores: [/* ... */]
})

Export Important Results

Use exporters for evaluations you'll want to track over time:

await Eval("Production Quality", {
  data: interactions(),
  scores: [Effectiveness],
}, {
  exporters: [new SqliteExporter({ dbPath: "./prod-results.db" })]
})

Next: Data Sources Eval() API Reference