Evaluations

Evaluations are the core building block of Orchestrated. They define what to test, how to test it, and how to measure success.


What is an Evaluation?

An evaluation is a structured test that:

  1. Loads data - Test cases from static arrays, data sources, or custom functions
  2. (Optional) Runs a task - Transforms inputs into outputs using your LLM or system
  3. Scores results - Evaluates outputs using built-in or custom scorers
  4. Reports findings - Displays results in terminal, exports to databases, or uploads to cloud

Evaluations help you answer questions like:

  • "Is my new prompt better than the old one?"
  • "How well does my model follow guardrails?"
  • "What's the quality of my production outputs?"

Evaluation Structure

Every evaluation is defined using the Eval() function:

await Eval("Evaluation Name", {
  ctx: {},              // Shared context (optional)
  data: [],             // Test cases or data source
  task: () => {},       // Transform function (optional)
  scores: [],           // Scorer functions
}, {
  exporters: [],        // Export results (optional)
  reporters: [],        // Custom reporters (optional)
})

Required Fields

  • Name - Human-readable identifier for the evaluation
  • data - Test cases (array) or data source function
  • scores - Array of scorer functions or scorer names

Optional Fields

  • ctx - Shared context passed to tasks and scorers
  • task - Function to generate outputs from inputs
  • exporters - Storage backends (SQLite, cloud, etc.)
  • reporters - Custom result formatters

Context Flow

Context flows through your evaluation, allowing you to pass configuration, API keys, and state:

await Eval("My Eval", {
  ctx: {
    apiKey: "sk-...",
    model: "gpt-4",
  },
  data: [
    { input: "test", ctx: { temperature: 0.7 } }  // Override per test case
  ],
  task: async (input, ctx) => {
    // ctx includes: base ctx + test case ctx + state
    return await callLLM(input, ctx.apiKey, ctx.temperature)
  },
  scores: [/* ... */]
})

Context Merge Order

  1. Base context - From Eval() config
  2. Data context - Per test case overrides
  3. Task context - Task can return [output, ctxOverride]
  4. State - Auto-injected as ctx.state

Scorers receive the final merged context along with input, output, and expected values.


Execution Modes

Fire-and-Forget

Run evaluation without waiting for results:

Eval("Background Eval", {
  data: [/* ... */],
  scores: [/* ... */]
})
// Script continues immediately

Results stream to console as they complete.

Awaitable

Wait for evaluation to complete:

const summary = await Eval("Awaitable Eval", {
  data: [/* ... */],
  scores: [/* ... */]
})

console.log(summary.mean) // Access aggregated results

Returns an EvalSummary with statistics for all scorers.

With Exporters

Export results to storage backends:

await Eval("Tracked Eval", {
  data: [/* ... */],
  scores: [/* ... */]
}, {
  exporters: [
    new SqliteExporter({ dbPath: "./results.db" })
  ]
})

Results are saved to SQLite for historical tracking and analysis.


Best Practices

Keep Evaluations Focused

Each evaluation should test one thing:

// Good - focused evaluation
await Eval("Prompt A vs Prompt B", { /* ... */ })

// Less good - testing multiple unrelated things
await Eval("Everything Test", { /* ... */ })

Use Descriptive Names

Names should clearly indicate what's being tested:

await Eval("GPT-4 Effectiveness - Production Data - Jan 2025", {
  // ...
})

Leverage Context for Configuration

Pass configuration through context instead of hardcoding:

await Eval("Configurable Eval", {
  ctx: {
    model: process.env.MODEL_NAME,
    temperature: 0.7,
  },
  task: (input, ctx) => callLLM(input, ctx.model, ctx.temperature),
  scores: [/* ... */]
})

Export Important Results

Use exporters for evaluations you'll want to track over time:

await Eval("Production Quality", {
  data: interactions(),
  scores: [Effectiveness],
}, {
  exporters: [new SqliteExporter({ dbPath: "./prod-results.db" })]
})

Was this page helpful?