Results

Understand evaluation results, access summary statistics, and export data for analysis.


Result Structure

Every evaluation produces two types of results:

Individual Results

Per test case results with the shape:

{
  input: any,           // Test case input
  output: any,          // Generated or provided output
  expected?: any,       // Expected value (if provided)
  scores: {             // Score for each scorer
    [scorerName]: {
      name: string,
      score: number,    // 0-1
      metadata: object,
    }
  },
  error?: string,       // If test case failed
  tags?: string[],      // Test case tags
}

Evaluation Summary

Aggregated statistics across all test cases:

{
  name: string,           // Evaluation name
  testCaseCount: number,  // Total test cases
  scoreSummaries: {       // Per-scorer statistics
    [scorerName]: {
      name: string,
      mean: number,
      median: number,
      min: number,
      max: number,
      p10: number,      // 10th percentile
      p25: number,      // 25th percentile
      p75: number,      // 75th percentile
      p90: number,      // 90th percentile
      stddev: number,   // Standard deviation
    }
  },
  hasPendingBatch: boolean,  // If batch processing pending
}

Summary Statistics

Access summary statistics after evaluation completes:

const summary = await Eval("My Eval", {
  data: [/* ... */],
  scores: [Effectiveness, Factuality],
})

console.log(`Test Cases: ${summary.testCaseCount}`)
console.log(`Effectiveness Mean: ${summary.scoreSummaries.Effectiveness.mean}`)
console.log(`Effectiveness Median: ${summary.scoreSummaries.Effectiveness.median}`)
console.log(`Effectiveness P90: ${summary.scoreSummaries.Effectiveness.p90}`)

Per-Scorer Statistics

const effectivenessStats = summary.scoreSummaries.Effectiveness

console.log(`Mean: ${effectivenessStats.mean}`)
console.log(`Median: ${effectivenessStats.median}`)
console.log(`Range: ${effectivenessStats.min} - ${effectivenessStats.max}`)
console.log(`P25-P75: ${effectivenessStats.p25} - ${effectivenessStats.p75}`)
console.log(`Std Dev: ${effectivenessStats.stddev}`)

Terminal Output

By default, results stream to your terminal with a progress bar:

⠋ Running My Eval... (42/100)

Effectiveness: 0.85 (mean)
Factuality:    0.78 (mean)

Verbose Mode

Enable verbose output for detailed per-test-case results:

ORCHA_VERBOSE=true bun run my-eval.ts

JSONL Mode

Output results as JSONL for parsing:

ORCHA_JSONL=true bun run my-eval.ts > results.jsonl

Each line is a JSON object:

{"type":"result","data":{"input":"...","output":"...","scores":{}}}
{"type":"summary","data":{"name":"...","testCaseCount":100,"scoreSummaries":{}}}

Exporting Results

Export results to storage backends for persistence and analysis:

SQLite Exporter

Store results in a local SQLite database:

import { SqliteExporter } from 'orchestrated'

await Eval("Tracked Eval", {
  data: [/* ... */],
  scores: [Effectiveness],
}, {
  exporters: [
    new SqliteExporter({ dbPath: "./results.db" })
  ]
})

Database Schema

The SQLite exporter creates three tables:

evaluations - Eval metadata

CREATE TABLE evaluations (
  id INTEGER PRIMARY KEY,
  name TEXT,
  checksum TEXT,
  created_at TIMESTAMP
)

results - Individual test case results

CREATE TABLE results (
  id INTEGER PRIMARY KEY,
  evaluation_id INTEGER,
  input TEXT,
  output TEXT,
  expected TEXT,
  scores TEXT,  -- JSON
  tags TEXT,    -- JSON
  FOREIGN KEY (evaluation_id) REFERENCES evaluations(id)
)

batches - Batch processing status

CREATE TABLE batches (
  id INTEGER PRIMARY KEY,
  evaluation_id INTEGER,
  status TEXT,  -- pending, completed, failed
  batch_id TEXT,
  created_at TIMESTAMP
)

Historical Comparison

Compare current results to historical runs:

import { createSqliteReporter } from 'orchestrated'

await Eval("Historical Comparison", {
  data: [/* ... */],
  scores: [Effectiveness],
}, {
  exporters: [
    new SqliteExporter({ dbPath: "./results.db" })
  ],
  reporters: [
    createSqliteReporter({ dbPath: "./results.db" })
  ]
})

Terminal output includes comparison to previous runs:

✓ Historical Comparison completed (100 test cases)

Effectiveness: 0.85 (↑ +0.05 from previous run)
Factuality:    0.78 (↓ -0.02 from previous run)

Accessing Historical Data

Query the database directly:

import Database from 'better-sqlite3'

const db = new Database('./results.db')

const previousRuns = db.prepare(`
  SELECT name, created_at, scores
  FROM evaluations e
  JOIN results r ON r.evaluation_id = e.id
  WHERE e.name = ?
  ORDER BY e.created_at DESC
  LIMIT 10
`).all('My Eval')

console.log(previousRuns)

Was this page helpful?