Results
Understand evaluation results, access summary statistics, and export data for analysis.
Result Structure
Every evaluation produces two types of results:
Individual Results
Per test case results with the shape:
{
input: any, // Test case input
output: any, // Generated or provided output
expected?: any, // Expected value (if provided)
scores: { // Score for each scorer
[scorerName]: {
name: string,
score: number, // 0-1
metadata: object,
}
},
error?: string, // If test case failed
tags?: string[], // Test case tags
}
Evaluation Summary
Aggregated statistics across all test cases:
{
name: string, // Evaluation name
testCaseCount: number, // Total test cases
scoreSummaries: { // Per-scorer statistics
[scorerName]: {
name: string,
mean: number,
median: number,
min: number,
max: number,
p10: number, // 10th percentile
p25: number, // 25th percentile
p75: number, // 75th percentile
p90: number, // 90th percentile
stddev: number, // Standard deviation
}
},
hasPendingBatch: boolean, // If batch processing pending
}
Summary Statistics
Access summary statistics after evaluation completes:
const summary = await Eval("My Eval", {
data: [/* ... */],
scores: [Effectiveness, Factuality],
})
console.log(`Test Cases: ${summary.testCaseCount}`)
console.log(`Effectiveness Mean: ${summary.scoreSummaries.Effectiveness.mean}`)
console.log(`Effectiveness Median: ${summary.scoreSummaries.Effectiveness.median}`)
console.log(`Effectiveness P90: ${summary.scoreSummaries.Effectiveness.p90}`)
Per-Scorer Statistics
const effectivenessStats = summary.scoreSummaries.Effectiveness
console.log(`Mean: ${effectivenessStats.mean}`)
console.log(`Median: ${effectivenessStats.median}`)
console.log(`Range: ${effectivenessStats.min} - ${effectivenessStats.max}`)
console.log(`P25-P75: ${effectivenessStats.p25} - ${effectivenessStats.p75}`)
console.log(`Std Dev: ${effectivenessStats.stddev}`)
Terminal Output
By default, results stream to your terminal with a progress bar:
⠋ Running My Eval... (42/100)
Effectiveness: 0.85 (mean)
Factuality: 0.78 (mean)
Verbose Mode
Enable verbose output for detailed per-test-case results:
ORCHA_VERBOSE=true bun run my-eval.ts
JSONL Mode
Output results as JSONL for parsing:
ORCHA_JSONL=true bun run my-eval.ts > results.jsonl
Each line is a JSON object:
{"type":"result","data":{"input":"...","output":"...","scores":{}}}
{"type":"summary","data":{"name":"...","testCaseCount":100,"scoreSummaries":{}}}
Exporting Results
Export results to storage backends for persistence and analysis:
SQLite Exporter
Store results in a local SQLite database:
import { SqliteExporter } from 'orchestrated'
await Eval("Tracked Eval", {
data: [/* ... */],
scores: [Effectiveness],
}, {
exporters: [
new SqliteExporter({ dbPath: "./results.db" })
]
})
Database Schema
The SQLite exporter creates three tables:
evaluations - Eval metadata
CREATE TABLE evaluations (
id INTEGER PRIMARY KEY,
name TEXT,
checksum TEXT,
created_at TIMESTAMP
)
results - Individual test case results
CREATE TABLE results (
id INTEGER PRIMARY KEY,
evaluation_id INTEGER,
input TEXT,
output TEXT,
expected TEXT,
scores TEXT, -- JSON
tags TEXT, -- JSON
FOREIGN KEY (evaluation_id) REFERENCES evaluations(id)
)
batches - Batch processing status
CREATE TABLE batches (
id INTEGER PRIMARY KEY,
evaluation_id INTEGER,
status TEXT, -- pending, completed, failed
batch_id TEXT,
created_at TIMESTAMP
)
Historical Comparison
Compare current results to historical runs:
import { createSqliteReporter } from 'orchestrated'
await Eval("Historical Comparison", {
data: [/* ... */],
scores: [Effectiveness],
}, {
exporters: [
new SqliteExporter({ dbPath: "./results.db" })
],
reporters: [
createSqliteReporter({ dbPath: "./results.db" })
]
})
Terminal output includes comparison to previous runs:
✓ Historical Comparison completed (100 test cases)
Effectiveness: 0.85 (↑ +0.05 from previous run)
Factuality: 0.78 (↓ -0.02 from previous run)
Accessing Historical Data
Query the database directly:
import Database from 'better-sqlite3'
const db = new Database('./results.db')
const previousRuns = db.prepare(`
SELECT name, created_at, scores
FROM evaluations e
JOIN results r ON r.evaluation_id = e.id
WHERE e.name = ?
ORDER BY e.created_at DESC
LIMIT 10
`).all('My Eval')
console.log(previousRuns)