Working with Data Sources
Choose the right data source strategy for your evaluation needs, from simple static arrays to production data.
Static vs Dynamic Data
Understanding when to use each type of data source is crucial for effective evaluations.
Static Data
Best for:
- Development and testing
- Regression tests with known inputs/outputs
- Reproducing specific issues
- Baseline benchmarks
Pros:
- Simple and fast
- Fully reproducible
- Easy to version control
- No external dependencies
Cons:
- Limited coverage
- May not reflect production reality
- Requires manual updates
await Eval("Static Data Eval", {
data: [
{ input: "test 1", output: "result 1", expected: "result 1" },
{ input: "test 2", output: "result 2", expected: "result 2" },
],
scores: [Effectiveness],
})
Dynamic Data
Best for:
- Production quality monitoring
- Real-world validation
- Discovering edge cases
- Continuous evaluation
Pros:
- Reflects actual usage
- Automatically updated
- Comprehensive coverage
- Reveals unexpected patterns
Cons:
- Less reproducible
- Requires infrastructure
- May be slower to load
- Needs authentication
import { interactions } from 'orchestrated'
await Eval("Dynamic Data Eval", {
data: interactions({ limit: 100 }),
scores: [Effectiveness],
})
Hybrid Approach
Combine both for comprehensive testing:
// Static baseline
await Eval("Baseline Eval", {
data: [
{ input: "What is 2+2?", expected: "4" },
{ input: "What is the capital of France?", expected: "Paris" },
],
task: callMyLLM,
scores: [Effectiveness],
})
// Production sampling
await Eval("Production Eval", {
data: interactions({ limit: 50 }),
scores: [Effectiveness],
})
Using interactions()
The interactions() data source loads real user interactions from your production system.
Basic Usage
import { interactions } from 'orchestrated'
await Eval("Interactions Eval", {
data: interactions(),
scores: [Effectiveness, GuardrailAdherence],
})
Uses defaults from your state configuration (tenantId, serviceName, environment).
With Parameters
Customize what data to load:
await Eval("Filtered Interactions", {
data: interactions({
tenantId: "acme",
serviceName: "customer-support",
environment: "production",
limit: 200,
startDate: "2025-01-01",
endDate: "2025-01-31",
}),
scores: [Effectiveness],
})
Parameter Reference
{
tenantId?: string, // Organization identifier (default: from state)
serviceName?: string, // Service to evaluate (default: from state)
environment?: string, // Environment filter (default: from state)
limit?: number, // Max number of interactions (default: 100)
startDate?: string, // Start date (ISO format)
endDate?: string, // End date (ISO format)
tags?: string[], // Filter by tags
userId?: string, // Filter by user
}
Time-Based Sampling
Evaluate specific time periods:
// Last 24 hours
await Eval("Recent Interactions", {
data: interactions({
startDate: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
limit: 50,
}),
scores: [Effectiveness],
})
// Specific date range
await Eval("January 2025", {
data: interactions({
startDate: "2025-01-01",
endDate: "2025-01-31",
}),
scores: [Effectiveness],
})
Tag-Based Filtering
Focus on specific interaction types:
await Eval("Error Cases", {
data: interactions({
tags: ["error", "failure"],
limit: 100,
}),
scores: [Execution, Effectiveness],
})
await Eval("High Priority", {
data: interactions({
tags: ["high-priority", "enterprise"],
}),
scores: [Effectiveness, GuardrailAdherence],
})
User-Specific Evaluation
Test for specific users:
await Eval("VIP User Experience", {
data: interactions({
userId: "vip-user-123",
limit: 50,
}),
scores: [Effectiveness, ToneScorer],
})
Custom Data Sources
Create custom data sources for unique requirements.
Simple Function
async function myDataSource() {
return [
{ input: "test 1", output: "result 1" },
{ input: "test 2", output: "result 2" },
]
}
await Eval("Custom Source Eval", {
data: myDataSource,
scores: [Effectiveness],
})
Database Source
Load from your database:
import { Database } from 'bun:sqlite'
async function loadFromDatabase() {
const db = new Database('./data.db')
const rows = db.query('SELECT * FROM test_cases LIMIT 100').all()
return rows.map(row => ({
input: row.input,
output: row.output,
expected: row.expected,
tags: row.tags?.split(','),
}))
}
await Eval("Database Eval", {
data: loadFromDatabase,
scores: [Effectiveness],
})
API Source
Fetch from external API:
async function loadFromAPI() {
const response = await fetch('https://api.example.com/test-cases')
const data = await response.json()
return data.testCases.map(tc => ({
input: tc.question,
output: tc.answer,
expected: tc.groundTruth,
tags: tc.categories,
}))
}
await Eval("API Eval", {
data: loadFromAPI,
scores: [Effectiveness],
})
File-Based Source
Load from CSV, JSON, or other files:
import { readFile } from 'node:fs/promises'
async function loadFromCSV() {
const content = await readFile('./test-cases.csv', 'utf-8')
const lines = content.split('\n').slice(1) // Skip header
return lines.map(line => {
const [input, output, expected] = line.split(',')
return { input, output, expected }
})
}
async function loadFromJSON() {
const content = await readFile('./test-cases.json', 'utf-8')
const data = JSON.parse(content)
return data.testCases
}
await Eval("File-Based Eval", {
data: loadFromCSV,
scores: [Effectiveness],
})
Parameterized Source
Create reusable sources with parameters:
function createDatabaseSource(options: {
table: string
limit: number
where?: string
}) {
return async () => {
const db = new Database('./data.db')
const whereClause = options.where ? `WHERE ${options.where}` : ''
const query = `SELECT * FROM ${options.table} ${whereClause} LIMIT ${options.limit}`
const rows = db.query(query).all()
return rows.map(row => ({
input: row.input,
output: row.output,
expected: row.expected,
}))
}
}
// Use with different parameters
await Eval("Recent Errors", {
data: createDatabaseSource({
table: 'interactions',
limit: 100,
where: "status = 'error' AND created_at > datetime('now', '-7 days')",
}),
scores: [Execution],
})
Cached Source
Implement caching for expensive data loads:
let cachedData: any[] | null = null
let cacheTimestamp = 0
const CACHE_TTL = 5 * 60 * 1000 // 5 minutes
async function cachedDataSource() {
const now = Date.now()
if (cachedData && (now - cacheTimestamp) < CACHE_TTL) {
console.log('Using cached data')
return cachedData
}
console.log('Fetching fresh data')
const data = await expensiveDataFetch()
cachedData = data
cacheTimestamp = now
return data
}
await Eval("Cached Eval", {
data: cachedDataSource,
scores: [Effectiveness],
})
Data Transformation
Transform data as it flows through your evaluation.
In Data Source
Transform during data loading:
async function transformedSource() {
const rawData = await fetch('https://api.example.com/data').then(r => r.json())
return rawData.map(item => ({
input: item.user_message,
output: item.assistant_response,
expected: item.ground_truth,
tags: [item.category, item.priority],
ctx: {
userId: item.user_id,
timestamp: item.created_at,
},
}))
}
With Task Function
Transform inputs before scoring:
await Eval("Transform in Task", {
data: [
{ input: "What is 2+2?" },
{ input: "What is the capital of France?" },
],
task: async (input) => {
// Call LLM and transform output
const rawOutput = await callLLM(input)
const cleanedOutput = rawOutput.trim().replace(/\n+/g, ' ')
return cleanedOutput
},
scores: [Effectiveness],
})
Context Enrichment
Add context during data loading:
async function enrichedSource() {
const testCases = await loadTestCases()
return testCases.map(tc => ({
...tc,
ctx: {
// Add metadata as context
complexity: calculateComplexity(tc.input),
expectedDuration: estimateDuration(tc.input),
category: categorize(tc.input),
},
}))
}
await Eval("Enriched Eval", {
data: enrichedSource,
scores: [Effectiveness],
})
Best Practices
Start Small, Scale Up
Begin with a small dataset to validate your evaluation:
// Development - small static dataset
await Eval("Dev Test", {
data: [
{ input: "test 1", output: "result 1" },
{ input: "test 2", output: "result 2" },
],
scores: [Effectiveness],
})
// Staging - moderate production sample
await Eval("Staging Test", {
data: interactions({ limit: 20 }),
scores: [Effectiveness],
})
// Production - full evaluation
await Eval("Production Test", {
data: interactions({ limit: 200 }),
scores: [Effectiveness, GuardrailAdherence, Factuality],
})
Use Representative Samples
Ensure your data represents real usage:
await Eval("Representative Sample", {
data: interactions({
// Mix of success and failure
tags: ["success", "error"],
// Recent data
startDate: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000).toISOString(),
// Reasonable size
limit: 100,
}),
scores: [Effectiveness],
})
Handle Missing Data
Gracefully handle incomplete data:
async function robustSource() {
const rawData = await loadData()
return rawData
.filter(item => item.input && item.output) // Remove incomplete
.map(item => ({
input: item.input,
output: item.output,
expected: item.expected || null, // Allow missing expected
tags: item.tags || [],
}))
}
Version Your Data
Track which data version was used:
async function versionedSource() {
const version = "v2.1.0"
const data = await loadData(version)
return data.map(item => ({
...item,
tags: [...(item.tags || []), `data-version:${version}`],
}))
}
Document Your Sources
Add clear documentation:
/**
* Loads customer support interactions from production database.
*
* Filters:
* - Only resolved tickets
* - Last 30 days
* - English language only
*
* Returns up to 500 interactions with:
* - input: Customer question
* - output: Agent response
* - expected: Quality rating (1-5)
*/
async function customerSupportSource() {
// Implementation...
}