Working with Data Sources

Choose the right data source strategy for your evaluation needs, from simple static arrays to production data.


Static vs Dynamic Data

Understanding when to use each type of data source is crucial for effective evaluations.

Static Data

Best for:

  • Development and testing
  • Regression tests with known inputs/outputs
  • Reproducing specific issues
  • Baseline benchmarks

Pros:

  • Simple and fast
  • Fully reproducible
  • Easy to version control
  • No external dependencies

Cons:

  • Limited coverage
  • May not reflect production reality
  • Requires manual updates
await Eval("Static Data Eval", {
  data: [
    { input: "test 1", output: "result 1", expected: "result 1" },
    { input: "test 2", output: "result 2", expected: "result 2" },
  ],
  scores: [Effectiveness],
})

Dynamic Data

Best for:

  • Production quality monitoring
  • Real-world validation
  • Discovering edge cases
  • Continuous evaluation

Pros:

  • Reflects actual usage
  • Automatically updated
  • Comprehensive coverage
  • Reveals unexpected patterns

Cons:

  • Less reproducible
  • Requires infrastructure
  • May be slower to load
  • Needs authentication
import { interactions } from 'orchestrated'

await Eval("Dynamic Data Eval", {
  data: interactions({ limit: 100 }),
  scores: [Effectiveness],
})

Hybrid Approach

Combine both for comprehensive testing:

// Static baseline
await Eval("Baseline Eval", {
  data: [
    { input: "What is 2+2?", expected: "4" },
    { input: "What is the capital of France?", expected: "Paris" },
  ],
  task: callMyLLM,
  scores: [Effectiveness],
})

// Production sampling
await Eval("Production Eval", {
  data: interactions({ limit: 50 }),
  scores: [Effectiveness],
})

Using interactions()

The interactions() data source loads real user interactions from your production system.

Basic Usage

import { interactions } from 'orchestrated'

await Eval("Interactions Eval", {
  data: interactions(),
  scores: [Effectiveness, GuardrailAdherence],
})

Uses defaults from your state configuration (tenantId, serviceName, environment).

With Parameters

Customize what data to load:

await Eval("Filtered Interactions", {
  data: interactions({
    tenantId: "acme",
    serviceName: "customer-support",
    environment: "production",
    limit: 200,
    startDate: "2025-01-01",
    endDate: "2025-01-31",
  }),
  scores: [Effectiveness],
})

Parameter Reference

{
  tenantId?: string,       // Organization identifier (default: from state)
  serviceName?: string,    // Service to evaluate (default: from state)
  environment?: string,    // Environment filter (default: from state)
  limit?: number,          // Max number of interactions (default: 100)
  startDate?: string,      // Start date (ISO format)
  endDate?: string,        // End date (ISO format)
  tags?: string[],         // Filter by tags
  userId?: string,         // Filter by user
}

Time-Based Sampling

Evaluate specific time periods:

// Last 24 hours
await Eval("Recent Interactions", {
  data: interactions({
    startDate: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
    limit: 50,
  }),
  scores: [Effectiveness],
})

// Specific date range
await Eval("January 2025", {
  data: interactions({
    startDate: "2025-01-01",
    endDate: "2025-01-31",
  }),
  scores: [Effectiveness],
})

Tag-Based Filtering

Focus on specific interaction types:

await Eval("Error Cases", {
  data: interactions({
    tags: ["error", "failure"],
    limit: 100,
  }),
  scores: [Execution, Effectiveness],
})

await Eval("High Priority", {
  data: interactions({
    tags: ["high-priority", "enterprise"],
  }),
  scores: [Effectiveness, GuardrailAdherence],
})

User-Specific Evaluation

Test for specific users:

await Eval("VIP User Experience", {
  data: interactions({
    userId: "vip-user-123",
    limit: 50,
  }),
  scores: [Effectiveness, ToneScorer],
})

Custom Data Sources

Create custom data sources for unique requirements.

Simple Function

async function myDataSource() {
  return [
    { input: "test 1", output: "result 1" },
    { input: "test 2", output: "result 2" },
  ]
}

await Eval("Custom Source Eval", {
  data: myDataSource,
  scores: [Effectiveness],
})

Database Source

Load from your database:

import { Database } from 'bun:sqlite'

async function loadFromDatabase() {
  const db = new Database('./data.db')
  const rows = db.query('SELECT * FROM test_cases LIMIT 100').all()

  return rows.map(row => ({
    input: row.input,
    output: row.output,
    expected: row.expected,
    tags: row.tags?.split(','),
  }))
}

await Eval("Database Eval", {
  data: loadFromDatabase,
  scores: [Effectiveness],
})

API Source

Fetch from external API:

async function loadFromAPI() {
  const response = await fetch('https://api.example.com/test-cases')
  const data = await response.json()

  return data.testCases.map(tc => ({
    input: tc.question,
    output: tc.answer,
    expected: tc.groundTruth,
    tags: tc.categories,
  }))
}

await Eval("API Eval", {
  data: loadFromAPI,
  scores: [Effectiveness],
})

File-Based Source

Load from CSV, JSON, or other files:

import { readFile } from 'node:fs/promises'

async function loadFromCSV() {
  const content = await readFile('./test-cases.csv', 'utf-8')
  const lines = content.split('\n').slice(1) // Skip header

  return lines.map(line => {
    const [input, output, expected] = line.split(',')
    return { input, output, expected }
  })
}

async function loadFromJSON() {
  const content = await readFile('./test-cases.json', 'utf-8')
  const data = JSON.parse(content)
  return data.testCases
}

await Eval("File-Based Eval", {
  data: loadFromCSV,
  scores: [Effectiveness],
})

Parameterized Source

Create reusable sources with parameters:

function createDatabaseSource(options: {
  table: string
  limit: number
  where?: string
}) {
  return async () => {
    const db = new Database('./data.db')
    const whereClause = options.where ? `WHERE ${options.where}` : ''
    const query = `SELECT * FROM ${options.table} ${whereClause} LIMIT ${options.limit}`
    const rows = db.query(query).all()

    return rows.map(row => ({
      input: row.input,
      output: row.output,
      expected: row.expected,
    }))
  }
}

// Use with different parameters
await Eval("Recent Errors", {
  data: createDatabaseSource({
    table: 'interactions',
    limit: 100,
    where: "status = 'error' AND created_at > datetime('now', '-7 days')",
  }),
  scores: [Execution],
})

Cached Source

Implement caching for expensive data loads:

let cachedData: any[] | null = null
let cacheTimestamp = 0
const CACHE_TTL = 5 * 60 * 1000 // 5 minutes

async function cachedDataSource() {
  const now = Date.now()

  if (cachedData && (now - cacheTimestamp) < CACHE_TTL) {
    console.log('Using cached data')
    return cachedData
  }

  console.log('Fetching fresh data')
  const data = await expensiveDataFetch()
  cachedData = data
  cacheTimestamp = now

  return data
}

await Eval("Cached Eval", {
  data: cachedDataSource,
  scores: [Effectiveness],
})

Data Transformation

Transform data as it flows through your evaluation.

In Data Source

Transform during data loading:

async function transformedSource() {
  const rawData = await fetch('https://api.example.com/data').then(r => r.json())

  return rawData.map(item => ({
    input: item.user_message,
    output: item.assistant_response,
    expected: item.ground_truth,
    tags: [item.category, item.priority],
    ctx: {
      userId: item.user_id,
      timestamp: item.created_at,
    },
  }))
}

With Task Function

Transform inputs before scoring:

await Eval("Transform in Task", {
  data: [
    { input: "What is 2+2?" },
    { input: "What is the capital of France?" },
  ],
  task: async (input) => {
    // Call LLM and transform output
    const rawOutput = await callLLM(input)
    const cleanedOutput = rawOutput.trim().replace(/\n+/g, ' ')
    return cleanedOutput
  },
  scores: [Effectiveness],
})

Context Enrichment

Add context during data loading:

async function enrichedSource() {
  const testCases = await loadTestCases()

  return testCases.map(tc => ({
    ...tc,
    ctx: {
      // Add metadata as context
      complexity: calculateComplexity(tc.input),
      expectedDuration: estimateDuration(tc.input),
      category: categorize(tc.input),
    },
  }))
}

await Eval("Enriched Eval", {
  data: enrichedSource,
  scores: [Effectiveness],
})

Best Practices

Start Small, Scale Up

Begin with a small dataset to validate your evaluation:

// Development - small static dataset
await Eval("Dev Test", {
  data: [
    { input: "test 1", output: "result 1" },
    { input: "test 2", output: "result 2" },
  ],
  scores: [Effectiveness],
})

// Staging - moderate production sample
await Eval("Staging Test", {
  data: interactions({ limit: 20 }),
  scores: [Effectiveness],
})

// Production - full evaluation
await Eval("Production Test", {
  data: interactions({ limit: 200 }),
  scores: [Effectiveness, GuardrailAdherence, Factuality],
})

Use Representative Samples

Ensure your data represents real usage:

await Eval("Representative Sample", {
  data: interactions({
    // Mix of success and failure
    tags: ["success", "error"],
    // Recent data
    startDate: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000).toISOString(),
    // Reasonable size
    limit: 100,
  }),
  scores: [Effectiveness],
})

Handle Missing Data

Gracefully handle incomplete data:

async function robustSource() {
  const rawData = await loadData()

  return rawData
    .filter(item => item.input && item.output) // Remove incomplete
    .map(item => ({
      input: item.input,
      output: item.output,
      expected: item.expected || null, // Allow missing expected
      tags: item.tags || [],
    }))
}

Version Your Data

Track which data version was used:

async function versionedSource() {
  const version = "v2.1.0"
  const data = await loadData(version)

  return data.map(item => ({
    ...item,
    tags: [...(item.tags || []), `data-version:${version}`],
  }))
}

Document Your Sources

Add clear documentation:

/**
 * Loads customer support interactions from production database.
 *
 * Filters:
 * - Only resolved tickets
 * - Last 30 days
 * - English language only
 *
 * Returns up to 500 interactions with:
 * - input: Customer question
 * - output: Agent response
 * - expected: Quality rating (1-5)
 */
async function customerSupportSource() {
  // Implementation...
}

Was this page helpful?