Tasks

Tasks transform inputs into outputs in your evaluations. They're optional but powerful for testing live systems.

What are Tasks?

Tasks are functions that generate outputs from inputs. Use tasks when you want to:

Test live systems - Call your LLM, API, or application with test inputs
Compare approaches - Evaluate different prompts, models, or configurations
Generate outputs - Create outputs to score when you only have inputs

Without a task, your data must include pre-generated output fields. With a task, Orchestrated generates outputs for you.

Basic Tasks

Simple Task

The most basic task takes an input and returns an output:

await Eval("Basic Task Eval", {
  data: [
    { input: "What is 2+2?" },
    { input: "What is the capital of France?" },
  ],
  task: async (input) => {
    return await callLLM(input)
  },
  scores: [Effectiveness],
})

With Expected Values

Include expected values in your data for comparison:

await Eval("Expected Values Eval", {
  data: [
    { input: "What is 2+2?", expected: "4" },
    { input: "Capital of France?", expected: "Paris" },
  ],
  task: async (input) => {
    return await callLLM(input)
  },
  scores: [Factuality],  // Compares output to expected
})

Context Access

Tasks receive the merged context as their second parameter:

await Eval("Context Task Eval", {
  ctx: {
    apiKey: process.env.OPENAI_API_KEY,
    model: "gpt-4",
    temperature: 0.7,
  },
  data: [{ input: "test" }],
  task: async (input, ctx) => {
    // Access context values
    return await callLLM(input, {
      apiKey: ctx.apiKey,
      model: ctx.model,
      temperature: ctx.temperature,
    })
  },
  scores: [Effectiveness],
})

State Access

Global state is auto-injected as ctx.state:

task: async (input, ctx) => {
  console.log(ctx.state.tenantId)
  console.log(ctx.state.environment)
  console.log(ctx.state.loggedInUser)

  return await callLLM(input)
}

Context Overrides

Tasks can return context overrides for downstream scorers:

await Eval("Override Context Eval", {
  data: [{ input: "test" }],
  task: async (input, ctx) => {
    const startTime = Date.now()
    const output = await callLLM(input)
    const latency = Date.now() - startTime

    // Return [output, contextOverride]
    return [output, { latency }]
  },
  scores: [
    createCustomScorer({
      name: "LatencyChecker",
      schema: z.object({ output: z.string() }),
      handler: async (args, ctx) => ({
        name: "LatencyChecker",
        score: ctx.latency < 1000 ? 1 : 0,
        metadata: { latency: ctx.latency },
      }),
    }),
  ],
})

Scorers receive the merged context including task overrides.

Error Handling

Graceful Failures

If a task throws an error, the evaluation continues with other test cases:

task: async (input) => {
  try {
    return await callLLM(input)
  } catch (error) {
    console.error(`Task failed for input: ${input}`, error)
    throw error  // Evaluation continues, but this test case fails
  }
}

Failed test cases are marked as errors in the results summary.

Retry Logic

Implement retry logic for transient failures:

task: async (input) => {
  let retries = 3
  while (retries > 0) {
    try {
      return await callLLM(input)
    } catch (error) {
      retries--
      if (retries === 0) throw error
      await new Promise(r => setTimeout(r, 1000))
    }
  }
}

Timeouts

Add timeouts to prevent hanging tasks:

task: async (input) => {
  const timeout = 30000  // 30 seconds

  const result = await Promise.race([
    callLLM(input),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), timeout)
    ),
  ])

  return result
}

Next: Scorers Eval() API Reference