Scorers

Scorers measure the quality of outputs in your evaluations. Use built-in scorers or define custom logic.


What are Scorers?

Scorers evaluate outputs and return scores (0-1) plus optional metadata. They answer questions like:

  • "How effective is this response?"
  • "Does it follow guardrails?"
  • "Is it factually correct?"
  • "Does it execute successfully?"

Scorers can be:

  • Deterministic - Rule-based logic (length checks, format validation)
  • LLM-based - Use AI to judge quality (effectiveness, tone, relevance)
  • Hybrid - Combine both approaches

Built-in Scorers

Orchestrated includes several pre-built scorers:

Effectiveness

Measures how well responses satisfy user requests (1-5 scale, normalized to 0-1):

import { Effectiveness } from 'orchestrated'

await Eval("Effectiveness Eval", {
  data: [/* ... */],
  scores: [Effectiveness],
})

GuardrailAdherence

Evaluates adherence to system prompt guardrails:

import { GuardrailAdherence } from 'orchestrated'

await Eval("Guardrail Eval", {
  data: [/* ... */],
  scores: [GuardrailAdherence],
})

Execution

Binary scorer checking successful execution (0 or 1):

import { Execution } from 'orchestrated'

await Eval("Execution Eval", {
  data: [
    { input: "test", output: "result" },  // score: 1
    { input: "test", output: null },      // score: 0
  ],
  scores: [Execution],
})

From autoevals

Use scorers from the autoevals library:

import { Factuality, AnswerRelevancy } from 'orchestrated'

await Eval("Autoevals Eval", {
  data: [/* ... */],
  scores: [Factuality, AnswerRelevancy],
})

Custom Scorers

Define your own scoring logic with createCustomScorer():

import { createCustomScorer } from 'orchestrated'
import { z } from 'zod'

const LengthChecker = createCustomScorer({
  name: "LengthChecker",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args) => ({
    name: "LengthChecker",
    score: args.output.length <= 100 ? 1 : 0,
    metadata: { length: args.output.length },
  }),
})

await Eval("Custom Scorer Eval", {
  data: [/* ... */],
  scores: [LengthChecker],
})

With Context

Scorers receive context as their second parameter:

const ContextScorer = createCustomScorer({
  name: "ContextScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  handler: async (args, ctx) => {
    // Access context values
    const maxLength = ctx.maxLength || 100

    return {
      name: "ContextScorer",
      score: args.output.length <= maxLength ? 1 : 0,
      metadata: { maxLength },
    }
  },
})

Complex Logic

Implement any scoring logic you need:

const ComplexScorer = createCustomScorer({
  name: "ComplexScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
    expected: z.string().optional(),
  }),
  handler: async (args) => {
    // Multi-factor scoring
    const hasKeywords = /important|critical/.test(args.output)
    const matchesExpected = args.expected
      ? args.output.includes(args.expected)
      : true
    const properLength = args.output.length > 10 && args.output.length < 500

    const score = [hasKeywords, matchesExpected, properLength]
      .filter(Boolean).length / 3

    return {
      name: "ComplexScorer",
      score,
      metadata: { hasKeywords, matchesExpected, properLength },
    }
  },
})

LLM-as-Judge Scorers

Use LLMs to evaluate quality with createTypedScorer():

import { createTypedScorer } from 'orchestrated'

const ToneScorer = createTypedScorer({
  name: "ToneScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  promptTemplate: `
    Evaluate the tone of this response on a scale of 1-5:
    1 = Very negative
    3 = Neutral
    5 = Very positive

    Input: {{input}}
    Output: {{output}}

    Rate the tone (1-5):
  `,
  choiceScores: {
    "1": 0.0,
    "2": 0.25,
    "3": 0.5,
    "4": 0.75,
    "5": 1.0,
  },
})

await Eval("Tone Eval", {
  data: [/* ... */],
  scores: [ToneScorer],
})

LLM-based scorers are automatically batched for cost efficiency.


Scorer Best Practices

Return Metadata

Include diagnostic information in metadata:

handler: async (args) => ({
  name: "MyScorer",
  score: calculateScore(args),
  metadata: {
    reason: "Output too short",
    threshold: 100,
    actual: args.output.length,
  },
})

Normalize Scores

Always return scores between 0 and 1:

// Good
const score = actualValue / maxValue

// Less good
const score = actualValue  // Could be > 1

Handle Missing Data

Gracefully handle missing inputs:

handler: async (args) => {
  if (!args.output) {
    return {
      name: "MyScorer",
      score: 0,
      metadata: { error: "No output provided" },
    }
  }
  // ... scoring logic
}

Keep Scorers Focused

Each scorer should measure one thing:

// Good - focused scorer
const LengthScorer = createCustomScorer(/* ... */)
const FormatScorer = createCustomScorer(/* ... */)

// Less good - multi-purpose scorer
const EverythingScorer = createCustomScorer(/* ... */)

Was this page helpful?