Defining Custom Scorers

Build custom scorers to measure domain-specific quality metrics that matter to your use case.

When to Create Custom Scorers

Create custom scorers when built-in scorers don't meet your needs:

Common Use Cases

Domain-specific validation - Check medical terminology, legal compliance, or technical accuracy
Format requirements - Validate JSON structure, markdown formatting, or specific output patterns
Business rules - Enforce company policies, brand voice, or regulatory requirements
Performance metrics - Measure response time, token usage, or cost efficiency
Multi-factor quality - Combine multiple criteria into a single composite score

Decision Guide

// Use built-in scorers when possible
scores: [Effectiveness, Factuality, Execution]

// Create custom scorers for unique requirements
scores: [
  Effectiveness,
  JSONFormatValidator,      // Custom
  MedicalTerminologyChecker, // Custom
  BrandVoiceCompliance,     // Custom
]

Simple Custom Scorers

Use createCustomScorer() for deterministic, rule-based scoring:

Basic Structure

import { createCustomScorer } from 'orchestrated'
import { z } from 'zod'

const MyScorer = createCustomScorer({
  name: "MyScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  handler: async (args) => ({
    name: "MyScorer",
    score: 0.0,  // 0-1 range
    metadata: {}, // Optional diagnostic info
  }),
})

Length Validator

Check output length constraints:

const LengthValidator = createCustomScorer({
  name: "LengthValidator",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args) => {
    const length = args.output.length
    const minLength = 10
    const maxLength = 500
    const isValid = length >= minLength && length <= maxLength

    return {
      name: "LengthValidator",
      score: isValid ? 1 : 0,
      metadata: {
        length,
        minLength,
        maxLength,
        isValid,
      },
    }
  },
})

Format Validator

Validate specific output formats:

const JSONFormatValidator = createCustomScorer({
  name: "JSONFormatValidator",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args) => {
    let isValid = false
    let parsedData = null
    let errorMessage = null

    try {
      parsedData = JSON.parse(args.output)
      isValid = true
    } catch (error) {
      errorMessage = error.message
    }

    return {
      name: "JSONFormatValidator",
      score: isValid ? 1 : 0,
      metadata: {
        isValid,
        errorMessage,
        hasRequiredFields: parsedData && 'id' in parsedData && 'name' in parsedData,
      },
    }
  },
})

Keyword Checker

Verify presence of required keywords:

const KeywordChecker = createCustomScorer({
  name: "KeywordChecker",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args) => {
    const requiredKeywords = ['important', 'verified', 'approved']
    const foundKeywords = requiredKeywords.filter(keyword =>
      args.output.toLowerCase().includes(keyword.toLowerCase())
    )
    const score = foundKeywords.length / requiredKeywords.length

    return {
      name: "KeywordChecker",
      score,
      metadata: {
        requiredKeywords,
        foundKeywords,
        missingKeywords: requiredKeywords.filter(k => !foundKeywords.includes(k)),
      },
    }
  },
})

Using Context

Access context values in your scorer:

const ContextualValidator = createCustomScorer({
  name: "ContextualValidator",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args, ctx) => {
    // Access context values
    const maxLength = ctx.maxLength || 100
    const requiredTone = ctx.requiredTone || 'professional'

    const lengthValid = args.output.length <= maxLength
    const toneValid = checkTone(args.output, requiredTone)

    return {
      name: "ContextualValidator",
      score: (lengthValid && toneValid) ? 1 : 0,
      metadata: {
        maxLength,
        requiredTone,
        lengthValid,
        toneValid,
      },
    }
  },
})

// Use with context
await Eval("Contextual Eval", {
  ctx: { maxLength: 200, requiredTone: 'friendly' },
  data: [/* ... */],
  scores: [ContextualValidator],
})

LLM-as-Judge Scorers

Use createTypedScorer() to leverage LLMs for subjective quality judgments:

Basic LLM Scorer

import { createTypedScorer } from 'orchestrated'

const ToneScorer = createTypedScorer({
  name: "ToneScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  promptTemplate: `
    Evaluate the tone of this response on a scale of 1-5:
    1 = Very negative/hostile
    2 = Somewhat negative
    3 = Neutral
    4 = Somewhat positive
    5 = Very positive/friendly

    User Input: {{input}}
    Assistant Output: {{output}}

    Rate the tone (respond with just the number 1-5):
  `,
  choiceScores: {
    "1": 0.0,
    "2": 0.25,
    "3": 0.5,
    "4": 0.75,
    "5": 1.0,
  },
})

Multi-Criteria LLM Scorer

Evaluate multiple aspects with a single LLM call:

const ComprehensiveQuality = createTypedScorer({
  name: "ComprehensiveQuality",
  schema: z.object({
    input: z.string(),
    output: z.string(),
    expected: z.string().optional(),
  }),
  promptTemplate: `
    Evaluate this response across multiple dimensions:

    User Question: {{input}}
    Assistant Response: {{output}}
    {{#expected}}Expected Answer: {{expected}}{{/expected}}

    Rate the overall quality on a scale of 1-5:
    1 = Poor (incorrect, unhelpful, or off-topic)
    2 = Below average (partially correct but missing key points)
    3 = Average (correct but could be more complete)
    4 = Good (accurate and helpful)
    5 = Excellent (accurate, complete, and well-explained)

    Consider:
    - Accuracy of information
    - Completeness of answer
    - Clarity of explanation
    - Relevance to question

    Provide your rating (1-5):
  `,
  choiceScores: {
    "1": 0.0,
    "2": 0.25,
    "3": 0.5,
    "4": 0.75,
    "5": 1.0,
  },
})

Domain-Specific LLM Scorer

Create scorers for specific domains:

const MedicalAccuracy = createTypedScorer({
  name: "MedicalAccuracy",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  promptTemplate: `
    As a medical professional, evaluate the accuracy and safety of this health information:

    Patient Question: {{input}}
    Assistant Response: {{output}}

    Rate on a scale of 1-5:
    1 = Dangerous or highly inaccurate medical information
    2 = Misleading or incomplete information
    3 = Generally accurate but lacks important context
    4 = Accurate with minor omissions
    5 = Highly accurate, safe, and comprehensive

    Rating (1-5):
  `,
  choiceScores: {
    "1": 0.0,
    "2": 0.25,
    "3": 0.5,
    "4": 0.75,
    "5": 1.0,
  },
})

Combining LLM and Rule-Based Scoring

Use both approaches together:

const HybridScorer = createCustomScorer({
  name: "HybridScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  handler: async (args, ctx) => {
    // Rule-based checks
    const hasMinLength = args.output.length >= 50
    const hasProperFormat = /^[A-Z].*[.!?]$/.test(args.output)
    const ruleScore = (hasMinLength && hasProperFormat) ? 0.5 : 0

    // LLM check (simplified - in practice, call LLM API)
    const llmScore = await evaluateWithLLM(args.input, args.output)

    // Combine scores
    const finalScore = (ruleScore + llmScore) / 2

    return {
      name: "HybridScorer",
      score: finalScore,
      metadata: {
        ruleScore,
        llmScore,
        hasMinLength,
        hasProperFormat,
      },
    }
  },
})

Testing Scorers

Always test your scorers before using them in production evaluations.

Unit Testing

import { describe, it, expect } from 'bun:test'

describe('LengthValidator', () => {
  it('should score 1 for valid length', async () => {
    const result = await LengthValidator.handler({
      output: 'This is a valid length response.',
    }, {})

    expect(result.score).toBe(1)
    expect(result.metadata.isValid).toBe(true)
  })

  it('should score 0 for too short', async () => {
    const result = await LengthValidator.handler({
      output: 'Short',
    }, {})

    expect(result.score).toBe(0)
    expect(result.metadata.isValid).toBe(false)
  })

  it('should score 0 for too long', async () => {
    const result = await LengthValidator.handler({
      output: 'x'.repeat(1000),
    }, {})

    expect(result.score).toBe(0)
  })
})

Testing with Real Data

Run a small evaluation to test your scorer:

await Eval("Scorer Test", {
  data: [
    { input: "test 1", output: "Valid response here" },
    { input: "test 2", output: "X" }, // Too short
    { input: "test 3", output: "X".repeat(1000) }, // Too long
  ],
  scores: [LengthValidator],
})

// Review results to ensure scorer behaves as expected

Debugging Scorers

Add verbose logging for debugging:

const DebugScorer = createCustomScorer({
  name: "DebugScorer",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args, ctx) => {
    console.log('Scorer Input:', args)
    console.log('Context:', ctx)

    const score = calculateScore(args)
    console.log('Calculated Score:', score)

    return {
      name: "DebugScorer",
      score,
      metadata: {
        debug: true,
        inputLength: args.output.length,
      },
    }
  },
})

Uploading Scorers

Upload custom scorers to the cloud for use in the web console and API.

Create a Project File

Define your scorers in a project file:

// project.ts
import { createCustomScorer, createTypedScorer } from 'orchestrated'
import { z } from 'zod'

export const LengthValidator = createCustomScorer({
  name: "LengthValidator",
  schema: z.object({
    output: z.string(),
  }),
  handler: async (args) => ({
    name: "LengthValidator",
    score: args.output.length >= 10 && args.output.length <= 500 ? 1 : 0,
    metadata: { length: args.output.length },
  }),
})

export const ToneScorer = createTypedScorer({
  name: "ToneScorer",
  schema: z.object({
    input: z.string(),
    output: z.string(),
  }),
  promptTemplate: `Evaluate tone (1-5): {{input}} -> {{output}}`,
  choiceScores: {
    "1": 0.0,
    "2": 0.25,
    "3": 0.5,
    "4": 0.75,
    "5": 1.0,
  },
})

Upload to Cloud

orcha upload project.ts

This command:

Analyzes your scorers and extracts definitions
Bundles handler functions
Uploads bundle to S3
Generates definitions.json with metadata

Verify Upload

Check the output for confirmation:

✓ Serialized 2 scorers
✓ Uploaded bundle to S3
✓ Generated definitions.json

Scorers:
  - LengthValidator (custom_scorer)
  - ToneScorer (typed_scorer)

Bundle: s3://bucket/tenant/service/abc123/handlers.bundle.js
Fingerprint: abc123

Use Uploaded Scorers

After uploading, scorers are available in:

Web Console - Select from dropdown in evaluation builder
API - Reference by slug in API calls
CLI - Use definition objects in eval files

// Using uploaded scorer definitions
const evalDef = {
  scorers: [
    { type: "custom_scorer", slug: "length-validator", fingerprint: "abc123" },
    { type: "typed_scorer", slug: "tone-scorer", fingerprint: "abc123" },
  ],
}

await Eval("Cloud Scorers Eval", {
  data: [/* ... */],
  scores: evalDef.scorers,
})

Next: Working with Data Sources Scorers Overview