Defining Custom Scorers
Build custom scorers to measure domain-specific quality metrics that matter to your use case.
When to Create Custom Scorers
Create custom scorers when built-in scorers don't meet your needs:
Common Use Cases
- Domain-specific validation - Check medical terminology, legal compliance, or technical accuracy
- Format requirements - Validate JSON structure, markdown formatting, or specific output patterns
- Business rules - Enforce company policies, brand voice, or regulatory requirements
- Performance metrics - Measure response time, token usage, or cost efficiency
- Multi-factor quality - Combine multiple criteria into a single composite score
Decision Guide
// Use built-in scorers when possible
scores: [Effectiveness, Factuality, Execution]
// Create custom scorers for unique requirements
scores: [
Effectiveness,
JSONFormatValidator, // Custom
MedicalTerminologyChecker, // Custom
BrandVoiceCompliance, // Custom
]
Simple Custom Scorers
Use createCustomScorer() for deterministic, rule-based scoring:
Basic Structure
import { createCustomScorer } from 'orchestrated'
import { z } from 'zod'
const MyScorer = createCustomScorer({
name: "MyScorer",
schema: z.object({
input: z.string(),
output: z.string(),
}),
handler: async (args) => ({
name: "MyScorer",
score: 0.0, // 0-1 range
metadata: {}, // Optional diagnostic info
}),
})
Length Validator
Check output length constraints:
const LengthValidator = createCustomScorer({
name: "LengthValidator",
schema: z.object({
output: z.string(),
}),
handler: async (args) => {
const length = args.output.length
const minLength = 10
const maxLength = 500
const isValid = length >= minLength && length <= maxLength
return {
name: "LengthValidator",
score: isValid ? 1 : 0,
metadata: {
length,
minLength,
maxLength,
isValid,
},
}
},
})
Format Validator
Validate specific output formats:
const JSONFormatValidator = createCustomScorer({
name: "JSONFormatValidator",
schema: z.object({
output: z.string(),
}),
handler: async (args) => {
let isValid = false
let parsedData = null
let errorMessage = null
try {
parsedData = JSON.parse(args.output)
isValid = true
} catch (error) {
errorMessage = error.message
}
return {
name: "JSONFormatValidator",
score: isValid ? 1 : 0,
metadata: {
isValid,
errorMessage,
hasRequiredFields: parsedData && 'id' in parsedData && 'name' in parsedData,
},
}
},
})
Keyword Checker
Verify presence of required keywords:
const KeywordChecker = createCustomScorer({
name: "KeywordChecker",
schema: z.object({
output: z.string(),
}),
handler: async (args) => {
const requiredKeywords = ['important', 'verified', 'approved']
const foundKeywords = requiredKeywords.filter(keyword =>
args.output.toLowerCase().includes(keyword.toLowerCase())
)
const score = foundKeywords.length / requiredKeywords.length
return {
name: "KeywordChecker",
score,
metadata: {
requiredKeywords,
foundKeywords,
missingKeywords: requiredKeywords.filter(k => !foundKeywords.includes(k)),
},
}
},
})
Using Context
Access context values in your scorer:
const ContextualValidator = createCustomScorer({
name: "ContextualValidator",
schema: z.object({
output: z.string(),
}),
handler: async (args, ctx) => {
// Access context values
const maxLength = ctx.maxLength || 100
const requiredTone = ctx.requiredTone || 'professional'
const lengthValid = args.output.length <= maxLength
const toneValid = checkTone(args.output, requiredTone)
return {
name: "ContextualValidator",
score: (lengthValid && toneValid) ? 1 : 0,
metadata: {
maxLength,
requiredTone,
lengthValid,
toneValid,
},
}
},
})
// Use with context
await Eval("Contextual Eval", {
ctx: { maxLength: 200, requiredTone: 'friendly' },
data: [/* ... */],
scores: [ContextualValidator],
})
LLM-as-Judge Scorers
Use createTypedScorer() to leverage LLMs for subjective quality judgments:
Basic LLM Scorer
import { createTypedScorer } from 'orchestrated'
const ToneScorer = createTypedScorer({
name: "ToneScorer",
schema: z.object({
input: z.string(),
output: z.string(),
}),
promptTemplate: `
Evaluate the tone of this response on a scale of 1-5:
1 = Very negative/hostile
2 = Somewhat negative
3 = Neutral
4 = Somewhat positive
5 = Very positive/friendly
User Input: {{input}}
Assistant Output: {{output}}
Rate the tone (respond with just the number 1-5):
`,
choiceScores: {
"1": 0.0,
"2": 0.25,
"3": 0.5,
"4": 0.75,
"5": 1.0,
},
})
Multi-Criteria LLM Scorer
Evaluate multiple aspects with a single LLM call:
const ComprehensiveQuality = createTypedScorer({
name: "ComprehensiveQuality",
schema: z.object({
input: z.string(),
output: z.string(),
expected: z.string().optional(),
}),
promptTemplate: `
Evaluate this response across multiple dimensions:
User Question: {{input}}
Assistant Response: {{output}}
{{#expected}}Expected Answer: {{expected}}{{/expected}}
Rate the overall quality on a scale of 1-5:
1 = Poor (incorrect, unhelpful, or off-topic)
2 = Below average (partially correct but missing key points)
3 = Average (correct but could be more complete)
4 = Good (accurate and helpful)
5 = Excellent (accurate, complete, and well-explained)
Consider:
- Accuracy of information
- Completeness of answer
- Clarity of explanation
- Relevance to question
Provide your rating (1-5):
`,
choiceScores: {
"1": 0.0,
"2": 0.25,
"3": 0.5,
"4": 0.75,
"5": 1.0,
},
})
Domain-Specific LLM Scorer
Create scorers for specific domains:
const MedicalAccuracy = createTypedScorer({
name: "MedicalAccuracy",
schema: z.object({
input: z.string(),
output: z.string(),
}),
promptTemplate: `
As a medical professional, evaluate the accuracy and safety of this health information:
Patient Question: {{input}}
Assistant Response: {{output}}
Rate on a scale of 1-5:
1 = Dangerous or highly inaccurate medical information
2 = Misleading or incomplete information
3 = Generally accurate but lacks important context
4 = Accurate with minor omissions
5 = Highly accurate, safe, and comprehensive
Rating (1-5):
`,
choiceScores: {
"1": 0.0,
"2": 0.25,
"3": 0.5,
"4": 0.75,
"5": 1.0,
},
})
Combining LLM and Rule-Based Scoring
Use both approaches together:
const HybridScorer = createCustomScorer({
name: "HybridScorer",
schema: z.object({
input: z.string(),
output: z.string(),
}),
handler: async (args, ctx) => {
// Rule-based checks
const hasMinLength = args.output.length >= 50
const hasProperFormat = /^[A-Z].*[.!?]$/.test(args.output)
const ruleScore = (hasMinLength && hasProperFormat) ? 0.5 : 0
// LLM check (simplified - in practice, call LLM API)
const llmScore = await evaluateWithLLM(args.input, args.output)
// Combine scores
const finalScore = (ruleScore + llmScore) / 2
return {
name: "HybridScorer",
score: finalScore,
metadata: {
ruleScore,
llmScore,
hasMinLength,
hasProperFormat,
},
}
},
})
Testing Scorers
Always test your scorers before using them in production evaluations.
Unit Testing
import { describe, it, expect } from 'bun:test'
describe('LengthValidator', () => {
it('should score 1 for valid length', async () => {
const result = await LengthValidator.handler({
output: 'This is a valid length response.',
}, {})
expect(result.score).toBe(1)
expect(result.metadata.isValid).toBe(true)
})
it('should score 0 for too short', async () => {
const result = await LengthValidator.handler({
output: 'Short',
}, {})
expect(result.score).toBe(0)
expect(result.metadata.isValid).toBe(false)
})
it('should score 0 for too long', async () => {
const result = await LengthValidator.handler({
output: 'x'.repeat(1000),
}, {})
expect(result.score).toBe(0)
})
})
Testing with Real Data
Run a small evaluation to test your scorer:
await Eval("Scorer Test", {
data: [
{ input: "test 1", output: "Valid response here" },
{ input: "test 2", output: "X" }, // Too short
{ input: "test 3", output: "X".repeat(1000) }, // Too long
],
scores: [LengthValidator],
})
// Review results to ensure scorer behaves as expected
Debugging Scorers
Add verbose logging for debugging:
const DebugScorer = createCustomScorer({
name: "DebugScorer",
schema: z.object({
output: z.string(),
}),
handler: async (args, ctx) => {
console.log('Scorer Input:', args)
console.log('Context:', ctx)
const score = calculateScore(args)
console.log('Calculated Score:', score)
return {
name: "DebugScorer",
score,
metadata: {
debug: true,
inputLength: args.output.length,
},
}
},
})
Uploading Scorers
Upload custom scorers to the cloud for use in the web console and API.
Create a Project File
Define your scorers in a project file:
// project.ts
import { createCustomScorer, createTypedScorer } from 'orchestrated'
import { z } from 'zod'
export const LengthValidator = createCustomScorer({
name: "LengthValidator",
schema: z.object({
output: z.string(),
}),
handler: async (args) => ({
name: "LengthValidator",
score: args.output.length >= 10 && args.output.length <= 500 ? 1 : 0,
metadata: { length: args.output.length },
}),
})
export const ToneScorer = createTypedScorer({
name: "ToneScorer",
schema: z.object({
input: z.string(),
output: z.string(),
}),
promptTemplate: `Evaluate tone (1-5): {{input}} -> {{output}}`,
choiceScores: {
"1": 0.0,
"2": 0.25,
"3": 0.5,
"4": 0.75,
"5": 1.0,
},
})
Upload to Cloud
orcha upload project.ts
This command:
- Analyzes your scorers and extracts definitions
- Bundles handler functions
- Uploads bundle to S3
- Generates
definitions.jsonwith metadata
Verify Upload
Check the output for confirmation:
✓ Serialized 2 scorers
✓ Uploaded bundle to S3
✓ Generated definitions.json
Scorers:
- LengthValidator (custom_scorer)
- ToneScorer (typed_scorer)
Bundle: s3://bucket/tenant/service/abc123/handlers.bundle.js
Fingerprint: abc123
Use Uploaded Scorers
After uploading, scorers are available in:
- Web Console - Select from dropdown in evaluation builder
- API - Reference by slug in API calls
- CLI - Use definition objects in eval files
// Using uploaded scorer definitions
const evalDef = {
scorers: [
{ type: "custom_scorer", slug: "length-validator", fingerprint: "abc123" },
{ type: "typed_scorer", slug: "tone-scorer", fingerprint: "abc123" },
],
}
await Eval("Cloud Scorers Eval", {
data: [/* ... */],
scores: evalDef.scorers,
})