Scorers
Scorers measure the quality of outputs in your evaluations. Use built-in scorers or define custom logic.
What are Scorers?
Scorers evaluate outputs and return scores (0-1) plus optional metadata. They answer questions like:
- "How effective is this response?"
- "Does it follow guardrails?"
- "Is it factually correct?"
- "Does it execute successfully?"
Scorers can be:
- Deterministic - Rule-based logic (length checks, format validation)
- LLM-based - Use AI to judge quality (effectiveness, tone, relevance)
- Hybrid - Combine both approaches
Built-in Scorers
Orchestrated includes several pre-built scorers:
Effectiveness
Measures how well responses satisfy user requests (1-5 scale, normalized to 0-1):
import { Effectiveness } from 'orchestrated'
await Eval("Effectiveness Eval", {
data: [/* ... */],
scores: [Effectiveness],
})
GuardrailAdherence
Evaluates adherence to system prompt guardrails:
import { GuardrailAdherence } from 'orchestrated'
await Eval("Guardrail Eval", {
data: [/* ... */],
scores: [GuardrailAdherence],
})
Execution
Binary scorer checking successful execution (0 or 1):
import { Execution } from 'orchestrated'
await Eval("Execution Eval", {
data: [
{ input: "test", output: "result" }, // score: 1
{ input: "test", output: null }, // score: 0
],
scores: [Execution],
})
From autoevals
Use scorers from the autoevals library:
import { Factuality, AnswerRelevancy } from 'orchestrated'
await Eval("Autoevals Eval", {
data: [/* ... */],
scores: [Factuality, AnswerRelevancy],
})
Custom Scorers
Define your own scoring logic with createCustomScorer():
import { createCustomScorer } from 'orchestrated'
import { z } from 'zod'
const LengthChecker = createCustomScorer({
name: "LengthChecker",
schema: z.object({
output: z.string(),
}),
handler: async (args) => ({
name: "LengthChecker",
score: args.output.length <= 100 ? 1 : 0,
metadata: { length: args.output.length },
}),
})
await Eval("Custom Scorer Eval", {
data: [/* ... */],
scores: [LengthChecker],
})
With Context
Scorers receive context as their second parameter:
const ContextScorer = createCustomScorer({
name: "ContextScorer",
schema: z.object({
input: z.string(),
output: z.string(),
}),
handler: async (args, ctx) => {
// Access context values
const maxLength = ctx.maxLength || 100
return {
name: "ContextScorer",
score: args.output.length <= maxLength ? 1 : 0,
metadata: { maxLength },
}
},
})
Complex Logic
Implement any scoring logic you need:
const ComplexScorer = createCustomScorer({
name: "ComplexScorer",
schema: z.object({
input: z.string(),
output: z.string(),
expected: z.string().optional(),
}),
handler: async (args) => {
// Multi-factor scoring
const hasKeywords = /important|critical/.test(args.output)
const matchesExpected = args.expected
? args.output.includes(args.expected)
: true
const properLength = args.output.length > 10 && args.output.length < 500
const score = [hasKeywords, matchesExpected, properLength]
.filter(Boolean).length / 3
return {
name: "ComplexScorer",
score,
metadata: { hasKeywords, matchesExpected, properLength },
}
},
})
LLM-as-Judge Scorers
Use LLMs to evaluate quality with createTypedScorer():
import { createTypedScorer } from 'orchestrated'
const ToneScorer = createTypedScorer({
name: "ToneScorer",
schema: z.object({
input: z.string(),
output: z.string(),
}),
promptTemplate: `
Evaluate the tone of this response on a scale of 1-5:
1 = Very negative
3 = Neutral
5 = Very positive
Input: {{input}}
Output: {{output}}
Rate the tone (1-5):
`,
choiceScores: {
"1": 0.0,
"2": 0.25,
"3": 0.5,
"4": 0.75,
"5": 1.0,
},
})
await Eval("Tone Eval", {
data: [/* ... */],
scores: [ToneScorer],
})
LLM-based scorers are automatically batched for cost efficiency.
Scorer Best Practices
Return Metadata
Include diagnostic information in metadata:
handler: async (args) => ({
name: "MyScorer",
score: calculateScore(args),
metadata: {
reason: "Output too short",
threshold: 100,
actual: args.output.length,
},
})
Normalize Scores
Always return scores between 0 and 1:
// Good
const score = actualValue / maxValue
// Less good
const score = actualValue // Could be > 1
Handle Missing Data
Gracefully handle missing inputs:
handler: async (args) => {
if (!args.output) {
return {
name: "MyScorer",
score: 0,
metadata: { error: "No output provided" },
}
}
// ... scoring logic
}
Keep Scorers Focused
Each scorer should measure one thing:
// Good - focused scorer
const LengthScorer = createCustomScorer(/* ... */)
const FormatScorer = createCustomScorer(/* ... */)
// Less good - multi-purpose scorer
const EverythingScorer = createCustomScorer(/* ... */)