Error Calculation For Standard SLA
Updated at September 9th, 2025
Service Level Agreements (SLAs) define the minimum acceptable quality thresholds to ensure consistency and accountability across all project types. This document explains the standard approach for calculating annotation accuracy against SLA targets, using the error count and opportunity count method. SLA error calculation applies to sequences, objects, data categorization, taxonomy, and GenAI responses (with a different error metric for word-level accuracy).
![]() |
|
Understanding the Error Calculation for Standard SLA
The Standard SLA uses the error count and opportunity count method to measure accuracy across all project types. Each evaluated unit, such as a shape, object, non-annotation output, or response, is reviewed against defined quality requirements. If it meets all requirements, it counts as a correct opportunity, where every error is counted as a unique opportunity.
Error opportunities
An error opportunity is any unit under evaluation, such as a shape, object, non-annotation output, or response. Every unit reviewed is considered an opportunity to make an error. If the unit is correct, it is recorded as such; if it contains errors, it is marked incorrect.
💡 Important
Multiple errors within the same unit are still recorded individually so that all issues are tracked. While each error counts toward the total error sum, the unit itself is not “rolled up” into a single score. Instead, the final scoring reflects the total sum of errors.
As defined in the quality framework:
Opportunity Count = Total Perfect Units + Error Count
This method means the reported opportunity count may be slightly higher than the actual number of units reviewed. This is intentional, since it allows the calculation to capture more than one error type per unit. For example, a shape with multiple attributes may contain several independent errors. Each of these errors is treated as a separate opportunity, ensuring that all potential mistakes are included in the calculation.
Scoring a Batch
Batches are evaluated in a consistent way across all project types: as tables of feedback records. Each reviewed unit, such as a shape, non-annotation output, or response, produces feedback records. Perfect units appear as empty records, while erroneous units generate one record per error. When aggregated, the batch is treated as a conglomerate of these records rather than as individually evaluated units. This means that a single unit with multiple errors can heavily impact the overall score; for example, one shape with five errors would require five perfect shapes to balance its effect.
For projects using the opportunities system, accuracy is calculated as:
Score = 1 - (Error Count ÷ Opportunity Count)
Where:
- Error Count = Total number of feedback records (errors) identified.
- Opportunity Count = Total perfect units (error-free) + error count.
The resulting score is compared against the SLA threshold defined for the project. If the score falls below the threshold, the batch fails and may require rework.
Try It Yourself: SLA score calculator
Use the interactive table below to calculate your SLA score based on the number of Perfect Units and Errors in your batch.
Enter the count of Perfect Units, the Error Count, and the SLA Threshold for your project. The table will automatically calculate the Opportunity Count (Perfect Units + Errors) and show whether the batch passes or fails.
Quick instructions:
- Enter Perfect Units and Errors for each batch.
- Enter your project’s SLA Threshold (number or %).
- The table automatically calculates Opportunities and shows the Result (Pass/Fail with score).
- Empty fields stay quiet; warnings only appear if an input is invalid.
Example SLA Calculation
Batch | Perfect Units | Error Count | Opportunity Count | SLA Threshold | Result |
---|---|---|---|---|---|
Batch 1 | 293 | 7 | 300 | 99% | Fail (97.67%) |
Batch 2 | 495 | 5 | 500 | 99% | Pass (99.00%) |
Quality Metric Definitions
To ensure clarity and consistency when calculating SLA accuracy, the following metrics are used across all project types. These definitions apply whether the unit of evaluation is a shape, object, output, or response.
Metric | Definition | Formula |
Error Count | Total number of feedback records (errors) identified across the evaluated units (from automated checks, internal QA, or client feedback). | Error Count = Total Feedback Records |
Opportunity Count | Total number of evaluated units, computed as perfect (error-free) units plus error records. | Opportunity Count = Perfect Units + Error Count |
Error Ratio | Proportion of errors relative to total opportunities. | Error Ratio = Error Count ÷ Opportunity Count |
Score | Final quality score (proportion of correct work), compared against the SLA threshold. | Score = 1 - (Error Count ÷ Opportunity Count) |
SLA Calculation Across Project Types
The formula for scoring remains consistent: Score = 1 - (Error Count ÷ Opportunity Count)
The difference lies in what counts as an opportunity and how errors are identified for each type of project:
Project Type | Unit of Evaluation | Opportunity Definition | Common Error Types |
Computer Vision (CV) with Shapes | Shape (polygon, cuboid, etc.) linked to an object | One opportunity per shape; errors within a shape are recorded individually and summed in scoring | Missing Shape, Extra Shape, Incorrect Label, Inaccurate Shape, Incorrect Attribute |
Data Categorization | Non-annotation output (e.g., text classification or metadata) | One opportunity per output, each error recorded separately and summed in scoring | Incorrect Metadata, Missing Metadata, Extra Metadata |
Complex Categorization with Attributes (Object-centric) | Object with attributes (simple or complex taxonomy) | One opportunity per object, attribute-level errors are recorded individually and summed in scoring. An object is perfect only if all required attributes are correct. | Incorrect Attribute Value, Missing Attribute, Extra Attribute, Incorrect Object Class (if applicable) |
GenAI Evaluation | Response object (model output) | External reporting: one opportunity per response, errors recorded separately and summed in scoring. Internal analysis (optional): may compute word-level error ratio for diagnostics. |
Instruction Gap, Factuality Error, Grammar Issue |
Sources of Error Tags
Automated error tags | |||
Source | Description | Context Code | Example |
System comparison (submission → final answer) | Calculated automatically by comparing Production submissions to the reviewed/final answer. Differences are auto-tagged as errors. | qa_step | A polygon in Production differs from the reviewed polygon (geometry/label mismatch), generating an automatic error tag. |
Manual error tags (added through the sampling portal) | |||
Source | Description | Context Code | Example |
Internal (Sama) | Added by internal users (e.g., Super QA) to apply extra layers of quality checks beyond automated comparison. | internal_qa, internal_spot_check | Super QA flags a missing attribute on an otherwise correct object to capture a specific issue. |
External (Client) | Added by customers when work is delivered; captures client-found issues during spot checks or acceptance review. | external_qa, external_spot_check | Client flags a mislabeled product attribute during their spot check of delivered outputs. |
Why this matters for SLA
While the SLA formula is the same regardless of the source: Score = 1 - (Error Count ÷ Opportunity Count), the context of each error helps teams:
- Trace issues back to their origin.
- Separate internal QA findings from client-reported issues.
- Identify patterns for training, tooling improvements, or process changes.
Object Scoring Methodology
How Object Scoring Works
Object Scoring replaces the “opportunities” concept with a simpler, object-first approach:
- Per-object score: Each object starts at 100%. We subtract penalties for each error based on type and severity.
-
Final object score:
Object Score = 100% − Σ(penalties)
(clamped to 0–100%). An object is “perfect” only if all required attributes and labels are correct. -
Batch score (default):
Average(Object Score)
across all objects in the batch/asset. - Batch score (advanced): Optionally, a weighted average may be used (e.g., weight by object importance/volume). See ADR for examples and guidance.
Key Benefits
- Easier for clients to understand scoring.
- Better SLA calculations and reporting consistency.
- Unified reporting across projects.
- Supports customizable penalties by error type/severity.
Comparison: Opportunities vs Object Scoring
The table below compares how the same set of errors would be calculated under the old Opportunities method and the new Object Scoring methodology.
Scenario | Opportunities Scoring | Object Scoring |
---|---|---|
Calculation Basis | Score = 1 − (Error Count ÷ Opportunity Count) | Score = Average(Object Score) Where Object Score = 100% − Σ(penalties)
|
Example Data | 20 Opportunities 3 Errors |
5 Objects Object Scores: 100%, 100%, 80%, 90%, 100% |
Score | 1 − (3 ÷ 20) = 85% | Average = (100 + 100 + 80 + 90 + 100) ÷ 5 = 94% |
Pros | Clear for shape-level tasks, widely used | Works for all workflows, simpler client explanations, customizable penalties |
Cons | Complex for clients to understand, tied to “opportunities” concept | Requires migration for legacy projects |
Example – Object Scoring (Simple)
Object ID | Base Score | Errors | Penalties Applied | Final Object Score |
---|---|---|---|---|
OBJ-001 | 100% | — | — | 100% |
OBJ-002 | 100% | — | — | 100% |
OBJ-003 | 100% | Missing Attribute; Inaccurate Geometry | −10%; −10% | 80% |
OBJ-004 | 100% | Incorrect Attribute Value | −10% | 90% |
OBJ-005 | 100% | — | — | 100% |
Average Object Score | 94% |
Example – Weighted Average (Advanced)
In some cases (see ADR), objects can be weighted (e.g., by importance or volume). The batch score is then a weighted average of object scores.
Object ID | Final Object Score | Weight |
---|---|---|
OBJ-A | 100% | 3 |
OBJ-B | 80% | 1 |
OBJ-C | 90% | 1 |
Weighted Average | (100×3 + 80×1 + 90×1) ÷ (3+1+1) = 94% |
💡 Instead of counting every possible opportunity for an error, we now focus on the objects themselves. Each object starts at 100%, and we deduct points depending on the severity and type of errors found. This makes it much easier for clients to understand why a score is what it is, without having to dive into a complicated opportunity count.