What Is Inter-Annotator Agreement?

If two trained people label the same data and disagree half the time, you do not have a labeling problem, you have a definition problem, and your model is about to learn the confusion. Inter-annotator agreement, usually shortened to IAA, is how data teams catch that before it ships. It is one of the most useful and most misunderstood quality metrics in machine learning, so here is what it actually measures and what score you should be aiming for.

What inter-annotator agreement measures

IAA measures how often independent annotators assign the same label to the same item. The logic is simple: if your guidelines are clear and your task is well defined, different careful people should mostly arrive at the same answer.

High agreement means the task is well specified and people apply it consistently. Low agreement points to one of three things: ambiguous guidelines, a genuinely subjective task, or undertrained annotators. In every case, low agreement is a signal that the labels feeding your model are noisy.

Why a raw agreement percentage lies to you

The obvious move is to compute the percentage of items two people labeled the same way. The problem is that raw agreement is easily fooled by imbalance.

Suppose the task is "is this email spam, yes or no," and 95 percent of emails are not spam. Two annotators who lazily mark everything "not spam" will agree 95 percent of the time while doing no real work at all. Raw percent agreement rewards them anyway. To measure real agreement, you need to subtract the agreement you would expect from chance. That is exactly what kappa does.

Cohen's kappa, explained simply

Cohen's kappa adjusts agreement for chance. The formula is:

kappa = (observed agreement minus chance agreement) divided by (1 minus chance agreement)

You do not need to compute it by hand, but the intuition matters:

1.0 means perfect agreement beyond chance.
0 means the annotators agreed no more than random guessing would predict.
A negative value means they agreed less than chance, which is a serious red flag about the task or the rubric.

In plain terms, kappa asks: how much better than luck did these annotators actually do?

A note on variants: Cohen's kappa is for exactly two annotators. For more than two, teams use Fleiss' kappa. For ordered labels, like a 1 to 5 rating, weighted kappa or Krippendorff's alpha is more appropriate because being one point off should count differently from being four points off.

How to read a kappa score

A widely used interpretation (from Landis and Koch) reads roughly like this:

below 0.20: slight agreement
0.21 to 0.40: fair
0.41 to 0.60: moderate
0.61 to 0.80: substantial
0.81 to 1.00: almost perfect

Treat these as rules of thumb, not laws. Context decides what is acceptable. For a safety-critical medical label, "substantial" might still be too low to ship. For messy, subjective sentiment work, "moderate" may be a realistic and honest ceiling. Always read the number against the stakes of the task.

What to do when agreement is low

Low IAA is fixable, and usually faster than people expect:

Read the disagreements. They point straight at the ambiguous rule. The items people split on are telling you exactly which definition is unclear.
Fix the guidelines. Add explicit examples for the contested cases and tighten the definitions that caused the splits.
Retrain annotators on those specific examples.
Re-measure and repeat.

A ten-minute guideline fix often moves kappa more than any amount of telling people to "be more careful." The problem is usually the instructions, not the people.

Where IAA fits in a real quality process

During design. Pilot a small batch, measure IAA, and fix the guidelines before you scale to thousands of items. This is the cheapest place to catch problems.
During production. Overlap a portion of items, say 5 to 10 percent, across multiple annotators and track IAA over time. A sudden drop is an early warning that something (a new project rule, a new annotator, a misread update) has gone sideways.
Alongside gold standards. IAA tells you whether people agree with each other. Gold-standard items, which have a known correct answer, tell you whether they agree with the truth. You want both, because annotators can be consistently wrong together.

Bottom line

Inter-annotator agreement is how you know your labels are trustworthy before they ever reach a model. Raw percentages flatter you; chance-adjusted measures like Cohen's kappa tell the truth. Aim for the level of agreement your stakes demand, read your disagreements as feedback on the guidelines, and pair IAA with gold standards so you are measuring both consistency and correctness. This is the discipline behind genuinely high-quality training data, and it is exactly the kind of agreement checking trAIn builds into human-reviewed work.