How AI Detectors Work: Perplexity, Burstiness, and What the Scores Mean

Key takeaways

Perplexity measures how predictable each word is given the words before it. AI-generated text tends to score low because LLMs pick statistically likely tokens. Human writing scores higher because people choose surprising, personal, or stylistically unconventional words.
Burstiness measures how much perplexity varies across a document. Humans spike and dip unpredictably. AI models maintain a steady, flat predictability level throughout.
Neural classifiers are fine-tuned models trained on labeled AI vs. human text. They can hit 95%+ accuracy on the model they trained for, but often fail badly on newer or different models. (Hugging Face, arxiv.org, 2406.11073)
Watermarking embeds a statistical signal at generation time by biasing token selection. It is the most reliable method but requires the generating model to support it.
Every method has a meaningful false positive problem. A 2023 peer-reviewed study found a 61.3% average false positive rate when detectors evaluated TOEFL essays written by non-native English speakers. (Liang et al., Patterns 2023)

Fixes when it breaks. Workflows when it doesn't.

OpenClaw guides, configs, and troubleshooting notes. Every two weeks.

What AI detectors are actually measuring

AI detectors don't read writing the way humans do. They run statistical calculations on token probability distributions, then compare those numbers against patterns learned from known AI and human text. The goal is to catch a statistical fingerprint that language models leave behind.

Large language models generate text one token at a time. At each step, the model calculates a probability distribution over its entire vocabulary, then selects from the high-probability end of that distribution. The result is text that is statistically coherent, internally consistent, and tends toward the predictable.

Humans write differently. We backtrack, digress, make unexpected word choices, vary our sentence length, and occasionally write something that would genuinely surprise a language model. That difference, in aggregate over hundreds or thousands of words, produces measurable statistical patterns.

Detectors look for those patterns. They are not "reading" for meaning. They are running math on token sequences, and the math tells them whether the distribution looks like what a language model would produce.

Perplexity: the probability score behind "AI-sounding" text

Perplexity in AI detection is a measure of how surprising each word in a piece of text is, calculated from the perspective of a reference language model.

To calculate it, the detector runs the text through a language model and checks the probability assigned to each token given its context. If the model assigns a high probability to the word that actually appeared, perplexity is low. If the model is "surprised" by the word, perplexity is high.

Take two sentences:

"For lunch today, I ate a bowl of soup." (low perplexity)
"For lunch today, I ate a bowl of spiders." (high perplexity)

Diagram comparing low perplexity (predictable) vs high perplexity (surprising) word choices in AI detection

The second sentence surprises the model because very few training examples describe eating spiders for lunch. That surprise registers as high perplexity. (Pangram Labs, vendor blog)

AI-generated text tends to score low on perplexity because LLMs are literally optimized to produce high-probability continuations. GPTZero, one of the early AI detectors, set its threshold at a perplexity of 85, treating scores below that as likely AI-generated. (GPTZero)

DetectGPT: a more sophisticated take on perplexity

The academic paper DetectGPT, published at ICML 2023, pushed the perplexity idea further with a technique called probability curvature analysis. (Mitchell et al., 2023)

The insight: text generated by an LLM tends to sit in a region where the model's log probability function has negative curvature. In plain terms, if you take an AI-generated passage and generate small variations of it using a separate model (like T5), the original tends to have a higher log probability than those variations. Human text doesn't show this property to the same degree.

DetectGPT uses this to detect AI text without training a separate classifier or needing a labeled dataset. It samples perturbations, compares log probabilities, and checks whether the original sits at a local maximum. The method improved AUROC from 0.81 to 0.95 for detecting GPT-NeoX-generated fake news articles, compared to the strongest prior zero-shot baseline. (ar5iv.labs.arxiv.org)

Burstiness: why sentence variation is the giveaway

Burstiness measures how much perplexity varies across a document. It is, in effect, the variance of the perplexity signal over time.

Human writers are inconsistent. One paragraph uses simple, plain sentences (low perplexity). The next uses an unusual metaphor, an unexpected word, or a stylistically distinct phrase (high perplexity). Short-term memory makes us avoid repetition. Personal voice makes us unpredictable. This variation shows up as irregular spikes in the perplexity timeline, which detectors call high burstiness.

AI models are consistent. The same token selection process runs at every position. There is no short-term memory driving you away from the words you just used. The result is a flat, uniform perplexity signal across the entire document, which detectors call low burstiness.

Waveform chart showing uniform AI text perplexity vs variable human text perplexity (burstiness)

Pangram Labs demonstrated this with a visualization: AI-generated text produces a deep, uniform blue when color-coded by perplexity, while human text produces clear red spikes at unexpected phrases. That visual pattern is what burstiness captures mathematically. (Note: Pangram Labs is a vendor in the AI detection space. Pangram Labs, vendor blog)

GPTZero's original detection model was built on exactly this principle. Edward Tian, its creator, describes burstiness as "a key factor unique to GPTZero detector, allowing our models to evaluate long-term-context, and perform better with additional inputs." (GPTZero)

Neural classifiers: the fine-tuned model approach

Perplexity and burstiness are statistical heuristics. Neural classifiers take a different approach: train a model to directly distinguish AI text from human text.

OpenAI took this approach when it released a classifier based on RoBERTa, a transformer model. The classifier was trained on examples of GPT-2-generated text alongside human writing. According to OpenAI's own model card, it achieved approximately 95% accuracy on GPT-2-generated text. (Hugging Face)

OpenAI retired this detector in 2023, citing low accuracy on newer models. (Hugging Face) That retirement is the key limitation of the neural classifier approach: the model learns to recognize the specific statistical patterns of the AI it trained on. When a new, more capable model arrives with different patterns, accuracy drops.

GPTZero has since expanded to a seven-component model. It layers statistical methods, deep learning, and interpretability techniques to produce sentence-level classifications with explanations. (GPTZero) This multilayer approach helps mitigate some of the generalization problems of pure classifiers, but doesn't fully solve them.

A 2024 academic study found that RoBERTa classifiers specifically struggle with human text that is simple and easy-to-predict, often labeling plain human writing as machine-generated. (arxiv.org, 2406.11073) The classifier gets confused by human text that happens to be statistically unremarkable.

Tokenization patterns and what they reveal

Token-level analysis goes one layer deeper than word-level perplexity.

LLMs don't work with words. They work with tokens, which can be whole words, word fragments, or individual characters depending on the tokenizer. When a model generates text, it's selecting from a probability distribution over its token vocabulary at each position.

Certain token sequences are statistically over-represented in AI output compared to human writing. This shows up in several ways:

Sentence length tends toward uniformity. LLMs using sampling strategies like top-k or nucleus sampling still produce sequences where sentence lengths cluster in a narrow range.
Transitional phrases repeat at higher-than-human rates. Words like "however," "furthermore," and "also" appear in AI output more frequently and more predictably than in human writing.
Vocabulary breadth, measured as the ratio of unique tokens to total tokens (type-token ratio), tends to be higher in AI text but less contextualized. The model uses a wide vocabulary but distributes it in ways that reflect training patterns, not personal voice.

These token-level patterns contribute to both perplexity-based and classifier-based detection. A 2025 academic survey of failure modes in AI-generated text detectors covers multiple approaches using these patterns. One of the earliest was GLTR (the Giant Language model Test Room), which visualized token probability distributions in written text. (MDPI Mathematics, 2025)

Statistical watermarking: embedding the signal at generation time

All the methods above are retrofitted: they analyze text after it's been generated. Watermarking takes a different approach. It embeds a detectable signal during generation.

A text watermark works by biasing the model's token selection with a secret key. At each generation step, the model's output probabilities (logits) are adjusted using a pseudorandom function derived from the key. The adjustments steer the model toward a "green list" of tokens and away from a "red list." The signal is invisible to a human reader because both green and red tokens produce fluent text, but a detector with the key can verify the pattern's presence statistically.

Google's SynthID Text is the most publicly documented implementation. Technically, it is a logits processor applied after Top-K and Top-P sampling, using a pseudorandom g-function to encode watermarking information across generated tokens. (Google AI for Developers) A formal survey of watermarking techniques describes three core components: an encoding function that modifies generated tokens, a decoding function that extracts and verifies the watermark, and a verification function that checks for valid watermark presence. (arxiv.org, 2504.03765)

The Brookings Institution describes watermarking as more accurate and more reliable against erasure and forgery than other approaches. It also acknowledges that a motivated actor can degrade watermarks and that open-source models present a hard limit on adoption. (Brookings)

The key practical limitation: watermarking only works if the model that generated the text supported it. ChatGPT, Claude, and Gemini could implement it. Open-source models running locally never will by default. GPTZero's own summary notes that watermarks can be removed, edited, or forged fairly easily through paraphrasing or translation. (GPTZero)

Where every AI detection approach breaks down

Every detection method has failure modes. Here's where each breaks down in practice.

Perplexity fails on memorized text

LLMs are trained to minimize perplexity on their training data. That means famous, widely-reproduced text, the kind that appears in millions of training examples, gets "memorized" and scores as low perplexity, even though it's clearly human-written.

Pangram Labs demonstrated this problem directly: perplexity-based detectors flag the Declaration of Independence as AI-generated. The document is old, short, frequently reproduced, and well within LLM training data. The model assigned high probability to every word. Perplexity came out low. The detector called it AI. (Note: Pangram Labs is a vendor in the AI detection space. Pangram Labs, vendor blog)

Classifiers fail on simple, plain human writing

A 2024 academic study found RoBERTa classifiers consistently misclassify human text that happens to be simple and statistically unremarkable. The classifier doesn't understand "simple writing." It just sees "low probability variance" and labels it machine-generated. (arxiv.org, 2406.11073)

All methods show bias against non-native English speakers

This is the most documented and most consequential failure. A 2023 peer-reviewed study in the journal Patterns evaluated seven widely-used GPT detectors against 91 TOEFL essays written by non-native English speakers and 88 US eighth-grade essays. (Liang et al., Patterns 2023)

Results: the detectors accurately classified the US student essays. They incorrectly labeled more than half of the TOEFL essays as AI-generated. Average false positive rate: 61.3%. All seven detectors unanimously flagged 19.8% of the TOEFL essays as AI. At least one detector flagged 97.8% of them. (Liang et al., Patterns 2023)

The mechanism explains why. Non-native English writing tends to use simpler vocabulary, shorter sentences, and more common grammatical structures. These traits score as low perplexity, the same statistical signature that detectors associate with AI output.

The study's authors verified the mechanism directly: when they improved the vocabulary in non-native English essays, the false positive rate dropped. When they simplified native English essays, the false positive rate rose. The detectors are measuring linguistic sophistication, not AI authorship.

A University of San Diego legal research guide additionally notes that neurodivergent students (autism, ADHD, dyslexia) are flagged at elevated rates for similar reasons. (USD Legal Research Guide)

Watermarking fails outside the generating model's control

Watermarks are removable. Paraphrasing, translating, or heavily editing the text disrupts the token-level statistical pattern the watermark relies on. The Brookings Institution notes this as the primary limitation: it raises the barrier to evasion but doesn't eliminate it, and only works for popular, watermarking-enabled models. (Brookings)

What AI detection scores actually mean for writers

A detection score tells you how closely a piece of text matches statistical patterns associated with AI output. That's all it tells you.

A high AI probability score means the text is predictable. It might be AI-generated. It might also be the writing style of someone who writes clearly and directly, someone who is not a native English speaker, someone who writes to a professional style guide, or text that has been copy-edited into uniformity.

A passing score doesn't mean a different detector won't flag the same text. Detectors disagree with each other regularly. They use different reference models, different training sets, and different thresholds.

The tools exist for a reason. They can flag obvious, low-effort AI output with reasonable reliability. But their error rates, their documented biases against non-native speakers, and their tendency to misclassify simple human writing mean they should never be the only basis for an important decision.

If you're evaluated on AI detection scores, it helps to know what's being measured. Longer sentences with unexpected word choices push perplexity up. Variation in paragraph length and rhythm pushes burstiness up. Neither of those things requires "fooling" anything. They're just what writing that has a voice sounds like.

For a closer look at how AI writing tools fit into a content workflow, see AI Tools I Actually Pay For (And Which Ones I Canceled).

Key terms

Perplexity: A measure of how predictable each token in a text is, given the tokens before it, according to a reference language model. Low perplexity = high predictability. High perplexity = surprising word choices.

Burstiness: The variance in perplexity across a document. High burstiness means the text alternates between predictable and surprising passages, a pattern associated with human writing.

Token: The unit a language model operates on. Tokens can be whole words, word parts, or punctuation characters, depending on the model's tokenizer.

Log probability: The logarithm of the probability a model assigns to a given token. Detectors use log probability sums and comparisons to calculate perplexity and detect patterns.

Neural classifier: A fine-tuned language model trained to output a binary label (AI vs. human) for input text. Accurate within its training distribution; degrades on out-of-domain models.

Watermarking: A method that embeds a statistical signal in generated text at the point of creation by biasing token selection probabilities with a pseudorandom function.

False positive rate: The proportion of human-written texts incorrectly labeled as AI-generated by a detector.

FAQ

What does a high perplexity score mean in AI detection?

A high perplexity score in AI detection means the text contains unexpected or unusual word choices that a language model would not have predicted. Detectors like GPTZero treat text with a perplexity score above 85 as more likely to have been written by a human. High perplexity is associated with human writing because people choose words that reflect personal voice, context, and creativity, rather than the statistically most probable option.

Why do AI detectors flag non-native English writers as AI?

AI detectors flag non-native English writers at elevated rates because these writers tend to use simpler vocabulary and shorter, more uniform sentences. Those traits produce low perplexity and low burstiness, the same statistical signatures detectors associate with AI output. A 2023 peer-reviewed study in the journal Patterns found an average 61.3% false positive rate when seven widely-used AI detectors evaluated TOEFL essays written by non-native English speakers, compared to near-perfect accuracy on US student essays. (Liang et al., 2023)

Can AI watermarking be removed?

AI text watermarks can be weakened or removed through paraphrasing, translation, heavy editing, or substituting synonyms. These operations disrupt the token-level statistical pattern the watermark relies on. The Brookings Institution notes that watermarking raises the barrier to evasion for most users but does not prevent a motivated actor from defeating it. (Brookings) Watermarks are also inherently limited to text from models that support the feature at generation time.

How accurate are AI detectors at detecting ChatGPT text?

Accuracy varies significantly by detector and by how the text was generated. OpenAI's own RoBERTa-based classifier achieved approximately 95% accuracy on GPT-2-generated text before the company retired it, citing low accuracy on newer models. (Hugging Face) Modern detectors like GPTZero use multilayer models that combine statistical analysis, deep learning, and interpretability signals, but no current detector is reliably accurate across all AI models and writing styles.

What is the difference between perplexity and burstiness in AI detection?

Perplexity measures how predictable individual words or tokens are in a passage. Burstiness measures how much that predictability varies across the full document. AI-generated text tends to be low in both: each word is predictable, and that predictability stays consistent from start to finish. Human writing tends to show higher perplexity overall and higher burstiness, with the predictability spiking up and down as the writer makes unexpected choices throughout. Both measures were central to GPTZero's original detection model. Burstiness specifically was introduced as a way to evaluate the document holistically rather than sentence by sentence.

Evidence & Methodology

This article cites peer-reviewed academic research, official vendor documentation, and primary sources.

Key sources:

Liang, Weixin et al. "GPT Detectors Are Biased Against Non-Native English Writers." Patterns, Cell Press, July 2023. PMC10382961
Mitchell, Eric et al. "DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature." ICML 2023. arxiv.org/pdf/2301.11305
Tian, Edward. "What is perplexity & burstiness for AI detection?" GPTZero Blog. gptzero.me
Srinivasan, Siddarth. "Detecting AI Fingerprints: A Guide to Watermarking and Beyond." Brookings Institution, January 2024. brookings.edu
"Watermarking for AI Content Detection: A Review." arxiv, April 2025. arxiv.org/html/2504.03765v1
OpenAI RoBERTa Detector model card. Hugging Face. huggingface.co
"Exploring the Limitations of Detecting Machine-Generated Text." arxiv, June 2024. arxiv.org/html/2406.11073v1
Pangram Labs. "Why Perplexity and Burstiness Fail to Detect AI." pangram.com (Note: Pangram Labs is a vendor in the AI detection space. Claims about perplexity/burstiness limits are corroborated by independent academic sources above.)

All external links verified as live at time of publication. No competitor blog links included.

Changelog

Date	Change
2026-03-21	Initial publish