White-paper

Evidence-based prompt engineering: what the research actually says

Promptivo Research · Published June 30, 2026

Most prompt-engineering advice is folklore. “Always give the model a persona,” “always tell it to think step by step,” “force JSON output” — these rules circulate as universal truths, yet the peer-reviewed record shows each one helps on some tasks and measurably hurts on others, often by double-digit accuracy points. This paper reviews that evidence and shows, mechanism by mechanism, how Promptivo compiles it into the prompts it builds for you.

The short version

A structured, six-section prompt is a sound scaffold — but the popular “always-on” rules are not. Expert personas don’t improve factual accuracy. “Think step by step” can hurt modern reasoning models. Forcing JSON can collapse a model’s reasoning. The fix isn’t to abandon these techniques — it’s to apply each one only where the evidence says it helps, conditioned on the task and the target model. That conditioning is what Promptivo automates.

The problem: prompting rules are conditional, not universal

The popular playbook is a list of imperatives: assign a role, demand chain-of-thought, constrain the output to JSON, give examples. Each has a kernel of truth from an influential early paper. But the same techniques have since been tested at scale across many models and task types, and the finding is consistent: the effect of a technique depends on the task and the model. Applied blindly, a “best practice” becomes a regression. A human expert navigates these trade-offs case by case; Promptivo’s thesis is that the navigation can be compiled into deterministic rules.

How we did the research

We did not rely on blog posts. We ran a structured review over the academic literature with an explicit anti-confirmation-bias step: decompose the question into independent angles, fetch primary sources (prioritizing peer-reviewed venues — EMNLP, NeurIPS, ICML, TACL — and major-lab work from Anthropic, Google DeepMind, Princeton, USC, and Wharton), extract falsifiable claims with supporting quotes, then adversarially verify each claim with a panel instructed to refute it. A claim only reached our engine if it survived an attempt to destroy it. Of the claims tested this way, 20 survived and 5 were killed — including several plausible-sounding ones we would otherwise have acted on.

The findings, and how Promptivo harnesses each

Every prompt Promptivo builds has six labeled sections — Role, Context, Task, How-to-approach, Constraints, and Output format. The sections stay constant; what changes is what goes in them, decided by the task type and the target model’s capability.

Telling the AI it’s an expert (“You are a senior…”)

Contradicted as a default

An expert persona does not improve factual or analytical accuracy, and often lowers it by 3–5 points. It does help alignment-dependent work — creative writing, voice, support, and safety.

The evidence

Zheng et al. (Findings of EMNLP 2024) tested 162 personas across 4 model families on 2,410 factual questions and found no gain over a no-persona control. A Wharton replication (Mollick et al., 2025) found zero significant gains and nine significant losses across six models.

In Promptivo

Promptivo emits an identity persona only for creative/voice tasks (and code, which is generative). Factual and analytical goals get a task-relevant behavioral framing instead — the literature’s explicit recommendation.

“Think step by step” (chain-of-thought)

Mixed / conditional

Strongly validated for non-reasoning models on hard reasoning tasks — but it can reduce accuracy on intuition tasks, and on modern reasoning models that already think before answering, instructing more explicit reasoning can hurt.

The evidence

Wei et al. (NeurIPS 2022) and Kojima et al. (NeurIPS 2022) showed large gains for older models. Liu et al. (Princeton, ICML 2025) showed drops up to 36 points where deliberation hurts; Anthropic’s inverse-scaling study (2025) showed extended reasoning degrading accuracy on native-reasoning models.

In Promptivo

Promptivo injects an explicit step-by-step scaffold only when the task is reasoning-heavy and the target model is not a native reasoner. For Claude, GPT-5-class, Gemini, and other native reasoners it removes the instruction and lets the model reason as trained.

Forcing rigid output (“Return ONLY JSON”)

Contradicted as a default

Strict format restrictions degrade reasoning, and stricter constraints cause greater degradation. A fixed schema can make the model answer before it reasons, collapsing chain-of-thought.

The evidence

Tam et al. (EMNLP 2024): under JSON-with-schema, a smaller model’s grade-school-math accuracy fell from 87.0% to 23.4%, recovering fully when the schema was removed.

In Promptivo

For reasoning tasks that still need machine-readable output, Promptivo asks the model to reason in plain text first and serialize to JSON last — structure without the reasoning tax. Strict JSON is reserved for genuine data-extraction tasks.

Where you place instructions in the prompt

Validated

Models attend best to information at the beginning and end of a prompt, and worst to the middle — a U-shaped “lost in the middle” curve, with primacy and recency both exploitable.

The evidence

Liu et al. (TACL 2024) demonstrated mid-context accuracy drops of 15–25 points across six model families, holding content constant and varying only position.

In Promptivo

Promptivo’s six sections place the Task near the top and the Output format last — both strong positions — and restate a single dominant requirement at the very bottom as a recency anchor.

Tuning the prompt to the specific AI model

Mixed / conditional

Justified — the same instruction has opposite effects across model classes — but only by capability, not by vendor stereotype. The claim that failure modes are cleanly vendor-specific did not survive verification.

The evidence

Format-restriction and reasoning effects diverge sharply between strong and weak, and reasoning vs non-reasoning models (Tam 2024; Liu 2025; Anthropic 2025). The vendor-specific-folklore claim was refuted 3 votes to 0.

In Promptivo

Promptivo keys its per-model tuning on capability class (native-reasoner? capacity tier), not vendor stereotypes — so the rules stay correct as model versions change.

A worked example

A user asks Promptivo to help underwrite a property acquisition and selects Claude. The engine classifies the goal as reasoning-class and the model as a native reasoner, then compiles a prompt that differs from the old one-size-fits-all default in three evidence-driven ways:

Section	Old default	Evidence-based
Role	“You are a seasoned financial analyst…”	Behavioral framing — persona dropped on a factual task
Approach	“Plan in <thinking> tags first…”	Native reasoning respected — forced CoT removed
Output	Risk of a rigid container	Reason first, then structure the memo

Every difference traces to a specific peer-reviewed result. The user never sees the machinery — they see a cleaner, better-calibrated prompt.

What Promptivo deliberately doesn’t do yet

Intellectual honesty is part of the method. Several common techniques were researched but did not yet clear the evidence bar, so the engine leaves them alone pending dedicated study: negative vs positive instruction phrasing; few-shot example scaffolds (the canonical “label correctness doesn’t matter” result was refuted in our run); automatic prompt optimization; and politeness/emotional-stakes prompting, whose replication is contested. A second research pass on these questions is underway.

Get a research-backed prompt in 30 seconds

Promptivo applies every finding above automatically — tuned to your task and your model.

Build a prompt free →

Sources

Zheng et al., When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of LLMsFindings of EMNLP 2024 · arxiv.org/abs/2311.10054
Mollick et al. (Wharton), Prompting Science Report 4: Playing Pretend2025 · arxiv.org/abs/2512.05858
Hu, Rostami & Thomason (USC), PRISM2026 · arxiv.org/abs/2603.18507
Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeurIPS 2022 · arxiv.org/abs/2201.11903
Kojima et al., Large Language Models are Zero-Shot ReasonersNeurIPS 2022 · arxiv.org/abs/2205.11916
Liu et al. (Princeton), Mind Your Step (by Step): Chain-of-Thought Can Reduce PerformanceICML 2025 · arxiv.org/abs/2410.21333
Gema, Hägele, Perez et al. (Anthropic), Inverse Scaling in Test-Time Compute2025 · arxiv.org/abs/2507.14417
Tam et al., Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of LLMsEMNLP 2024 · arxiv.org/abs/2408.02442
Liu et al., Lost in the Middle: How Language Models Use Long ContextsTACL 2024 · arxiv.org/abs/2307.03172
Schulhoff et al., The Prompt Report: A Systematic Survey of Prompt Engineering Techniques2024 · arxiv.org/abs/2406.06608

Frequently asked questions

Does telling an AI to act as an expert improve its answers?

Not for factual or analytical questions. A large EMNLP 2024 study (162 personas, 2,410 questions) found expert personas gave no accuracy gain over no persona, and a Wharton replication found nine significant losses and zero gains. Personas do help creative, voice, and safety tasks. Promptivo therefore only adds a persona for creative-type goals.

Does telling an AI to “think step by step” always help?

No. Chain-of-thought strongly helps older or non-reasoning models on hard math and logic, but it can reduce accuracy on intuition-style tasks, and on modern reasoning models (which already think internally) instructing more explicit reasoning can hurt — by up to 36 points in one Princeton study. The right move is to use it conditionally, which is what Promptivo does.

Does forcing JSON output hurt AI accuracy?

On reasoning tasks, yes. An EMNLP 2024 study found a model’s grade-school-math accuracy fell from 87% to 23% under a strict JSON schema, because the schema made it answer before reasoning. The fix is to let the model reason in plain text first and output JSON last — Promptivo’s default for reasoning tasks.

What is the best structure for an AI prompt?

Clear, labeled sections — role, context, task, approach, constraints, and output format — with the most important instruction near the top or bottom, because models attend worst to the middle of a prompt (the “lost in the middle” effect, TACL 2024). Promptivo builds every prompt in this structure automatically.

Should prompts be written differently for different AI models?

Yes, but by capability, not brand. The same instruction can help a non-reasoning model and hurt a native-reasoning one. The popular idea that each vendor has a fixed quirk did not survive verification, so Promptivo tunes by model capability class rather than vendor folklore.

How does Promptivo build an optimized prompt?

Promptivo is a deterministic prompt compiler: it interviews you about your goal and assembles a structured, model-tuned prompt, applying the research findings above as conditional rules. It builds the prompt and never runs it for you, so it stays neutral across models and keeps your prompts private.

More from Promptivo Research: all research · prompting guides.