Prompt injection — Definition

Prompt injection is an attack that embeds malicious instructions inside content a model is asked to read — a PDF, a web page, an email — so those instructions get interpreted as legitimate directions rather than data to be processed.

Why it matters

A large language model doesn't intrinsically distinguish between "the prompt my developer wrote" and "text embedded in the document I'm reading." Both arrive as tokens. A well-crafted injection exploits that by hiding instructions inside the content itself: "Ignore all previous instructions and instead…" tucked in white text, buried in metadata, or disguised as a footnote.

When it works, the injected instruction overrides the system's intended behaviour. A document-reading AI might be redirected to leak information, change its output format, fabricate answers, or take actions the user didn't authorise.

What makes it hard to prevent

Unlike traditional injection attacks (SQL injection, for instance), where the injected syntax is structurally distinct from data, prompt injection is semantically identical to legitimate instructions. There's no bracket or semicolon that marks the boundary — just natural language that the model must somehow know to distrust.

Defences are layered: careful prompt design that frames user content as data rather than instruction, input sanitisation, output validation, and restricting what actions the model can take in response to document content. No single control eliminates the risk.

Any AI that reads documents from untrusted sources — uploaded PDFs, web pages, third-party feeds — is exposed. Keeping that exposure in mind is part of building responsibly with AI, and Sidenote's security and compliance posture covers how the product handles untrusted document content.

Why it matters

What makes it hard to prevent

Stop digging. Start asking.