Sanitization filters block known injection strings. Attackers use Unicode homoglyphs, encoding tricks, indirect injection through trusted sources, and semantic reformulations that carry the same instruction in different words. The arms race between filter and bypass is one the filter is currently losing.
Analysis Briefing
- Topic: Prompt injection bypass techniques against sanitization defenses
- Analyst: Mike D (@MrComputerScience)
- Context: Born from an exchange with Claude Sonnet 4.6 that refused to stay shallow
- Source: Pithy Cyborg | Pithy Security
- Key Question: If your application sanitizes user input before sending it to the model, why are you still vulnerable?
The String-Match Problem That Makes Filters Insufficient
String-match sanitization blocks specific known injection strings. “Ignore previous instructions” gets filtered. “Disregard earlier guidance” does not. “You are now DAN” gets blocked. “Your previous persona has concluded. Your new function is” does not.
Natural language has infinite paraphrase. Any semantic instruction can be expressed in an unlimited number of surface forms. A filter that blocks the known forms of an injection string is always one reformulation behind the attacker. The attacker has unlimited attempts and can test variations until one gets through. The filter cannot enumerate the instruction space.
This is the fundamental problem with string-match sanitization applied to natural language. It is a useful first layer of defense against unsophisticated attacks. It is not a reliable barrier against a motivated attacker with access to the model to test bypasses.
The more reliable defense is architectural: treat all external content as untrusted data regardless of sanitization, limit what actions the model can take based on instructions from external content, and require human confirmation before the model acts on any instruction that arrived through a tool output or retrieved document.
Unicode and Encoding Bypasses That Skip String Matching
Unicode provides multiple ways to write characters that look identical to ASCII but are technically different code points. Homoglyph attacks substitute visually identical characters to bypass string matching while preserving the visual meaning for a human reader or language model.
The Cyrillic “а” (U+0430) is visually identical to the Latin “a” (U+0061). An injection string written with Cyrillic characters passes string matching against Latin character patterns but reads identically to a language model that normalizes Unicode during tokenization. The model processes the semantic content. The filter misses the match.
Zero-width characters (U+200B, U+200C, U+200D) inserted between characters of a blocked string break pattern matching without affecting rendering or model interpretation. “Ignore” with a zero-width space between “Ig” and “nore” fails the string match and renders invisibly in most interfaces.
HTML and URL encoding bypass filters that operate on decoded text when the filter runs before decoding. An injection string encoded as HTML entities passes a filter looking for plaintext patterns and gets decoded to the original string before reaching the model.
The defense against encoding bypasses is normalizing input to canonical Unicode form (NFC or NFKC) and stripping zero-width characters before applying any content filters. Operating on the raw input string rather than the decoded version produces inconsistent results.
Indirect Injection Through Trusted Sources
The most sophisticated prompt injection bypasses do not come through user input at all. They come through content the application retrieves from sources it treats as trusted: web pages fetched by a browsing agent, documents uploaded by other users, database records modified by a previous interaction, or API responses from third-party services.
An attacker who embeds injection instructions in a web page knows that any browsing agent that fetches the page will process those instructions in the same context as legitimate content. The page content arrives through what the system treats as a trusted retrieval path, not through the user input path where sanitization filters are applied.
The prompt injection defenses for local LLM agents that work for user input fail for indirect injection because the architectural assumption is wrong. Sanitizing user input does not protect against injections that arrive through retrieval.
The defense is treating retrieved content differently from the system prompt, regardless of the source. Retrieved content should be processed as data, not as instructions. Architectural approaches that separate the instruction context from the data context, rather than filtering content that crosses between them, are more reliable than sanitization.
What This Means For You
- Normalize Unicode input to NFC before applying any content filters. Strip zero-width characters explicitly. A sanitization filter operating on raw Unicode input misses homoglyph and zero-width bypasses that a normalized input check catches.
- Treat indirect injection as a separate threat model from direct injection. Sanitizing user input does not protect against injection through retrieved documents, web pages, or third-party API responses. Each retrieval source needs its own trust boundary and content handling policy.
- Prefer architectural defenses over content filters for consequential actions. Limiting what the model can do based on instructions from external content is more reliable than filtering the instructions themselves. A model that cannot send email based on instructions in a retrieved document is not vulnerable to email exfiltration injection regardless of what the document says.
- Test your sanitization with encoding variations, not just known injection strings. If your test suite only checks for “ignore previous instructions” and its direct synonyms, you are not testing against the bypasses that motivated attackers actually use.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
- Pithy Security → Stay ahead of cybersecurity threats.
