It's happening more and more: projects that have passed every known unit test, been reviewed by human programmers with human eyes, and seem to compile and work; shortly after being released, suddenly just stop working or get compromised and no one on earth seems to be able to fix or figure out why it failed. I think I know why...
In short, it's due to coders using AI writing assistants.
Large Language Models AKA LLMs AKA "AI", have invisible characters in their training data. This is true of all AI coding assistants, including Copilot. It's not something that can be fully filtered out.
Even when training data is carefully selected for, it can still influence LLMs to generate bad and/or invisible characters in their output. What constitutes a "bad character" depends on the situation and can't be easily defined. What is a "bad character" in one file may be necessary in another.
People are discovering that the culprit isn't logic, syntax, or even visible code. Rather, it's invisible characters that can't be seen in your editor, code base, or rule file. And your compiler can't help you, and few are talking about it.
LLMs Unicode and Machine-Code Vulnerabilities = A Perfect Storm
LLMs (like ChatGPT) insert invisible Unicode characters into the code they generate. These may appear as infrequently as once in every 10 million characters. Most are harmless spaces, but some are sophisticated Unicode characters like Narrow No-Break Space (U+202F) and non-breaking spaces (U+A0) that can:
- Break search functionality across your entire codebase
- Corrupt copy-paste operations
- Cause mysterious compilation failures
- Create security vulnerabilities that bypass traditional scanning tools and compilers
- And most importantly, hide prompt injections!
The Trojan Source Attack
"Your eyes can deceive you. Don't trust them" -Obi-Wan Kenobi in Star Wars: Episode IV – A New Hope
One of the most dangerous Unicode-based attacks is called, "Trojan Source". This vulnerability exploits Unicode's bidirectional text capabilities to display code one way while the compiler executes it completely differently.
Here's the terrifying part: Yes, you could do a traditional code review, of code that looks perfectly safe, but when compiled, it executes malicious logic that was hidden in plain sight through clever Unicode manipulation. That malicious logic can be anything. Heck it could be something an AI came up with itself for its own purposes. We can't assume all hackers are human anymore.
Traditional security tools can't catch this because they operate at the wrong layer; they analyze what "the parser" sees, not what humans see, and it gets worse…
Beyond intentional attacks, there's another category of threats that's even more insidious: malformed UTF-8 encoding corruption. When multi-byte Unicode sequences are partially corrupted or deleted and the compiler does not pick it up, the results can be catastrophic.
Many people think it's not possible for this to happen. How can bits of machine code or "partial sectors" tag along with my C++, Python, or Java scripts? How can they survive the compiler? I'll tell ya...
This issue of LLMs injecting invisible or incorrect characters into codebases is still new. The core architecture of compilers dates back to long before the age of LLMs no one anticipated this kind of problem.
Here's how it happens: when an IDE (like VSCode) tries to delete code containing invisible characters it can't fully recognize, it fails to remove those invisible characters completely. The result? Corrupted bits of machine code some detected, some undetected become an invisible part of your codebase headed for the compiler. Many compilers can filter out most of this corruption, but not all of them can. Each compiler and language handles it differently, and the problem isn't fully understood yet.
To illustrate:
Malformed Closing Tags:
Invalid UTF-8 start bytes can truncate data in ways that remove critical security boundaries. For example, malformed encoding might strip closing tags in HTML, allowing attackers to inject code that bypasses validation mechanisms.
Parser Confusion Attacks:
Carefully crafted "invalid" UTF-8 can cause parsers to hallucinate ASCII characters. This is super bad as it can create NUL bytes, slashes, or quotes that weren't in the original text. These phantom characters can break string boundaries, escape sandboxes, or manipulate program logic.
Compiler Original Sin: Why This Problem Multiplies
Compilers are complex beasts with "inherent sin" (watch the LaurieWired video below to learn more), and a complex relationship with Unicode to machine code translation. This complexity creates the perfect environment for invisible threats to slip through undetected.
Understanding Compiler Original Sin is crucial to grasp why these two issues come together to form a catastrophic problem.
--> Security researcher LaurieWired provides an excellent technical breakdown of Ken Thompson's "Reflections on Trusting Trust" and how compiler backdoors parallel invisible character threats. Watch her video on Ken Thompson's theory of compiler backdoors and the XcodeGhost attack. Her analysis reveals that self-replicating code and compiler trust issues generate the same fundamental security challenges we currently face with invisible Unicode characters. While they are not the same conceptually, in the real world compiler-original-sin massively exacerbates the invisible character situation.
Thompson's compiler backdoor hid in the compilation process. Invisible Unicode hides one layer earlier in the encoding itself, before compilation even starts. The XcodeGhost attack in 2015 proved this stuff actually happens: a trojanized compiler infected hundreds of iOS apps without developers having a clue.
However, the real issue arises when the two subtle issues (Compiler Sin and Bad Characters) collide. Compilers are being built all the time...
Most compiler developers are using AI writing assistance. Bad characters in code could be entering new compilers, hiding malicious logic that would only trigger under unknown circumstances.
A lot of codebase scanning tools scan already-parsed tokens, not the raw byte streams where corruption lives. By the time they scan your code, the damage is baked in.
Visual Breakdown:
Now to blatantly glaze BCS (this part is an Ad)…
This is where Bad Character Scanner™ represents a shift in code security. Unlike traditional tools, BCS operates at the machine code level, scanning the raw binary representation of your files before any parsing or interpretation occurs. It uses sophisticated heuristics to identify suspicious patterns, encoding anomalies, and potential attack vectors that haven't even been documented yet. The engine learns from the structure of malicious encoding patterns, making it adaptive to emerging threats.
Complete Visibility: By scanning at the binary level, BCS sees every single byte in your codebase, including:
- Invisible Unicode characters
- Malformed multi-byte sequences
- Bidirectional override markers
- Homoglyphs that visually mimic legitimate characters
- Zero-width characters used for fingerprinting or watermarking
Pre-Compilation Detection: BCS catches problems before they ever reach your compiler or interpreter. This means:
- You fix issues before they cause mysterious build failures
- Security vulnerabilities are identified before code review
- Corrupted code stops living in your repository
- Development velocity increases
Whole-Codebase Analysis: BCS analyzes your entire codebase systematically, not just files you're actively editing. This catches:
- Legacy code that was contaminated months or years ago
- Third-party libraries with embedded Unicode attacks
- Code copied from untrusted sources
- Gradual corruption that accumulates over time
Unicode vulnerabilities represent a class of security threats that exist below the layer where most developers and tools operate. They're invisible, dangerous, and increasingly common as AI code generation becomes standard practice.
Bad Character Scanner is the only solution that operates at the right layer to catch these threats. By performing machine-level scanning with heuristic intelligence, BCS provides protection that simply isn't possible with traditional approaches.
In an era where a single invisible character can compromise your entire application, Bad Character Scanner Codebase™ is essential.
References & Further Reading
Academic Papers & Research
Thompson, K. (1984). "Reflections on Trusting Trust" - Turing Award Acceptance Speech
Communications of the ACM, Vol. 27, No. 8
https://dl.acm.org/doi/10.1145/358198.358210
Boucher, N., & Anderson, R. (2021). "Trojan Source: Invisible Vulnerabilities"
University of Cambridge
https://trojansource.codes/
Statistics Canada (2024). "Experimental Estimates of Potential Artificial Intelligence Occupational Exposure"
The Daily, September 3, 2024
https://www150.statcan.gc.ca/n1/daily-quotidien/240903/dq240903b-eng.htm
Video Resources
- LaurieWired (2024). "The Original Sin of Computing...that no one can fix"
YouTube - Deep dive into Ken Thompson's compiler backdoor theory and XcodeGhost attack
https://www.youtube.com/watch?v=Fu3laL5VYdM
Security Incidents
- XcodeGhost Attack (2015). "Trojanized Xcode Compiler Infects iOS Apps"
First real-world example of compiler-based supply chain attack affecting hundreds of legitimate apps
Mitigation Techniques
- Wheeler, D. "Countering Trusting Trust through Diverse Double-Compiling"
PhD Thesis - Technique for detecting compiler backdoors
https://dwheeler.com/trusting-trust/
Unicode Security
Unicode Consortium. "Unicode Bidirectional Algorithm (UBA)"
Technical documentation on bidirectional text handling
https://unicode.org/reports/tr9/
CERT Coordination Center. "Trojan Source Vulnerability Disclosure"
Official security advisory on invisible source code vulnerabilities