⚠️ JavaScript Disabled

For the best experience, please enable JavaScript. However, you can still read all content on this page. Some interactive features may not be available.

.noscript-blog-container { max-width: 800px; margin: 2rem auto; padding: 0 1.5rem; font-family: 'Lexend', 'Inter', system-ui, -apple-system, sans-serif; color: #e5e7eb; background: #111827; min-height: 100vh; } .noscript-blog-header { border-bottom: 2px solid #9333ea; padding-bottom: 1.5rem; margin-bottom: 2rem; } .noscript-blog-title { font-size: 2.5rem; font-weight: 700; color: #ffffff; margin: 0 0 1rem 0; line-height: 1.2; } .noscript-blog-meta { color: #9ca3af; font-size: 0.95rem; display: flex; gap: 1rem; flex-wrap: wrap; } .noscript-blog-content { line-height: 1.8; font-size: 1.1rem; } .noscript-blog-content h2 { font-size: 1.875rem; font-weight: 700; color: #ffffff; margin: 2.5rem 0 1rem 0; border-left: 4px solid #9333ea; padding-left: 1rem; } .noscript-blog-content h3 { font-size: 1.5rem; font-weight: 600; color: #f3f4f6; margin: 2rem 0 0.75rem 0; } .noscript-blog-content p { margin: 1rem 0; color: #d1d5db; } .noscript-blog-content ul, .noscript-blog-content ol { margin: 1rem 0; padding-left: 2rem; color: #d1d5db; } .noscript-blog-content li { margin: 0.5rem 0; } .noscript-blog-content blockquote { border-left: 4px solid #9333ea; padding-left: 1.5rem; margin: 1.5rem 0; font-style: italic; color: #9ca3af; background: #1f2937; padding: 1rem 1rem 1rem 1.5rem; border-radius: 0.25rem; } .noscript-blog-content code { background: #1f2937; padding: 0.2rem 0.4rem; border-radius: 0.25rem; font-family: 'Fira Code', monospace; font-size: 0.9em; color: #a78bfa; } .noscript-blog-content pre { background: #1f2937; padding: 1rem; border-radius: 0.5rem; overflow-x: auto; margin: 1.5rem 0; } .noscript-blog-content pre code { background: transparent; padding: 0; } .noscript-blog-content strong { color: #ffffff; font-weight: 600; } .noscript-blog-content a { color: #a78bfa; text-decoration: underline; } .noscript-blog-content a:hover { color: #c4b5fd; } .noscript-back-link { display: inline-block; margin-top: 3rem; padding: 0.75rem 1.5rem; background: #9333ea; color: white; text-decoration: none; border-radius: 0.5rem; font-weight: 600; } @media (max-width: 640px) { .noscript-blog-title { font-size: 1.875rem; } .noscript-blog-content { font-size: 1rem; } }

What is 'Software-Induced Silent Data Corruption' in AI

📅 2025-11-17 ⏱️ 3 min read

So... Silent Data Corruption (SDC) is a type of data corruption that goes undetected by the system. In the past, it mostly referred to hardware issues, but with the rise of LLMs and the context of buzzwords like the ever-present “AI”, a new type of software-induced SDC is rearing its ugly head.

To be clear, what an SDC is is a big-tent of ideas; an SDC can come from various sources, including hardware failures (such as maliciously poisoned chips) and software-level issues.

Large Language Models (LLMs) can inadvertently introduce SDC into code and text. This can happen in many ways:

A programmer could copy and paste some code with invisible characters in it into the codebase from an e-mail or online forum (I'm not saying which one!).
A programmer could use LLMs that were trained on bad coding examples and complex formatting or poisoned data that could include inserting invisible characters into its outputs. This is a well-documented problem in recent years.
LLMs can insert invisible Unicode characters that can alter the logic of code or text, leading to unexpected behaviour. This includes a wide range of invisible characters, such as zero-width spaces and other format-control characters.
Homoglyph attacks: Using characters that look identical to the human eye but are from different Unicode blocks (e.g., Cyrillic ‘а’ and Latin ‘a’).

The Future of SDC: Self-Replicating Risks

Now let's get wild: what is the future of SDCs that could produce a huge, huge issue? These issues are seemingly un-reconciled, and may get even more ‘un-reconciled’.

There is a low-level "Self-Replicating" Risk. This is where the relationship between SDC and AI gets a bit eerie. I was reading a paper recently that looks at how random noise—basically what SDC is—can actually lead to a kind of “pseudo-life”: not ‘intelligence’ but self-reproducing complexity.

In a paper titled "Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction," researchers from Google and the University of Chicago found something surprising. They showed that when you place random, non-self-replicating programs in an environment without any specific goals (no "fitness landscape"), self-replicators tend to arise naturally.

"We show that when random, non self-replicating programs are placed in an environment lacking any explicit fitness landscape, self-replicators tend to arise. We demonstrate how this occurs due to random interactions and self-modification, and can happen with and without background random mutations."

What does this have to do with SDC? Well, SDC is essentially a "background random mutation." If an LLM introduces random corruptions into a system, we aren't just risking a crash. According to Agüera y Arcas et al., we are risking the accidental creation of self-modifying code loops that emerge just from the chaos of interaction.

You can check out the full paper here: https://arxiv.org/pdf/2406.19108

Citations

The "Software-Induced" SDC & Invisible Character Attacks

Trojan Source: Invisible Vulnerabilities (Boucher & Anderson, Cambridge University)
This is the definitive paper on how invisible characters and homoglyphs can be used to corrupt source code logic without being seen by human reviewers.
https://arxiv.org/pdf/2111.00169
Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair (2025)
Discusses how "poisoned" text inputs (like those from an LLM or malicious user) can trick systems into generating insecure code.
https://www.researchgate.net/publication/395354608
Invisible Prompt Injection: A Threat to AI Security (Trend Micro Research)
A technical breakdown of how invisible Unicode characters are used to manipulate LLM outputs—effectively "software SDC."
https://www.trendmicro.com/en_us/research/25/a/invisible-prompt-injection-secure-ai.html

Traditional SDC & AI Hardware Failures

Understanding Silent Data Corruption in LLM Training (Ma et al., 2025)
A brand new paper analyzing how hardware SDC (bit flips) actually affects the training of Large Language Models.
https://arxiv.org/pdf/2502.12340
Silent Data Corruptions at Scale (Dixit et al., Facebook/Google)
The famous paper that proved SDC is much more common in data centers than previously thought.
https://arxiv.org/pdf/2102.11245

This Part is an Ad:

Bad Character Scanner™ (BCS) is a tool designed to detect and prevent SDC in AI-generated code and text. BCS can scan for most types of SDCs before they have a chance to turn into unexpected behaviours (or self-replicating headaches).

← Back to Blog