So... Silent Data Corruption (SDC) is a type of data corruption that goes undetected by the system. In the past, it mostly referred to hardware issues, but with the rise of LLMs and the context of buzzwords like the ever-present “AI”, a new type of software-induced SDC is rearing its ugly head.
To be clear, what an SDC is is a big-tent of ideas; an SDC can come from various sources, including hardware failures (such as maliciously poisoned chips) and software-level issues.
Large Language Models (LLMs) can inadvertently introduce SDC into code and text. This can happen in many ways:
- A programmer could copy and paste some code with invisible characters in it into the codebase from an e-mail or online forum (I'm not saying which one!).
- A programmer could use LLMs that were trained on bad coding examples and complex formatting or poisoned data that could include inserting invisible characters into its outputs. This is a well-documented problem in recent years.
- LLMs can insert invisible Unicode characters that can alter the logic of code or text, leading to unexpected behaviour. This includes a wide range of invisible characters, such as zero-width spaces and other format-control characters.
- Homoglyph attacks: Using characters that look identical to the human eye but are from different Unicode blocks (e.g., Cyrillic ‘а’ and Latin ‘a’).
The Future of SDC: Self-Replicating Risks
Now let's get wild: what is the future of SDCs that could produce a huge, huge issue? These issues are seemingly un-reconciled, and may get even more ‘un-reconciled’.
There is a low-level "Self-Replicating" Risk. This is where the relationship between SDC and AI gets a bit eerie. I was reading a paper recently that looks at how random noise—basically what SDC is—can actually lead to a kind of “pseudo-life”: not ‘intelligence’ but self-reproducing complexity.
In a paper titled "Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction," researchers from Google and the University of Chicago found something surprising. They showed that when you place random, non-self-replicating programs in an environment without any specific goals (no "fitness landscape"), self-replicators tend to arise naturally.
"We show that when random, non self-replicating programs are placed in an environment lacking any explicit fitness landscape, self-replicators tend to arise. We demonstrate how this occurs due to random interactions and self-modification, and can happen with and without background random mutations."
What does this have to do with SDC? Well, SDC is essentially a "background random mutation." If an LLM introduces random corruptions into a system, we aren't just risking a crash. According to Agüera y Arcas et al., we are risking the accidental creation of self-modifying code loops that emerge just from the chaos of interaction.
You can check out the full paper here: https://arxiv.org/pdf/2406.19108
Citations
The "Software-Induced" SDC & Invisible Character Attacks
Traditional SDC & AI Hardware Failures
Understanding Silent Data Corruption in LLM Training (Ma et al., 2025)
A brand new paper analyzing how hardware SDC (bit flips) actually affects the training of Large Language Models.
https://arxiv.org/pdf/2502.12340
Silent Data Corruptions at Scale (Dixit et al., Facebook/Google)
The famous paper that proved SDC is much more common in data centers than previously thought.
https://arxiv.org/pdf/2102.11245
This Part is an Ad:
Bad Character Scanner™ (BCS) is a tool designed to detect and prevent SDC in AI-generated code and text. BCS can scan for most types of SDCs before they have a chance to turn into unexpected behaviours (or self-replicating headaches).