Okay folks for this blog post it's going to be quite different than my past ones which were based on research and were 'dry', essentially. This one is more of a comment, a personal opinion from somebody researching specific aspects of how LLMs interact with low level code, machine code, hexadecimal code, Unicode and ASCII and ASCII reduced, and then how that interacts with languages like C, JS and so forth.
And what I'm finding is either a very big deal that could be really risky for us or it might be a simple fix, but it is something. In essence there's this very bizarre thing happening when LLMs do coding that happens at the lowest level of interpretation.
Now to really get into this we have to talk about how computer science is taught.
In short, a computer science student generally starts off by learning that zeros and ones form bytes and bits, and how those are interpreted into hexadecimal code. Then, we learn how hexadecimal code is interpreted into machine code. From there, we learn that there are different types of Unicode and ASCII character encoding standards. It's a very basic foundation that forms the basis of computing and computer programming. With these simple encoding standards, such as ASCII, we can write computer programs in Rust and Cobol.
[To be clear Rust and COBOL are programming languages whose source code is usually stored as ASCII or Unicode-encoded text, So it snot so much that ASCII is a laywer below Rust, but that ASCII is the encoding standard used to represent the source code of Rust programs as text files. Not that differnt but its imporotant to be clear on this.]
And those layers generally get read by parsers and look-up-tables and they work very well; And this system has worked very well for 50 plus years of computing.
Generally, if there's an issue with software, it doesn't come down to those layers. It mostly happens within the computer language itself, the compiler, the runtime engine, etc. For many generations, it hasn't been important to teach people the complete set of computing, from how a parser understands zeroes and ones to how a lookup table understands hexadecimal code, how a lookup table understands Unicode, and how a parser understands a lookup table itself. What's important is that, even when using sorting algorithms, optimizations were made to all of these things, parsers, look up tables, compilers, were optimized at one point.
These low level things, like ASCII and Unicode, are not simple devices. That's what I'm trying to convey here. A lot of optimization and abstraction has happened at every layer, even starting at the machine code level. These optimizations are asymmetric, unique, and one-of-a-kind. However, because these optimizations are unique and effective, they always introduce a level of risk and exposure. That's true of any optimization; there's always a trade-off. A good optimization is usually a smart trade, but it's always a trade-off.
So, what am I talking about here? Basically, LLMs do things when interacting with IDEs that are very unexpected and not something humans would ever do, even by accident. I'm sure these are things humans have done on purpose... One issue we've identified that we keep coming back to is issues related to deleting sectors of code.
This is just one example of many, but as you can see, we have this issue. We've discussed it at length in other posts where the IDE doesn't recognize some invisible characters that have been copied and pasted into a file. For example, let's say you have a C file, and you copy and paste some LLM-generated code into it. Later, you delete part of the code. There's a chance that you've partially deleted invisible characters that weren't detected. Obviously, the IDE can't detect some invisible characters, so it often can't delete them properly. This is one of the core issues we're seeing, which has to do with parser optimization and core architectural issues with many IDEs. I don't know of any IDE that has fully solved this problem, not even NeoVim, which is much better at this. To be clear, so far NeoVim is the best I've seen, but it still has issues.
So if these corrupted bits of data, that don't actually get detected, they can actually reform into readable characters later on and the process by which this corrupted data can become corrupted and uncorrupted is a huge security risk, and it goes down to just basically the fact that the way parsers work they aren't infallible at detecting corrupted data and the types of corruptions. These types of corruptions are not always detectable, the detection is not 100%.
Now so far we've mostly really seen this happen by accident but you could all imagine how this could be used by a bad LLM or a bad person to do bad things.
I don't want to delve too deeply into this because I dont have much solide to say, but I've been studying it for some time, and I have only become more concerned by my and most peoples lack of full understanding. And I truly don't think there's any definitive work on this exact issue. However, there's a lot of work on "undetectable code corruption," also known as silent data corruption (SDC). There isn't much research, though, on this new form of hidden corrupt data sectors caused by LLMs. One reason for this, I think, is that this kind of issue is often benign.
It can exist in the final product, be released, and not interact with anything or cause problems. Therefore, it's possible to get a program that passes every test and is greenlit. There may be some of this in a file somewhere, but it doesn't affect operation or get picked up, so it's not really an issue. However, that's not satisfactory. I don't think anyone closely examining this issue would find it satisfactory because it leaves the possibility open for a new kind of problem. This one problem has also caused me to look deeper and eventually discover many more problems.
The Ocean of Problems.
So when you really start looking at this most devlopers start off thinking okay 'we could solve these problems by just using reduced ASCII', it's a common refrain from software engineers. However if you look a lot more you realize that's not practical and there's no real proof that that's going to actually solve the problem fully. For my personal testing it seems to mostly work but it doesn't seem to fully solve the problems; There's still some issues that can get past even reduced ASCII set and those problems are often really bad…
So in short what I'm really trying to say here is that I found an ocean of problems. The deletion sectors is one of those issues, but thayre are many like it and just as bad. I'm calling it "The Bit Echo LLM Error Ocean" and I think that this is going to become a big realm of study fairly soon. I feel like I'm at the vanguard of something pretty big as I realized there's this kind of thing happening where there is an upper and lower boundlessness of interaction that no one I meet seems to suspect.
However there is precedent: I think that there's actually a lot of work that's very close and related to this issue, not specifically about this issue but there is a lot of work about this kind of interaction. Obviously there's a huge amount of chip architecture that's involved in this and so it goes from chip architecture...
Yes, let's go even deeper, let's talk about chip architecture.
First, I doubt you, the reader, unless you work at TSMC, know the architecture of the chips in the computer or cell phone you're using to read this blog. I suspect this because the architecture is often secret. In fact, some parts are "hyper secret." Understanding the architecture that goes into understanding the low-level vulnerabilities that affect them is highly privileged information.
High-level vulnerabilities, on the other hand, tend to live in the realm of programming languages and frameworks. When we're talking about machine code vulnerabilities, the architecture of the chip plays a huge role.
Industry Awareness of Hardware-Level Vulnerabilities: It's worth noting that the industry has been aware of certain relevent low-level hardware vulnerabilities for over a decade. A prime example is the "row hammer" problem in DRAM memory, which was documented as early as 2012 through a series of patent applications filed by Intel. As detailed in Kim et al.'s seminal 2014 paper "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors" (Carnegie Mellon University and Intel Labs), researchers demonstrated that simply by reading from the same address in DRAM repeatedly, it becomes possible to corrupt data in nearby memory addresses, a fundamental violation of memory isolation. This type of hardware-level vulnerability, where the physical proximity of memory cells creates exploitable coupling effects, exemplifies how chip architecture directly impacts security at the lowest levels of computing. https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf
So what I'm trying to say here is that the actual known extent of how big this problem is, the 'Bit Echo LLM Error Ocean Problem', is almost unknowable because we can't know the actual architecture of the chips to a full degree - they have parts of the chips that are secret (The secret part of Intel chips that is often referred to is called the "Management Engine" or M.E.) and those parts are actually kind of important when it comes to these kind of low level issues.
Also, as an aside, deep inside some Intel chips, there is an embedded operating system called MINIX, which runs at a very privileged level below the main operating system.
How these low level systems and chip architectures are going to interact with this ocean of problems, this ocean of LLM issues, is just anyone's guess at this point.
So this is less of a blog post with a disclosure of information and more of a desperate cry for help into the ether for other researchers to join me in this burgeoning new field of impossibly complicated issues. If you have insights please contact me and if you have papers that you think I should read please send them thank you so much.
I also hope to start conducting closed door web symposia soon to discuss this issue. If interested please contact me.
Related Research:
While specific research on LLM-induced low-level corruption is emerging, there is established work on related vulnerabilities:
- Bit Flips and Hardware Errors: Research on Rowhammer attacks (Kim et al., "Flipping Bits in Memory Without Accessing Them", ISCA 2014) demonstrates how unintended bit flips can be exploited for security vulnerabilities. https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf
- Silent Data Corruption: Studies on silent data corruption in memory systems (Schroeder et al., "DRAM Errors in the Wild", SIGMETRICS 2009) show that undetected bit errors occur more frequently than previously assumed. https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
- Parser Vulnerabilities: The "Weird Machine" concept (Bratus et al., "Exploit Programming", Communications of the ACM 2011) explores how parsers can be exploited through carefully crafted malformed inputs. https://doi.org/10.1145/1941487.1941508
- Unicode Security Issues: The Unicode Technical Report #36 on Unicode Security Considerations documents numerous security risks related to character encoding and invisible characters. https://www.unicode.org/reports/tr36/
- Trojan Source Attacks: Recent work by Boucher et al. ("Trojan Source: Invisible Vulnerabilities", 2021) demonstrates how Unicode bidirectional text can create invisible vulnerabilities in source code. https://trojansource.codes/trojan-source.pdf
Related BCS Blog Posts:
By J. Shoy - independent volunteer correspondent, all vues express our his and his alone and not related to Bad Character Scanner (BCS) or its affiliates.
Last updated: Nov 9th, 2025