AI-generated code comes with security risks
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
More and more students are using AI-generated code in their studies, without necessarily understanding the security risks that this entails. This has consequences for users such as students learning how to code in R.
How AI-generated code happens
Generative AI services such as ChatGPT use Large Language Models to generate computer code. These models are ‘trained’ against a dataset of publicly available code.
Many users of generative AI do not seem fully aware of what ‘publicly available code’ might contain, and therefore do not really seem aware, in turn, of the security risks that come with executing AI-generated code.
Using generative AI services in a learning environment such as academia raises many concerns. Security is only one of them.
Where the security risk lies
Programming languages like R are not sandboxed. This means that these languages can execute malicious instructions such as ‘erase every image on the hard drive of that laptop,’ or ‘replace all occurrences of “Jewish” in that text with “kike”.’
Sandboxing the execution of R code is possible, but this is not how R runs by default.
The risk is real and already active
Just like human languages have already been ‘poisoned’ in various ways, some of the computer code that make up the public codebase on which Large Language Models are trained has already been ‘poisoned’ in various ways.
One of the ways that this has happened is through software packages. It is very easy to bundle harmful or malicious code into a software package, and then to give it a name that resembles the name of a legitimate software package.
Executing R code that contains such a package will pose a security threat to the user, equivalent to that of opening emails or attachments sent by unknown sources. The consequences can be relatively innocuous, or extremely serious.
Both AI-generated code and inattentive users can be misled into referring to these harmful or malicious software packages into their own code. The vulnerability will be triggered when the code is executed.
This scenario is not a view of the future. It is already happening.
Real-world example threat
Even a user like myself, who has learnt how to code in R for research purposes, can very easily write up a malicious software package.
An example of such a software package might do the following:
- Scan all text files on disk for credit card information
- Hide that information in a website address
- Automatically open a Web browser and point it to that address
- Collect the credit card information server-side
- Delete as many files as possible on the hard drive
The steps above can be executed without the user noticing at all, or might execute in part or in full before the user can stop them from happening.
Privacy and security breaches of the sort are very easy to implement, and have been implemented in virtually all programming languages.
The risk is of course not limited to AI-generated code. Executing computer code from any untrusted source can lead to the same issues.
How to minimise the risk
R users should always check where their packages come from.
R users who use AI-generated code should be even more careful, and should also warn other users that their code was at least in part AI-generated.
It goes without saying that I have never, and will never, design the kind of attack described in this note.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.