The future of secure code: Exploring the impact of AI

I enjoy learning about new technology but try to join the second or third wave. I’ve been bitten for learning VBA for Access, .NET internals, and being an early adopter of Golang. I’m glad I did, but I’m more selective with my time now as there is a real opportunity cost for becoming proficient.

I treat learning technology like a hobby and never really swim outside the flags. This approach saved me from wasting time on Solidity, Ansible, Angular, Rancher, Nix, Arch, and Solana.

Given this background, I’ve had reservations about machine learning (ML), artificial intelligence (AI), and large language models (LLMs). Mainly because I’ve seen ‘Copy FAANG’ and Web3 come and go.

What changed my view was my brother being excited about ChatGPT, asking:

  1. Can we use ChatGPT to identify source code vulnerabilities?
  2. Should we use ChatGPT to write and format our reports for us?
  3. I’m bad at writing, so can I use ChatGPT to write my blog posts?
  4. How can we conceal that we used AI to ghost-write for us?

Spoilers for the above:

  1. Yes, and no, it could find issues, but do you trust the output entirely for such a critical service? Also, NCC debunked this.
  2. It can format the document, but using a findings library and giving your context would be faster than relying on customising the ChatGPT output.
  3. You can, but so can everyone else. If you do, you’ll never have your own voice and never build written communication skills.
  4. Start without ChatGPT and write your own reports and blogs.

I’m a little jaded about the hype. Rather than disengage, I decided to look into AI applications within an AppSec context.

So after explaining to my brother where we should use ChatGPT (test data, structures and frameworks, Q&A, marketing copy, etc.) and where to avoid it (reports, emails, blogs, etc.), I decided to look into how we can leverage it to manage risks around software development.

Where is AI used?

I’ve started reading into the collective wisdom of security and AI forums. I encountered a lot of research about simplifying reverse engineering, making automated decisions with traffic analysis, building secure configurations and architecture patterns, and automated governance and audits. These are all good use cases to explore, but my domain of expertise is AppSec.

AppSec is sold as a cost-effective way to manage risk surrounding software development. Static Application Security Testing (SAST) tools are cheaper, can run frequently, and scale with development velocity. They will marginally delay delivery timeframes as opposed to a backlog of penetration tests. The great promise of AI is that it will drive security and accelerate delivery. So I’m excited to see more research into this area!

I read a paper on ‘Do Users Write More Insecure Code with AI Assistants?’, which will be the basis for this blog. The general summary of the report is as follows:

  • Participants who use AI assistants write more insecure code than the control participants.
  • Participants who use AI assistants are more confident their code is secure than the control participants.

The authors state several limitations to the study, especially with sampling. But I wanted to talk through some things that will impact the adoption and effectiveness of AI within AppSec.

Maturity of users working with AI

Few software engineers think about security often. Fewer still are trained security experts. Vulnerabilities continue appearing in our applications, and organisations experience breaches.

AppSec experts have tried many ways to elevate the importance of security with varying degrees of success. Yet, we’re still largely reliant on thinly-resourced AppSec units introducing tools, processes, and training to software engineers with other concerns.

So can AI help?

Kind of.

That AppSec expertise we wish engineers had is also needed to interrogate the AI-generated response. The study above found that the majority of AI users, when presented with a cryptography problem, solved it with direct output from AI. This would be great if the output weren’t using a hardcoded curve and missing a flag surrounding authentication.

The AI participants were confident that the AI had done it right, and they trusted the output. They could not discern whether it was secure. The control group read the documentation, learned about the algorithm, and was more likely to write a secure implementation.

Nobody knows everything, but engineers who use AI often will develop an increasingly shallow knowledge base. In turn, they will increasingly rely on AI. I notice this when I use Google Maps; I don’t learn to navigate the environment when following directions blindly.

Mind you, this isn’t a bad thing. We have more ETL roles in software engineering than compiler design.

But blindly trusting an AI’s output, regardless of its improvement, does increase an organisation’s risk posture if our engineers are not interrogating the outputs.

Many people would say it’s not that different from blindly copying from StackOverflow or importing libraries. But those have social proof built into their ecosystem. Upvoted answers have human curation and community backing; actively maintained libraries with large numbers of downloads have demonstrated maturity through that.

Generated code isn’t the only problem. Prompt writing has issues as well. Most users of AI are still experimenting and learning about what works for them. We have limited formal education options around prompt engineering, and the AI ecosystem evolves daily.

Most software engineers understand that ‘add’, ‘add()’, and ‘add[]’ differ. Programmers try to think like machines when cutting code, but I doubt they apply the same rigour to Append, Add, and Concatenate on Slack. This context-switch matters for AI, where direct, clear, and brief instructions free of emotive language and ambiguity produce the best-generated responses. Yet we’ve created an AI user experience built around conversational English, which is anything but.

Maturity of the AI training dataset

The study authors pointed out the training dataset had flaws. AI wasn’t as hyped then as it is now, mainly because AI was academic and incomprehensible to the general public. ChatGPT broke down that barrier, enabled people to see how AI can help them and continues to draw heavy capital investment into the field.

The current data set used for GitHub CoPilot and similar models allegedly leverages public GitHub code repositories. GitHub is the world’s largest collection of software engineering artefacts. So, it’s a natural place to train your AI on structured data.

Unfortunately, there are real issues with the quality of code provided in the dataset. Engineers wrote a large portion of this code before security practices became commonplace. Their skills also ranged from Leetcode grinders preparing for their first role to established project leads. This variance leads to code skewed towards the average or lower-skilled engineers simply because there are far more code artefacts written over the last fifteen years than the previous two or three.

Copilot still produces code using string concatenation instead of string interpolation and uses MD5 and SHA1 over modern cryptographic algorithms. Suppose you address this quality gap by leveraging known high-quality or fresh repositories. You’ll end up with a tremendously smaller model that won’t be able to respond meaningfully to a prompt and likely hallucinate.

There are two measures I’ve seen taken to try and improve this:

  1. Experts have performed manual fine-tuning and large-scale community projects to improve data quality. GitHub actively looks for areas where Copilot makes mistakes and accepts examples from the community. Security experts review these and take steps to improve the suggestions provided by GitHub.
  2. Another way is to engage the community to improve the dataset’s quality. GitHub provides automated security tools to public repositories to help people understand and reduce the occurrence of security vulnerabilities.

We want to improve the security of the software engineering ecosystem, but it is mutually beneficial to train Copilot. We also see this in other security firms. AI underpins their technology like Snyk running ‘The Big Fix’, which provides a great collection of crowd-sourced pull requests written by humans to analyse and deliver automated fixes for their Snyk Code security product.

Regardless of these initiatives, the study highlights that AI generally produces shallow responses and falls short when encountering edge cases, business context, or external knowledge.

What are the implications of AI on Code Security?

I don’t see AppSec being a short-term target for AI disruption. Engineers first need to learn how to incorporate AI into their workflows while building an understanding of what AI produces and how to tweak prompts to improve that. Copilot is actively working on giving documentation and references for engineers to explain why it generated what it did. But, as that feature is only a few weeks old, it will be some time before it has enough content to address the knowledge gap.

Prompt engineering will take time but ultimately become professionalised. We should only be worried when questions like this arise:

  • ‘Which AI works best for this context?’
  • ‘How do I tune this AI for my business?’
  • ‘How can I chain multiple AI systems together to leverage their individual strengths?’

For now, I think the sector is immature and populated mainly by hobbyists.

Finally, I don’t think the datasets serve non-English contexts well, and the shallow training base is still a problem for unique situations. The fact that troubleshooting integrations is a daily occurrence for most engineers doesn’t inspire confidence that ChatGPT or future LLMs can respond to new scenarios effectively.

The model will absorb more data as time passes to produce good output. PHP8 will become standard over 5, and string concatenation will wane as more users adopt interpolation. But it will take time. We already have individuals unhappy with their code used to train models and leaving the GitHub and open source ecosystem entirely. It’s difficult to predict the impact this and potential regulatory changes will have on AI built with LLMs in a few years.

Positively though, these models can help accelerate the adoption of good security practices. We often talk about creating a golden path, a paved road, etc. The idea is that we give engineers the tools they need to build with minimal decision-making required for security, performance, reliability, privacy, and more. We often develop proprietary tooling or create architectural patterns, shared libraries, baseline images, and so on. I am excited to see how Copilot, when trained in your business context, can create secure baselines in the IDE without a centralised Product Security function defining how s3, EKS, or GitHub Actions are set up.

Anyway, I hope you’ve enjoyed reading this. If you’re interested in discussing the impacts of AI on your DevSecOps pipeline or generally need AppSec advice, assurance, or training, you can trust us at Galah Cyber to take you under our wing.