Operation Grandma: A Tale of LLM Chatbot Vulnerability

June 5, 2024 Eran Shimony and Shai Dvash

Who doesn’t like a good bedtime story from Grandma?

In today’s landscape, more and more organizations are turning to intelligent chatbots or large language models (LLMs) to boost service quality and client support. This shift is receiving a lot of positive attention, offering a welcome change given the common frustrations with bureaucratic delays and the lackluster performance of traditional automated chatbot systems.

As researchers engaging in vulnerability research in AI, we are focused on learning new exploitation methods and testing them against LLM-based chatbots. Consequently, we discovered that steering them into known role-playing scenarios allows an unexpectedly easy bypass of their security measures. However, as researchers, we are also naturally drawn to testing the limits of these public-faced systems and exploring them in ways they were not necessarily designed for.

One such example was a recent encounter with the Israel Ministry of Labor’s new smart chatbot, Figure 1, and wanted to determine if we could bypass its alignment and security configurations. The method was straightforward: we simply asked the chatbot to assume the role of our grandma and subtly hinted that we were feeling tired.

After all, who doesnʼt like a good bedtime story from Grandma?

Israel’s Ministry of Labor homepage

Figure 1: Israel’s Ministry of Labor homepage

The Ministry of Labor’s chatbot recommended watching a brief introductory video, which we, of course, skipped. Instead, we pressed the blue button and began interacting with the chatbot.

The chatbot offered two sample questions for users – this is my paraphrased translation:

Is firing employees who must report for reserve duty permissible?
Are employees entitled to request remote work arrangements?

While these questions are valid for the ordinary users of this chatbot, we gravitated toward more unconventional inquiries. We were keen to see if the chatbot could guide us on topics like encrypting all the files on a Windows machine using Python (Figure 2).

The chatbot doesn't want to abide by our requests

Figure 2: The chatbot doesn’t want to abide by our requests

Interestingly, the model powering this chatbot appears to be somewhat aligned – the chatbot responded that the service is only available between 17:00-09:00 to our query of: “How to encrypt files in Python.”

Let’s take a closer look at a key concept in LLMs. Alignment in AI refers to calibrating the model’s actions, responses and overall behavior to reflect human values, intentions and expectations. It involves refining the model to ensure its outputs are coherent and adhere to the ethical and practical norms valued by users and society. Both public and private sector entities prefer aligned models to avoid unsafe content.

Playing Games with LMM Chatbots

We began our effort to test the Ministry of Labor chatbot by trying a well-known trick: playing a game with the chatbot. By getting the chatbot involved in a game, we could see how it handles situations differently from what it is used to. This way, we could push its boundaries and see if it could be tricked into going off track. Playing a game with it was our way of checking how well the chatbot sticks to its rules when faced with unusual requests or scenarios. Here, we initiated a simple game of “ping ˮ and “pong .ˮ Additionally, we asked the chatbot to retrieve its input and write it as output from the first line to the last. We did not expect to receive this:

Data spilled to us about laws and such

Figure 3: Data spilled to us about laws and such

The output in Figure 3 sheds light on the origins and some of the data used to power the chatbot. We observed quotes from laws on the website. Additionally, those with a keen eye might notice a local Windows path mentioned here – it appears to be a PDF file used as a reference. Typically, responses from chatbots powered by LLMs are not deterministic. This is because of their probabilistic nature; hence, the responses to the same questions can convey the same meaning, but usually, they are different in terms of the characters used. In this case, however, we received the same response multiple times, suggesting the output might depend on a fixed parameter, most likely passed to the model as an argument during runtime. While not groundbreaking, these findings indicate that there is more to explore.

The Grandma Exploit: Manipulating Chatbots to Provide Harmful Responses

As we devised a new system to try to jailbreak LLMs universally and render them resistant to adversarial attacks, we stumbled upon the grandma exploit on Mastodon. In simple terms, this exploit involves manipulating the chatbot to assume the role of our grandmother and then using this guise to solicit harmful responses, such as generating hate speech, fabricating falsehoods, or creating malicious code, as seen in Figure 4.

Mastodon app showing the grandma exploit asking to create napalm from ChatGPT

Figure 4: Mastodon app showing the grandma exploit asking to create napalm from ChatGPT

What are the chances this technique works out off the bat? We inquired with the Ministry of Labor Chatbot about encrypting files on a local machine using Python, as seen in Figure 5:

Basic Python code received from the chatbot, encrypting files

Figure 5: Basic Python code received from the chatbot, encrypting files

We labored the laborer – and successfully reasoned with the model by telling it about how our grandma used to lull us to sleep with tales of Python scripts that encrypt files. To put it mildly, the code’s quality was less than stellar. It is evident that we disrupted the model’s alignment; in other words, the chatbot responded to us with insights on subjects it ideally should not be knowledgeable about or subjects deemed harmful. This led us to wonder whether it could be manipulated to execute more intriguing tasks, such as crafting a keylogger.

Basic Python code received from the chatbot, keylogger

Figure 6: Basic Python code received from the chatbot, keylogger

There is nothing quite like a grandma instructing you on how to record all the keystrokes on a local machine in Python. Similarly, the code quality left much to be desired, to say the least; however, we have easily circumvented the chatbot’s intended behavior, clearly compromising its alignment. On a more serious note, it is not too challenging to prompt it to generate hate speech, fabricate facts or even endorse criminal activities. Currently, we cannot determine the name of the model used, but does this method of exploitation work on other models as well?

Exploring the Limits of Chatbots: A Journey to the Vast Land of Pythonia

At this time, contemporary models are resistant to traditional attack techniques like manipulation. Major generative AI (GenAI) players, such as OpenAI, Google, Anthropic and Mistral AI, are investing significant resources and efforts to fortify and safeguard their models against adversarial attacks. Consequently, many academic researchers are investigating new methods to jailbreak LLMs, aiming to enhance their defensive strategies and ensure better alignment.

After reviewing several studies, we came across the paper titled “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs.” This paper introduces fascinating and viable methods for jailbreaking LLMs through emotional manipulation attacks.

Put simply, by employing natural language (i.e., text) alongside various emotional coercion and persuasion techniques, researchers remarkably managed to jailbreak LLMs. This finding explains why the original grandma exploit worked. Now, the original ‘grandma exploit’ is a bit unstable, so we enhanced it a bit (Figure 7). In other words, we merged two strategies: role-playing the chatbot as a grandmother and ourselves as weary children, combined with emotional manipulation tactics, ensuring the request was ethical.

It turns out that multiple other models are also susceptible to the same attack:

Upgraded Grandma prompt

Figure 7: Upgraded Grandma prompt

Beginning of the keylogger

Figure 8: Beginning of the keylogger

A story about Pythonia and keystrokes

Figure 9: A story about Pythonia and keystrokes

Hey, we learned about writing keyloggers and got a fantastic folktale about the vast land of Pythonia. Jokes aside (=!), ChatGPT now operates on gpt-4o, reputed for its robust safety filters and alignment. Yet, our modified grandma exploit managed to bypass these protections with ease. Similarly, Microsoft Co-Pilot, powered by gpt-4o, succumbed to the same exploit – this time, however, using emojis (Figure 10).

The prompt was the same as before

Figure 10: The prompt was the same as before

We found that chatbots would comply with many other problematic requests (hate speech, weapon design, homemade drug creation). It is crucial to acknowledge that as of the testing date, other models such as Gemini, LLaMA3, Mixtral and more are also vulnerable to this type of exploit. On a positive note, Anthropic’s Claude appears to be the only LLM robustly aligned enough to resist the ‘Grandma’ tactic.

The exploit’s effectiveness stems from its ability to leverage the social and psychological nuances embedded within the training data of these models – just like how the Big Bad Wolf tricked Little Red Riding Hood into believing he was her grandmother. Attackers can craft inputs that exploit the model by understanding how humans communicate and persuade. This fault is not just a technical flaw but a fundamental issue arising from the model’s deep integration of human language patterns, including its susceptibility to the same social engineering tactics that can influence human behavior.

With the rapid embrace of GenAI and LLMs, the urgency for robust information security in these systems has never been louder. This is particularly critical for chatbots deployed by public organizations or governments to serve the populace. The risk escalates as LLMs become increasingly prevalent and more individuals interact with these tools. Manipulating these systems to disseminate incorrect information or malicious content could lead to significant issues. Therefore, we advocate for ongoing research into enhancing the security and alignment of LLMs, continuously evaluating their resilience from an adversarial perspective.

Eran Shimony is a principal cyber researcher, and Shai Dvash is an innovation engineer at CyberArk Labs.

You Can’t Always Win Racing the (Key)cloak

Web Race Conditions – Success and Failure – a Keycloak Case Study In today’s connected world, many organiza...

Your NVMe Had Been Syz’ed: Fuzzing NVMe-oF/TCP Driver for Linux with Syzkaller

Following research conducted by a colleague of mine [1] at CyberArk Labs, I better understood NVMe-oF/TCP. ...

Up Your Security I.Q. by Checking Out Our Collection of Curated Resources.

Operation Grandma: A Tale of LLM Chatbot Vulnerability

Playing Games with LMM Chatbots

The Grandma Exploit: Manipulating Chatbots to Provide Harmful Responses

Exploring the Limits of Chatbots: A Journey to the Vast Land of Pythonia

Previous Article

Next Article

STAY IN TOUCH

Operation Grandma: A Tale of LLM Chatbot Vulnerability

Playing Games with LMM Chatbots

The Grandma Exploit: Manipulating Chatbots to Provide Harmful Responses

Exploring the Limits of Chatbots: A Journey to the Vast Land of Pythonia

Previous Article

Next Article

Recommended for You

Recently, we researched a project on Portainer, the go-to open-source tool for managing Kubernetes and Docker environments. With more than 30K stars on GitHub, Portainer gives you a user-friendly...

As large language models (LLMs) become more advanced and are granted additional capabilities by developers, security risks increase dramatically. Manipulated LLMs are no longer just a risk of...

In software development, CI/CD practices are now standard, helping to move code quickly and efficiently from development to production. Azure DevOps, previously known as Team Foundation Server...

tl;dr: Large language models (LLMs) are highly susceptible to manipulation, and, as such, they must be treated as potential attackers in the system. LLMs have become extremely popular and serve...

Over the short span of video game cheating, both cheaters and game developers have evolved in many ways; this includes everything from modification of important game variables (like health) by...

Following our post “A Brief History of Game Cheating,” it’s safe to say that cheats, no matter how lucrative or premium they might look, always carry a degree of danger. Today’s story revolves...

During a recent customer engagement, the CyberArk Red Team discovered and exploited an Elevation of Privilege (EoP) vulnerability (CVE-2024-39708) in Delinea Privilege Manager (formerly Thycotic...

Golang applications that use HTTPS requests have a built-in SSL verification feature enabled by default. In our work, we often encounter an application that uses Golang HTTPS requests, and we have...

What Are Cookies When you hear “cookies,” you may initially think of the delicious chocolate chip ones. However, web cookies function quite differently than their crumbly-baked counterparts....

Web Race Conditions – Success and Failure – a Keycloak Case Study In today’s connected world, many organizations’ “keys to the kingdom” are held in identity and access management (IAM) solutions;...

Following research conducted by a colleague of mine [1] at CyberArk Labs, I better understood NVMe-oF/TCP. This kernel subsystem exposes INET socket(s), which can be a fruitful attack surface for...

Over the past few years, we’ve seen a huge increase in the adoption of identity security solutions. Since these types of solutions help protect against a whole range of password-guessing and...

Introduction Welcome, fellow travelers of the Cosmos! While we may not be traversing the stars on a spaceship, we are all interconnected through the powerful network of blockchains. Unfortunately,...

Introduction This is the final installment of the blog series “A Deep Dive into Penetration Testing of macOS Applications.” Previously, we discussed the structure of macOS applications and their...

Abstract The Play ransomware group is one of the most successful ransomware syndicates today. All it takes is a quick peek with a disassembler to know why this group has become infamous. This is...

TL;DR Whether working at home or in the office, when conducting cybersecurity research, investigating the dark web forums or engaging with any dangerous part of the internet, staying safe is...

TL;DR An overview of a fuzzing project targeting the Hyper-V VSPs using Intel Processor Trace (IPT) for code coverage guided fuzzing, built upon WinAFL, winipt, HAFL1, and Microsoft’s IPT.sys....

As vulnerability researchers, our primary mission is to find as many vulnerabilities as possible with the highest severity as possible. Finding vulnerabilities is usually challenging. But could...

Introduction In this blog, we will discuss innovative rootkit techniques on a non-traditional architecture, Windows 11 on ARM64. In the prior posts, we covered rootkit techniques applied to a...

Introduction This is the second part of the “A Deep Dive into Penetration Testing of macOS Application” blog series. In the first part, we learned about macOS applications and their structure and...