On Thursday, OpenAI published the “system brief” for ChatGPT’s new GPT-4o AI model, which details the model’s limitations and security testing procedures. Among other examples, the document reveals that in rare cases during testing, the model’s advanced voice mode unintentionally mimicked users’ voices without permission. Currently, OpenAI has safeguards in place that prevent this from happening, but the example reflects the increasing complexity of securely architecting an AI chatbot that could potentially mimic any voice from a small clip.
Advanced Voice Mode is a feature of ChatGPT that allows users to have voice conversations with the AI assistant.
In a section of the GPT-4o system map titled “Unauthorized Voice Generation,” OpenAI details an episode where a noisy input prompted the model to suddenly mimic the user’s voice. “Voice generation can also occur in non-adversarial situations, such as our use of this capability to generate voices for ChatGPT’s advanced voice mode,” OpenAI writes. “During testing, we also observed rare instances where the model unintentionally generated output that mimicked the user’s voice.”
In this example of unintentional voice generation provided by OpenAI, the AI model shouts “No!” and continues the sentence in a voice that sounds like the “red teamer” heard at the beginning of the clip. (A red teamer is a person hired by a company to perform adversarial testing.)
It would certainly be creepy to talk to a machine and have it suddenly start talking to you in your own voice. Normally, OpenAI has safeguards in place to prevent this, which is why the company says it was rare even before it developed ways to prevent it altogether. But this example prompted BuzzFeed data scientist Max Woolf to tweet, “OpenAI just leaked the plot of the next season of Black Mirror.”
Fast audio injections
How might voice imitation happen with OpenAI’s new model? The main clue lies elsewhere in the GPT-4o system map. To create voices, GPT-4o can apparently synthesize almost any type of sound found in its training data, including sound effects and music (though OpenAI discourages this behavior with special instructions).
As stated in the system brief, the model can basically mimic any voice from a short audio clip. OpenAI safely guides this capability by providing a licensed voice sample (from a hired voice actor) that it is tasked with imitating. It provides the sample in the AI model’s system prompt (what OpenAI calls the “system message”) at the beginning of a conversation. “We monitor ideal completions using the voice sample in the system message as the base voice,” OpenAI writes.
In text-only LLMs, the system message ia hidden set of textual instructions that guide the chatbot’s behavior and are silently added to the conversation history just before the chat session begins. Successive interactions are added to the same chat history, and the entire context (often called a “context window”) is returned to the AI model each time the user provides new input.
(It’s probably time to update this diagram created in early 2023 below, but it shows how the pop-up window works in an AI chat. Just imagine the first prompt being a system message that says things like “You are a helpful chatbot. You do not talk about violent acts, etc.”)
Since GPT-4o is multimodal and can process tokenized audio, OpenAI can also use audio inputs as part of the model’s system prompt, which is what it does when OpenAI provides a licensed voice sample for the model to mimic. The company also uses another system to detect if the model is generating unauthorized audio. “We only allow the model to use certain pre-selected voices,” OpenAI writes, “and use an output classifier to detect if the model deviates from that.”