In October 2023, an AI-synthesized impersonation of the voice of an opposition leader helped swing the election in Slovakia to a pro-Russia candidate. Another AI audio fake was layered onto a real video clip of a candidate in Pakistan, supposedly calling on voters to boycott the general election in February 2024. Ahead of the Bangladeshi elections in January, several fakes created with inexpensive, commercial AI generators gained voter traction with smears of rival candidates to the incumbent prime minister. And, in the US, an audio clip masquerading as the voice of President Joe Biden urged voters not to vote in one key state’s primary election.
“AI audio fakes can pose a significant threat… They are easier and cheaper to create than deepfake videos, and there are fewer contextual clues to detect with the naked eye." — Olga Yurkova, journalism trainer and cofounder of StopFake.org
Experts agree that the historic election year of 2024 is set to be the year of AI-driven deepfakes, with potentially disastrous consequences for at-risk democracies. Recent research suggests that, in general, about half of the public can’t tell the difference between real and AI-generated imagery, and that voters cannot reliably detect speech deepfakes — and technology has only improved since then. Deepfakes range from subtle image changes using synthetic media and voice cloning of digital recordings to hired digital avatars and sophisticated “face-swaps” that use customized tools. (The overwhelming majority of deepfake traffic on the internet is driven by misogyny and personal vindictiveness: to humiliate individual women with fake sexualized imagery — but this tactic is also increasingly being used to attack women journalists.)
Why AI Audio Fakes Could Pose the Chief Threat This Election Cycle
Media manipulation investigators told GIJN that fake AI-generated audio simulations — in which a real voice is cloned by a machine learning tool to state a fake message — could emerge as an even bigger threat to elections in 2024 and 2025 than fabricated videos. One reason is that, like so-called cheapfakes, audio deepfakes are easier and cheaper to produce. (Cheapfakes have already been widely used in election disinformation, and involve video purportedly from one place that was actually from another, and where short audio clips are crudely spliced into videos, or the closed captions blatantly edited.) Another advantage they offer bad actors is they can be used in automated robocalls to target (especially) older, highly active voters with misinformation. And tracing the origin of robocalls remains a global blind spot for investigative reporters.
“AI audio fakes can pose a significant threat,” emphasizes Olga Yurkova, journalism trainer and cofounder of StopFake.org, an independent Ukrainian fact-check organization. “They are easier and cheaper to create than deepfake videos, and there are fewer contextual clues to detect with the naked eye. Also, they have a greater potential to spread, for example, in WhatsApp chats.”
She adds: “Analysis is more complex, and voice generation tools are more advanced than video generation tools. Even with voice samples and spectral analysis skills, it takes time, and there is no guarantee that the result will be accurate. In addition, there are many opportunities to fake audio without resorting to deepfake technology.”
Often journalists recognize obvious manipulation in many audio clips right away, based on their knowledge of a candidate, low quality of the recording, context, or just plain common sense.
Data journalism trainer Samantha Sunne says newsrooms need constant vigilance in elections — both for the sudden threat of comparatively under-researched AI audio fakes, and because “deepfake technology is changing quickly and so are the detection and monitoring tools.”
Fact check organizations and some pro-democracy NGOs have mobilized to help citizens groups and newsrooms analyze suspicious viral election content. For instance, a human rights empowerment nonprofit called WITNESS conducted a pilot Deepfakes Rapid Response project in the past year, using a network of about 40 research and commercial experts to analyze dozens of suspicious clips. In an interview with GIJN, the manager of the Rapid Response project, Shirin Anlen, said AI audio fakes appear to be both the easiest to make and the hardest to detect — and that they seem tailor-made for election mischief.
“As a community, we found that we are not as prepared for audio as we were for video — that’s the gap we see right now,” says Anlen, who added that researchers were “surprised” by the high proportion of impactful AI audio fakes in 2023. Of the six high-impact cases involving elections or human rights that the Response Force chose to deeply investigate, four were audio fakes.
“Audio does seem to be used more in elections and areas of crisis — it's easier to create and distribute, through various platforms or robocalls,” Anlen explains. “It’s also very personalized — you often really need to know the person, the way they talk, to detect manipulation. Then you have double-audio and background noise, music, or cross-talking — all these make detection more complex, unlike video, where you can see manipulation, maybe with a glitch in the face.”
But Anlen warns that “video detection is also lagging behind the generative techniques,” and that the release of the new text-to-video OpenAI tool Sora illustrates a trend toward almost seamless simulations. She adds that a lack of media literacy among older voters amplifies the threat of audio fakes and AI-driven robocalls even further — “because people not used to, say, X [Twitter] or TikTok may have less ability to filter out audio fakes.”
Where — and How — Speech Deepfakes are Used
The Financial Times reported that voice-cloning tools have also targeted elections in countries such as India, the UK, Nigeria, Sudan, and Ethiopia. The FT investigation alleged that AI audio fakes were suddenly popular among propagandists due to the new, easy availability of inexpensive and powerful AI tools “from start-ups such as ElevenLabs, Resemble AI, Respeecher and Replica Studios.” Note that several text-to-speech AI tools are designed for pranks, commercial ads, or even fun gifts, but experts warn they can be repurposed for political propaganda or even incitement. The report showed that basic tools can be used from as little as US$1 per month, and advanced tools for US$330 per month — a tiny fraction of political campaign budgets.
To date, the most convincing audio fakes have been made of voices that have said the most words on the internet, which, of course, often involves well-known public figures, including politicians. One of the most eerily accurate examples targeted British actor and intellectual Stephen Fry, where an AI program exploited Fry’s extensive online narration of seven Harry Potter novels to create a fake narration about Nazi resistance, which also included German and Dutch names and words — perfectly modulated to Fry’s accent and intonation — that the actor himself had never said. The AI program had uncannily predicted how Fry would say those foreign words. (See Fry’s explainer clip from the 12:30 to 15:30-minute mark in the video below to gain a sense of the alarming sophistication of advanced speech deepfakes.)
However, Hany Farid, a computer science professor and media forensics expert at the University of California, Berkeley, told Scientific American magazine that a single minute’s recording of someone’s voice can now be enough to fabricate a new, convincing audio deepfake from generative AI tools that costs just US$5 a month. This poses a new impersonation threat to mid-level election-related officials — bureaucrats whose public utterances are normally limited to short announcements. Farid explained the two primary ways that audio fakes are made: either text-to-speech — where a scammer uploads real audio and then types what they’d like the voice to “say” — or speech-to-speech, where the scammer records a statement in their own voice, and then has the tool convert it. He described the effort involved in creating a convincing fake of even a non-public figure as “trivial.”
A new hybrid fake model is provided by the digital avatar industry, where some AI startups offer a selection of digitally fabricated actors that can be made to “say” longer messages that sync to their lips better than fake messages superimposed on real people in video clips. According to The New York Times, researchers at social media analysis company Graphika traced avatar-driven news broadcasts to services offered by “an AI company based above a clothing shop in London’s Oxford Circus,” which offers scores of digital characters and languages to choose from.
Tips for Responding to Fake Audio Threats
While expert analysis and new detection tools are required for advanced speech deepfakes that even friends of the speaker can’t distinguish, often journalists recognize obvious manipulation in many audio clips right away, based on their knowledge of a candidate, low quality of the recording, context, or just plain common sense. But experts warn that gut-level suspicion is just a small part of the detection process. A speedy, evidence-based response, highlighting real audio in a story “truth sandwich,” and tracing the source of the scam are all equally important.
Here’s a step-by-step process for analyzing potential audio deepfakes.
- First, you need to flag suspicious clips, and do so early. Editors report that publishing an audience tip line for suspicious audio and robocalls — using a dedicated WhatsApp number, for example — provides an effective early warning system from voters themselves. Brazil’s classic Comprova Project — an election disinformation investigation that featured a collaboration of 24 media organizations — also showed the wisdom of rival newsrooms publishing the same WhatsApp number and sharing the results, to take full advantage of the crowdsourcing power and brain power of voters. Some services provide instant deepfake alerts to your email, such as Reality Defender. Collaboration can also help identify coordinated campaigns using different audio fakes. Traditional social media monitoring and open communication channels with fact checking organizations and political journalist chat groups are also helpful.
- Newsrooms need a second, early warning system for fake audio that is going viral, to know which ones to prioritize, prominently debunk, and back-trace. The sudden appearance of an audio clip across multiple social media platforms is an early indication, while analytics tools like BuzzSumo can give you a sense of a clip’s rate of share, and amplification by activist sources on partisan media can be either a reflection or a cause of harmful virality.
- Remember that suspicious clips with glitches and voice inconsistencies could still be real. As experts note in this Wired piece, an “unnatural-sounding voice could be a result of reading a script under extreme pressure” — a familiar phenomenon from “real” statements made by hostages. Meanwhile, suspicious glitches in low-quality video could “as likely be artifacts from compression as evidence of deepfakery.”
- Journalists and fact checkers do need evidence-based data to effectively counter widely shared AI audio fakes, even when they appear to be obvious — and this is where fact check sources, native language experts, deepfake rapid response teams, and detection tools are most important (more on those below). Media technologists, such as Shirin Anlen, also stress that journalists should start with traditional verification methods — such as reverse image search, source interviews, and the many tools in Craig Silverman’s Verification Handbook — to bolster their reaction.
- Since newsrooms can’t “prove a negative” — that a candidate never said the faked statement — reporters are forced to focus on the provenance of the clip itself, and on its creation and dissemination. However, experts say reporters can, and should, identify and highlight a verified clip of what a candidate has said on the same issue discussed in the fake — and that this truthful content should dominate the top of the story, and even the headline if possible.
- Most important: experts emphasize that media trust is the single most crucial element in countering AI audio deepfakes — that news outlets be so rigorous and evidence-based in their prior campaign coverage and elections investigations that their deepfake investigations on the eve of elections will be believed.
- Find impact by seeking the comment and policies of regulators. The difficulty of tracing individual fakes — as well as robocalls with fake caller IDs — to scammers means that investigative stories could find greater impact in pressuring legislators and government regulators to limit or ban the dissemination of AI-generated audio spam. In February, the US Federal Communications Commission banned the use of AI tools in robocalls, in a direct response to the election disinformation threat they pose.
The Added Threat Deepfakes Pose on Election Eve
The increase in deepfakes also poses a maddening threat to investigative stories themselves. Politicians or partisan officials now revealed as making outrageous statements or abusing peoples’ rights in real video or audio clips obtained by journalists may well claim that this legitimate evidence is simply the result of an advanced AI deepfake; convenient denials that could be difficult to rebut. This has already happened with politicians in countries such as India and Ethiopia, and this new onus on journalists to prove that a properly sourced, verified recording is indeed real is a deep concern for experts such as Sam Gregory, executive director of WITNESS. This problem is known as “the liar’s dividend,” and its ultimate solution involves media trust: that newsrooms relentlessly ensure that all their other stories and sources on elections are also solid. (Watch Gregory discuss the threat of deepfakes in his TED Talk below.)
The Slovakia case is especially concerning for watchdog reporters, for two reasons. First, because the fake two minute audio clip, which focused on election rigging, also fabricated the voice of an investigative journalist, Monika Tódová — supposedly in conversation with the opposition leader. In an investigative story on the incident by The Dial, Tódová revealed that she initially dismissed the viral clip as not believable. “[But] I had friends writing me that their college-educated coworkers had listened to it and believed it,” she recalled. “And they were sharing it on social media. I found myself in the midst of a totally new reality.”
And, second: the timing of the Slovakian audio deepfake bore the hallmarks of foreign state operatives. The Dial investigation found that the clip was released just prior to Slovakia’s legislated two-day “silence” period for all campaigning prior to election day. This tactic both maximized impact and gave journalists little recourse to rebut it, because the country’s media was legally limited in debunking the disinformation. (This case precisely vindicates a prediction that ProPublica’s Craig Silverman made to GIJN in 2022, that “elections are likely most vulnerable to deepfakes in the 48 hours prior to election days, as campaigns or journalists would have little time to vet or refute.”)
The fake Biden robocall that circulated right before the New Hampshire primary election is also noteworthy. NBC News ultimately tracked down the source of that deepfake audio, a magician who claimed he was paid by a consultant from a rival Democratic presidential campaign. According to the report, the man acknowledged that "creating the fake audio took less than 20 minutes and cost only US$1." He came forward about his role in the disinformation campaign after regretting his involvement. "It's so scary that it's this easy to do," he told NBC News. "People aren't ready for it."
Detection Tips and Tools for Advanced Deepfakes
Ukraine’s StopFake.org recently debunked and traced a deepfake video purporting to show a top general denouncing President Volodymyr Zelensky. Using the Deepware Scanner tool and consistency analysis, the team found that the scammer had used a machine learning technique called GAN (generative adversarial network) to superimpose fake imagery and audio onto a real video of the Ukrainian general taken the year before. Other analysts found that the deepfake was first posted by a Telegram channel that claims to share “humorous content.”
StopFake’s Yurkova says it has used detection tools in combination with normal reverse image tools to investigate suspicious multimedia content, but warns that “unfortunately, it doesn't always work.”
“We have little experience with pure audio fakes,” she explains. “We often distinguish such fakes by ordinary listening, but this works mostly for low-quality ones.”
Reporters should also check for political slogans or campaign narratives hidden in captions and accompanying text that could point to manipulation.
It’s important to note that deepfake detection is an emerging technology, and both open source and commercial tools are frequently inaccurate or case-limited — and journalists need to alert audiences to their limits. Indeed, WITNESS’s Anlen warns that “from our experience, we have yet to find [a tool] that didn't fail our tests and that provided transparent and accessible results.” Nonetheless, they can be helpful as leads or as supporting evidence.
Here are more technical tips for dealing with suspicious audio.
- Cross-check audio with native speakers. Yurkova cited a cheap audio fake that crudely impersonated the voice of US President Joe Biden in 2023, supposedly admitting to Vladimir Putin’s invincibility and widely amplified by Russian state media and some Telegram channels. StopFake simply shared the recording with several native speakers of American English, who immediately noted that certain words were clearly faked — especially the use of a “soft ‘i’” sound in the word “patriot,” which Americans pronounce almost as three full syllables.
- Try tool-specific detection portals. Yurkova says good audio deepfake detection tools include ElevenLabs’ AI speech classifier — but warns that this can only detect clips created with ElevenLabs tools. “You need to upload the audio file to the site to check the audio,” she adds. Of note: researchers used ElevenLabs’ detection systems on the audio deepfake of Biden and found a very high likelihood that it was created using that company's own AI tools. The fake audio's creator later confirmed this to NBC News, demonstrating that back-tracing a deepfake source can be quite accurate.
- Develop expert industry sources. Reporters can enlist the help of expert sources at forensic organizations such as Reality Defender, Loccus.AI, Respeecher, DeepMedia, and university digital forensic labs and information technology departments. Create a database of experts already quoted in AI audio content stories by other outlets, and see if they’ll help with forensic work on your suspicious clip.
- Try the PlayHT Classifier tool to flag general signs of AI manipulation in audio. “This is to check whether the audio track was made with the help of AI or whether it is an original recording; again, you need to upload the audio file,” said Yurkova, referring to a tool from text-to-speech startup PlayHT. She also suggests the AI or Not tool as a fully cost-free option for searching for fake imagery in clips. Samantha Sunne suggests that reporters check out alternate tools such as sensity.ai.
- Consider paid-for detectors that work with multiple languages. In addition to automated audio verification, AI Voice Detector offers additional features such as filters to remove background music and the ability to search without leaving a record. “The program does not store personal audio files and has a wide selection of languages,” Yurkova explains. However, she also notes that, after you create an account, it pushes a subscription for almost $20 per month and does not offer a trial period. Yurkova also suggests the any-language synthetic sound detector DuckDuckGoose — with a claimed accuracy rate of 93% — and the paid-for real-time audio checker Resemble Detect, which requires registration.
- Monitor for jarring or unlikely word choices. Last October, Israel’s government released an audio recording purporting to show Hamas radio chatter after the al-Ahli Hospital explosion in Gaza, which officials claimed as proof of Hamas’ culpability. However, Arab journalists cast doubt on the authenticity of the dialect, syntax, and accents of the voices, while another Channel 4 report also dismissed it as likely fake dialogue stitched together from two separate recordings.
- Probe the metadata and domain history of origin sites. “Reporters can use online tools such as WHOIS to trace the fakes to the original social media account or poster,” Yurkova explains. Tools like RiskIQ and Crowdtangle can also help trace clip origins. “However, tracing them to the original scammers or funders can be more challenging and may require assistance from law enforcement or cybersecurity experts,” she warns.
- Look for things that seem “off” in videos with frame-by-frame analysis. “Visual inconsistencies can become visible with frame-by-frame viewing,” says Yurkova. “We pay attention to whether the facial expression corresponds to the expected emotions of a person during the words he utters. A mismatch between verbal and non-verbal signals may mean that words and facial expressions have different origins.”
- Analyze accompanying text and captions for tell-tale wording and typos. In addition to clear falsehoods, profanity, and incitement to violence, reporters should also check for political slogans or campaign narratives hidden in captions and accompanying text that could point to manipulation.
In time-pressed cases, newsrooms can apply to human rights tech NGOs to help analyze suspicious election content. For instance, using this form, under-resourced newsrooms can apply for intensive analysis on “high-impact” clips from the experts at the Deepfakes Rapid Response project. (Bear in mind that this quick-response project has limited capacity.)
“We need a commitment…to deepen the media forensics capacity and expertise in using detection tools of journalists and others globally who are at the forefront of protecting truth and challenging lies.” — WITNESS Executive Director Sam Gregory
“We mostly collaborate with fact checkers or local journalists who have limited access to detection tools,” explains WITNESS’s Anlen, who added that researchers had already engaged with newsrooms on elections in Indonesia, Pakistan, and India. “Therefore, we are less likely to work with, for example, The New York Times or the Guardian for analyzing requests because they have great investigative resources. We have 15 teams — about 40 experts — with different expertise: video or image or audio-specific; local context. We try to pass on as much analysis information as possible, and journalists can do whatever they wish with that data.”
The mantra for dealing with deepfakes among researchers at WITNESS is “Prepare, don’t panic.”
In his seminal blog post on the challenge, Sam Gregory wrote: “We need a commitment by funders, journalism educators, and the social media platforms to deepen the media forensics capacity and expertise in using detection tools of journalists and others globally who are at the forefront of protecting truth and challenging lies.”
Rowan Philp is GIJN’s senior reporter. He was formerly chief reporter for South Africa’s Sunday Times. As a foreign correspondent, he has reported on news, politics, corruption, and conflict from more than two dozen countries around the world.
This article first appeared on Global Investigative Journalism Network and is republished here under a Creative Commons license.
If you liked what you just read and want more of Our Brew, subscribe to get notified. Just enter your email below.
Related Posts
How to Handle the Hydra That Is Social Media. Hire it
Aug 18, 2024
The Lifespan of Messages: News and Its Living, Breathing Nature
Aug 11, 2024
New Study Looks at What Can Happen When Journalism and Comedy Intersect
Mar 06, 2024