It's easy for someone with potentially malicious intentions to record your voice, as you leave traces of your voice when simply talking somewhere out in public, during mobile phone calls, in videos posted on social networking sites, or even when sending a recorded voice greeting card. Your voice is considered to be unique enough to serve as an authentication of your identity.
But after studying the implications of commonplace voice leakage and developing voice impersonation attacks, researchers from the University of Alabama at Birmingham warned that an attacker, in possession of only a very limited number of your voice samples, with "just a few minutes worth of audio of a victim's voice," can clone your voice and could compromise your security, safety, and privacy.
Nitesh Saxena, Ph.D., the director of the Security and Privacy In Emerging computing and networking Systems (SPIES) lab and the associate professor of computer and information sciences at UAB, kindly supplied me with a copy of his report, "All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines." The research was presented at the European Symposium on Research in Computer Security.
"Because people rely on the use of their voices all the time, it becomes a comfortable practice," explained Saxena. "What they may not realize is that level of comfort lends itself to making the voice a vulnerable commodity. People often leave traces of their voices in many different scenarios. They may talk out loud while socializing in restaurants, giving public presentations or making phone calls, or leave voice samples online."
Using an off-the-shelf voice morphing tool, researchers Dibya Mukhopadhyay, Maliheh Shirvanian, and Saxena discovered that an "attacker can build a very close model of a victim's voice after learning only a very limited number of samples in the victim's voice (e.g., mined through the Internet, or recorded via physical proximity). Specifically, the attacker uses voice morphing techniques to transform its voice – speaking any arbitrary message – into the victim's voice."
"As a result, just a few minutes' worth of audio in a victim's voice would lead to the cloning of the victim's voice itself," Saxena said. "The consequences of such a clone can be grave. Because voice is a characteristic unique to each person, it forms the basis of the authentication of the person, giving the attacker the keys to that person's privacy."
The attack has three phases: First, obtain a mere "50 to 100" five-second sentences. Second, an attacker repeats those same short sentences and feeds it into a morphing engine. "At this point, the attacker has at its disposal essentially the voice of the victim," the report reads. In the third phase, the attacker uses the voice imitation capability; he can say anything at all and make it sound like the victim's voice. The imitated voice can "compromise any application or contest that utilizes the victim's voice."
So what could an attacker do? Think authentication, not only fooling humans but also tricking "state-of-the-art automated speaker verification algorithms." Some people use voice prints instead of PIN lock on their smartphone, others use Window Hello, and voice biometrics are even used "in many government organizations for building access control."
Saxena said, "Many banks and credit card companies are striving for giving their users a hassle-free experience in using their services in terms of accessing their accounts using voice biometrics." Such systems might claim biometric voice prints are secure, but the researchers found "automated systems are largely ineffective to our attacks. The average rates for rejecting fake voices were under 10-20% for most victims."
Although humans fared better against such attacks, they failed to recognize morphed/imitated voices were fake half of the time. "Based on two online studies with about 100 users," the researchers found "that only about an average 50% of the times people rejected the morphed voice samples of two celebrities as well as briefly familiar users."
UAB pointed out:
If an attacker can imitate a victim's voice, the security of remote conversations could be compromised. The attacker could make the morphing system speak literally anything that the attacker wants to, in the victim's tone and style of speaking, and can launch an attack that can harm a victim's reputation, his or her security, and the safety of people around the victim.
"For instance, the attacker could post the morphed voice samples on the Internet, leave fake voice messages to the victim's contacts, potentially create fake audio evidence in the court and even impersonate the victim in real-time phone conversations with someone the victim knows," Saxena said. "The possibilities are endless."
The researcher team used the Festvox Voice Conversion System to morph the voices, testing machine-based attacks against the Bob Spear Speaker Verification System using MOBIO and VoxForge datasets. The attacks included "different speaker attack," basically faking out a machine to believe the attacker's voice belongs to the victim, and a "conversion attack," which could replace the victim's voice with the attacker's; this could potentially lock a victim out of a "speaker-verification system that gives random challenge each time a victim users tries to login or authenticate to the system."
The human-based attacks, "familiar speaker" and "famous speaker verification," were conducted with the help of Amazon Mechanical Turk online workers. After collecting audio samples of Oprah Winfrey and Morgan Freeman from the Internet, the researchers had the M-Turk workers repeat and record the clips; the team then converted the audio.
Our work highlights a real threat of practical significance because obtaining audio samples can be very easy both in the physical and digital worlds, and the implications of our attacks are very serious. While it may seem very difficult to prevent "voice hacking," our work may help raise people's awareness to these attacks and motivate them to be careful while sharing and posting their audio-visuals online.
Future research will explore additional attacks, but for now, the researchers concluded:
We showed that voice conversion poses a serious threat and our attacks can be successful for a majority of cases. Worryingly, the attacks against human-based speaker verification may become more effective in the future because voice conversion/synthesis quality will continue to improve, while it can be safely said that human ability will likely not.