Photoshop trolls can manipulate photos, but in the future we may have a new type of troll … trolls which can easily manipulate spoken words just by typing text into an audio editing program.
Last week at the Adobe Max Creativity Conference, Adobe developer Zeyu Jin mentioned that people “having been making weird stuff online” with photo editing software, before adding, now “let’s do something to human speech.”
Jin gave a sneak peek of software which is like Photoshop for audio, demonstrating Project VoCo, for voice conversion, by altering a voice clip of comedian Keegan-Michael Key. The voice clip was something Key said after being nominated for an award.
Key had said, “I jumped on the bed, and I, uh, kissed my dogs and my wife – in that order.” Jin honed in on the “kissed my dogs and my wife” portion.
Jin showed that changing what Key said is as simple as typing in new text. He changed the order of Key’s kissing statement to, “kissed my wife and my wife” and then to “kissed my wife and my dogs.”
To show the voice conversion would work using words which had not been spoken, Jin decided to use an example of inserting a reference to Max Creativity Conference host Jordan Peele; Peele is also Key’s comedic partner. Jin edited the audio so Key said “kissed Jordan and my dogs” and “kissed Jordan three times.” You can hear that around 3:55 in the video below.
Adobe issued a statement explaining the purpose of VoCo:
When recording voiceovers, dialog, and narration, people would often like to change or insert a word or a few words due to either a mistake they made or simply because they would like to change part of the narrative. We have developed a technology called Project VoCo in which you can simply type in the word or words that you would like to change or insert into the voiceover. The algorithm does the rest and makes it sound like the original speaker said those words.
TechCrunch explained how VoCo works:
Project VoCo needs about 20 minutes of voice samples from a given speaker. It then analyzes the speech, breaks it down into phonemes, transcribes it and creates the voice model. If you listen closely, you can hear when a word has changed, but it’s probably only a matter of time before you won’t be able to distinguish the actual recording and the edited (or completely fake) one.
It’s also likely the amount of voice data needed will decrease. At this stage, Adobe considers VoCo to be “experimental technology.”
Toward the end of the presentation, Jin noted that Adobe has already considered how VoCo could be abused. The company has “researched how to prevent forgery. Think about watermarking detection. As we’re getting the results much better, making it so people can’t distinguish between the fake and the real one, we’re working harder trying to make it detectable.”
Adobe hasn’t said when VoCo will be made available as part of Creative Cloud, although the Adobe blog pointed out that many sneak peeks “from previous years have later been incorporated into our products.”
If interested, you can find out more by reading the Princeton University and Adobe Research paper (pdf) released in March at the IEEE International Conference on Acoustics, Speech and Signal Processing.