AI Can Clone Your Voice in 3 Seconds. The First Synthetic Voice Took 10 Operators — in 1939.
Homer Dudley built a machine called the Voder that could synthesize human speech at the 1939 World's Fair. It took a trained operator playing a keyboard. Today's AI needs three seconds of audio.
Key Takeaways
- •The Voder at the 1939 World's Fair was the first machine to synthesize human speech from scratch
- •Only 30 people ever learned to operate it — each requiring a full year of training
- •The Vocoder became the core of SIGSALY, a WWII encrypted phone system used by Churchill and Roosevelt
- •ElevenLabs and VALL-E can now clone any voice from just 3 seconds of audio
- •Deepfake voice scams caused over $25 million in documented losses in 2025
Root Connection
Homer Dudley's Voder at the 1939 World's Fair → ElevenLabs and OpenAI voice cloning in 2024
ROOT
In the spring of 1939, visitors to the New York World's Fair lined up to witness something that had never existed before: a machine that could talk. Not a recording. Not a phonograph. A device that synthesized human speech from scratch, in real time, controlled by a human operator sitting at a console that looked like a cross between a pipe organ and a telephone switchboard.
The machine was called the Voder — short for Voice Operation DEmonstratoR — and it was the brainchild of Homer Dudley, a physicist and engineer at Bell Telephone Laboratories. Dudley had spent years studying how the human vocal tract produces speech, breaking the process down into its fundamental components: a sound source (the vocal cords), a noise source (the breath), and a series of resonant filters (the throat, mouth, and nasal cavities). The Voder recreated this chain electronically.
The operator sat at a keyboard with 10 keys, each controlling a different frequency band filter. A wrist bar toggled between voiced sounds (like vowels, produced by a buzz oscillator simulating vocal cords) and unvoiced sounds (like "s" and "f," produced by a hiss generator simulating breath). A foot pedal controlled pitch inflection — the rise and fall that makes speech sound natural rather than robotic. To produce a single word, the operator had to coordinate all three systems simultaneously, in real time, with millisecond precision.
It took approximately one year of full-time training to learn to play the Voder competently. Bell Labs trained a small group of women — telephone operators selected for their dexterity and musical ear — and only about 30 people ever mastered the instrument. The demonstrations at the World's Fair were impressive but imperfect. The Voder could say phrases like "she saw me" and count to ten, but the speech was slurred, mechanical, and often required the audience to be told what was being said before they could understand it.
What most fairgoers didn't know was that Dudley had already built the Voder's predecessor: the Vocoder (Voice Coder), completed in 1938. Where the Voder synthesized speech from manual input, the Vocoder analyzed incoming speech into frequency bands and then resynthesized it — compressing and reconstructing the human voice electronically. This analysis-synthesis approach caught the attention of the U.S. military, which saw an immediate application: encrypted voice communications.
During World War II, the Vocoder became the core of Project X, also known as SIGSALY — a top-secret encrypted telephone system that allowed Winston Churchill and Franklin Roosevelt to hold secure transatlantic conversations. SIGSALY weighed 55 tons, filled an entire room, and used vinyl records of random noise as one-time encryption keys. It was never broken. The system that started as a parlor trick at a World's Fair became one of the most important cryptographic tools of the war — and the conceptual ancestor of every digital voice system that followed.
TODAY
The distance between Homer Dudley's keyboard-operated speech machine and today's voice AI is measured not just in decades but in conceptual leaps. In 2023, Microsoft Research published a paper on VALL-E, a neural codec language model that could clone any human voice from just 3 seconds of audio. The same year, ElevenLabs launched a commercial voice cloning platform that lets anyone upload a short audio sample and generate unlimited synthetic speech in that voice — with emotion, pacing, and intonation that is, in many cases, indistinguishable from the original speaker.
OpenAI's Voice Engine, demonstrated in early 2024, achieved similar results and sparked immediate debate about safety. The company deliberately limited its release, acknowledging the potential for misuse. That caution was well-founded: by 2025, the FBI and FTC reported that deepfake voice scams had caused losses exceeding $25 million in documented cases, with the actual figure likely far higher. The most common attack vector was CEO impersonation fraud — a criminal clones a company executive's voice from a public earnings call or podcast appearance, then calls the CFO requesting an urgent wire transfer.
The voice acting industry has been upended. SAG-AFTRA went on strike in 2023 partly over AI voice rights. Audiobook narrators discovered their voices being cloned without consent by platforms that had licensed their recordings for "text-to-speech research." Podcasters found synthetic versions of their shows appearing on competing platforms. The fundamental question Dudley never had to consider — who owns a voice? — became one of the defining legal battles of the decade.
Underneath the headlines, the technical lineage is direct. Dudley's frequency-band approach to speech — breaking the voice into component frequencies and manipulating them independently — is the conceptual foundation of every speech synthesis system that followed. Parametric synthesis (1960s-1980s) formalized Dudley's approach mathematically. Concatenative synthesis (1990s-2000s) stitched together recorded speech fragments. Neural TTS (2016-present), beginning with DeepMind's WaveNet, used deep learning to generate speech waveforms directly — but still operates on the same principle of modeling speech as a combination of frequency components shaped over time.
Today, AI-generated voices narrate YouTube videos, staff customer service lines, read audiobooks, and provide accessibility tools for people who have lost their ability to speak. Apple's Personal Voice feature lets ALS patients bank their voice before they lose it, then use their iPhone to speak in their own voice through a synthesizer. The technology that took 30 operators a year each to learn now runs on a phone in your pocket.
Enjoy This Article?
RootByte is 100% independent - no paywalls, no corporate sponsors. Your support helps fund education, therapy for special needs kids, and keeps the research going.
Support RootByte on Ko-fiHow did this make you feel?
Recommended Gear
View all →Disclosure: Some links on this page may be affiliate links. If you make a purchase through these links, we may earn a small commission at no extra cost to you. We only recommend products we genuinely believe in.
NVIDIA Jetson Orin Nano
Compact AI computer for running local LLMs, computer vision, and robotics. 40 TOPS of AI performance.
The Innovators by Walter Isaacson
The untold story of the people who created the computer, internet, and digital revolution. Essential tech history.
Raspberry Pi 5 (8GB)
The latest Pi - powerful enough for local AI inference, home servers, and retro gaming. A tinkerer's best friend.
AI and Machine Learning for Coders
Practical guide by Laurence Moroney. Hands-on TensorFlow projects, zero heavy math. Perfect for developers entering AI.
Keep Reading
Want to dig deeper? Trace any technology back to its origins.
Start Research