If you had a Nokia in the mid-2000s, you might remember her.
IVONA. Polish text-to-speech. That voice that read your SMS messages with an accent that made everything sound slightly ominous. Later she showed up in those YouTube videos—the ones where someone would paste the most unhinged Reddit comments into a TTS engine and let it rip. "Subscribe to my channel" never sounded the same after Nina said it.
Amazon bought IVONA in 2013 for $26 million. Turned it into Polly. Corporate sanitized. The original voices—the weird ones, the ones with character—got shelved somewhere in AWS's basement.
I still had an old APK. Android app from 2012. Inside: Nina's voice. 186 megabytes of... something.
No docs. No source. Just a binary blob that started with "DDXV" and went nowhere on Google.
I've been messing with Claude for coding stuff. Autocomplete on steroids, whatever. But I wanted to see what happens when you point it at something genuinely unknown and just... let it drive.
So I dumped the first 1000 bytes of the voice file into the chat and said: "figure this out."
And then I mostly watched.
Claude wrote a Python script. Ran it. Found a section table at offset 0x68. Eleven sections. 30 megs of index data. Wrote another script to parse each section. Found 487,000 audio units. Wrote another script to analyze their structure. Found they averaged 45 bytes each—way too small for actual audio. Hypothesized parametric synthesis. Wrote a script to test that. Found monotonically increasing sequences. Revised the hypothesis.
I'm sitting there drinking coffee. Occasionally saying "yeah, try that" or "what's in section 5?"
Twenty minutes in, it had mapped more of this format than I would've found in a week of hex editing.
The moment that got me was the Festival thing.
We hit these feature names in the database:
=R:SylStructure.parent.stress
=p.ph_ctype
Gibberish. I was about to move on.
Claude goes: "These are Festival TTS prosodic features."
I'd never heard of Festival TTS. It's this old speech synthesis framework from Edinburgh. Claude immediately connected the dots—the phoneme naming convention (ARPABET), the diphone structure (1,604 pairs), the prosody encoding. All matching Festival's architecture.
It knew because somewhere in its training data is probably every speech synthesis paper ever written. And it pattern-matched across all of them in like two seconds.
That's not autocomplete. That's something else.
We hit the wall around hour three.
We'd found the units. We knew they were parametric—not stored waveforms. 48:1 compression ratio. The structure looked like: frame count, cumulative spectral values, some tail parameters for pitch and duration.
But we couldn't crack the actual synthesis algorithm.
The cumulative values increase by ~145 per step. What does that represent? LSP frequencies? Pitch marks? Something proprietary?
Claude kept generating hypotheses. I kept running tests. Nothing clicked.
And that's when I realized the limit: you can't reverse engineer a synthesis algorithm by staring at its output. You need to run the original code. Watch the data transform. Set breakpoints in the actual DSP loop.
Static analysis only gets you so far. Claude was hitting the same wall any human would—just faster.
So what did three hours of AI-assisted binary archaeology actually produce?
A complete map of the DDXV format. Eleven sections documented. 487,000 units indexed. Full phoneme inventory (44 ARPABET symbols plus some Polish-specific variants). Festival compatibility confirmed. Parametric encoding confirmed.
Three things still unknown: the synthesis algorithm, what Section 5 actually does (16.7MB of mystery data), and how the language files are encrypted.
But here's the thing: before this session, I had an opaque binary blob. After, I have a solvable engineering problem. The next step is obvious—run the ARM32 binary on a Raspberry Pi, attach a debugger, trace the synthesis function.
I couldn't see that path before. Now I can.
The weird part was how little I did.
Claude wrote maybe 15 analysis scripts. Each one would've taken me 30-60 minutes. It generated documentation as we went. It caught patterns I would've missed. It pulled in knowledge from domains I've never touched.
My job was: pick which hypothesis to test next. Recognize when we were going in circles. Know when to stop.
That's it. Direction, not implementation.
Somewhere in Amazon's servers, Nina is still frozen in her DDXV coffin. But I know what's in there now. I know exactly how she's encoded. And when I have a free weekend and a Raspberry Pi, I'm going to finish what Claude started.
That voice deserves better than a corporate archive.
Tools: Claude, Python, rizin, patience.