Hacker News new | past | comments | ask | show | jobs | submit login
An IBM computer learned to sing in 1961 (tedgioia.substack.com)
81 points by isomorph on May 12, 2023 | hide | past | favorite | 39 comments



One of the things I get a kick out of in the clip of HAL singing Daisy is just how much the physical modules being removed look like hard drives being pulled out of a NAS. I could easily think of a much larger version of my Synchrony NAS looking just like that.

BUT: When I first watched 2001 sometime in the early 1990s, I had only seen a 5.25" hard drive, and not on a slider / rails. I thought the inside of HAL was just tacky scifi from the 1960s.

It's only later as I've seen the predictions come true that I've realized just how forward-looking 2001 is. Like the scene with watching news reports on the tablets at breakfast. It wasn't until I watched a video, on my phone, in the late 2010s, that I realized that prediction in the move was 100% spot-on.

BTW, the 4k Ultra-HD bluray of 2001 is awesome.


They even nailed the headrest screens on the space "plane"! The lunar landing scenes are also mind blowing when you realize the film was released before the actual moon landing!


And more generally, a good predictor of how the free market would eventually take over the industry. All the branding, the food, etc. really does come across as well considered.


And the quiet, zero-point style propulsion on the carrier ship. No rocket boosters! That always sticks with me.


We built tablets based on the influence from 2001. You have it backwards.


Are there any detailed descriptions available of how the music and voice were synthesized?

Based on the recording, the information I could find, and imagining how I'd try to do the same thing using the technology of the era, I assume the melody is based on single-cycle samples of a piano and Max Matthews playing the violin. The vocals sound like formant synthesis like the Votrax SC-01 or TI LPC series, although of course those chips didn't exist until 15+ years after the work at IBM. But I'm very curious about the details. Did the team develop a general-purpose sequencer for the melody and/or speech, or were all of the notes, slides, etc. hardcoded? Did the computer actually output all 3+ parts together, or were they separate elements mixed after the fact? I assume the output was not realtime, but it would be a neat surprise if they achieved that in the 60s. Was it all handled digitally in the computer, or was the computer controlling some add-on hardware, maybe with analogue filters? Etc.


The singing synthesizer used a surprisingly sophisticated physical model of the human voice [1].

The music was mostly likely created using some variant of MUSIC-N [2], the first computer music language. The syntax and design of Csound[3] was based off of MUSIC-N, and I believe the older Csound opcodes are either ported or based off those found.

Apparently the sources for MUSIC-V (the last major iteration of the MUSIC language) can be found on github [4], though I haven't tried to run it yet.

1: https://ccrma.stanford.edu/~jos/pasp/Singing_Kelly_Lochbaum_...

2: https://en.wikipedia.org/wiki/MUSIC-N

3: https://en.wikipedia.org/wiki/Csound

4: https://github.com/vlazzarini/MUSICV


I guess they built upon the Voder [1] (Homer Dudley, Bell Labs 1939 as well. But that was played manually, amazing ‘instrument’?

1: https://youtu.be/5hyI_dM5cGo


Sort of. Both use articulatory synthesis, which attempts to model speech by breaking it up into components and using some coordinated multi-dimensional continuous control to perform phonemes (the articulation aspect). The voder uses analog electronics, while Daisy does it digitally (and without a human performer).

The underlying signal processing used for both is different, but both use a source-filter mechanism.


The synthetic voder output sounds more or less exactly like the output of a vocoder where the input is a human voice and the carrier is a sawtooth. Not surprising, given that the voder was made by the same people.

But I'm still unsure why those two things sound so similar to each other, and formant/LPC chips sound so similar to each other, but the two groups of things sound so dissimilar (at least, IMO).

I have a background in electronic music, so I'm pretty familiar with additive, subtractive, and other types of synthesis.

I'm especially surprised about the physical modelling sounding more like a formant chip, because a guitar "talk box" gives a sound exactly like a vocoder, and that should be almost the same thing, just with a real human mouth instead of a model.


The vo(co)der uses banks of fixed filters to apply the broad shape of a spectrum to an input signal. It's basically an automated graphic EQ. The level of each fixed band in the modulator is copied to the equivalent band in the carrier.

The bandpass filters have a steeper cutoff than usual and are flatter at the top of the passband than usual. And the centre frequencies aren't linearly spaced. But otherwise - it's just a fancy graphic EQ.

The formant approach uses dynamic filters. It's more like an automated parametric EQ. Each formant is modelled with a variable BPF with its own time-varying level, frequency, and possibly Q. You apply that to a simply buzzy waveform and get speech-like sounds out. If you vary the pitch of the buzz you can make the output "sing."

LPC uses a similar model but it applies data compression to estimate future changes for each formant band. So instead of having to control all the parameters at or near audio rate, you can drop the control rate right down and still get something that can be understood.

There are more modern systems. FOF and FOG use granular synthesis to create formant sounds directly. Controlling the frequency and envelope of the grains is equivalent to filtering a raw sound, but is more efficient.

FOF and FOG evolved into PSOLA which is basically real-time granulated formant synthesis and pitch shifting.


Many of the simpler vocal tract physical models are very similar to the cascaded allpass filter topologies found in LPC speech synthesizers.

In general, tract physical models have never sounded all that realistic. The one big thing they have going for them is control. Compared to other speech synthesis techniques, they can be quite malleable. Pink Trombone [1] uses a physical model under the hood. While it's not realistic sounding, the interface is quite compelling.

1: https://dood.al/pinktrombone/


Thank you! Seems like that project was incredibly far ahead of its time.

The physical-modelling aspect is super interesting. Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input? I always imagined that a physical-modelling speech synthesizer fed by a sawtooth wave would sound more like a vocoder than Votrax or TI LPC output does, but I guess not.


> Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input?

Essentially, yes. Both are known as "source-filter" models. A sawtooth, narrow pulse, or impulse wave is a good approximation glottal excitation for the source signal, though many articulatory speech models use a more specialized source model that's analytically derived from real waveforms produce by the glottis. The Lilencrantz-Fant Derivative Glottal Waveform model is the most common, but a few others exist.

In formant synthesis, the formant frequencies are known ahead of time and are explicitly added to the spectrum using some kind of peak filter. With waveguides, those formants are implicitly created based on the shape of the vocal tract (the vocal tract here is approximated as a series of cylindrical tubes with varying diameters).


Human speech production/perception works by articulation changing the shape, hence resonant frequencies (formants), of the vocal tract, and our ear/auditory cortex then picking up these changing formants. We're especially attuned to changes in the formants since those correspond to changes in articulation. The specific resonant frequency values of the formants vary from individual to individual and aren't so important.

Similarly the sound source (aka voice) for human speech can vary a lot from individual to individual, so serves more to communicate age/sex, emotion, identity, etc, not actual speech content (formant changes).

The reason articulatory synthesis (whether based on a physical model of the vocal tract, or a software simulation of one) and formant synthesis sound so similar is because both are designed to emphasize the formants (resonant frequencies) in a somewhat overly-precise way, and neither typically do a good job of accurately modelling the voice source, and other factors that would make it sound more natural. The ultimate form of formant synthesis just uses sine waves (not a source + filter model) to model the changing formant frequencies, and is still quite intelligible.

The "Daisy" song somehow became a staple for computer speech, and can be heard here in the 1984 DECtalk formant-synthesizer version. You can still pick up DECtalks on eBay - an impressive large VCR-sized box with a 3" 68000 processor inside.

https://en.wikipedia.org/wiki/Daisy_Bell



It didn’t “learn“ how to sing, of course. The voice was built bottom up, phoneme by phoneme, pitch by pitch.


The crowdsourced version of the song is quite chilling: https://youtu.be/Gz4OTFeE5JY


I don't know if paying for samples via Mechanical Turk is "crowdsourced," but you were 100% right on the being chilling. Is there an audio uncanny valley to go with the visual one we know so well? Like when something sounds close-to but not exactly like something else we know? That was so weird to listen to.


If you listen to the individual tracks it just sounds like people "normally" making noises https://www.bicyclebuiltfortwothousand.com/

Maybe intent plays a part in the perceived musicality of the result, considering that even 4chan and other forums can make up more coherent virtual choirs, under equally poor recording conditions... https://www.youtube.com/watch?v=uK_SRSB9pdA&list=PLlsIiu-R8a...

Edit: or perhaps the Mechanical Turk performance is more like a haka than the original song in effect - https://www.youtube.com/watch?v=BI851yJUQQw


It does feel like they wanted the choir to stay true to the speech synthesizer's sounds, not the song itself. So (rendition of (rendition of Bicycle Built for Two)).


I think it fits the definition if the invitation to work on a piece of the task is open for a significant chunk of the population. It doesn't have to be unpaid.


Nice. But why did the video creator feel the need to put in the fake film projector effects? The urge people have to add "oldness" where it is already present - though not in the form they imagine - is interesting by itself.


A few years ago I picked up "Music By Computer"[1] from a used book store, and it's fascinating.

Published in 1969, it's a collection of papers from the 60s about music and sound processing on the machines back then, and it goes into a lot more detail, if anybody is interested and can find a copy. It even came with recorded music on 5 paper thin flexi-discs that I've never been able to play.

https://www.amazon.com/Music-Computers-Heinz-von-Foerster/dp...

https://en.wikipedia.org/wiki/Flexi_disc


Library of Congress essay by Cary O'Dell on "Daisy Bell (Bicycle Built For Two)" from song origins to Bell Labs recording:

https://www.loc.gov/static/programs/national-recording-prese...


We had a floppy vinyl 45 of that when I was a kid in the 1960s (my mother was a high school science teacher and we often had cool stuff like that around the house).


A recording of it was included in some electronics hobbyist magazines of that time on shiny black flexible vinyl for playing on a phonograph at 45rpm. I seem to recall that Bell Labs was credited on the label but IBM was not.

EDIT: dbarlett just posted an image of the recording's label elsewhere in this thread


The neat thing about this particular singing synthesizer is that it used a surprisingly sophisticated (especially for the 60s) physical model of the human vocal tract [1], and was perhaps the first use of physical modeling sound synthesis. Vowel shapes were obtained through physical measurements of an actual vocal tract via x-rays. In this case, they were Russian vowels, but were close enough for English.

While this particular kind of speech synthesis[2] isn't really used anymore, it's still fun to play around with. Pink Trombone [3] is a good example of a fun toy that uses a waveguide physical model, similar to the Kelly-Lochbaum model above. I've adapted some of the DSP in Pink Trombone a few times[4][5][6], and used it in some music[7] and projects[8]of mine.

For more in-depth information about specifically doing singing synthesis (as opposed to general speech synthesis) using waveguide physical models, Perry Cook's Dissertation [9] is still considered to be a seminal work. In the early 2000s, there were a handful of follow-ups to physically-based singing synthesis being done at CCRMA. Hui-Ling Lu's dissertation [10] on glottal source modelling for singing purposes comes to mind.

1: https://ccrma.stanford.edu/~jos/pasp/Singing_Kelly_Lochbaum_...

2: https://en.wikipedia.org/wiki/Articulatory_synthesis

3: https://dood.al/pinktrombone/

4: https://pbat.ch/proj/voc/

5: https://pbat.ch/sndkit/tract/

6: https://pbat.ch/sndkit/glottis/

7: https://soundcloud.com/patchlore/sets/looptober-2021

8: https://pbat.ch/wiki/vocshape/

9: https://www.cs.princeton.edu/~prc/SingingSynth.html

10: https://web.archive.org/web/20080725195347/http://ccrma-www....


Another excellent, but quite dense, resource I've found helpful for implementing my own waveguide models is Physical Audio Signal Processing, a book available as a hard copy and online [1]. There are also an absolute ton of research papers on these topics which have failed to be summarized anywhere or cited outside the small circle of researchers, so there's a ton of institutional knowledge about physical modeling locked up in academic papers that isn't super accessible.

1: https://ccrma.stanford.edu/~jos/pasp/


I've been fascinated by the simplicity of this since I ran into SAM (Software Automatic Mouth) on the C64, but never really taken the time to delve into it. Your links are an amazing resource...


Eery, this song performance reminded me of the ending song of Portal.


Does anyone have any information / background on the programming behind this project?

It seems incredible for so long ago and I can't quite conceive of how they were able to do it.


See my other comments here for more info about the underlying technology.

It is pretty incredible that sophisticated digital physical models of the human vocal tract were being done in the early 60s. This was able to be done largely due to the deep pockets of Bell Labs. A lot of R+D was put into the voice and voice transmission.


Reminds me of Kapp'n in Animal Crossing.


the thing i most like in the film 2010 is that HAL gets put back together and saves Helen Mirren's human crew.


“Learned” or “was instructed”?


Daisy, Daisy, Daisy.


HAL did it better 40-years later in 2001: https://www.google.com/search?q=HAL+singing+D+a+bicycle+buil...


The clip is embedded in the bottom of the TFA.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: