For the last few months, we at Standard Intelligence have been researching scalable cross-modality learning. We're excited to announce that we're open-sourcing current checkpoints of our full-duplex, audio-only transformer base model, hertz-dev, with a total of 8.5 billion parameters.
- hertz-codec: a convolutional audio autoencoder that takes mono, 16kHz speech and transforms it into a 8 Hz latent representation at about 1kbps bitrate. The codec at 1kbps outperforms Soundstream and Encodec at 6kbps and is on par with DAC at 8kbps in subjective evaluations, while having lower tokens per second than any popular tokenizer, critical for language modeling. The codec has 5 million encoder parameters and 95 million decoder parameters.
- hertz-vae: a 1.8 billion parameter transformer decoder. The model has a context of 8192 sampled latent representations (17 minutes) and predicts the next encoded audio frame. 15 bits of quantized information from the output of hertz-dev act as semantic scaffolding to steer the generation during inference.
- hertz-dev: a 6.6 billion parameter transformer stack. The primary checkpoint is partially initialized from the weights of a pre-trained language model and then trained for a single epoch on 20 million hours of audio with a 2048-token (4 minute) context length. We've also released an ablation that has no text initialization at all.
Hertz-dev is the first publicly released audio base model of its kind. Base models are uniquely valuable as a research product because they accurately model the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes base models the best starting point to fine-tune for a large number of different tasks.
Hertz-dev has a theoretical latency of 65ms and a real-world average latency of 120ms on a RTX 4090. This is about 2x lower latency than any public model in the world—a prerequisite for a model that can interact with you in human-like ways instead of what feels like a delayed, choppy phone call. We're currently training a larger, more advanced version of Hertz, which will use a scaled base model recipe and RL tuning to substantially improve the raw capabilities and final coherence of the model. Hertz-dev is a glimpse at the future of real-time voice interaction, and is the easiest conversational audio model in the world for researchers to fine-tune and build on top of.
Sample Generations
To demonstrate the audio modeling capabilities of hertz-dev, we sample both one-channel and two-channel generations as well as a live conversation between the model and a human.
One-channel
Two-channel
Interactive
9 seconds of prompt included.
At SI, we're doing fundamental research with the goal of building aligned general intelligence, and we view this as just the first step on that journey. We're starting in a unique time where a tiny team can do massively outsized work.
We're currently a team of 4 in San Francisco. If your life goal is to build AGI in a way that benefits all humanity, we might want to hire you—reach out at [email protected]. If you're interested in investing, please reach out at [email protected].