Introducing hertz-dev, the first open-source base model for conversational audio generation

Github | Download checkpoints | Twitter

For the last few months, the team at Standard Intelligence has been doing research in cross-modality learning. We're excited to announce that we're open-sourcing an early product of this research, an 8.5B, full-duplex, audio-only base model: hertz-dev.

Audio modality is imperative to creating interactive agents that feel natural. Currently the two methods of utilizing audio with generative AI are either diffusion based methods or autoregressive methods. Though diffusion based audio models prove to be good at music generation and small samples, truly interactive audio generation needs to be autoregressive.

The largest problems in this field are 1) Getting audio generation that sounds human (ie. non-synthetic as well as handling interruptions well) and 2) Handling realtime generation with two live channels that are both producing information, like regular human dialogue.

Our model is at the frontier of both of these, natively fitting to the two-speaker format with faster-than-human reaction times and full ability to parse and generate overlapping two-speaker audio. We do this by operating in latent space as well as using quantized phonetic bits, allowing a 80ms theoretical average latency with only a single sampled latent at each timestep. Currently, we benchmark at 120ms real-world latency on a single RTX 4090—2x lower than the previous state of the art.

Overview

hertz-codec architecture diagram
Figure 1: hertz-codec architecture diagram for our VAE. The input is 6s 16kHz mono audio and the output is a 32-dim latent.
hertz-ar architecture diagrams
Figure 2: hertz-ar architecture diagram for the autoregressive section of our model. (2a) is mono-channel autoregressive latent prediction and (2b) is duplex autoregressive latent prediction.

hertz-dev is made out of two parts—the hertz-codec which produces audio latents and the hertz-ar which predicts future latents conditioned on past latents. The audio latents are an extremely rich prior that could be used for many downstream tasks.

Hertz-dev is the first publicly released base model for conversational audio. Base models accurately predict the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes these models the best starting point for downstream fine-tuning in a large number of different tasks. We're currently training a larger, more advanced version of Hertz, which will use a scaled base model recipe and RL tuning to substantially improve the raw capabilities and final coherence of the model. Hertz-dev is a glimpse at the future of real-time voice interaction, and is the easiest conversational audio model in the world for researchers to fine-tune and build on top of.

Sample Generations

To demonstrate the audio modeling capabilities of hertz-dev, we sample both one-channel and two-channel generations as well as a live conversation between the model and a human.

One-channel

Two-channel

Interactive

9 seconds of prompt included.

Training Choices

Performance

During live inference, the model needs to run at 8 forward passes per second, doing constant autoregressive generation. It takes two separate channels as input, but in conversations only returns one. At each step, it receives the human’s audio and tokenizes it into a latent, combining this with the model’s last generated latent and feeding both into hertz-ar.

This allows the latency, measured as the average time between user utterance and model response, to be 62.5ms (the average time between any given utterance and the end of one token) + the time for forward pass + round-trip internet delay. By running on local 4090s, we usually see a real-world average latency of 120ms. This is 2x lower than any other audio model—which is necessary for a model that can interact with you in human-like ways instead of what feels like a delayed, choppy phone call.

At SI, we're doing fundamental research with the goal of building aligned general intelligence, and we view this as just the first step on that journey. We're starting in a unique time where we can rely on a tiny team to do massively outsized work, and are currently a team of 4 in San Francisco.

If your life goal is to build AGI in a way that benefits all humanity, reach out at [email protected]. If you're interested in investing, please reach out at [email protected].