blog

a collection of our writing

Building the heap: racking 30 petabytes of hard drives for pretraining

We built a storage cluster in downtown SF to store 90 million hours worth of video data. Why? We’re pretraining models to solve computer use. Compared to text LLMs like LLaMa-405B, which require ~60 TB of text data to train, videos are sufficiently large that we need 500 times more storage. Instead of paying the $12 million / yr it would cost to store all of this on AWS, we rented space from a colocation center in San Francisco to bring that cost down ~40x to $354k per year, including depreciation.

Introducing hertz-dev, the first open-source base model for conversational audio generation

For the last few months, the team at Standard Intelligence has been doing research in cross-modality learning. We’re excited to announce that we’re open-sourcing an early product of this research, an 8.5B, full-duplex, audio-only base model: hertz-dev. Audio modality is imperative to creating interactive agents that feel natural. Currently the two methods of utilizing audio with generative AI are either diffusion based methods or autoregressive methods. Though diffusion based audio models prove to be good at music generation and small samples, truly interactive audio generation needs to be autoregressive.