infrastructure

Building the heap: racking 30 petabytes of hard drives for pretraining

September 30, 2025

We built a storage cluster in downtown SF to store 90 million hours worth of video data. Why? We’re pretraining models to solve computer use. Compared to text LLMs like LLaMa-405B, which require ~60 TB of text data to train, videos are sufficiently large that we need 500 times more storage. Instead of paying the $12 million / yr it would cost to store all of this on AWS, we rented space from a colocation center in San Francisco to bring that cost down ~40x to $354k per year, including depreciation.