MIT baby-talk project spawns massive IP SAN

MIT Media Lab builds a 1.4-petabyte SAN to study how babies learn to talk.

Imagine a storage array with capacity that's equivalent to a stack of iPods three times the height of the Empire State Building but that can be managed with common Ethernet networking tools, and you'll get what a group of MIT scientists and four storage vendors are in the process of building.

The storage array will support an MIT Media Lab project called the Human Speechome Project that is studying how babies develop the ability to talk. The project began three months ago when MIT associate professor Deb Roy began recording his baby's everyday life through the use of 14 fish-eye lens cameras set up throughout his house, giving researchers a bird's-eye view of every room.

In order to store and then process the video and audio data, a massive storage-area network (SAN) was needed to archive and search what is expected to be 1.4 petabytes of data, or 1,400TB of data, over the span of the three-year project.

The SAN is being built from commodity hardware and uses a 10 Gigabit Ethernet IP network for data transfer between the backend SAN and hundreds of servers.

"I think here what we're seeing is what the future of storage is going to be like. This is a great marriage between industry and the academic world," said Frank Moss, director of the Media Lab and a former CEO of Tivoli Systems, a maker of storage management software now owned by IBM.

Moss spoke at a press conference held Monday at MIT's Media Lab in Cambridge, Mass.

The Human Speechome Project computing infrastructure is expected to be composed of more than 300 Hammer Z-Rack storage enclosures from Bell Microproducts, about 3,000 SATA (Serial Advanced Technology Attachment) hard disk drives from Seagate Technology and more than 100 10 Gigabit Ethernet switches and 400 blade processors from Marvell Technology Group Ltd.

The high-throughput switches are needed for the storage I/O anticipated by researchers who believe they'll be processing 700TB of data during every 12-hour analytical run. To achieve the desired performance requirements, 150-drive stripes (aggregated virtual volumes) will be created using the native virtualization capabilities of Bell's Z-SAN. Protection against data loss will be delivered through RAID 10 mirrors (duplicate copies) of the raw video data, transform data, and metadata files.

"Our approach allows us to eliminate a lot of cost by using high volume, commonly available systems," said Jeff Greenberg, senior director of product marketing at Zetera, the vendor designing the SAN.

The project has been amassing several terabytes of audio and video data per week of early childhood learning and socialization data in order to model human language acquisition.

"If you take all parallel tracks of data over three years you'll have 400,000 hours of video and audio data," Roy said.

Roy said an application the university built allows researchers to quickly hone in on video and audio streams that involve his child's development while avoiding video playback of empty rooms or footage of mundane tasks, such as getting a drink of water or making coffee.

This story, "MIT baby-talk project spawns massive IP SAN" was originally published by Computerworld.

Join the discussion
Be the first to comment on this article. Our Commenting Policies