600 TB of data a day
Yes that’s what Facebook’s storage servers grapple with on an everyday basis, as we go on about our lives with our status updates, shares and likes- all at the click of a button.
In a word its mind-boggling.
To imagine just one social network accounting for so much of data being created every day and being able to crank their servers to a capacity where there is no downtime. The most widely used system at Facebook for large data transformations on raw logs is Hive, a query engine based on Corona Map-Reduce used for processing and creating large tables.
Facebook has managed the entire thing through by adding a distinctive touch of their own. By creating what they call the Facebook ORCFilewriter which allows for 3 times more compression than the average ORCFileWriter.
And true to their ideal, they have opened the entire storage format up on GitHub and are working with the open source community to incorporate these improvements back into the Apache Hive project.
This along with the way they custom built their server racks in the first place.
As John Gruber says,
There can’t be that many entities dealing with this scale of data storage, and the others likely aren’t sharing what they’ve learned. This is the cutting edge of computer science.
PS: If you can read the Facebook engineering blog at http://code.facebook.com. Quite a few nuggets of interesting read out there.