When we talk about Big Data, we often focus on the exciting potential business applications, things like real-time analytics and machine learning. But what happens to all that data once it’s been used? Though storage prices on the whole have plummeted, they still make up a considerable part of the overhead for companies that work with massive sets of unstructured data–we’re talking here about data in the tera- and petabytes. As organizations rely on and generate ever-increasing amounts of data, it’s become critical to find economical, long-term storage solutions. For many, the answer is cold storage.
What Is Cold Data?
Information has its own lifecycle. For businesses that rely on streaming analytics–for instance, sensor readings from an automated factory or real-time stock market movements–the value of their data is specifically tied to its immediacy. Data scientists call data that’s immediately useful and valuable hot data. As that data becomes less valuable, it’s said to cool down until it eventually becomes cold data.
Of course, just because data no longer has an immediate business application doesn’t mean it can just be discarded. Cold data can still be invaluable when it comes time to make long-term, strategic decisions. For instance, the same sensor data that’s used to monitor for real-time malfunctions on an assembly line can be used much later to identify opportunities to streamline or improve processes. Additionally, regulatory or compliance mandates often require organizations to hang on to log files, customer records, and backups long after that data has ceased to have an immediate business use.
But where does this cold data go? In the past, many companies would move their archival data to magnetic tape or commodity hard disk drives, but these formats become impractical and inconvenient when most or all of an organization’s data is stored on the cloud. At the same time, keeping tera- and petabytes of information on the same servers that you use for your hot data creates an unnecessary expense and can lead to suboptimal performance–in other words, having too much cold data around can actually make your hot data less valuable. This is where cold storage comes in.
How Does Cold Storage Work?
At its most basic, cold storage solutions are hard drives that are meant to be accessed less frequently than other types of storage. Like other IaaS solutions and unlike magnetic tape storage, cloud-based cold storage is highly scalable and allows for rapid provisioning of new resources based on organizational needs. Where response times for most cloud storage systems are measured in the milliseconds, response times for cold storage systems can range from seconds to hours depending on the service.
This compromise on accessibility leads to significant cost savings. Depending on the service and the details of the setup, cold storage can cost between a half and a third as much per gigabyte compared to hot storage options. For companies that store hundreds or thousands of terabytes of data, this can represent significant substantial savings.
What Should You Look For?
That said, while all cold-storage systems sacrifice availability for cost savings, within those solutions there’s tremendous variability when it comes to the methods of storage, price, retrieval times, and other features. The differences between the major cold storage options have less to do with the underlying technology than pricing and access models. Therefore, the best option will ultimately come down to what makes sense for a particular organization’s business needs. Besides the obvious question of how much storage you need, there are some other things you should consider.
- What kinds of data are you storing? Different types of data have different requirements. Financial records and logs might only need to be accessed in the (hopefully) rare event of an audit, but that might not be the case for user data that might be useful on an intermittent basis. These considerations will help determine how you balance access speed and cost.
- Do you need to move data manually or automatically? Managing massive amounts of data can be a labor- and resource-intensive job. Some services come with automated tiering, which allows organizations to specify when and how different types of data are moved between the “hot” and “cold” tiers.
- Do you need to integrate with your hot data operation? Another big operational cost in any data-heavy operation is getting the different parts to integrate smoothly. Unsurprisingly, Google, Amazon, and Microsoft’s cold storage solutions are designed to integrate with their respective cloud storage systems, but they also include APIs that should guarantee some level of interoperability.
- Do you need to comply with regulatory mandates? These can be important considerations for organizations in regulated industries like finance and healthcare. Policy-based archiving automatically places specified types of documents in easy-to-find archives, while audit logging ensures that there’s a trail showing exactly when and how data has been accessed.
- How secure does your data need to be? Cold storage services are typically highly durable, storing data redundantly across multiple machines and locations. Depending on your business needs, you may also need specific measures to protect your data, including 256-bit encryption and hashing.
To illustrate this, let’s briefly compare a few of the most widely used cold storage options from Amazon, Google, and Microsoft.
Amazon Glacier
Amazon’s Glacier product is the industry leader in cold storage. The way it works is simple: In exchange for retrieval times of 3-5 hours, Glacier will store your data for a fraction of the cost of storing it on S3. Another advantage to Glacier is a robust set of additional features, including automated tiering the ability to create custom access policies that only permit users to access a certain amount of data per day in order to avoid access charges. Of the major cold storage options, Amazon Glacier’s pricing and storage options may make it most appealing to small and mid-sized organizations that need an economical cold storage option but aren’t necessarily working with petabytes of data.
Google Nearline
Google Nearline promises slightly slower availability and lower latency in exchange for significant cost savings. When you start pulling data from Google Nearline, your download will begin in 2-5 seconds, much quicker than with Amazon Glacier. However, the speed at which your data is pulled depends on how much storage you’ve bought with Nearline. The result is that for customers with only a few terabytes of data stored on Nearline, the total time required to actually retrieve their data can end up being dozens of hours, compared with 3-5 for Amazon Glacier. This can make Nearline a more appealing option for enterprise-level customers and data heavy operations that will benefit from the increases in access speed that come from storing truly massive amounts of data on the service.
Microsoft Cool Blob Storage
Microsoft’s entry in the cold storage department is really more of a “cool” storage option compared to Glacier and Nearline. Its access speeds are more akin to a hot storage option, coming in at mere milliseconds. The difference here has to do with access costs: While Cool Blob Storage is less than half the price of Microsoft’s hot data storage, it comes with higher access costs, meaning it’s truly optimized for less frequently accessed data.
No matter what option you go with, know that a successful cold storage solution should be integrated with an overall data strategy. To learn more about building a comprehensive data solution, check out our guide to moving to the cloud.