Building a properly configured data warehouse is critical to a successful analytics operation. A well-designed data warehouse can allow data analysts (or anyone else with basic SQL skills) to execute lightning fast queries on data that’s being processed at near real-time speeds. By contrast, a poorly configured or inadequate data warehouse can be a source of endless problems, potentially putting your organization at a competitive disadvantage.
As more and more infrastructure moves to the cloud, the technology used to power data warehouses has changed. Cloud-based warehouses solve many of the problems of physical infrastructure, but they present their own tradeoffs when it comes to performance, cost, and ease of use. So how are you supposed to know which one is right for your organization?
Today, we’re going to look at the role of data warehouses in a data-driven organization, what sets cloud-based warehouses apart, and a few of the major options.
Data Warehouse Basics
A data warehouse is different from an operational database in that it’s built to facilitate the analysis of historical data as opposed to handling transactions. This means that data warehouses tend to be orders of magnitude larger than their corresponding operational databases. That also means that certain critical aspects of operational databases–ACID compliance and fast write times, for example–may not be as relevant. Instead, data warehouses have their own requirements. For our purposes, the most important ones are:
- How does it ingest data? Getting data into the warehouse and properly formatted is often one of the biggest challenges with data warehouses. Traditionally, data warehouses have relied on batch-processing Extract-Transform-Load jobs. ETL jobs are still important, but now there are also streaming data ingestion options and even services that allow you to run SQL queries against unstructured data that’s not even in the warehouse.
- How does it handle ad hoc queries? Hadoop’s MapReduce paradigm is a powerful tool for processing huge amounts of data, but the batch method makes it impractical for ad hoc queries. Many data warehouses turn to Massive Parallel Processing (MPP), a MapReduce-like architecture that spreads out queries across multiple high-end processors. Other alternatives include Apache Spark, which uses in-memory processing and a novel data structure to reach processing speeds that are orders of magnitude faster than MapReduce. A final consideration is what language it uses to execute queries. Some support standard SQL, while others rely on dialects that may support custom commands and disallow others.
- How does it scale? All data warehouses are meant to handle large volumes of data, but different warehouses are often optimized for clusters of different sizes. In cloud-based warehouses, a major consideration is how quickly and easy it is to provision new resources, and whether computing and storage resources can be provisioned separately.
- How much maintenance does it need? Whether they’re on premises or in the cloud, data warehouses are complex systems that often require a fair amount of regular maintenance that only an experienced database administrator can provide. Different data warehouses require different levels of attention–data may need to be cleaned and resorted, compression settings may need to be adjusted, data may need to be moved between disks. Depending on how these tasks are managed it could also lead to significant downtime.
With those characteristics in mind, let’s take a look at some popular data warehouse options, from the most customizable to the most managed.
Redshift is truly a virtual version of a traditional data warehouse. Database administrators provision resources just like they would in an on-premise warehouse–they choose the type of node, the number of nodes per cluster, the locations of those clusters, and the type of network connection. For teams that want to optimize their data warehouse to get the maximum performance possible, this level of customizability can be very attractive. It also integrates easily with other AWS components, making it extra attractive to teams already using S3 or EC2.
- Loading Data: As you might expect, Redshift can load structured data from many AWS sources (S3, DynamoDB, etc.) as well as an SSH-enabled host on premises. Redshift can also ingest streaming data via Firehose. A downside? Redshift only supports structured data.
- Query Processing: In many head-to-head tests, Redshift has been shown to outperform other cloud-based options, though achieving the highest possible speeds may involve a fair amount of tweaking and customization. Another cool feature of Redshift is Spectrum, which allows you to run queries against data stored in S3 without performing any ETL operations. One downside: Redshift queries are written in a dialect of PostgreSQL that doesn’t support a number of data types and functions.
- Scalability: Redshift clusters need to be manually spun up and shut down via the AWS Management Console. It’s important to make sure you shut down clusters that you’re not actively using to avoid extra charges.
- Support and Maintenance: Redshift monitors all system components for failures and recovers them automatically. Other maintenance tasks fall on the user.
BigQuery is at the opposite end of the customizability/ease-of-use spectrum. The heart of BigQuery is a novel query engine built on Google’s Dremel project. Like Hive, Dremel sits on a file management system and translates SQL queries into low-level instructions. Dremel’s approach to queries allows it to pore through billions of rows of data in just a few seconds, making it especially adept for ad hoc queries. The other thing that sets BigQuery apart is that it totally abstracts away the provisioning, assigning, and maintaining of resources, all of which is taken care of automatically by Google. This makes BigQuery ideal for small organizations or teams that prioritize ease of use over maximum performance.
- Loading Data: Data can be loaded from Google Cloud Storage, readable data sources, or streamed in. Google handles the formatting and processing of data–you just connect the source. Even cooler, BigQuery allows you to query external data sources including Google Cloud Storage and Google Drive. (You can even query shared and public datasets without loading them at all.)
- Query Processing: Running queries in BigQuery is pretty straightforward. BigQuery supports standard SQL, as well as some semi-structured data types like JSON and CSV. One consideration is that BigQuery limits the number of queries you can run concurrently–fortunately, it also includes features that allow you to prioritize some queries over others.
- Scalability: Scalability isn’t really a concern with BigQuery. All storage is handled by Google’s Colossus file management system, while Google’s Borg cluster-management system handles computing, assigning jobs automatically to Google’s store of machines, consisting of hundreds of thousands of CPUs.
- Support and Maintenance: Because Google abstracts away all aspects of resource provisioning and data management, there’s no need for periodic maintenance. In most cases, this is an advantage, since you won’t need to perform maintenance tasks in order to maintain high performance. On the other hand, it also means you easily amend or delete old data–you’ll have to truncate and recreate the whole table with new data.
In between the customizability of Redshift and the ease of BigQuery there’s Snowflake. In essence, Snowflake is a custom query engine and data storage format built on top of AWS architecture: Storage is handled by S3, while computing is taken care of by EC2. Once you load data into Snowflake, that’s it–they handle organization, compression, metadata, encryption, etc. The only way you can access your data after that is through queries. Snowflake processes queries across “virtual warehouses,” which are clusters of massively parallel processing (MPP) nodes provisioned from EC2. Each query is assigned to its own warehouse, meaning that complex or computationally expensive queries running in one warehouse won’t affect the performance of your other warehouses. Even better, these warehouses are highly elastic, meaning they can provision additional computing resources automatically.
- Loading Data: A key feature of Snowflake is its native support for both structured and semi-structured data, which it can ingest without a complex ETL pipeline or preprocessing. The advantage here is that it preserves the flexibility of semi-structured data without sacrificing performance when it comes to queries. Though Snowflake doesn’t come with support for streaming data out of the box, it does allow you to connect to Spark to ingest streaming data.
- Query Processing: Like BigQuery, Snowflake uses standard SQL, but with some added extensions that allow you to query both structured and semi-structured data at the same time. Some tests and anecdotal evidence have shown that Snowflake performs best with smaller datasets (think gigabytes vs petabytes), which could be a consideration for organizations that don’t need to crunch through the biggest datasets in the world.
- Scalability: Similar to BigQuery, Snowflake touts its ability to scale up and down seamlessly. Computing and storage are de-coupled, allowing these resources to be distributed more efficiently. That said, as in Redshift, you can still manually provision larger or smaller virtual warehouses depending on the complexity of your queries.
- Support and Maintenance: Snowflake handles maintenance, updates, and security. There’s no need to maintain indexes, clean files, or update metadata.
Questions to Ask When Choosing a Data Warehouse
While all of the data warehouses we’ve looked at today are powerful tools used by an array of large and small organizations, the best one for your team will depend on your specific needs. Here are some brass tacks considerations to take into account:
- Are you already committed to a cloud ecosystem? If you’re already a strong user of AWS or Google Cloud Services, it’s probably easiest to just go with the data warehouse that natively integrates with your system. Remember, Snowflake is built on top of AWS infrastructure so it may be a good alternative for AWS users who want a more managed experience than Redshift offers.
- How many queries do you need to be able to run at a time? Does your team need to be able to run just a few queries an hour? If so, a query-based model like BigQuery might be a good fit. If you need 24/7 availability, on the other hand, Redshift might make more sense.
- How large are your data needs? If your databases are in the gigabyte range, Snowflake might be an attractive option. For petabyte scale operations, though, you’ll probably want to go with BigQuery or Redshift.
- What kind of talent do you need? All data warehouses need skilled analysts with strong SQL skills, but if you’re running Redshift you will likely also need an experienced database administrator to fine tune and maintain your operation for maximum performance.