Before your data can be analyzed, you have to get it into your system in a form that your analytics software can understand. That process is typically called Exchange-Transform-Load (ETL). ETL tools have been a key part of data pipelines for decades, but recent advances in data analytics and storage mean that ETL isn’t the only game in town anymore. In this article, we’ll break down the ETL process and explain how cloud services are changing the way teams ingest and process analytics data at scale.
First Things First: What Is ETL?
Before we get into the technologies needed to build an ETL pipeline, we should clarify what we talk about when we talk about ETL. ETL is less a technology or discrete set of processes than a set of related interrelated problems related to preparing data for analysis. Getting your data in one place may seem like an obvious and straightforward problem, but once you get into the details of implementation, it can become very complex very quickly. To give you a taste of some of the considerations involved, let’s quickly run through the basics of the ETL process.
First, data has to be extracted from a number of (typically) heterogenous data sources. These sources can include business systems, APIs, sensor data, marketing tools, and transaction databases, among others. As you can see, some of these data types are likely to be the structured outputs of widely used systems, others may be semi-structured JSON server logs. Furthermore, different sources are likely to produce vastly different amounts and rates of data, all of which will have to be managed by your system. It may also have to validate incoming data and reject anything that doesn’t match expected values.
Additionally, the data needs to be transformed into a format that can be read by analytics tools (like SQL, for example). As we’ve seen, the types of data can vary widely, and the degree and kind of transformations necessary will depend greatly on the kinds of data you’re working with as well as your particular analytics needs. Some common transformations include reformatting and recoding data from different systems (for instance, making sure information like dates are in a consistent format), splitting or joining different tables, calculating new values from the ingested data, and assigning key-values so that your analytics software can quickly locate relevant columns.
Finally, the transformed data must be loaded into the data warehouse where it will eventually be queried. As with the other parts of the process, when and how this loading is performed will depend on your business needs. In some cases loading is done in a batch fashion, other times it’s loaded in as it’s ready. There are also ETL setups that load data into the target warehouse before performing any transformations in order to take advantage of in-memory processing.
Keep in mind also that this is just the core of ETL. In practice, there a number of smaller steps that are also important to the process, including building reference data, staging data in special staging tables, generating audit reports, etc. On top of managing the above processes, scalability is a major concern, especially when dealing with data at the tera- and petabyte scale.
Batches, Streams, and Hadoop
As Hadoop has become synonymous with big data, a number of Hadoop-based tools have been developed to handle various aspects of ETL. Depending on whether you’re dealing with batches or streams of data, you’ll either be looking at a combination of Sqoop and MapReduce, or Flume and Spark.
Sqoop is the closest Hadoop tool to a traditional ETL solution. It uses a system of connectors to move large batches of structured data between relational databases (like MySQL, Postgres, Oracle, etc.) and Hadoop. In practice, that means Sqoop can extract data from its source, transform it using MapReduce, and then load it into the target data warehouse. By leaning on MapReduce, Sqoop provides a high degree of parallelism and fault tolerance, which can have major advantages when you’re working with particularly large or sensitive sets of data.
But what if batch processing doesn’t really work for your needs? If you need to be able to process multiple fast-moving data streams (e.g. sensor data, application logs, real-time transaction data), you’ll probably need to turn to a combination of Flume and Spark. Where Sqoop is oriented around batches of data, Flume is event-driven, meaning data is processed as it’s created. Flume is based around a system of “agents” configured to watch for certain events, which they can then capture and direct into the relevant channels so that they can be written to their final destinations. Think of Flume as a combination of box office and ushers: It makes sure every piece of data has a ticket and gets to its seat. In this case, they’ll be directed into Spark for near real-time processing.
ETL and the Cloud
As more and more data operations move to the cloud, the whole concept of data ingestion and processing is starting to change. With nearly endless virtual storage and computing resources, cloud service providers can abstract away a lot of the technical aspects of data management. In some cases, that means they can automate some or even all of the ETL process so that you can focus on querying data rather than optimizing your pipeline. For instance, Google BigQuery can automatically convert entities from Cloud Datastore into recognized data types.
Even more exciting, both Amazon Redshift and BigQuery actually allow you to query data outside the warehouse without worrying about ETL at all. In the case of Redshift, you can query data directly from S3, while BigQuery lets you directly query data in Google Cloud Storage and even shared or public datasets. Where traditional data pipelines require data to be transformed before it can be queried, Amazon and Google are taking a different approach: optimizing queries and leveraging their massive computing resources to make sure those queries are executed as fast as possible–even when you’re using SQL to query semi- or unstructured data.
Want to learn more about BigQuery and Redshift? Check out our article comparing some top cloud-based data warehouses.