When attempting to understand the concept of Big Data, the words MapReduce and Hadoop cannot be avoided.
Although you might know the basic idea of each, digging a little deeper might help create a better understanding of how it works in relation to Big Data. Here, we will review MapReduce.
MapReduce is a processing function, which consists of two parts – mapping and reducing. When I explain MapReduce, I paint a visualization of a funnel (although others use an organizational chart, but to me, a funnel is a better representation).
Think of the top of the funnel as a place to collect data [the input files]. The data is stored in the funnel until a query is submitted.
When the query is submitted, map tasks are created. Map tasks review and sort through the input files to locate the requested information. This is considered the first part – mapping.
The results from the mapping process are sent through the funnel to the reduce part, where the information is aggregated, then outputs the information requested.
Let’s review a simple example:
A consumer retail brand is looking to identify the most frequently purchased products (the top three) from a cross-section of customers as part of a market research initiative focused on merchandising. Let’s say they are looking for data on women within a specific geographic area, which is information provided in each customer profile stored in their CRM database. There might be 2,000 women meeting the identified qualifications and therefore, this big data set needs to be sorted.
The input data for this query would be the profiles of the individual customers within the specifications. After the query is created and sent, the mapping function would sort through the profiles, then identify and send the most frequently purchased products to the reducer. The reducer would compare and aggregate the data generated from each map task and return an output file featuring the top three most frequently purchased products from the cross-section.
The MapReduce process is key in sorting through the big data that might be available when submitting a query. The goal is to create the most accurate output in the shortest amount of time.
…and there it is; the basics of MapReduce explained as simply as possible.