In Part I of this blog series, we left off wondering what was going to happen to our CDM-MDM system once we start flowing Big Data – billions of records and petabytes of data – through it. Our matching routines are already computationally intensive – is this going to put them over the edge and grind the system to a halt?
The key here is data classification. In the last post, we classified “attributive” or “profile” data separately from “behavioral” data and asserted that the Big Data almost always falls into the behavioral bucket. Let’s take a closer look at these two classifications.
- Entities & Attributes (e.g. customers, households, business locations)
- Personally Identifying Information (PII)
- Contains “Source Native Key” (e.g. cookie-based visitor id, cell phone #, device id, account #)
- Structured Only
- Transactions & Interactions (e.g. web clicks, page visits, ad impressions, mobile calls & data usage)
- No Personally Identifying Information(PII)
- Contains Source Native Key (same as above)
- Structured and Unstructured
Importantly, the data in the behavioral bucket requires different processing than that in the profile bucket. Since the processing is different, the two streams can be separated just after ingestion, like a fork in the road, with the profile data going one way and the behavioral data going the other. This is the key to integrating Big Data into your CDM-MDM system without grinding it to a halt. Let’s take a closer look at the types of processing appropriate for these two classifications of data.
Now, to be fair, the two streams arent completely independent. The behavioral stream will typically require two things from the profile stream; both of which can be considered “reference” data.
For example, the “subscriber” dimension table may be required in the Big Data world so that it can be joined to the “web clicks” table in order to aggregate web clicks by subscriber gender, which only exists in the subscriber table.
Master ID-to-Natural Key Cross-References
Master IDs are created and managed in the CDM-MDM world, but they are often needed for linkage and aggregation in the Big Data world. Shadowing cross-references that map master ids, such as master individual id, to “source natural keys” (more later) into the Big Data world solves this problem.
So the two classifications of data are separated into two streams and processed (mostly) independently. How do they come back together? One assumption behind this architecture is that both streams, profile and behavioral, contain a “source natural key.” This is a unique identifier that relates the two streams. For example, web clickstream data typically has an IP address or an web application-managed, cookie-based “visitor ID.” Transactional data typically has some sort of account number. Mobile data will have a phone number or device ID. These identifiers don’t have to mean anything, per se, but are critical for stitching the two streams back together.
Remember from the previous post that its not just the dimensionsalized, aggregated data that is reunited with the profile data, but also high-value, behavioral analytics attributes (predictive scores, micro-segmentations, etc.) created courtesy of Big Analytics. The profile data is now greatly enriched by the output of the Big Data processing stream. And, if we really want to get crazy, we can consider using these enriched behavioral analytics profile attributes as part of the next cycle of matching; similar, complex behavior patterns can help tip the scales, causing two entities to match that might not otherwise have matched. Pretty cool.
So CDM-MDM and Big Data can live together, in harmony. Big Data doesn’t replace CDM-MDM but rather extends it. Quite nicely, in fact.