3093594 / Pixabay

You’ve undoubtedly heard all the hype about the volume of data doubling every two years. The sad fact is that first, it’s true and second, 50% of data is coming from external (read: less controllable) sources. That means it’s getting harder and harder to find the forest (read: business insight) for the trees. Wouldn’t it be great if you had a data catalog that could show you exactly what data you have across your entire organization?

A Wall Street journal article estimated that 50% to 80% of any data project is spent on data discovery and data prep. I would estimate, based on talking to customers, that the great majority of that time is spent just on data discovery. It’s a fact: the first step in any data-driven digital transformation project is finding the data you will need to drive the new analytics and new business processes that are core to the success of that project.

Let’s look at a few examples. Customers tend to talk of the “journeys” they are on as they build the data strategy and architecture that will fuel their digital transformation initiatives, so let’s discuss it that way:

  • The Journey to Data Lakes:
    Nobody has achieved success by throwing all their data into a data lake and hoping for an amazing insight to emerge. But, I have certainly seen organizations try. The more likely path to success is to pick a functional area (like Marketing) and an important business problem to solve (like optimized lead conversion, improved wallet share, etc.). The next step is to identify and gather all the data that is relevant to that specific problem and ingest it into the data lake. This will very likely be a combination of internal structured data, internal unstructured data (weblogs), and a wide variety of external data from third parties and partners. The question is: how will you find this data? As you can see, it is scattered all over your organization.
  • The Journey to Cloud:
    A lot of organizations take the approach that they will simply start a new CRM cloud application, for example, by loading the relevant data from their current on-premises CRM system, and then, voila! they are ready to go. Unfortunately, if organizations stop at this point they will never get full-value from their new CRM systems. Sure, a good starting point is to migrate your current CRM data over to the new system, but there is a lot of other data that from other systems that would have relevance in this environment. The biggest examples would be customer data and data from marketing automation systems. There also needs to be a data flow the other direction as well. It is probably very important to synch data in the new cloud CRM system with other transactional and analytical systems, both on premises and in the cloud, on a regular basis. Suddenly, with a relatively simple example, we have a fairly complicated bi-directional data synchronization requirement between this cloud CRM system and multiple cloud and on-premises applications. How will you determine what data you have and what data is important to keep synched up across your very distributed environment?
  • The Journey to Enterprise Data Governance:
    So, let’s say that you are a new data steward for the Marketing function. Congratulations! Do you know where the data you are trying to manage is located? Or, take expensive marketing research: Do you know who has this data in your organization? Do you know if multiple people are buying this research without knowing if other employees already have it? It’s like they say in baseball, “you can’t tell your players without a scorecard.”
  • The Journey to Modern Data Integration:
    This is a pretty common situation: Organizations have accumulated dozens or hundreds of systems over decades of time and many M&A transactions. After a while it looks like a technology history museum. And all of it is cobbled together with a collection of point-to-point data integrations that the business depends on as the “currency” that runs the business. How will you make a requested change to this complex environment?
    • To consolidate on a modern application?
    • To augment a data warehouse with a data lake and machine learning?

Touching anything could cause a major failure to the overall environment. You need to be able to start by figuring out what data the new systems require, where it resides today, and how to get it to where it needs to be in order to provide value to the organization – without disrupting the ongoing business.

And the data management challenge does not stop there.

  • Can you find the data? It may very well be in multiple systems, cloud and on-premises. Do you have duplicate versions of the data? Which version do you choose?
  • Do you trust the data? Remember that a lot of that CRM data is typed in by sales reps. Just how accurate, fresh and complete is that data? You will also need to think about some data cleansing.
  • Suppose you want to do some analytics. That is undoubtedly going to require joining two or more tables together and anybody who has tried to do that will know how complicated finding and using the right keys can be.

Data-driven digital transformation isn’t just about shiny new business models. It’s also about speed. A 2017 KPMG study found that “Speed to market” was the #1 CEO priority. So, how exactly do you achieve speed?

It can’t just be IT doing all the data management work we have described above. They don’t have the resources or bandwidth to scale up for all the new initiatives that are coming to them. The focus on digital transformation means that the demand for trustworthy and timely data has never been higher. And worse, IT lacks the business context to understand the relevant context of the data.

What we are looking for here is a way for business people to effectively discover, manage, and use data quickly and that takes an enterprise-class data catalog solution. What are you look for in a data catalog?

  • Enterprise Visibility:
    It must provide enterprise-wide visibility into ALL of your data. A great solution for only structured data or only cloud data (for example) will not solve the problem.

  • Ease of Use:
    It must be easy for business users to use. It must enable business analysts, data analysts, data stewards, and others to self-serve their data needs. Specifically, users of Tableau, Qlik, or Microstrategy should be able to self-service their data for analytics use without IT assistance.
  • Productive:
    It must be intelligent. It has to automate routine activities to make people more productive and willing to take on data tasks. More importantly, it has to provide intelligent recommendations. How do you get people to use prepared data sets that already exist instead of trying to re-create the wheel by doing the work from scratch? Experience has shown that people do not have much patience for searching for existing work to re-use, but if that existing work was proactively offered them as an intelligent suggestion, it is an entirely different matter.