Business relies more than ever on predictive analytics to make critical decisions, from how banks price risk to how sales teams predict which customers are most likely to purchase. Predictions about the future are based on data patterns, mined from a staggering variety of sources spanning real-time web-clicks, psycho-demographics and years of accumulated transaction histories.  Predictive analytics is a powerful business weapon as long as the underlying data definitions and flows are well managed.

The trouble starts when the underlying data foundation for predictive analytics is not well maintained or is not sensitive to future changes.   Over time, source system changes, breakdowns in communication and lack of maintenance erode information accuracy (input) and diminish predictive insight (output).  The process for delivering clean, relevant data consistently can break down in three ways:

1. Changes in Raw Source Data:

Data variables can change from the originally understood value range and definition.  The change is from either a process change or breakdown in the system of source.  It can be as subtle as a data type change from integer to decimal, or as drastic as wholesale changes in the valid value ranges to accommodate new realities of the business or even worse, system changes that through lack of maintenance render the data completely useless.   These often are caused by source system breakdowns (i.e., part of a process to build the supplied data fails to run, and the data supplied is damaged.

2. Changes in Business Definition

These are typically changes in calculations supporting the data used in a predictive model.   These are very similar to the raw source data changes, but are generally characterized as changes driven by business rule change and not due to errors. For example, a bank builds a loan profitability model based on Revenue (Price), Cost of Funds, Interest, and Risk.    The model is trained using the current definition of ‘Interest Charge’ which includes interest on fees.  However, after the model is in use for several months, the bank decides to separate the interest on fees portion of the calculation to meet new government regulations.  This is implemented by changing the data element ‘Interest Charge’ to no longer include interest on fees, and to introduce a new element ‘Interest_on_Fees’.  The total Interest Charge would now be ‘Interest_Charge’ + Interest_on_Fees’.  Without a change in the model to include the new data element, the model would score accounts based on an illusory drop in interest charge (since the Interest on Fees is now being left out of the model’s input data).  Left alone, the model would inadvertently become sensitive to accounts with fees, which may be interesting but may have nothing to do with the profitability of those accounts.

3. External Environmental Change

Changes in the environment can be characterized by changes in external economic factors, government regulation or advancements in technology that render the initial modeling assumptions incorrect.  For example, imagine a model to predict the relative likelihood of sales to existing customers. The model is weighted heavily to employee growth (i.e.,. 20% growth is good, no growth is bad).  Some time later, the economy enters a period of stagnation.  Suddenly, there are very few companies that are growing their employee base by 20%. The model would need to reflect this new reality. A company holding flat in job growth might turn out to be a relatively good prospect.

6 Steps to Keep the Predictive in Predictive Analytics

In banking and areas where customers are exposed to risk, programs such as Sarbanes-Oxley are in place to provide protection. Most risk models must pass SOX testing for data and process maintenance to ensure modeling accuracy.

What should you do in the absence of government mandated processes to ensure data accuracy?

1. Choose data for your model that is more robust and consistently defined. Ask those closest to the data for any known plans for changes in the data and make sure they too have processes for change management.   Try to avoid data supplied by untested, undocumented processes.

2. Document the sources. Make sure that sources for model inputs are well documented, with special attention to the highest weighted variables.   Whoever is charged with the maintenance of these models should know how to validate the source data and which sources may require more scrutiny for change over time.

3. Get on the distribution list for changes from any data suppliers (internal and external). Make sure the producers of data used in modeling are aware that you run a process that is sensitive to any changes in the data they provide.  Make sure if they are planning change you are in the loop on those changes.

4. Test models for performance. Are they losing their predictive strength?    Even if the data is well maintained, models age and due to socio-economic factors and may need to be re-trained.  Even if the model is still fine, data quality issues can also be detected by a sudden change in model performance beyond any ‘natural aging’ of the model.

5. Profile the input data. Even rudimentary profiling for input data (detect greater than 2 standard deviations from historic observations of Avg., Max, Min, # of NULLs for each critical data element) can help maintain the accuracy of the input data.  There are several vendors of data profiling tools, and often the DBMS you use has built-in profiling capabilities.  This can also be a simple set of scripts to run on the source data as a first step in the update process.

6. Check intermediate steps of the analytic process. Check for valid expected values between major steps of data transformation.   Is the data within the bounds of expected ranges?  Even just checking that row counts are within expected ranges can catch many unexpected data changes.

If you are making an investment in predictive analytics, make sure to protect that investment by ensuring the validity of the model’s input.   With the exponential growth in data that can be applied to models, the risk of creeping data inaccuracies also increases. By building the appropriate validation and change management processes into the data flow, you increase the accuracy of source data and downstream predictions.