Recently I read very interesting article that explores role of intuition in data science. It is written by Tom Davenport who is well known in the field of analytics:
“A hypothesis is an intuition about what’s going on in the data you have about the world. The difference with analytics, of course, is that you don’t stop with the intuition — you test the hypothesis to learn whether your intuition is correct”
There are many different definitions of intuition and most of us know it when we experience an intuitive insight or a priori knowledge. However, since there is no clear understanding of how it was formed, we often distrust it and ignore it. Tom suggests that forming a hypothesis is an intuitive process, and as such intuition is an important instrument of learning.
Whenever we are looking at a data set, any pattern we may (or may not) see there is a byproduct of our own belief system – we cannot recognize something we are not already aware of. It is not very different from ancient astronomers who recognized images of well-known objects in constellations. Recognizing that group A and group B seem to behave similarly under similar circumstances (i.e. correlate) is the first step in a learning process.
The next step is to measure and test these correlations until we are satisfied enough to assume that probability of the past behavior to repeat in the future is relatively high. When this assumption is confirmed multiple times without (major) exception, people often treat such assumptions (i.e. models) as causal predictions. The better a model is (i.e. the longer it works as expected without re-calibration) the more we tend to forget that it was designed to calculate probabilities, not to discover a causal certainty. The consequence of such confusion is relatively high probability of very large losses. The only uncertainty left is timing of such losses. Every major market crash is exacerbated by uncontrolled proliferation of predictive models. That causes critics of predictive modeling to rightly point out that we often forget the difference between correlation and causation. However, it easy to throw a proverbial baby with a bathwater – use of models helps to improve our lives, from helping us to prepare for likely weather conditions, to finding better routes in traffic. Modeling is a viable methodology for developing knowledge.
Recommended for YouWebcast: Zero to Millions: The Secrets Behind Building a Business and Growing a Digital Audience
To illustrate a thought process going into construction of such model, consider the growing number of customer reviews describing experiences with specific products available online and the increasing trust consumers place in such reviews to help form purchasing decisions. Observation of these two trends suggest a hypothesis that the products that provide better customer experience are in higher demand from well informed market participants, and therefore their social media product reputation may influence market share dynamics.
Since the Likert scale (commonly used on customer reviews websites) was too coarse for customer experience/product reputation measurements required for the study, we developed opinion mining algorithms for estimation of NPS® (Net Promoter Score). We then aggregated and cleansed sufficient volumes of content (unstructured) data, converted it into structured information (NPS) and compared it to market share available to see if there are any correlations.
This example shows trends correlation for Windows OS smartphones.
It suggests that given sufficient amount of historical data and ongoing measurements of customer feedback available online, it is possible to build predictive models to improve the currently employed forecasting processes.
Causality can be relatively easily learned and proven in controlled experiments. However, it is not very often available in a real market environment with a very large number of players. When dealing with open systems, correlation is one of the best tools available for managing economic uncertainties, even though it can backfire if used for predicting a single outcome without periodic re-calibration and realistically assessing probabilities of errors.