
The Internet of Things (IoT) is one big contributor to data generation. Machines are increasingly being equipped with sensors and can report back their state continuously, across the factory or across the globe. Thousands of these sensors can be remotely monitored, making traditional machinery more connected, and more visible to the managers who have invested in it.
And data is being harvested at incredible speed. Most websites are hooked up to Google Analytics, Google’s website statistics package; this retains data for at least 25 months (although some users report seeing older data) on every website it monitors. Google also collects information on our email habits, our social circles, our locations and our mobile devices.
All of this data sounds like a huge bonus to the human race. But how many people are using all of the data Google is storing on their behalf? And how many companies are collecting our data, just like Google, then selling it to someone else?
We are fast reaching a stage where data is infinite, and we are only limited by the speed at which we can store it – or the price of keeping it organised. The latter is a huge hurdle in our quest for data quality.
The Age of Big Data
On a global scale, the amount of data being created and transferred is staggering:
- According to IBM, we generated 2.5 billion gigabytes (or 2.5 exabytes) every single day in 2012
- About 75 per cent of the data we all generate is unstructured – in other words, random, and difficult to index
- From the dawn of time to 2003, the amount of data in existence was estimated to be 5 exabytes. We are now generating more than 5 exabytes of data every 48 hours
Big data hasn’t come about suddenly. The building of data production has been a gradual process. And it’s presenting new opportunities for marketers and scientists alike. Businesses are trying to make sense of the data we all produce to try to use it for commercial advantage.
But simply having masses of data isn’t helpful if that data is old or inaccurate. Quality must prevail over quantity. The more data there is, the more resources we need to make sense of it. And the longer we store the data, the less useful it becomes.
Additionally, there are some big data dilemmas that businesses must note:
- Quality vs quantity
- Truth vs trust
- Correction vs curation
- Ontology vs anthology
Or, as the above Gartner blog states:
“Why store all data? Because we can. What’s in this data? Who knows.”
Much Too Much?
When it comes to data, we clearly have an embarrassment of riches. Even the biggest proponents of big data admit that data in itself is useless, and its volume does not negate its uselessness.
Only when data becomes information can it be put to work. Until then, it’s just ones and zeros. An infographic from Inc.com suggests that too much data is costing businesses money to the tune of $50 billion per year.
Apart from the cost, what are the other downsides of masses of data?
While storage costs have decreased, it’s still wasteful to pay for storage space that you are unlikely to use. Warehousing massive amounts of data becomes expensive fast if you have no quality control at all
Analytics require huge amounts of computational power, most likely in the cloud. While the cloud is undoubtedly powerful and useful, particularly for small businesses, extensive use of the cloud puts data in third party control which throws up security challenges
Once data is stored, it’s difficult to retrieve. Ralf Dreischmeier, head of the Boston Consulting Group, says that one third data held by banks is never used for this reason
There is no legal precedent for data ownership, and no clear way to define who owns the data that an individual entity generates
Storing all of this data is the result of immature processes. Or: ‘We don’t know why we need this data, but let’s save it anyway’.
Costly Waste
Let’s look at the price of data in more detail.
We can use the 1-10-100 rule to estimate the cost of waste in a database. The 1-10-100 rule says that:
- Verification of a record in a database costs $1
- Cleansing and deduplicating a record costs $10
- Working with an inaccurate, unclean and decayed record costs $100
So working with bad data is costly, and so is cleansing data. If we have masses of data that we don’t need, cleansing all of that data is unnecessarily expensive. If we only collect the data we really need, we instantly see a reduction in data cleansing and deduplication cost.
Also, note the final bullet point. Working with bad data is very expensive. If we collect masses of state and archive it, that data is going to present a massive resource drain when it’s retrieved.
The bigger the dataset, and the less controlled or accessible it is, the more errors will be lurking within it, and the more costly it will be.
False Economies
Most businesses leverage data for:
- Sales and marketing
- Analysis and reporting
- Decision making
- Speed and efficiency
All of these require quality data that is relevant, targeted and clean. A large dataset could be a symptom of a data quality problem, since there are more likely to be duplicate records hidden among the rows.
When data is collected from a global “digital exhaust”, it is inherently data. It is not sampled, it is not consistent. It is just as liable to perish as data collected through other means. It presents no insight without deduplication and cleansing. And it does not bypass the need to be thorough in our quest for data quality.
