Big Data is in its infancy. While it has been around for many years, it is just becoming mainstream now. Large data providers like Experian, Axiom, and D&B have been collecting data for a long time. What is different now? Today, Big Data is in the form of LinkedIn, Facebook, Twitter and Google. To lean more about why Big Data is big now, you must understand the continuum of getting at Big Data. In other words, today’s data must meet my 11 big Data prerequisites.
1. The data must be there
This is the most exciting tipping point. In being the CEO of a data-mining software company, I’m still dumbfounded when users expect to get information off the web that is not there. To begin, the data must actually exist.
2. You must be able to flag it
You can’t store everything, therefore, you must make choices. What is important? When is it important?
3. You must be able to find it
In the absence of a real-time data stream, you must be able to search though the data to find a “flag” of what you are looking for.
4. You must be able to parse it
This is the analysis of relevant grammatical constituents, identifying the parts of what you need, from within potential noise. For example, parsing out the name of an inventor from within an article on nanotechnology.
5. You must be able to extract it
This is not the same as parsing. What if the data is in a PDF file or HTML code? In many cases, extraction is about access. Is the data across five links within a single web page? Extraction as it relates to the Internet also encapsulates web crawling.
6. You must be able to process it
This takes CPU cycles. Bigger problems need bigger computers.
7. You must normalize it
If you have multiple pieces of data on “The Container Company”, “Container Company, The”, “The Container Co”, etc., how do you merge that data? You must normalize like entities to a standard “canonical form”. Without it, we’ve got the Data Tower of Babel.
8. You must be able to store it
Big data takes up disk space.
9.You must be able to index it
If you ever want to find the data after you store it, the data needs to be indexed. This also means more disk space.
10. You must be able to analyze it
Big Data needs big (or many distributed) CPUs to crunch the numbers and garner order from the chaos.
11. There must be a payoff
Putting Big Data together is expensive. Without an end goal in mind, it is expensive to collect. Google and Facebook collect, process, index and store data for profit.
While it is tough to predict where Big Data will go next, we can start by looking at the requirements of Big Data, and where it comes from in the first place.
Comments on this article are closed.