The rapid growth of data creation and collection has driven the need for big data solutions at many organizations. As enterprises look to innovate products and services at a faster pace and improve customer service, they cannot afford to overlook the massive amount of data locked away in data repositories on both sides of the firewall. However, getting the ball rolling on analyzing this fluid data set can be a huge challenge. While some of this information is housed in well-structured databases and applications, invariably a large percentage, perhaps up to 80%, is in complex and unstructured formats spanning multiple systems. In this post I’ll focus on the challenge of discovery in a world of unstructured data.

Part of the challenge of working with unstructured data in a big data environment is getting a handle on exactly what type of data you have available. Simply moving everything in bulk into Hadoop clusters and data warehouses is not a viable approach. Successful big data implementations take a phased approach and deciding what data to roll into your big data platform is part of this process. This data exploration phase is critical in developing and understanding what data exists, what is missing and how the data ties to the use case scenarios most important to the business.

Discovery Through Search…

Search is an important tool in this exploration and discovery process. Data analysts must be able to execute queries over a range of repositories and aggregate the results in a meaningful way. In my previous post I discussed how a unified search interface enables this aggregate search capability.

In addition to enabling unified search across multiple repositories, the search interface must also help users derive meaning from the returned results. Considering the ever-increasing volume of data that is being searched, simply returning search results in the form of a long list will lead to frustration on the part of searchers.

Add Some Structure…

One way to help alleviate the sheer volume of search results is through the use of structured navigation and visualization. Results are categorized into bins that users can then utilize to further refine their search. But how do you define these bins? One way is use a static set of tags that have been applied to the source documents. These tags may have been assigned manually or automatically by software that assigns the labels. The search platform can then index and use these tags to bucket content into groups at search time.

This works well when examining a single repository where content creators have shown good discipline in tagging content, but when we start to consider the big data case of highly varied content stored in multiple repositories, consistent tagging will not be the norm and additional steps must be taken to categorize the data. The problem is further exacerbated by the fact that as the size and variety of data increases; the set of tags that can adequately cover the set must be made more generic. The unfortunate side effect of this dynamic is that the structured navigation based on the tag set also becomes rather broad and generic making it difficult for users to drill down to precise results.

For documents and content that do not have quality metadata associated with them, entity extraction can help fill the void. Entity Extraction is the process of automatically extracting document metadata from unstructured text documents. Extracting key entities such as person names, locations, dates, specialized terms and product terminology from free-form text can improve keyword search and also structured, faceted navigation.

Entity Extraction relies on the adoption of a controlled vocabulary or taxonomy for describing documents. This can be problematic for highly variable data sources. Defining a comprehensive taxonomy that suitably applies across varied data repositories is difficult at best. Furthermore, even if such a taxonomy can be defined, maintaining it’s relevance on an ongoing basis can be very time consuming and expensive. Even so, terms derived from entity extraction can be a valuable compliment to existing metadata tags.

Infer Meaning…

Dynamic tagging (or clustering) addresses the problems with static tag sets by inferring labels dynamically from the content itself, thus avoiding tag sets that are too general or simply unavailable. Furthermore, the nature of dynamic tagging allows for the identification of richer descriptive phrases as labels as opposed to simple keywords.

Vivisimo Velocity’s dynamic tagging technique automatically organizes search results into groups of related content that are known as clusters. Velocity uses multiple heuristics to quickly identify meaningful groups that can be concisely described, and creates these groups as search results are returned. The costs and disadvantages of taxonomy maintenance therefore do not apply to Velocity’s clustering. Apart from some optional cluster tuning, all classification is done on the fly, with no intellectual effort or maintenance required by the organization.

There is great value in clustering beyond simply the notion of dynamically assigning tags to documents. A federated search can be configured to range over several sources, combine their search results and cluster them. Even though some of these sources may have metadata associated with contents and some may not, dynamic clustering draws common themes from the search results allowing you to understand relationships between seemingly disjointed data sets.

In Summary…

Data exploration and discovery is a critical component of any big data initiative. Leveraging a comprehensive search platform such as Velocity as part of this process provides a complete overview of data housed in disparate systems without having to migrate it to a common repository. Taking advantage of the option to leave data in place for analysis significantly reduces the load on stretched IT resources.

In addition, offering users enhanced navigation and visualization through:

• Existing metadata
• Extracted entities
• Dynamic document clustering

Will greatly enhance their ability to extract value from burgeoning corporate data stores beyond fundamental keyword searches.