In an earlier blog post titled, “Use the Four V’s to Better Understand the Big Data Ecosystem,” I discussed the concepts of volume, velocity, variety and variability that represent the measurable dimensions of big data. I then reviewed some research on how the various tools that make up the big data “ecosystem” address these dimensions. Further vetting of these ideas has helped to fuel discussions about the role of enterprise search in addressing big data with customers, partners, analysts and a number of big data practitioners I met at the recent Strata conference in Santa Clara, California. One of the key takeaways of this research is the real-time element that search can add to a big data deployment—more on that later. As promised in my initial post, I developed the topic of search and the value it can bring to big data into a Vivisimo White Paper titled, Optimizing Big Data.
So this blog post is number two in a series on big data. My plan is to progress from general principles to specifics:
- Part 1 defined the nature and dimensions of “big data,” as well as the relative strengths of the available tools currently used to address big data, with special emphasis on the role that search can fulfill.
- In Part 2 (which you are reading today), I’m getting more specific and will identify some scenarios for deploying enterprise search as part of the big data ecosystem.
- In Part 3 and beyond, I will discuss uses of enterprise search to generate business value from big data in applications such as national security, legal discovery, social network analysis, customer experience management and revenue assurance—what we at Vivisimo call “big data optimization.”
If I may take a moment to define “enterprise search,” I am referring to a comprehensive indexing platform or service with the capability to index content from a variety of different sources and to provide a single point of access for search and discovery. This is a minimalist definition, because a full-featured enterprise search platform like Vivisimo’s Velocity Platform offers many more capabilities, from deep information discovery and collaboration features that are very obvious to end users, to the not-so-obvious but critical back-end capabilities such as entity extraction, security, scalability and fault tolerance. Velocity can also serve as a platform for search-based applications in which the central role of search is not immediately apparent to the end user, but defines the unique capabilities and business value of the application.
These scenarios lay the foundation for generating business value from big data. Put another way, they define the architecture of potential search-based applications that leverage big data. A few themes run across all of these scenarios:
- Truly robust enterprise search adds a real-time element to the predominantly batch-oriented world of big data processing
- The ability to access multiple different data sources (the “variability” dimension discussed in my previous post) greatly expands the scope of possibilities for exploiting big data
- Search is accessible and usable to end users, whereas the typical hands-on big data user is a data scientist
So without further ado, I’ll walk through our four scenarios for enterprise search deployed as part of the big data ecosystem.
1. Indexing and Fusion of Big Data
In this scenario, the search platform indexes content that is resident in a big data repository or “holding area” such as Hadoop Distributed File System (HDFS). As discussed in my earlier post, such information is typically under control of data scientists and not easily accessible to end users. Furthermore, the connections and relationships to information in other enterprise systems are not always apparent in the isolation of a big data laboratory. This is where enterprise search can step in by enabling search that goes across both the big data repository and other organizational information. This fusion of enterprise application data and data that has been placed in the big data lab can provide a unique view and insights that would not otherwise be apparent to both the analyst and the everyday user.
2. Indexing and Search of Big Data Analytics
Most of the hard work done in the typical big data lab is designed to drive analytics. A single project could produce an extremely large number of results “packages” that are stored in individual files, aggregated into a single large file, or stored as database records. Future navigation and recall of these results, either individually or in related sets, can be problematic. If they are viewed once and allowed to languish on a file server somewhere, future value may be lost. What if we were to index these analytic products and make them accessible to a broader range of users, over time? Results from analytics can also of course be merged with results from other supported data sources, providing a critical fusion function for deeper insight into the business or mission context of the analysis.
3. Access and Loading of Content from Diverse Data Sources
The typical big data deployment needs to be “fed” data from wherever it is generated or collected. This step usually involves creation of custom data adapters. A robust enterprise search platform such as Vivisimo’s Velocity has the ability to collect data from a wide range of external systems, transform it into a format that is useful for merging with other data, and passed along for processing in a big data project. This process may bypass the normal indexing step, in which case the search platform is providing something similar to an “extract, transform and load” function.
4. Bulk Processing and Conversion of Extremely Large Data Sets
An enterprise search system can use the distributed batch processing capabilities of a big data framework such as Hadoop and MapReduce to perform bulk processing tasks such as entity extraction and document conversion against extremely large data sets. In this use case, the native analysis, conversion and metadata extraction processes of the search platform are either deployed within MapReduce or replaced with equivalent functions. The search system could then then ingest the output of these processes and pass it along to the indexing stage of the pipeline. This is an attractive option for truly massive data sets, and where organizations already have invested in big data processing infrastructure to leverage commodity hardware for massively parallel processing.
In a future post, I’ll explore the business side of big data optimization, identifying applications and business solutions that can be deployed using these four scenarios, plus any that are introduced in future discussions.
Are you using search as part of your big data project, or do you plan to do so? What is the deployment architecture? How does search integrate with your other big data infrastructure? What business problems do you propose to solve?