Photo of a spider web covered in dew

Natural Language Processing (NLP) has long been one of the holy grails of computer science. While we all know that computers are better than humans at making sense of highly structured information, there are still some important areas where humans are undeniably better than machines. Understanding language is one of those areas.

For humans, understanding language is so natural we usually don’t even have to make a conscious effort to do it. In reality, though, processing language and turning it into meaningful information is an extremely complex and difficult task. Without consciously thinking about it, we correct grammar mistakes, resolve ambiguities, and infer meaning that isn’t explicitly stated.

Teaching computers to perform these tasks (even imperfectly) has huge implications in many areas of our lives, from the way we design products to the way we research cures for diseases to the way we get directions. In this article, we’re going to explore what Natural Language Processing is, how it works, and how it’s being paired with Big Data to solve problems in a wide array of fields.

What Is NLP?

Many of us already encounter NLP in our daily lives. It’s the technology that allows us to ask our smartphones for directions or help recognizing the song playing on the radio. It’s also the technology that powers the automated call centers we often reach when calling customer service.

The key to Natural Language Processing is taking data as complex and context-dependent as human language and translating it into the kind of structure that a computer can understand and act upon. But how do you do that? The earliest efforts at teaching computers how to understand human language looked a lot like a language class: Scientists tried to teach computers how language worked by explicitly teaching it the rules of grammar and syntax. But the way people actually speak and use language often doesn’t follow the rules. Misspellings, idioms, slang, and common grammatical errors may not prevent a human from understanding the meaning of a text, but computers aren’t able to understand when the rules aren’t followed to the letter.

This has changed with the advent of machine learning. Machine learning refers to the use of a combination of real-world and human-supplied characteristics (called “features”) to train computers to identify patterns and make predictions. In the case of NLP, using a real-world data set lets the computer and machine learning expert create algorithms that better capture how language is actually used in the real world, rather than on how the rules of syntax and grammar say it should be used. This allows computers to devise more sophisticated—and more accurate—models than would be possible solely using a static set of instructions from human developers.

For example, a typical NLP task might involve identifying the names of people in Facebook posts. The first step of the process is feature extraction, which involves identifying meaningful characteristics of something that set it apart from something else. To do this, we’d start with a training set of real Facebook posts. We might say that a name usually begins with a capital letter and is likely to be found in a book of baby names. Using these feature vectors, we would train the computer to recognize first names, taking all of our different features into account. (For instance, if the name “jason” appears uncapitalized, the computer might still recognize it as a name because it appears in our book of baby names.) Then, using a different set of Facebook posts, we’d test our computer’s model. If it successfully distinguishes names from non-names, we’ve built a successful model.

NLP and Big Data

An important part of the Big Data revolution has been the rise in the use of unstructured data. Thanks in large part to systems like Hadoop and Spark, we now have the ability to quickly process huge troves of unstructured data that in the past would have just been left sitting in boxes and warehouses.

While many NLP tasks may not require the same kind of real-time streaming analytics as some other Big Data tasks, it does require facility working with large, unstructured datasets, whether in the form of text pulled from webpages, Facebook posts, search queries, text messages, or more.

Open-Source Tools for NLP

Some of the most common tasks for NLP include tokenization (splitting text into words and terms), tagging various parts of speech, creating parse trees (which are like sentence diagrams), and classifying some terms as named entities (for example, grouping together names of people, days of the week, or cities). From these basic tasks, it’s possible to create more sophisticated applications, like the ones we’ll explore in the next section.

Before we look at NLP’s more advanced applications, it’s worth noting that there are a number of open-source libraries that support both basic and more advanced NLP tasks. For example, Pattern and NLTK are written in Python and provide a number of classes and modules that make it easy to work with text. NLTK is designed to be an intuitive, practical, and modular tool for NLP. It’s well documented, with two books and an active community in both academia and industry. Pattern is billed as a web-mining module, and includes several tools that NLTK doesn’t, like a web crawler, HTML parser, and a number of APIs for major web services. Pattern also provides modules for graphic data structures that show the relationship between nodes representing different words or concepts.

Stanford CoreNLP is a Java-based suite of tools that provides similar functionality to NLTK. Described as an “integrated framework,” CoreNLP is designed to make it easy to apply multiple tools to a single piece of text.


One of the trends in Big Data has been to recognize the value of information in all kinds of places we wouldn’t normally think to look, and NLP is no different. Organizations are just beginning to understand the enormous potential value stored in all the text we generate on a daily basis, in the form of emails, text messages, social media posts, search queries, medical and legal records, and more.

By leveraging NLP, many organizations are able to create new value and improve efficiency. Here are a few of the more advanced applications of NLP, and how organizations are using them.

  1. Automatic translation allows a computer to quickly translate a complex piece of text from one language into another. Because different languages are highly nuanced and idiosyncratic, this is an area where machine learning techniques are extremely useful. This is the technology that allows Google to automatically translate pages from French or Urdu or Mandarin into English. By looking at the way language is actually used across millions of webpages, the computer is able to offer much more accurate (and expressive) translations than if it were simply using a dictionary.
  1. Automatic summarization is the process of creating a short summary of a longer piece of text that captures the most relevant information. Think of the abstracts or executive summaries found at the beginnings of research papers and longer reports. This can be achieved by extracting key sentences and combining them into a concise paragraph, or by generating an original summary from keywords and phrases.
  1. Natural Language Generation (NLG) combines data analysis and text generation to take data and turn it into language that humans can understand. While it’s been used to create jokes and poems, it’s also being used to generate news articles based on stock market events and weather reports based on meteorological data.
  1. Speech processing is the specific technology that allows virtual assistants to translate verbal commands into discrete actions for the computer to perform. This technology allows Amazon Echo to translate your request to hear some dance music into a specific Pandora search, or Siri to turn your question about local hot spots into a Yelp search for dinner recommendations.
  1. Topic segmentation and information retrieval refer (respectively) to the process of dividing text into meaningful units and identifying meaningful pieces of information based on a search query. You enjoy the benefits of this technology every time you execute a Google search. Taken together, these two techniques are also being used by several legal tech companies to create searchable databases of legal opinions, allowing lawyers to more efficiently find relevant case law without having to scour briefs for hours on end.
  1. Biomedical text mining is a subset of text mining used by biomedical researchers to glean insights from massive databases of specialized research. Some of its applications include identifying relationships between different proteins and genes, as well as assisting in the creation of new hypotheses.
  1. Sentiment analysis is routinely used by social analytics companies to put numbers behind the feelings expressed on social media or the web in order to generate actionable insights. Marketers use sentiment analysis to inform brand strategies, while customer service and product departments can use it to identify bugs, product enhancements, and possible new features.

These are just a few of the ways organizations are using NLP to derive value from text. As with any machine learning project, you need clearly defined business goals that your predictions will serve. From there, you can define the data set that will be most relevant and then develop a training set which the computer will use to build an algorithm. Building a machine learning system is a complex undertaking, requiring data scientists to extract features and train algorithms. Depending on the size of the datasets you’re working with and your specific business requirements, you might also want a database expert to manage document storage, or data engineers to design and manage a data pipeline, especially if you need to analyze a constant stream of new data in or near real time. Ready to get your Natural Language Processing project off the ground? Build the team you need with freelancers on Upwork today.