The business value of natural language processing (NLP) is indisputable, and there’s never been a time this technology has proven to be so useful. Just think: in the rapid shift to remote work in response to the global coronavirus pandemic, companies have leveraged NLP for everything from chatbots used to effectively onboard workers remotely, to safely interfacing with patients in healthcare settings.

It’s especially encouraging to see that, despite IT budgets being on a downturn (Gartner), enterprise leaders have not shied away from NLP investments. In fact, according to new research, survey respondents across industries, company sizes, and geographic locations reported increases in their organization’s NLP technology budgets from 10-30% more compared to last year.

With the proliferation of NLP services in the cloud, companies need not even install and manage open source NLP libraries. In fact, 77% of all survey respondents indicated that they use at least one of the NLP cloud services listed (Google, Amazon, Azure, and IBM). But, despite their popularity and availability, cloud services do not come without challenges – and usually include a hefty price tag.

Roughly a third (34%) of all technical leaders cited data privacy and security as a key challenge for cloud service adoption. This can be especially poignant in highly-regulated industries such as healthcare and financial services, in which users in many cases are unable or unwilling to share third-party data. In fact, privacy regulations in healthcare require that users strip (medical) records of any protected health information (PHI) – a process is known as de-identification . While this process has largely been a manual and labor intensive process, open source NLP software can be used to help automate this task.

Additionally, 32% of respondents in this group cited difficulty customizing models using cloud services. Language is very application- and domain-specific, so models often must be customized. This can be especially painful when a cloud-based service is trained for general uses of words, but does not understand how to recognize or disambiguate terms-of-art for a specific domain. For example, speech-to-text services for video transcripts might identify the word “doctor” for nearly every instance of “Docker,” which degrades the accuracy of cloud-based solutions.

Security and model customization are factors that contribute to difficulties deploying NLP cloud services, but not surprisingly, the top concern cited by all respondents is the cost . These services grow more expensive as the collection of documents does. Here’s how a few of these services describe their pricing:

  • Amazon : The company’s Comprehend service for Entity Recognition, Sentiment Analysis, Syntax Analysis, Key Phrase Extraction, and Language Detection, is measured in units of 100 characters, with a 300 character minimum charge per request.
  • Google: The Natural Language API service that garnered the most users, according to the survey results, charges in terms of ‘units.’ Each document sent to the API for analysis accounts for at least one unit.
  • Microsoft : Azure Text Records correspond to the number of 1,000-character units within a document that is provided as input to a Text Analytics API request.

It’s easy to see how the cost can add up when using NLP cloud services, it’s likely that users will pay thousands of dollars before even going live. While it’s easy to get started, it’s hard to scale, which is in part why more survey respondents in the exploratory phase of NLP considered cloud services, while those more established in their journey turn to open source options.

No solution is a one-size-fits-all, but when it comes to NLP, open source NLP libraries have some clear advantages over cloud offerings. Given the aforementioned challenges, coupled with the trending commodification of deep learning, open source libraries for NLP provide relative ease of use and extensibility per application, proving to be a more cost-effective solution. The ability to train models – something not achievable with a cloud solution – also comes back into play when considering the needs of industries, such as healthcare and financial services, that may not be able to share documents.

This is likely why a third of survey respondents indicated that they use Spark NLP , an open source library built on top of Apache Spark, and has a Python API. A quarter indicated that they use spaCy , one of the most popular open source NLP libraries in the Python ecosystem, and more than half of all respondents (53%) used at least one of these two libraries . In other recently conducted surveys on data science and machine learning tools , these two libraries also finished high. This is significant, given Spark NLP is only three years old, compared to more established libraries and cloud services.

In addition to cost constraints, cloud services can’t train models: you can have a cloud API, but you may need other information. Libraries, such as Spark NLP, are trainable. This comes back into play when considering domain-specific systems – for healthcare and financial services, for example – that rely heavily on their own jargon and terminologies, and may not be able to share documents.

When choosing the right NLP solution for your organization, it makes sense to explore a range of offerings from both open source libraries and cloud solutions. After all – open source NLP libraries come with challenges of their own – language support, scalability, and integration issues are all areas for improvement, according to the survey results. That said, cloud providers have lagged when it comes to customizing solutions, extensibility, and pricing. As a result, cloud-based NLP services are generally perceived as a low-accuracy, high-cost option. It will be interesting to see how cloud providers respond to these market needs or if open source libraries will prevail.