geralt / Pixabay

A few years ago, the major bottleneck in data science was cleaning and processing huge volumes of data so that it could be analyzed. While new tools and processes have automated large parts of the data ingestion process, a new bottleneck has appeared: The analysis itself. Even though data science remains one of the fastest growing jobs categories in the country, with demand expected to increase 28% by 2020, there still simply aren’t enough trained data analysts to answer all the questions that organizations are creating.

As work across all industries becomes increasingly data driven, the ability to quickly access and interpret data has become more important than ever. While it might be great if every worker just learned how to write perfect SQL queries, it’s not going to happen anytime soon. To reduce some of the burden on already overstretched data teams, many organizations are looking for tools that allow non-developers to query their relational databases.

The Importance of Ad-Hoc Queries

There are plenty of business intelligence (BI) tools out there that provide dashboards with key metrics, but if you’re only looking at dashboards you’re missing out on all the insights that remain hidden inside your data. That’s why it’s important to be able to perform ad-hoc queries as well.

Unfortunately, writing a valid SQL query that answers the question you’re trying to ask isn’t always easy. Even seemingly simple questions, like “Which of our customers makes us the most money?” can actually be very hard to answer, let alone convert into a SQL query. How do you define your customer cohorts? By age, location, behavioral patterns? Are you looking at how much customers spend on a single visit, or over their entire customer lifecycle?

As more and more organizations try to integrate data into their decision-making processes, the ability for non-data specialists to write their own queries can be the key to major gains in productivity and improved decision-making. Especially as functions that have traditionally been less data-driven, like Marketing and HR, begin to explore their own data, it’s important for employees to be able to get answers to their questions in a timely manner.

The Advantages (and the Shortcomings) of Visual Query Builders

Lots of organizations turn to BI software to create dashboards with key performance indicators and other useful metrics. Some of these BI and visualization tools have begun offering support for more than just standard SQL queries, building drag-and-drop interfaces that allow people who don’t necessarily know how to structure a SQL query to get into the data.

These tools work by connecting to a database and then allowing users to select what variables they want and then map them onto a graph or chart. Unfortunately, these tools have a number of shortcomings. A visual query builder may generate technically valid SQL, but the query itself may end up being too complex to run or risk bringing your server down in the process.

Another shortcoming of visual query builders is that they still require users to know more or less what they’re asking for. To construct a query, you still have to manually select the right columns and input the appropriate conditions. In essence, the user still has to “convert” their question into terms a computer can understand even if they’re not writing the exact query.

This may be fine if a user is working with a database they’re already familiar with, but what if Marketing wants to ask a question about detailed user behavior on the company’s iOS app? Or what if a user isn’t sure which columns or operations are appropriate?

Looking to the Future: Natural-Language Queries

What we’re really asking here is how to convert natural human language into valid SQL queries. By itself, natural language processing (NLP) is one of the most challenging areas in AI research, and translating that question into a valid SQL query introduces a whole new layer of complexity. Still, there’ve been a number of efforts over the last few years to develop models that can do just that.

In August of 2017, Salesforce released a research paper detailing a model called Seq2SQL that uses neural networks and reinforcement learning to create valid and accurate SQL queries out of human questions. Salesforce’s work is a big deal for two reasons. First, the model uses reinforcement learning, meaning it will improve over time as more and more people use it. Already Seq2SQL has shown significant improvements over the previous state-of-the-art.

Second, Salesforce has released a massive open source training set, WikiSQL, which it claims to be orders of magnitude larger than existing training sets. This way, other developers and researchers working on NLP interfaces can take advantage of a high-quality training set, saving them one of the most time-consuming and difficult steps of building and training an algorithm.

So how does it work? Say we want to answer the question: “How many presidents since 1900 have won with less than 50% of the national popular vote?” Let’s say we’re using a database called presidential_elections, which looks something like this:

Year President PopVoteRaw PopVotePer PopVoteMar Runner-up Turnout
1824 John Q Adams 113,142 30.92 -38,221 Andrew Jackson 26.9
1828 Andrew Jackson 642,806 55.93 140,839 John Q Adams 57.3
1832 Andrew Jackson 702,735 54.74 228,628 Henry Clay 57.0
1836 Martin Van Buren 763,291 50.79 213,384 William Henry Harrison 56.5

And so on. The model should recognize that the “presidents” in our question points to the “President” column, while “How many” indicates that we want the COUNT operation. Thus, the first part of our query should ideally look something like:

SELECT COUNT (President)
FROM presidential_elections

Now’s the harder part. Hopefully the model will be able to tell that “since 1900” and “with less than 50% of the popular vote” represent two separate conditions with greater than and less than operators, respectively. So hopefully, we’ll get a condition that looks something like:

WHERE Year > 1900
AND PopVotePer < 50;

If all goes well, our query should return the correct answer: Nine.

Looking for a Data Analyst?

For the moment, natural language-based query engines remain more of an intriguing concept than a practical reality. Until then, you may still want to turn to human data analysts for your querying needs. Fortunately, data analysis skills are increasingly widespread, and even junior-level data analysts should be able to produce ad hoc queries for a variety of database systems and create visualizations to help you interpret their findings.