Gathering data is just one component of performing an audit. Before you analyze your data, you need to cleanse the data and prepare it for the analysis.

Without this step, your analysis may not work, so it’s important to be thorough here. It may take some time to go through these steps, but it’s important to get it right.

Cleansing and preparation could account for half of your overall data focused effort, but it’s well worth spending the time here. If you dive in without fully cleansing, verifying, formatting, and standardizing your data, your analysis will be standing on a flimsy foundation. Taking your time and being thorough in this preparation stage will produce strong, reliable data that you can use to make informed recommendations.

Published with permission from the author.

Initial Cleanse, Exploration, and Verification

First we remove superfluous data. At the end of this step, we will have a data set that is shaped for preliminary exploration. If we skip this step, we may not be able to work with our data set. For each set of data that we have, we remove things like header records, blank or empty rows, and empty columns.

Next we want to understand and explore the data that we will be working with. The goal is to create an initial data profile, a broad understanding of the data and an initial set of data quality results.

We need to find the answers to these questions for each data set that we will be using: how many records and columns do we have? Is there any irrelevant data we can remove? Any strange outliers? Also, we check that there are no duplicate records, null values, or empty cells.

We must also check that our data matches our broad expectations. At the end of this step, we will be ready to proceed with cleansing. Here are the steps we follow to ensure that we’re done with the pre-cleanse verification.

Check that the summary is reasonable. Based on our business understanding:

  • Do the totals make sense?
  • Does the number of transactions per period (day/month) make sense (average, minimum and maximum)?
  • Do the number of entities (e.g., customers, suppliers) and amount of transactions make sense?

At this point, we may not yet be ready to confirm that the data will enable us to answer our questions (or our hypotheses).

Published with permission from the author.

Cleanse, Format, Standardize

Data is rarely clean when we get it, and often the data columns need to be correctly formatted and standardized. In this step, we will clean, format and standardize the data so that it can be blended (joined) in preparation for analysis. If we skip this step, we may not be working with data in the necessary technical format.

There are three things we need to check in this set

  1. Unique identifiers
  2. Formatting (e.g. all dates formatted the same)
  3. Whether or not data sets fall within the same range.

Now, determine what needs to be corrected or adjusted and how to fix missing data. Before we reach the “joining” stage, we will create new data within one or more of the individual data sets. At the end of this step, we will have a few new columns in one or more of the data sets.

Our data does not always contain group columns that we need. For example, we may have granular geographical identifiers (e.g., cities) but need another form of geographical grouping (e.g., states). In such a case, we will map the cities to states, using a join, to create a new column and populate it. We will need a separate data set with cities and their corresponding states (usually open data), and we will need to know how to join (explained in the next step).

Published with permission from the author.

Joining and Finishing Touches

In this step, which is also known as blending, we will combine our data to enable meaningful analysis. The easiest way to understand the different join types is by looking at what the result of the joins will be, if applied to a pair of simple tables. For example, if we have a master table that contains information on all products (including unit prices) and a transaction table that has order information:

  • Use INNER join when you only want order information for which there is a unit price, AND you don’t need unit prices for products that have not had an order.
  • Use LEFT join when you want all order information, even if there is no unit price.
  • Use RIGHT join when you want to see all products, but only orders with a unit price.
  • Use FULL OUTER join when you want to see all orders, even if there is no unit price, AND (in the same table) you want to see all unit prices, even for products that have not had an order.

Joins are tricky operations. Always check them. To verify the data, check for duplication, empty cells, special characters, and formatting for dates and formulas.

Now, check that the fixes have carried through to the new joined table. We confirm that in executing the joins, we have not unintentionally excluded data; this sometimes happens when we are using an inner join. We can perform the check by counting the number of primary keys in each table, then comparing that to the joined table. Are there any missing?

Now that we have our data joined, we can use one or more columns from each data set to plug any holes that we may have had in the underlying data.

For example, if we had data at different geographical levels in our underlying data sets, we may want to use one or both columns to fill gaps. Let’s say we had postcodes (zip codes) in one data set and state names in the other, but there were missing state names. We can now use the codes to populate the state names.

Look at the data and try to identify other such missing data that we can now populate.

Congrats! Your Data is Ready for Analysis.

We have successfully prepared our data for analysis. That’s a huge step. From here on, it’s all much easier. There may still be further cleansing or fixing required later; using data often involves iteration. But none of the later cleansing will be as challenging as what we have already completed.

Now that you’ve thoroughly cleaned and prepared the data, you can comfortably move on to analyzing it without worrying that you’ve missed a step or that the data is somehow faulty. You’re free to analyze your data with clarity and peace of mind.

For more advice on preparing and analysing data for audits, you can find The Data-Confident Internal Auditor on Amazon.

Yusuf Moolla has over twenty years of experience in data, assurance, and plaguing his colleagues with his dry sense of “humor.” A Certified Internal Auditor, Yusuf has worked for both Deloitte and KPMG, leading audits and data projects around the world.

Conor McGarrity’s background includes KPMG as well as the public sector. Passionate about leveraging data for better audit outcomes and redefining the limits of what’s possible, he also enjoys displaying a unique blend of “skill” and “finesse” on the soccer field.

Together, they founded Risk Insights, a specialist advisory firm that focuses on using data for internal audit and performance audit. Connect with them online at data-confident.com.

This post is adapted from The Data-Confident Internal Auditor.