Simpson’s Paradox: Is your data telling the truth?

Organisations around the world are operating in an increasingly uncertain, competitive and challenging environment, responding to economic, social and health changes, whilst maintaining high levels of customer experience and service. The benefits of a data-driven approach to decision making to combat these contemporary challenges and ensure high performance cannot be underestimated. To gain a true understanding of data and to make informed decisions, sophisticated statistical techniques are required. In this blog, we’re going to introduce you to Simpson’s Paradox and show the impact that it can have on informed decision making. It’s paramount that variables are accounted for when it comes to dealing with data and planning for the future as best as you can. Let us show you why…

What is Simpson’s Paradox?

Described by Edward H. Simpson in 1951, a British codebreaker, statistician and civil servant
One of many Association Paradoxes that occur in statistics

When looking at overall admission statistics, men were more likely to be admitted than women, however when looking at the numbers on a department level, women were more likely to be admitted in the majority of the departments. Men seemingly had a higher rate of admittance because a larger proportion of men applied to departments that had a higher rate of admittance than women did.

Another famous example is in a trial of new treatments for kidney stones:

When looking at the overall success rates of treatment, Treatment B seems more effective. However, when looking at small and large stones separately, Treatment A was more effective.

How does Simpson’s Paradox work?

What better way of explaining Simpson’s Paradox than by using dogs?

Firstly, let’s observe 20 dogs jumping over a fence. We have 10 big dogs and 10 small dogs, and we want to calculate which size dog is better at jumping.

We observe that 5 out of 10 big dogs and 4 out of 10 small dogs have made it over the fence.

From this observation alone, we can see that a larger percentage of big dogs successfully jumped over the fence, and so it’s reasonable to conclude that they are the better jumpers.

However, there is something we aren’t considering. The fence itself isn’t uniform, and the height of the fence has an impact on the success of the jump…this is what we call an associated variable.

Let’s take a look at just big dogs jumping over two types of fence – a big fence and small fence. 8 of the dogs attempt to jump over the smaller white fence while 2 of them try for the taller black fence.

5 out of 8 of them succeed in jumping over the white fence while zero of them jump over the black fence.

Let’s move onto observing the same experiment with the smaller dogs. This time, 3 of the dogs go for the white fence and 7 dogs attempt the black fence.

2 out of 3 of the dogs succeed in jumping the white fence and 2 out of 7 succeed in jumping over the black fence.

When looking at the types of fence separately, we can conclude that small dogs are the better jumpers…

However, looking at the aggregate we see the opposite…

When we looked at just the size of dogs, we incorrectly conclude that the big dogs were better at jumping. However, their performance was being largely affected by which type of fence they attempted to jump over – more big dogs went for the white fence, which was easier to jump, and so more of them made it over.

What does this mean for you?

This paradox, and others like it, has the potential to affect your data any time that you calculate a percentage or ratio from data that is taken from differing situations. Taking into account as many variables as possible or correctly identifying associated variables helps to reach the correct conclusions from data.

An example of this can be seen in calculating changes in CTR or CPC. We’ve summarised an example based on something we observed when monitoring CTR changes during the early stages of the COVID-19 pandemic and lockdown in the UK. By taking a quick look at changes in the CTR of Google Search campaigns between February and March for all of our clients aggregated together, we saw the following:

This is a significant increase and seems odd at first glance – you would expect to see CTRs actually decrease due to the change in user behaviours, increases in costs and decreases in budgets – so let’s investigate it further…

Let’s start by taking into account an associated variable that will impact CTR other than the month. Due to brand recognition and other factors, the CTR of a campaign can vary significantly. When looking at two different clients we see this:

We see that in fact, both clients had a decrease in their CTR. This was due to a decrease in activity from Client 1, which had a generally lower CTR, and so a higher proportion of the clicks and impressions we were using to calculate the aggregate CTR belonged to Client 2. Since Client 2 had a generally higher CTR, the CTR when aggregating them together increased.

SO WHAT?

As you can see, it’s vital to recognise associated variables and account for, eliminate or include them in our calculations. Once we’ve done this, we can then calculate percentages at more segmented levels, before finding an average of all the percentages in that segment or category. This could involve calculating CTR and CPC at campaign level if possible (although some campaigns will not run over multiple time periods), and then calculate an average at the account level.

But what happens if there are larger numbers of associated variables? It can get incredibly complex and time-consuming to calculate manually…

This is where statistical modelling comes into the picture! Statistical techniques allow us to predict percentages at an incredibly segmented level for a more accurate understanding of our data. By including all relevant variables into a statistical model, their associations are taken into account.

If you’re dealing with large amounts of data, you will want to find insight that will inform your decision making. However, much of this is on a surface-level analysis…in an education setting, this could include observing the admittance rates by gender of the enrolment rates of certain markets. But an applicant is not just their gender or their nationality. There are many variables that affect these behaviours.

We analysed admissions data over the past 5 years of all applicants to a client’s university to predict a future applicant’s rate of enrolment. Using statistical techniques we identified 5 significant variables that affect their rate of enrolment:

Age
Gender
Country of Residence
Entry system
Source of Interest

We then ran a regression model which calculated the rate of enrolment of an applicant while accounting for all the variables and avoiding the effects of Simpson’s Paradox.

How can Arke help?

Simpson’s Paradox is a common Association Paradox which can cause us to come to incorrect conclusions from our data. An understanding of associated variables and accounting for them is needed to avoid this paradox.

When dealing with complex data, sophisticated statistical techniques are needed to ensure we gain a true understanding of our data for informed decision making.

For help with turning your data into real customer stories get in touch with one of our friendly experts below.