Introduction to ECDFs in Python: A Robust Histogram Replacement

Ditch the Binning bias once and for all with ECDFs

Oct 15, 2024

Charts shouldn’t be open to interpretation.

And unfortunately, that’s rarely the case, especially with histograms. Today you’ll see how histogram’s binning bias can mislead you in the analysis and how to prevent this issue with the power of ECDF plots.

After reading, you’ll know:

What’s wrong with histograms — and when should you avoid them
How to replace histograms with ECDFs — a more robust method for examining data distributions
How to use and interpret multiple ECDFs in a single chart — to compare distributions among different data segments

Let’s dig in!

If you’re a paid subscriber, you can skip the reading and download the notebook instead.

What’s Wrong with Histograms?

As Justin Bois from DataCamp said — binning bias — and I can’t agree more. What this means is that using different bin sizes on a histogram makes data distribution look different. Don’t take my word for it — the example below speaks for itself.

To start, I’ll import a couple of libraries for data analysis and visualization, and load the Titanic dataset straight from the web:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("custom_light.mplstyle")

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df.sample(5)

I’m using my custom-light theme to change the look of the visualizations. If you’re interested, you can learn how to create a Matplotlib theme from scratch in this article:

How to Create a Custom Matplotlib Theme And Make Your Charts Go From Boring To Amazing

Dario Radecic

September 3, 2024

Read full story

If you are unfamiliar with the dataset, here’s what the first couple of rows look like:

Image 1 - Titanic dataset sample (image by author)

Missing values can’t be visualized, so I’ll drop them for the purpose of the article:

df = df.dropna(subset=["Age"])

Next, I’ll declare a function for visualizing histograms. It takes a bunch of parameters, but the most important ones in our case are:

x – a single attribute for which you want to make a histogram
nbins – how many bins the histogram should have

def plot_histogram(x, nbins=10, title="Histogram", xlab="Age", ylab="Count"):
    plt.hist(x, bins=nbins, color="#895884", ec="#000000", linewidth=1)
    plt.title(title, size=20, y=1.04)
    plt.xlabel(xlab, size=14)
    plt.ylabel(ylab, size=14)

I’ll use the mentioned function twice, the first to make a histogram with 10 bins, and the second to make a histogram with 30 bins. Here are the results:

Histogram with 10 bins:

plot_histogram(df["Age"], title="Histogram of passenger ages")

Histogram with 30 bins:

plot_histogram(df["Age"], nbins=30, title="Histogram of passenger ages")

The data is identical, but different bin sizes can lead to binning bias — perceiving the same data differently, due to a slight change in the visual representations.

What can you do to address this issue? ECDF plots are here to save the day.

ECDFs: A Robust Histogram Replacement

ECDF stands for the Empirical Cumulative Distribution Function. In reality, it’s way less fancy than it sounds and is also relatively easy to interpret.

Like histograms, ECDFs show a single variable distribution, but in a more efficient way. You’ve seen previously how histograms can be misleading due to different bin sizing options. That’s not the case with ECDFs. ECDFs show every data point, and the plot can be interpreted only in one way.

Think of ECDFs as scatter plots because they also have points along X and Y axes. To be more precise, here’s what ECDFs show on both axes:

X-axis — a quantity you’re measuring (Age in the example above)
Y-axis — the percentage of data points that have a smaller value than the respective X value (at each point X, Y% of the values are smaller or identical to X)

To make this sort of visualization, you need to do a bit of calculation first. Two arrays are required:

X — sorted data (sorting the Age column from lowest to highest)
Y — list of evenly spaced data points where the maximum is 1 (as in 100%)

Use the following snippet to calculate X and Y values for a single column in a Pandas DataFrame:

def ecdf(df, column):
    x = np.sort(df[column])
    y = np.arange(1, len(x) + 1) / len(x)
    return x, y

And to plot ECDF, use the following:

def plot_ecdf(x, y, title="ECDF", xlab="Age", ylab="Percentage", color="#895884"):
    plt.scatter(x, y, color=color)
    plt.title(title, size=20, y=1.04)
    plt.xlabel(xlab, size=14)
    plt.ylabel(ylab, size=14)

Let’s use this function to make an ECDF plot of the Age attribute:

x, y = ecdf(df, "Age")

plot_ecdf(x, y, title="ECDF of passenger ages")
plt.show()

Image 4 - Age ECDF plot (image by author)

To interpret:

Around 25% of the passengers are 20 years old or younger
Around 80% of the passengers are 40 years old or younger
Around 5% of the passengers are 60 years old or older (1 — the percentage)

The true power lies in plotting multiple ECDFs. Let’s see what that’s all about.

Multiple ECDF Plots

In the Titanic dataset, you have the Pclass attribute, which indicates the passenger class. This sort of class organization is typical in travel even today, as the first class is reserved for wealthier individuals, and the other classes are where the rest of the folk is located.

With the power of ECDFs, you can explore how the passenger age was distributed among classes. You’ll need to call the ecdf() function 3 times, as there were three classes on the ship. The rest of the code boils down to data visualization, which is self-explanatory:

x1, y1 = ecdf(df[df["Pclass"] == 1], "Age")
x2, y2 = ecdf(df[df["Pclass"] == 2], "Age")
x3, y3 = ecdf(df[df["Pclass"] == 3], "Age")

plt.scatter(x1, y1, color="#F07605", label="Pclass = 1")
plt.scatter(x2, y2, color="#0B6E4F", label="Pclass = 2")
plt.scatter(x3, y3, color="#9B1D20", label="Pclass = 3")
plt.title("ECDF of ages across passenger classes", size=20, y=1.04)
plt.xlabel("Age", size=14)
plt.ylabel("Percentage", size=14)
plt.legend(loc="lower right")
plt.show()

Image 5 - ECDFs across passenger classes (image by author)

As you can see, the third class (red) had a lot of children on board, which isn’t the case with the first class (orange). The population in the first class is quite older, too:

Only 20% of the first-class passengers are older than 50 years
Only 20% of the second-class passengers are older than 40 years
Only 20% of the third-class passengers are older than 34 years

Wrapping Up

In data visualization, don’t leave anything open to interpretation.

Just because you prefer to see 15 bins in a histogram doesn’t mean your coworker does too. These differences might lead to different data interpretations if interpreted visually.

That’s not the case with ECDFs. You now know enough about them so you can include them in the next data analysis project. They might look strange, but they’re dead easy to explain and interpret.

Until next time.

-Dario

Data Doodles with Python

How to Create a Custom Matplotlib Theme And Make Your Charts Go From Boring To Amazing

Discussion about this post