Probability Distribution Function (PDF, PMF & CDF)

Jinendrasingh
11 min readApr 13, 2023

--

A Probability Distribution Function (PDF) is a function that tells us the probability of a random variable taking on a particular value or a range of values in a given probability distribution. It is used in statistics and data science to model and analyze data, make predictions, and perform statistical inference.

To understand this we need to understand random variables and probability distribution first, so let’s dive deep into Probability Distribution Function.

Random Variables:

A Random Variable is a set of possible values from a random experiment.

For example, we roll a fair six-sided die. The roll's outcome is a random variable, as it is determined by a random process (the roll of the die). the integers 1, 2, 3, 4, 5, and 6, and each outcome has an equal probability of 1/6.

X = {1, 2, 3, 4, 5, 6}

Here X denotes the random variable and the values are sample spaces

Type of Random Variable:

Discrete Random Variable: That can only take on certain specific values and the probabilities of those values occurring are defined. Examples include the number of children in a family or the outcome of flipping a coin.

Continuous Random Variable: That can take on any value within a certain range (possibly infinite), and the probabilities of those values occurring are described using a probability density function. Examples include the height of a person or the time it takes to complete a task.

Probability Distributions:

A probability distribution is a list of all possible outcomes of a random variable along with their corresponding probability values.

Probability Distributions

Problem with Distribution?

In many scenarios, the number of outcomes can be much larger, and hence a table would be tedious to write down. Worse still, the number of possible outcomes could be infinite, in which case, good luck writing a table for that. Example — Height of people, Rolling 10 dice together

Solution — Function?

What if we use a mathematical function to model the relationship between outcome and probability?

Yes, using a probability distribution function that allows us to describe the behavior of the random variable in a concise and mathematically rigorous way, without having to list out every possible outcome.

Note — A lot of time Probability Distribution and Probability Distribution Functions are used interchangeably.

Types of Probability Distribution Function

Based on the type of Random Variable we can create 2 types of Probability Distribution Function

PDFs can describe either discrete or continuous data. The difference is that discrete variables can only take on specific values, such as integers, yes vs. no, times of day, and so on. A continuous variable, in contrast, contains all values along the curve, including very small fractions or decimals out to a theoretically infinite number of places.

Famous Probability Distributions

Famous Probability Distributions

Why are Probability Distributions important?

  • Gives an idea about the shape/distribution of the data.
  • If our data follows a famous distribution then we automatically know a lot about the data.

A note on Parameters:

Parameters in probability distributions are numerical values that determine the shape, location, and scale of the distribution.
Different probability distributions have different sets of parameters that determine their shape and characteristics, and understanding these parameters is essential in statistical
analysis and inference.

Probability Distribution Functions:

A probability distribution function is a mathematical function that describes the probability of obtaining different values of a random variable in a particular probability distribution.

Type of Probability Distribution Functions

  1. Probability Mass Function
  2. Probability Density Function
  3. Cumulative Distribution Function

Probability Mass Function:

The probability mass function (PMF) is a function that gives the probability that a discrete random variable takes on a specific value or we can say it describes the probability distribution of a discrete random variable.

The probabilities assigned by the PMF must satisfy two conditions:
a. The probability assigned to each value must be non-negative (i.e., greater
than or equal to zero).
b. The sum of the probabilities assigned to all possible values must equal 1.

Let’s create a PMF for the below experiment

Experiment 2: Dice Roll 10000 times

As it is a random experiment all the probabilities are not equal.

import pandas as pd
import random

L = []

for i in range(0,10000):
L.append(random.randint(1,6))

s = (pd.Series(L).value_counts()/pd.Series(L).value_counts().sum()).sort_index()

print(s)

Output:
1 0.1619
2 0.1637
3 0.1726
4 0.1598
5 0.1678
6 0.1742
Name: count, dtype: float64

Cumulative Distribution Function(CDF) of PMF:

The Cumulative Distribution Function (CDF) of a Probability Mass Function (PMF) is a function that describes the probability that a discrete random variable takes on a value less than or equal to a given value.

Let X be a discrete random variable with a PMF p(x). The CDF of X, denoted by F(x), is defined as follows:

F(x) = P(X ≤ x) = ∑p(i) for all i ≤ x

Let's understand this with a simple Example. For a dice roll, the possible outcomes for X(discrete random variable) are 1,2,3,4,5 or 6.

The PMF of X is given by:

p(1) = 1/6, p(2) = 1/6, p(3) = 1/6, p(4) = 1/6, p(5) = 1/6, p(6) = 1/6

To find the CDF of X, we can add up the probabilities of the outcomes up to and including each value of X:

F(1) = P(X ≤ 1) = p(1) = 1/6
F(2) = P(X ≤ 2) = p(1) + p(2) = 1/6 + 1/6 = 1/3
F(3) = P(X ≤ 3) = p(1) + p(2) + p(3) = 1/6 + 1/6 + 1/6 = 1/2
F(4) = P(X ≤ 4) = p(1) + p(2) + p(3) + p(4) = 1/6 + 1/6 + 1/6 + 1/6 = 2/3
F(5) = P(X ≤ 5) = p(1) + p(2) + p(3) + p(4) + p(5) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 5/6
F(6) = P(X ≤ 6) = p(1) + p(2) + p(3) + p(4) + p(5) + p(6) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1

This means that the probability of getting 1 or fewer on a dice roll is 1/6, the probability of getting 3 or fewer is 1/2, and the probability of getting 6 or fewer is 1.

PMF and CMF of a dice role

Probability Density Function (PDF)

A Probability Density Function (PDF) describes the probability distribution of a continuous random variable. It is a function that maps a range of values to their corresponding probabilities.

Unlike a PMF for a discrete random variable, the PDF cannot give the probability of getting any specific value since the probability of a continuous random variable taking on any particular value is zero. Instead, the PDF describes the likelihood of the variable taking on a range of values.

Mathematically, the PDF is defined as the derivative of the Cumulative Distribution Function (CDF) of the continuous random variable. The CDF gives the probability that the random variable is less than or equal to a certain value, while the PDF gives the probability density at that value.

For example, let’s consider the height of a person as a continuous random variable X. We can use a PDF to describe the probability density of different heights. Let’s assume that the PDF of X is given by:

f(x) = 1/6 * (x — 4) for 4 ≤ x ≤ 10 f(x) = 0 otherwise

This means that the probability density of a height between 4 and 10 feet is proportional to the height minus 4, and the probability density of any other height is zero. We can normalize the PDF by integrating it over its entire domain:

∫4 to 10 f(x) dx = ∫4 to 10 (1/6 * (x — 4)) dx = 1

This tells us that the total probability density over the range of 4 to 10 feet is 1, or 100%.

Using this PDF, we can calculate the probability of a person being between 5 and 7 feet tall:

P(5 ≤ X ≤ 7) = ∫5 to 7 f(x) dx = ∫5 to 7 (1/6 * (x — 4)) dx = 1/3

This means that the probability of a person’s height being between 5 and 7 feet is 1/3 or 33.3%.

Normal Distribution (Probability Density Function)

Density Estimation

Density estimation is a statistical technique used to estimate the probability density function (PDF) of a random variable from a sample of data. In other words, it is a way to estimate the underlying probability distribution of a set of data points.

The main goal of density estimation is to construct a function that closely approximates the probability distribution of the data. This function can then be used to analyze and make predictions about the data.

There are various methods for density estimation, including parametric and non-parametric approaches. Parametric methods assume that the data follows a specific probability distribution (such as a normal distribution), while non-parametric methods do not make any assumptions about the distribution and instead estimate it directly from the data.

Commonly used techniques for density estimation include kernel density
estimation (KDE), histogram estimation, and Gaussian mixture models (GMMs). The choice of method depends on the specific characteristics of the data and the intended use of the density estimate.

Parametric Density Estimation:

Parametric density estimation is a statistical technique that involves making assumptions about the shape or form of the probability density function (PDF) of a random variable, and then estimating the parameters of that function from a sample of data.

In other words, we assume that the data follows a certain probability distribution, such as the normal distribution or the Poisson distribution, and then estimate the parameters of that distribution, such as the mean and standard deviation for the normal distribution, from the available data.

Once we have estimated the parameters of the assumed distribution, we can use that distribution to model the PDF of the data. This allows us to make predictions about the data and perform statistical inference.

Here’s an Example:

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import normal

#Generated data from a normal distribution
sample = normal(loc=50, scale=5,size=1000)

# plot histogram to understand the distribution of data
plt.hist(sample,bins=10)

# calculate sample mean and sample std dev
sample_mean = sample.mean()
sample_std = sample.std()

# fit the distribution with the above parameters
from scipy.stats import norm
dist = norm(sample_mean, sample_std)

values = np.linspace(sample.min(),sample.max(),100)

probabilities = [dist.pdf(value) for value in values]

# plot the histogram and pdf
plt.hist(sample,bins=10,density=True)
plt.plot(values,probabilities)
Parametric Density Estimation (Normal Distribution)

Non-Parametric Density Estimation

Non-parametric density estimation is a statistical technique that does not require making assumptions about the shape or form of the probability density function (PDF) of a random variable. It is also referred to as non-parametric density estimation because it does not require the use of a predefined probability distribution function, as opposed to parametric
methods such as the Gaussian distribution.

It estimates the PDF directly from the available data using techniques such as kernel density estimation or histogram estimation.

But sometimes the distribution is not clear or it’s not one of the famous distributions. Sometimes Non-parametric density estimation can be computationally intensive and may require more data to achieve accurate estimates compared to parametric methods.

Kernel Density Estimation

The KDE technique involves using a kernel function to smooth out the data and create a continuous estimate of the underlying density function.

In KDE, a smooth estimate of the density function is obtained by summing up a set of kernel functions, which are centered at each observation point and have a fixed bandwidth or window width. The kernel functions are usually chosen to be symmetric and bell-shaped, such as the Gaussian distribution.

The bandwidth or window width parameter determines the trade-off between the bias and variance of the density estimate. A smaller bandwidth will produce a more variable, jagged density estimate, while a larger bandwidth will produce a smoother, but potentially more biased, density estimate.

# generate a sample
sample1 = normal(loc=20, scale=5, size=300)
sample2 = normal(loc=40, scale=5, size=700)
sample = np.hstack((sample1, sample2))

# plot histogram bins=50
plt.hist(sample,bins=50)

from sklearn.neighbors import KernelDensity

model = KernelDensity(bandwidth=3, kernel='gaussian')

# convert data to a 2D array
sample = sample.reshape((len(sample), 1))

model.fit(sample)

values = np.linspace(sample.min(),sample.max(),100)
values = values.reshape((len(values), 1))

probabilities = model.score_samples(values)
probabilities = np.exp(probabilities)

#score_samples(values) returns the log-density estimate of the input samples values. This is because the score_samples()
#method of the KernelDensity class returns the logarithm of the probability density estimate rather than the actual probability density estimate.

plt.hist(sample, bins=50, density=True)
plt.plot(values[:], probabilities)
plt.show()
KDE

Cumulative Distribution Function(CDF) of PDF

CDF of PDF

The CDF is obtained by integrating the PDF over the range of the random variable. The formula for the CDF F(x) of a continuous random variable X with probability density function f(x) is:

F(x) = ∫_{-∞}^x f(t) dt

where t is the dummy variable of integration.

The CDF can be visualized as a graph where the x-axis represents the values of the random variable and the y-axis represents the probability.

How to use PDF in Data Science

In the below code, we are able to identify the species using the PDF. The code is useful for visualizing the distribution of the iris dataset, allowing us to compare the distributions of different measurements across the three species. The PDF and ECDF plots help us gain insights into the behavior of the data, estimate probabilities of certain events, and make predictions based on the underlying distribution of data.

import seaborn as sns

#imported iris dataset
df = sns.load_dataset('iris')


#ploted pdf for sepal length for all 3 Species
sns.kdeplot(data=df,x='sepal_length',hue='species')

#ploted pdf for sepal width for all 3 Species
sns.kdeplot(data=df,x='sepal_width',hue='species')

#ploted pdf for petal length for all 3 Species
sns.kdeplot(data=df,x='petal_length',hue='species')

#ploted pdf for petal width for all 3 Species
sns.kdeplot(data=df,x='petal_width',hue='species')

sns.kdeplot(df['petal_width'],hue=df['species'])
sns.ecdfplot(data=df,x='petal_width',hue='species')
Different measurements across the three species.
Final Plot

2D Density Plots:

This code creates a 2D kernel density estimate (KDE) plot of the petal_length and sepal_length variables in the df dataset using the jointplot() function of seaborn. The x and y parameters specify the variables to plot and the kind the parameter is set to "kde" to create a KDE plot. The fill the parameter is set to True to fill the contours with colors, and the cbar the parameter is set to True to show the color bar for the density scale.

The resulting plot shows the density of points for the petal_length vs. sepal_length measurements in the df dataset. The darker regions indicate a higher density of points, while the lighter regions indicate a lower density. This can help to identify patterns and relationships between the two variables. The marginal distributions for each variable are also shown on the sides of the plot as KDE plots, allowing you to see the distributions of each variable separately.

import seaborn as sns
import pandas as pd

# Load the iris dataset
iris = sns.load_dataset('iris')

sns.jointplot(data=df, x="petal_length", y="sepal_length", kind="kde",fill=True,cbar=True)
2D Density Plots

--

--

Jinendrasingh
Jinendrasingh

Written by Jinendrasingh

An Aspiring Data Analyst and Computer Science grad. Sharing insights and tips on data analysis through my blog. Join me on my journey!

No responses yet