Normal Distribution in Data Science
In this vlog will cover the “Normal Distribution” and its use in Data Science. For previous topics of “Statistics in Data Science” you can refere to my previous blogs. I have written blogs on multiple topics of data science till now and they all follow “Data Science Mentorship Program” of CampusX by Nitish Sir. You can also use them for revision purpose as I also wrote them for my revision. So lets start with the defination:
The normal distribution, also known as the Gaussian distribution or bell curve, is a probability distribution that is commonly used in statistics and probability theory. It is a continuous probability distribution that is symmetric, bell-shaped, and defined by two parameters: the mean (μ) and the standard deviation (σ).
The shape of the normal distribution is characterized by its bell-shaped curve, which is centered at the mean. The mean represents the average or expected value of the distribution, while the standard deviation measures the spread or dispersion of the data points around the mean. The standard deviation determines the width of the bell curve, with larger values indicating a wider spread of data and smaller values indicating a narrower spread.
In a normal distribution, the data is symmetrically distributed around the mean, with the highest probability density occurring at the mean and decreasing as the distance from the mean increases. The total area under the curve is equal to 1, representing the probability of all possible outcomes.
It is denoted as X~N(μ,σ) Where μ is the mean and
σ is the standard deviation
Why Normal Distribution is important?
The normal distribution has several important properties and is widely used in various fields of study. Many natural phenomena and real-world data tend to follow a normal distribution, such as heights and weights of people, test scores, errors in measurements, and financial market returns. It is also a fundamental assumption in many statistical methods and hypothesis-testing procedures. If a distribution follows the normal distribution, we can get a lot of information about the data.
The normal distribution is mathematically defined by the probability density function (PDF) equation:
where:
x represents a random variable
μ is the mean
σ is the standard deviation
π is a mathematical constant (approximately 3.14159)
e is the base of the natural logarithm (approximately 2.71828)
Standard Normal Variate
A Standard Normal Variate(Z) is a standardized form of the normal distribution with mean = 0 and standard deviation = 1.
Standardizing a normal distribution allows us to compare different distributions with each other, and to calculate probabilities using standardized tables or software.
To standardize a variable, you can follow these steps:
- Calculate the mean (μ) and standard deviation (σ) of the variable.
- For each data point, subtract the mean from the value.
- Divide the result by the standard deviation.
The formula for standardizing a variable x is:
z = (x — μ) / σ
where z represents the standardized value (z-score), x is the original value, μ is the mean, and σ is the standard deviation.
A z-table tells you the area underneath a normal distribution curve, to the left of the z-score. https://www.ztable.net/
Properties of Normal Distribution
- Symmetry: The normal distribution is symmetric around its mean. This means that the distribution’s left and right halves are mirror images of each other.
- Measures of Central Tendencies are equal: The mean, median, and mode of a normal distribution are all equal and located at the center of the distribution.
- Empirical rule: The normal distribution follows the empirical rule, also known as the 68–95–99.7 rule. According to this rule, approximately 68% of the data falls within one standard deviation of the mean, about 95% falls within two standard deviations, and nearly 99.7% falls within three standard deviations.
4. The area under the curve: The total area under the curve is equal to 1, representing the probability of all possible outcomes.
Skewness
A normal distribution is a bell-shaped, symmetrical distribution with a specific mathematical formula that describes how the data is spread out. Skewness indicates that the data is not symmetrical, which means it is not normally distributed.
Skewness is a measure of the asymmetry of a probability distribution. It quantifies the extent to which a distribution deviates from being symmetrical. In other words, skewness indicates whether the data is skewed to the left (negatively skewed), skewed to the right (positively skewed), or symmetrically distributed.
Skewness is typically defined in terms of the third standardized moment of a distribution. The standardized moment is calculated by subtracting the mean from each data point, dividing by the standard deviation, and raising it to the power of three. The formula for skewness is:
skewness = (1/n) * Σ((x — μ)/σ)³
The greater the skew the greater the distance between mode, median, and mean.
The skewness value can be positive, negative, or zero. Here’s what each value represents:
Positive skewness (right-skewed): A positive skewness value indicates that the tail of the distribution is longer on the right side, and the bulk of the data is concentrated towards the left. The mean is typically greater than the median in a right-skewed distribution.
Negative skewness (left-skewed): A negative skewness value indicates that the tail of the distribution is longer on the left side, and the bulk of the data is concentrated towards the right. The mean is typically less than the median in a left-skewed distribution.
Zero skewness: A skewness value of zero indicates that the distribution is perfectly symmetrical, with an equal balance on both sides. The mean and median are equal in a symmetrical distribution.
CDF of Normal Distribution
The cumulative distribution function (CDF) of a normal distribution gives the probability that a random variable X, following a normal distribution, takes on a value less than or equal to a given value x. The CDF of a normal distribution is often denoted as Φ(x) or N(x).
The formula for the cumulative distribution function (CDF) of the normal distribution is derived using integral calculus and properties of the normal distribution.
The CDF of a normal distribution provides a way to determine the probability that a random variable follows a normal distribution and takes on a value less than or equal to a given value. By evaluating the CDF at a specific value, you can find the probability associated with that value in the distribution.
Use of normal distribution in Data Science
The normal distribution is extensively utilized in data science for various purposes. Here are some of its key applications:
Descriptive statistics: The normal distribution is often used to describe and summarize data. Parameters such as mean and standard deviation are commonly calculated to provide insights into the central tendency and spread of the data.
Data modeling: In many statistical and machine learning algorithms, assumptions of normality are made to simplify the analysis. Linear regression, ANOVA, t-tests, and other techniques often assume that the errors or residuals follow a normal distribution.
Hypothesis testing: The normal distribution plays a crucial role in hypothesis testing. When sample sizes are sufficiently large and assumptions are met, statistical tests such as Z-tests and t-tests use the properties of the normal distribution to assess the significance of differences between groups or estimate confidence intervals.
Estimation and inference: Maximum likelihood estimation (MLE) is a commonly used method for estimating the parameters of a distribution. In the case of the normal distribution, MLE provides estimates of the mean and standard deviation based on observed data.
Quality control: The normal distribution is employed in quality control processes to monitor and assess the consistency and quality of products or processes. Control charts, such as the Shewhart control chart, rely on the normal distribution to determine control limits and identify deviations from normality.
Risk assessment and finance: Many financial models, such as the Black-Scholes model for option pricing, assume that asset prices follow a log-normal distribution, which is closely related to the normal distribution. Risk assessment and portfolio management also often assume that returns on investments are normally distributed.
Simulation and random number generation: Simulating data based on assumed distributions is a common practice in data science. Generating random numbers from a normal distribution enables the creation of synthetic datasets for testing algorithms or conducting Monte Carlo simulations.
These are just a few examples highlighting the wide-ranging use of the normal distribution in data science. Its properties and prevalence make it a valuable tool for analyzing and understanding data, making predictions, and making informed decisions.