Descriptive Statistic for Data Science
In this vlog, I will try to cover all the topics of Descriptive Statistics. Descriptive statistics is a branch of statistics that focuses on summarizing and describing the main features of a dataset. It includes measures of central tendency and measures of variability or dispersion and is often used to gain initial insights into a dataset and identify patterns or trends. Descriptive statistics are important for providing a clear and concise summary of data, which can be used to guide further statistical analysis.
So I am starting this with the Type of Data as in statistics we should know the type of our Data.
Type of Data
Categorical Data(Qualitative): Refers to data that cannot be measured or quantified using numbers, such as gender, color, or type of car. Categorical data can be further divided into two subtypes: Nominal data, which has no natural ordering (such as colors or types of fruit), and Ordinal data, which has a natural ordering (such as rankings or ratings on a scale).
Numerical Data(Quantitative): Refers to data that can be measured or quantified using numbers, such as height, weight, temperature, or age. Numerical data can be further divided into two subtypes: Discrete data, which takes on only specific, numerical values (such as number of children in a family), and Continuous data, which can take on any value within a range (such as height or weight).
Measure of Central Tendency
Measures of central tendency are statistical values that represent the center or typical value of a dataset. In simpler terms, they are numbers that help us understand where most of the data lies in a distribution.
The three most common measures of central tendency are Mean, Median, and Mode.
Mean:
The mean is the average value of a dataset and is calculated by adding up all the values and dividing by the number of values.
Mean = (Sum of all values in the dataset) / (Number of values in the dataset)
Issue with mean: The mean can be affected by extreme values or outliers in a dataset. For example, if a dataset contains one or more unusually high or low values, the mean may not accurately represent the typical value of the data.
Median:
The Median represents the middle value of a dataset when the data is arranged in order.
For example(for odd), in the dataset {2, 5, 7, 9, 12}, the median is 7 because it’s the middle value when the data is arranged in order.
For example(for even), in the dataset {2, 5, 7, 9, 12, 15}, the median is (7+9)/2 = 8 because it’s the average of the two middle values (7 and 9) when the data is arranged in order.
The median is useful because it’s less sensitive to outliers or extreme values in a dataset compared to the mean. It can provide a more accurate representation of the typical value when the data is skewed or has extreme values. However, it doesn’t take into account all the values in the dataset, and it may not be as representative of the data as the mean in certain cases.
Mode:
Mode is the most frequently occurring value in a dataset. It’s the value that appears most often in the data.
To calculate the mode of a dataset, you need to identify which value or values occur most frequently. In some cases, there may be more than one mode if multiple values occur with the same highest frequency. Mostly used for categorical data and discrete numerical data.
However, the mode has some limitations. It may not be as informative as the mean or median in certain cases, especially when the dataset is continuous or when there are several values with similar frequencies. Additionally, the mode may not exist or be unique in some datasets, especially if all values occur with the same frequency or if there is no clear clustering of values.
Weighted Mean:
The weighted mean is the sum of the products of each value and its weight, divided by the sum of the weights. It is used to calculate a mean when the values in the dataset have different importance or frequency.
weighted mean = (w1x1 + w2x2 + … + wnxn) / (w1 + w2 + … + wn)
The weighted mean can be useful in situations where each value in the dataset has different importance or weight, such as when analyzing survey responses where each respondent may represent a different demographic group with varying proportions in the population.
Trimmed Mean:
A trimmed mean is calculated by removing a certain percentage of the
smallest and largest values from the dataset and then taking the mean
of the remaining values. The percentage of values removed is called the
trimming percentage.
The trimmed mean can be useful in situations where the data contains outliers or extreme values that may distort the mean and make it less representative of the majority of the data.
Measure of Dispersion
The measure of dispersion is a numerical value that describes how a dataset is spread out or varied. It provides information about the variability or diversity of the data points around the central tendency, which can be useful in understanding the distribution and characteristics of the data.
The measures of central tendency are not adequate to describe data. Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion
There are several commonly used measures of dispersion, including:
Range:
The range is a simple measure of dispersion that represents the difference between the largest and smallest values in a dataset.
Range = Largest Value — Smallest Value
It provides a quick way to assess the spread of the data, but it’s sensitive to outliers, which can distort the range and give a misleading picture of the variability. For more robust measures of dispersion, such as the variance or standard deviation, are often used.
Variance:
Variance describes how much the individual values in a dataset deviate from the mean. The variance is the average of the squared differences between each data point and the mean.
Population Variance: σ² = (Σ(xi — μ)²) / N
Sample Variance: s² = (Σ(xi — x̄)²) / (n-1)
Mean absolute deviation (MAD) is a measure of dispersion that describes how much the individual values in a dataset deviate from the mean, on average.
Population MAD: σ₁ = (Σ|xi — μ|) / N
Sample MAD: s₁ = (Σ|xi — x̄|) / n
Unlike the variance, MAD is not in squared units, which can make it easier to interpret and compare to other measures of dispersion. It’s also less sensitive to outliers than the variance, which can make it a more robust measure of dispersion in some cases.
MAD has several limitations, however. First, it doesn’t take into account the direction of the deviations, meaning that positive and negative deviations are treated equally. Second, it can be less efficient as an estimator of dispersion compared to the variance when the data are normally distributed.
Standard deviation:
The standard deviation is the square root of the variance. It is
a widely used measure of dispersion that is useful in describing the shape of a distribution.
The formula for the sample standard deviation is:
s = sqrt[Σ(xi — x̄)² / (n-1)]
The standard deviation provides a way to measure the average distance of each data point from the mean, in the same units as the original data. It is a widely used measure of dispersion that has several advantages over other measures, such as the range or mean absolute deviation. For example, it considers the size and direction of the deviations from the mean, making it a more accurate measure of variability.
Coefficient of Variation:
Coefficient of Variation (CV): The CV is the ratio of the standard deviation to the mean expressed as a percentage. It is used to compare the variability of datasets with different means.
The coefficient of variation (CV) is a statistical measure that expresses the amount of variability in a dataset relative to the mean. It is a dimensionless quantity that is expressed as a percentage.
The formula for calculating the coefficient of variation is:
CV = (standard deviation / mean) x 100%
It can be used to compare the variability of different species with different sizes or populations.
Graphs for Univariate Analysis
Univariate analysis is a statistical analysis technique that involves analyzing a single variable at a time.
In this section, we will learn which type of graph can be made based on the type of data we have. The Graph type will depend on the data, whether the data is Numerical or Categorical.
Categorical — Frequency Distribution Table & Cumulative Frequency:
A frequency distribution table is a table that summarizes the number of times (or frequency) that each value occurs in a dataset.
Let’s say we have a survey of 200 people and we ask them about their favorite type of vacation, which could be one of six categories: Beach, City, Adventure, Nature, Cruise, or Other
Relative frequency is the proportion or percentage of a category in a dataset or sample. It is calculated by dividing the frequency of a category by the total number of observations in the dataset or sample.
Cumulative frequency is the running total of frequencies of a variable or category in a dataset or sample. It is calculated by adding up the frequencies of the current category and all previous categories in the dataset or sample.
Numerical — Frequency Distribution Table & Histogram
A Numerical frequency distribution table is similar to the frequency distribution table mentioned earlier, but instead of using intervals or bins, it lists each individual value along with its frequency of occurrence.
Suppose we have the following data set: 8, 5, 6, 7, 9, 8, 7, 6, 8, 10
Shapes of Histogram:
Graphs for Bivariate Analysis
Bivariate analysis is the analysis of two variables simultaneously. There are several types of graphs that can be used for bivariate analysis.
We have 3 Scenarios, based on which we create the graph:
- Categorical — Categorical
- Numerical — Numerical
- Numerical— Categorical
Categorical — Categorical:
Contingency Table/Crosstab: A contingency table, also known as a cross-tabulation or crosstab, is a type of table used in statistics to summarize the relationship between two categorical variables.
A contingency table displays the frequencies or relative frequencies of the observed values of the two variables, organized into rows and columns.
Suppose a sample of 200 people was surveyed, with 100 men and 100 women. They were asked to choose their favorite pizza toppings from a list of options. The results are summarized in the following contingency table:
Numerical — Numerical
For bivariate numerical-numerical data, we can create a scatter plot. Scatter plots are a simple yet powerful tool for analyzing bivariate numerical data and can help us visualize complex relationships and patterns in the data.
Numerical — Categorical
For bivariate numerical-categorical data, we can create a variety of charts depending on the nature of the data and the question we are trying to answer. Here are a few common examples: Boxplot, Violin plot, Bar chart, and Scatter plot.
Quantiles and Percentiles
Quantiles are values that divide a dataset into equal subgroups, where each subgroup contains an equal number of data points. For example, the median, which is the middle value of a dataset, is a type of quantile that divides the dataset into two subgroups of equal size. Other common quantiles include quartiles, which divide the dataset into four subgroups, and deciles, which divide the dataset into ten subgroups. We also have Quintiles which divide the data into 5 equal parts.
Percentiles, on the other hand, are values that divide a dataset into 100 equal subgroups, where each subgroup contains a certain percentage of the data points. For example, the 25th percentile is the value below which 25% of the data points fall, while the 75th percentile is the value below which 75% of the data points fall. The median, which is the 50th percentile, is also a type of percentile.
To calculate the percentile value for a given percentage (p) in a dataset, you can use the following formula:
Arrange the dataset in order from lowest to highest.
Calculate the index (i) of the percentile by using the following formula:
i = (p/100) * (n + 1)
where n is the total number of data points in the dataset.
Note: If the result of (p/100) * (n + 1) is a whole number, then the percentile is the value at that index. If the result is not a whole number, then round up to the nearest whole number to get the index of the closest value.
Find the value at the index (i) in the sorted dataset. This value is the percentile value for the given percentage (p).
For example, suppose you have the following dataset of 10 test scores:
50, 55, 60, 65, 70, 75, 80, 85, 90, 95
To calculate the value at the 75th percentile of this dataset using the formula, you would follow these steps:
Sort the dataset in ascending order:
50, 55, 60, 65, 70, 75, 80, 85, 90, 95Calculate the index (i) of the 75th percentile using the formula:
i = (75/100) * (10 + 1) = 8.25
Round up to the nearest whole number to get i = 9.
Find the value at the 9th index in the sorted dataset:
The value at the 9th index is 90.
Therefore, the value at the 75th percentile of the dataset is 90.
5 Number Summary
The five-number summary is a statistical summary of a dataset that includes five values: the minimum value, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum value. The five-number summary is useful for describing the central tendency, variability, and shape of a dataset.
Here are the steps to calculate the five-number summary of a dataset:
- Sort the dataset in ascending order.
- Find the minimum value, which is the smallest value in the dataset.
- Find the maximum value, which is the largest value in the dataset.
- Find the median (Q2), which is the middle value of the dataset. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.
- Find the first quartile (Q1), which is the median of the lower half of the dataset. To find Q1, calculate the median of the values below the median (Q2).
- Find the third quartile (Q3), which is the median of the upper half of the dataset. To find Q3, calculate the median of the values above the median (Q2).
The five-number summary provides a comprehensive summary of the distribution of the dataset. It can be used to identify outliers, measure the spread of the dataset, and describe the shape of the distribution.
Interquartile Range(IQR)
The interquartile range is a measure of variability that is based on the five-number summary of a dataset. Specifically, the IQR is defined as the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset.
IQR = Q3 — Q1
Boxplots
A box plot, also known as a box-and-whisker plot, is a graphical representation of a dataset that shows the distribution of the data. The box plot displays a summary of the data, including the minimum and maximum values, the first quartile (Q1), the median (Q2), and the third quartile (Q3).
Benefits of a Boxplot
○ Easy way to see the distribution of data
○ Tells about the skewness of data
○ Can identify outliers
○ Compare 2 categories of data
How to create a Boxplot
- Calculate the five-number summary (minimum, Q1, median, Q3, maximum) of the dataset.
- Draw a number line and mark the minimum and maximum values.
- Draw a box from Q1 to Q3, with a vertical line at the median.
- Draw whiskers from the box to the smallest and largest observations that are not considered outliers.
- Plot outliers as individual points outside the whiskers.
Covariance
In data science, covariance is used to analyze the relationship between two variables in a dataset. It can help identify whether two variables are positively or negatively related, and the strength of that relationship.
Covariance is a statistical measure that describes the degree to which two variables are linearly related. It measures how much two variables change together, such that when one variable increases, does the other variable also increase, or does it decrease?
If the covariance between two variables is positive, it means that the variables tend to move together in the same direction. If the covariance is negative, it means that the variables tend to move in opposite directions. A covariance of zero indicates that the variables are not linearly related.
How it is calculated?
Disadvantages of using Covariance:
One limitation of covariance is that it does not tell us about the strength of the relationship between two variables, since the magnitude of covariance is affected by the scale of the variables.
Covariance of a variable with itself:
The covariance of a variable with itself is simply the variance of that variable. This is because the covariance between two identical variables is calculated as the expected value of the product of their deviations from their respective means. If the two variables are identical, then their deviations from their means are both zero, and the product of these deviations is also zero. Therefore, the covariance between a variable and itself is equal to the expected value of the square of the deviation of the variable from its mean, which is the definition of variance. Mathematically, the covariance of a variable X with itself can be written as:
Cov(X, X) = E[(X — E[X])(X — E[X])] = Var(X)
Correlation
Correlation is a statistical measure that quantifies the strength and direction of the relationship between two variables.
Correlation is often measured using a statistical tool called the correlation
coefficient, which ranges from -1 to 1. A correlation coefficient of -1 indicates a perfect negative correlation, a correlation coefficient of 0 indicates no correlation and a correlation coefficient of 1 indicates a perfect positive correlation.
Correlation and Causation
The phrase “correlation does not imply causation” means that just because
two variables are associated with each other, it does not necessarily mean that one causes the other. In other words, a correlation between two variables does not necessarily imply that one variable is the reason for the other variable’s behavior.
Suppose there is a positive correlation between the number of firefighters
present at a fire and the amount of damage caused by the fire. One might be tempted to conclude that the presence of firefighters causes more damage. However, this correlation could be explained by a third variable — the severity of the fire. More severe fires might require more firefighters to be present and also cause more damage.
Thus, while correlations can provide valuable insights into how different
variables are related, they cannot be used to establish causality. Establishing causality often requires additional evidence such as experiments, randomized controlled trials, or well-designed observational studies.