Statistics for Data Science
Let’s start with a question, what is statistics?
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. It involves using mathematical methods and techniques to draw conclusions and make informed decisions based on data.
In simpler terms, statistics help us make sense of data and understand the relationships between different variables. It plays a crucial role in many fields, including science, engineering, business, healthcare, and social sciences.
Let’s say you want to find out how many people in your city own a car. You could go door-to-door and ask everyone, but that would take a lot of time and effort. Instead, you could take a sample of people from your city and ask them if they own a car. You could then use statistical methods to analyze the data from your sample and make an estimate of how many people in your city own a car.
For example, if you sampled 100 people and 60 of them said they owned a car, you could estimate that 60% of people in your city own a car. This is an example of how statistics can help us make inferences about a population based on a sample of data.
There are two main types of statistics:
- Descriptive statistics
- Inferential statistics
Descriptive statistics
Descriptive statistics involves summarizing and describing a set of data using measures such as mean, median, mode, standard deviation, and variance. Descriptive statistics are used to provide a clear understanding of the data and to identify patterns, trends, and relationships within the data.
For example, descriptive statistics can be used to summarize and visualize a dataset of customer purchase behavior, such as the number of purchases made, the average purchase amount, and the distribution of purchase amounts. By analyzing this data, we can gain insights into customer behavior and preferences, which can inform marketing strategies and product development.
Inferential statistics
Inferential statistics refers to the process of using statistical techniques and methods to make inferences or predictions about a population based on a sample of data. Inferential statistics are used to test hypotheses, determine the significance of relationships, and make predictions about future outcomes.
Inferential statistics are important in data science because they allow us to draw conclusions about a larger population based on a sample of data. This can be particularly useful when it is not feasible or practical to collect data on an entire population. For example, inferential statistics can be used to draw conclusions about the preferences of a large customer base based on a smaller sample of customer data.
Difference b/w descriptive and Inferential Statistic
The main difference between descriptive statistics and inferential statistics is the purpose and scope of the analysis.
Overall, descriptive statistics provide a summary of the data within a sample, while inferential statistics make predictions about a larger population based on a sample.
Population V/s Sample
In data science, a population refers to the entire group of individuals, objects, or measurements of interest that we want to study and make predictions about. This can include all customers of a business, all users of a website, or all patients in a clinical trial.
A sample, on the other hand, is a subset of the population that is selected to be studied in order to make predictions about the larger population. The sample is typically smaller than the population and is chosen using various sampling methods to ensure that it is representative of the population.
Examples
1. All cricket fans V/s fans who were present in the stadium
2. All students V/s who visit the college for lectures
Parameter V/s Statistics
The main differences between parameters and statistics are:
- Parameters are based on the entire population, while statistics are based on a sample from the population.
- Parameters are typically unknown and must be estimated using statistics from a sample.
- Parameters are denoted by Greek letters (such as μ, σ), while statistics are denoted by Roman letters (such as x̄, s).
I have covered various topics in this blog, including descriptive and inferential statistics, population and sample, and parameters and statistics. By understanding these concepts and how they relate to data science, we can become better equipped to analyze and interpret data.
Remember, statistics is a vast field, and this blog only scratches the surface. There is always more to learn, and I encourage you to continue exploring statistics and data science.
Thank you for your time, and happy data analyzing!