Intro to Data Science: A Summary of Key Concepts
Standard Deviation Calculation (by hand):
Formula:
s=1n−1∑i=1n(xi−xˉ)2s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i – \bar{x})^2}s=n−11i=1∑n(xi−xˉ)2
where xix_ixi are individual data points, xˉ\bar{x}xˉ is the sample mean, and nnn is the number of data points.
Indexing in Pandas DataFrames:
df[...]
: Basic column selection.df.Loc[...]
: Label-based indexing (rows/columns).df.Iloc[...]
: Position-based indexing.
Composite Boolean Selection:
- Use logical conditionals for filtering:
&
(AND),|
(OR),~
(NOT).Example:
df[(df['Age'] > 18) & (df['Gender'] == 'Male')]
Python Object Types:
Tuple
Immutable sequence.List
Mutable sequence.Dictionary (dict)
Key-value pairs.For loops
for i in range(n): # 0-based indexing
Scipy.Stats.Multinomial vs np.Random.Choice:
- Both involve random selection, but
multinomial
is used for probabilities of outcomes from multiple categories, whilenp.Random.Choice
handles simple random sampling.
Conditional Probability and Independence:- Pr(A|B) ≠ Pr(A) unless A and B are independent.
Differences Between Visualization Types:
Bar Plot
For categorical data (shows frequencies).Histogram
For numerical data distribution.Box Plot
Shows data distribution (quartiles and outliers).Violin Plot
Combines box plot with density estimate.Probability Density Function (PDF)
Theoretical distribution of a continuous variable.
Fitting Distributions in Python:
Use
scipy.Stats
to fit distributions:from scipy.Stats import norm Params = norm.Fit(data)
Visualize using:
plt.Hist(data, bins=20, density=True) Plt.Plot(x, norm.Pdf(x, *params))
Variability and Sample Size:
- Larger samples reduce variability. Variability in statistics reflects uncertainty, which decreases as sample size increases.
Bootstrapping and Statistical Terms:
- Use terms together:
“We can estimate the population parameter by calculating a sample statistic and use bootstrapping to generate a confidence interval for the statistic.”
95% Confidence Interval Misinterpretation:
Correct interpretation
The interval covers the true population parameter 95% of the time over repeated samples.
Sampling Distribution vs. Bootstrapped Sampling Distribution:
Sampling Distribution (under null hypothesis)
Distribution of a statistic under the assumption that the null hypothesis is true.Bootstrapped Sampling Distribution
Distribution generated by resampling the data with replacement.Variance
Measures data spread; doesn’t assume any specific hypothesis.
P-value, α-Significance, and Evidence in Hypothesis Testing:
- P-value:
Probability of observing your data (or more extreme) under the null hypothesis. Α (alpha)
Threshold for statistical significance (commonly set at 0.05).Strength of Evidence
Smaller p-values indicate stronger evidence against the null hypothesis.
Importing Libraries:
Using statements like import pandas as pd to access functionalities from external libraries.
Loading Data:
Using pd.Read_csv() to read data from a CSV file into a pandas DataFrame.
Missing Values:
Data Exploration:
Grouping and Aggregation:
Sorting and Indexing:
$\displaystyle \bar x = \frac{1}{n}\sum_{i=1}^n x_i$
$\displaystyle s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2$
$\displaystyle s = \sqrt{s^2}$
Python Object Types:
An ordered, immutable collection.
An ordered, mutable collection.
A collection of key-value pairs.
A NumPy array, efficient for numerical computations.
String data type.
Iterating over a sequence of elements.
for i in range(n):
for x in some_list:
for i,x in enumerate(some_list):
Logical Flow Control:
Additional Object Type Features:
DataFrames are a two-dimensional, tabular data structure, similar to a spreadsheet or a SQL table. They offer powerful methods for data manipulation and analysis.
Probability:
$\displaystyle P(A|B) = \frac{P(A \cap B)}{P(B)}$ (This formula is not explicitly given in the sources but is fundamental to understanding conditional probability.)
Data Visualization:
Creates interactive visualizations.
Provides high-level interface for statistical graphics.
A foundational plotting library in Python.
DataFrames have built-in plotting methods using Matplotlib.
Visualization Types:
Display data distribution using quartiles and outliers.
Represent data distribution by dividing the data into bins and showing the frequency of data points in each bin.
Provide a smoothed estimate of data distribution using a kernel function.
Combine box plots and kernel density estimators to show data distribution for multiple groups.
Plot Elements:
Applying a logarithmic function to data to compress the scale and potentially reveal patterns.
Population vs. Samples:
The entire group of interest.
A subset of the population used for analysis.
The process of drawing conclusions about a population based on sample data.
A resampling technique to estimate the sampling distribution of a statistic.
Variance:
Measures the spread of data points around the mean.The standard deviation of the sampling distribution of a statistic (often the mean).
An interval that is likely to contain the true population parameter with a certain level of confidence (e.G., 95%).
Inference vs. Estimation:
Drawing conclusions about a population based on sample data.
Determining approximate values for population parameters.
The number of observations in a sample. Affects the precision of estimates and the width of confidence intervals.
A statistical method for evaluating claims about a population parameter.
A statement of “no effect” or “no difference.” It is the hypothesis we aim to reject or fail to reject based on evidence from the data.
A statement that contradicts the null hypothesis. It represents the claim we are trying to find evidence for.
P-value:
The probability of observing data as extreme as, or more extreme than, the data we actually observed, assuming the null hypothesis is true.
Focuses on a specific direction of the effect (e.G., testing if a parameter is greater than a certain value).
Considers both directions of the effect (e.G., testing if a parameter is different from a certain value).
Rejecting the null hypothesis when it is actually true (false positive).
Failing to reject the null hypothesis when it is actually false (false negative).
The threshold below which we reject the null hypothesis (commonly set to 0.05).
Standard Deviation (by hand):
s=1n−1∑i=1n(xi−xˉ)2s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i – \bar{x})^2}s=n−11i=1∑n(xi−xˉ)2
where xix_ixi are data points, xˉ\bar{x}xˉ is the sample mean, and nnn is the number of data points.
Pandas Indexing:
df[...]
: Column selection.df.Loc[...]
: Label-based indexing.df.Iloc[...]
: Position-based indexing.
Boolean Selection:
- Logical conditionals:
&
(AND),|
(OR),~
(NOT).
Example:df[(df['Age'] > 18) & (df['Gender'] == 'Male')]
.
Python Object Types:
Tuple
Immutable.List
Mutable.
Dict: Key-value pairs.
For loop:for i in range(n):
Scipy.Stats.Multinomial vs np.Random.Choice:
- Multinomial: Probability of multiple categories.
- np.Random.Choice: Random sampling.
Conditional Probability: Pr(A|B) ≠ Pr(A) unless independent.
Visualization Types:
Bar Plot
Categorical data.Histogram
Numeric data distribution.Box Plot
Data quartiles and outliers.Violin Plot
Combines box plot with density estimate.- PDF:
Theoretical continuous distribution.
Fitting Distributions:
from scipy.Stats import norm
Params = norm.Fit(data)
Plt.Hist(data, bins=20, density=True)
Plt.Plot(x, norm.Pdf(x, *params))
Variability & Sample Size:
Larger sample size reduces variability (uncertainty decreases as nnn increases).
Bootstrapping & Terms:
“We estimate population parameters using a sample statistic and apply bootstrapping to generate a confidence interval.”
Confidence Interval Misinterpretation:
Correct: “95% of such intervals contain the true population parameter.”
Sampling Distribution vs Bootstrapping:
- Null Hypothesis: Distribution if null is true.
- Bootstrapping: Resampling to estimate a statistic’s variability.
Variance
Spread of data.
P-value & Hypothesis Testing:
- P-value: Probability of observing the data under the null hypothesis.
Α (alpha)
Significance threshold (commonly 0.05).- Evidence: Lower p-values provide stronger evidence against the null.
Data Importing & Cleaning:
import pandas as pd
Df = pd.Read_csv('file.Csv')
Df.Dropna(), df.Isna().Sum(), df.Describe(), df.Groupby("col1")["col2"].Describe()
Formulas:
- Sample Mean: xˉ=1n∑i=1nxi\bar x = \frac{1}{n}\sum_{i=1}^n x_ixˉ=n1i=1∑nxi
Variance
S2=1n−1∑i=1n(xi−xˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2s2=n−11i=1∑n(xi−xˉ)2- Standard Deviation:
S=s2s = \sqrt{s^2}s=s2
Probability & Conditional Probability:
from scipy import stats
Stats.Multinomial, np.Random.Choice
Conditional Probability: P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}P(A∣B)=P(B)P(A∩B)
Plotting in Python:
import matplotlib.Pyplot as plt
Plt.Hist(), plt.Boxplot(), plt.Violinplot()
Key Concepts (Summarized):
- Bootstrapping: Resampling technique for estimating statistic variability.
- Hypothesis Testing:
Test null vs alternative hypothesis using p-value and significance level (α\alphaα).