Intro to Data Science: A Summary of Key Concepts

Standard Deviation Calculation (by hand):


Formula:

s=1n−1∑i=1n(xi−xˉ)2s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i – \bar{x})^2}s=n−11​i=1∑n​(xi​−xˉ)2​

where xix_ixi​ are individual data points, xˉ\bar{x}xˉ is the sample mean, and nnn is the number of data points.


Indexing in Pandas DataFrames:


  • df[...]: Basic column selection.
  • df.Loc[...]: Label-based indexing (rows/columns).
  • df.Iloc[...]: Position-based indexing.

Composite Boolean Selection:


  • Use logical conditionals for filtering:
    • & (AND), | (OR), ~ (NOT).
    • Example:

      df[(df['Age'] > 18) & (df['Gender'] == 'Male')]


Python Object Types:


  • Tuple


    Immutable sequence.

  • List

    Mutable sequence.

  • Dictionary (dict)

    Key-value pairs.

  • For loops

    for i in range(n): # 0-based indexing

Scipy.Stats.Multinomial vs np.Random.Choice:


  • Both involve random selection, but multinomial is used for probabilities of outcomes from multiple categories, while np.Random.Choice handles simple random sampling.

  • Conditional Probability and Independence:
    • Pr(A|B) ≠ Pr(A) unless A and B are independent.

Differences Between Visualization Types:


  • Bar Plot


    For categorical data (shows frequencies).

  • Histogram

    For numerical data distribution.

  • Box Plot

    Shows data distribution (quartiles and outliers).

  • Violin Plot

    Combines box plot with density estimate.

  • Probability Density Function (PDF)

    Theoretical distribution of a continuous variable.

Fitting Distributions in Python:


  • Use scipy.Stats to fit distributions:

    from scipy.Stats import norm Params = norm.Fit(data)

  • Visualize using:

    plt.Hist(data, bins=20, density=True) Plt.Plot(x, norm.Pdf(x, *params))


Variability and Sample Size:


  • Larger samples reduce variability. Variability in statistics reflects uncertainty, which decreases as sample size increases.

Bootstrapping and Statistical Terms:


  • Use terms together:
    “We can estimate the population parameter by calculating a sample statistic and use bootstrapping to generate a confidence interval for the statistic.”

95% Confidence Interval Misinterpretation:


  • Correct interpretation


    The interval covers the true population parameter 95% of the time over repeated samples.

Sampling Distribution vs. Bootstrapped Sampling Distribution:


  • Sampling Distribution (under null hypothesis)


    Distribution of a statistic under the assumption that the null hypothesis is true.

  • Bootstrapped Sampling Distribution

    Distribution generated by resampling the data with replacement.

  • Variance

    Measures data spread; doesn’t assume any specific hypothesis.

P-value, α-Significance, and Evidence in Hypothesis Testing:

  • P-value:
    Probability of observing your data (or more extreme) under the null hypothesis.

  • Α (alpha)

    Threshold for statistical significance (commonly set at 0.05).

  • Strength of Evidence

    Smaller p-values indicate stronger evidence against the null hypothesis.


Here is a concise summary of the materials provided, including necessary Python code lines, terms, ideas, and formulas.
Week 1 Data Summarization
Key Concepts

Importing Libraries:


Using statements like import pandas as pd to access functionalities from external libraries.

Loading Data:


Using pd.Read_csv() to read data from a CSV file into a pandas DataFrame.
Arguments like encoding and inplace can be used to control data loading behavior.

Missing Values:


Using df.Isna().Sum() to count missing values in each column of a DataFrame.
You can use axis=1 and .Any() for advanced missing value handling.
Removing missing data:


df.Dropna(): Removes rows containing any missing values.


del df[‘col’]: Deletes a specific column from the DataFrame.


Applying del df[‘col’] before df.Dropna() can be important for efficient missing data removal.

Data Exploration:


df.Shape: Returns a tuple with the number of rows and columns in the DataFrame.
df.Columns: Returns a list of column names.
df.Dtypes: Shows the data type of each column.
.Astype(): Used to convert a column to a different data type.
df.Describe(): Provides summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns.
df.Value_counts(): Counts the occurrences of unique values in a column, useful for categorical data.

Grouping and Aggregation:


df.Groupby(“col1”)[“col2”].Describe(): Groups the DataFrame by ‘col1’, then calculates summary statistics for ‘col2’ within each group.

Sorting and Indexing:


.Sort_values(): Sorts the DataFrame by specified column(s).
0-based indexing: Accessing elements in a DataFrame using row and column indices starting from 0.


df[]: Can be used for both row and column indexing.


df.Iloc[]: Used for fully 0-based indexing (integer-location based).
Boolean Selection: Filtering the DataFrame based on conditions.


df[] or df.Loc[]: Can be used with logical conditions to select specific rows.


Logical Operators: >=, >, <=, <, ==, !=, ~, &, | (not, and, or).
Formulas
Sample Mean:
$\displaystyle \bar x = \frac{1}{n}\sum_{i=1}^n x_i$
Sample Variance:
$\displaystyle s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2$
Sample Standard Deviation:
$\displaystyle s = \sqrt{s^2}$
Week 2 Coding and Probability
Key Concepts

Python Object Types:


tuple:
An ordered, immutable collection.
list:
An ordered, mutable collection.
dict:
A collection of key-value pairs.
np.Array:
A NumPy array, efficient for numerical computations.
str:
String data type.
For Loops:
Iterating over a sequence of elements.

for i in range(n):

for x in some_list:

for i,x in enumerate(some_list):

Logical Flow Control:

if, elif, else: Used for conditional execution of code blocks.

Additional Object Type Features:

type(): Returns the type of an object.
Indexing: Accessing individual elements of an object using indices.
.Dtype: Returns the data type of a NumPy array.
.Split(): Splits a string into a list of words based on a delimiter.
Pandas DataFrame Objects:
DataFrames are a two-dimensional, tabular data structure, similar to a spreadsheet or a SQL table. They offer powerful methods for data manipulation and analysis.

Probability:

from scipy import stats: Imports the stats module from the SciPy library.
stats.Multinomial: Represents a multinomial probability distribution.
np.Random.Choice: Randomly selects elements from an array based on specified probabilities.
Conditional Probability: Pr(A|B) – the probability of event A happening given that event B has already occurred.
Independence: Pr(A|B) = Pr(A) – events A and B are independent if the occurrence of B doesn’t affect the probability of A.
Formulas
Conditional Probability:
$\displaystyle P(A|B) = \frac{P(A \cap B)}{P(B)}$ (This formula is not explicitly given in the sources but is fundamental to understanding conditional probability.)
Week 3 Data Visualization
Key Concepts

Data Visualization:

Libraries:


Plotly:
Creates interactive visualizations.


Seaborn:
Provides high-level interface for statistical graphics.


Matplotlib:
A foundational plotting library in Python.


Pandas:
DataFrames have built-in plotting methods using Matplotlib.

Visualization Types:

Box Plots:
Display data distribution using quartiles and outliers.


Histograms:
Represent data distribution by dividing the data into bins and showing the frequency of data points in each bin.


Kernel Density Estimators:
Provide a smoothed estimate of data distribution using a kernel function.


Violin Plots:
Combine box plots and kernel density estimators to show data distribution for multiple groups.

Plot Elements:

Legends: Annotate different elements in the plot.


Annotations: Add text labels to specific points or regions.


Figure Panels: Divide a figure into multiple subplots.
Log Transformations:
Applying a logarithmic function to data to compress the scale and potentially reveal patterns.

Population vs. Samples:

Population:
The entire group of interest.
Sample:
A subset of the population used for analysis.
Statistical Inference:
The process of drawing conclusions about a population based on sample data.
Week 4 Bootstrapping
Key Concepts
Bootstrapping:
A resampling technique to estimate the sampling distribution of a statistic.

Variance:

Measures the spread of data points around the mean.
Standard Error:
The standard deviation of the sampling distribution of a statistic (often the mean).
Confidence Interval:
An interval that is likely to contain the true population parameter with a certain level of confidence (e.G., 95%).
Bootstrapped Confidence Interval: A confidence interval constructed using bootstrapping.

Inference vs. Estimation:

Inference:
Drawing conclusions about a population based on sample data.
Estimation:
Determining approximate values for population parameters.
Sample Size (n):
The number of observations in a sample. Affects the precision of estimates and the width of confidence intervals.
Week 5 Hypothesis Testing
Key Concepts
Hypothesis Testing:
A statistical method for evaluating claims about a population parameter.
Null Hypothesis (H0):
A statement of “no effect” or “no difference.” It is the hypothesis we aim to reject or fail to reject based on evidence from the data.
Alternative Hypothesis (HA):
A statement that contradicts the null hypothesis. It represents the claim we are trying to find evidence for.

P-value:


The probability of observing data as extreme as, or more extreme than, the data we actually observed, assuming the null hypothesis is true.
A smaller p-value suggests stronger evidence against the null hypothesis.
One-Sided (One-Tailed) Test:
Focuses on a specific direction of the effect (e.G., testing if a parameter is greater than a certain value).
Two-Sided (Two-Tailed) Test:
Considers both directions of the effect (e.G., testing if a parameter is different from a certain value).
Type I Error:
Rejecting the null hypothesis when it is actually true (false positive).
Type II Error:
Failing to reject the null hypothesis when it is actually false (false negative).
Significance Level (α):
The threshold below which we reject the null hypothesis (commonly set to 0.05).
Key Python Code (Specific to Hypothesis Testing)
While there are no explicit code examples given for hypothesis testing in the provided materials, the following functions might be relevant depending on the specific test being performed:
stats.T.Cdf(): Calculates the cumulative distribution function for a t-distribution (used in t-tests).
stats.Norm.Cdf(): Calculates the cumulative distribution function for a normal distribution (used in z-tests).
Functions for calculating test statistics, such as np.Mean(), np.Std(), etc.


Standard Deviation (by hand):

s=1n−1∑i=1n(xi−xˉ)2s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i – \bar{x})^2}s=n−11​i=1∑n​(xi​−xˉ)2​

where xix_ixi​ are data points, xˉ\bar{x}xˉ is the sample mean, and nnn is the number of data points.


Pandas Indexing:

  • df[...]: Column selection.
  • df.Loc[...]: Label-based indexing.
  • df.Iloc[...]: Position-based indexing.

Boolean Selection:

  • Logical conditionals: & (AND), | (OR), ~ (NOT).
    Example: df[(df['Age'] > 18) & (df['Gender'] == 'Male')].

Python Object Types:


  • Tuple


    Immutable.

  • List

    Mutable.

  • Dict: Key-value pairs.

  • For loop:

    for i in range(n):


Scipy.Stats.Multinomial vs np.Random.Choice:


  • Multinomial: Probability of multiple categories.
  • np.Random.Choice: Random sampling.

  • Conditional Probability: Pr(A|B) ≠ Pr(A) unless independent.

Visualization Types:

  • Bar Plot


    Categorical data.

  • Histogram

    Numeric data distribution.

  • Box Plot

    Data quartiles and outliers.

  • Violin Plot

    Combines box plot with density estimate.
  • PDF:
    Theoretical continuous distribution.

Fitting Distributions:

from scipy.Stats import norm Params = norm.Fit(data) Plt.Hist(data, bins=20, density=True) Plt.Plot(x, norm.Pdf(x, *params))


Variability & Sample Size:
Larger sample size reduces variability (uncertainty decreases as nnn increases).


Bootstrapping & Terms:
“We estimate population parameters using a sample statistic and apply bootstrapping to generate a confidence interval.”


Confidence Interval Misinterpretation:
Correct: “95% of such intervals contain the true population parameter.”


Sampling Distribution vs Bootstrapping:

  • Null Hypothesis: Distribution if null is true.
  • Bootstrapping: Resampling to estimate a statistic’s variability.

  • Variance

    Spread of data.

P-value & Hypothesis Testing:

  • P-value: Probability of observing the data under the null hypothesis.

  • Α (alpha)

    Significance threshold (commonly 0.05).
  • Evidence: Lower p-values provide stronger evidence against the null.

Data Importing & Cleaning:

import pandas as pd Df = pd.Read_csv('file.Csv') Df.Dropna(), df.Isna().Sum(), df.Describe(), df.Groupby("col1")["col2"].Describe()


Formulas:

  • Sample Mean: xˉ=1n∑i=1nxi\bar x = \frac{1}{n}\sum_{i=1}^n x_ixˉ=n1​i=1∑n​xi​

  • Variance

    S2=1n−1∑i=1n(xi−xˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2s2=n−11​i=1∑n​(xi​−xˉ)2
  • Standard Deviation:
    S=s2s = \sqrt{s^2}s=s2​

Probability & Conditional Probability:

from scipy import stats Stats.Multinomial, np.Random.Choice


  • Conditional Probability: P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}P(A∣B)=P(B)P(A∩B)​

Plotting in Python:

import matplotlib.Pyplot as plt Plt.Hist(), plt.Boxplot(), plt.Violinplot()


Key Concepts (Summarized):

  • Bootstrapping: Resampling technique for estimating statistic variability.
  • Hypothesis Testing:
    Test null vs alternative hypothesis using p-value and significance level (α\alphaα).