Statistics Cheat Sheet: Key Concepts and Formulas
Tema 2: Data Analysis and Descriptive Statistics
Types of Variables
Variables can be categorized as either categorical (non-numerical values, e.g., hair color) or numerical. Numerical variables can be further classified as discrete (integer values, e.g., goals scored) or continuous (decimal values, e.g., height or weight).
Data Classification
- Qualitative Data:
- Nominal: Categories with no inherent order (e.g., hair color).
- Ordinal: Ordered categories (e.g., education level).
- Quantitative Data:
- Interval: Numerical data where the zero point is arbitrary (e.g., temperature in Celsius).
- Ratio: Numerical data with a meaningful zero point (e.g., weight).
Frequency Distributions and Graphs
- Frequency Distribution Tables: Summarize the frequency of each value or category of a variable, including absolute frequency (ni), relative frequency (fi), cumulative absolute frequency (Ni), and cumulative relative frequency (Fi).
- Graphs for Categorical Variables:
- Bar Chart: Represents categories with bars.
- Pie Chart: Displays categories as slices of a circle.
- Graphs for Continuous Variables:
- Histogram: Shows the distribution of continuous data using bars.
- Line Chart: Plots data points connected by lines over time or another continuous variable.
- Graphs for Two or More Variables:
- Scatter Plot: Displays the relationship between two numerical variables.
- Contingency Table: Analyzes the association between categorical variables.
Measures of Central Tendency
- Mean (X): The average of all values in a dataset.
- Median (Me): The middle value when data is ordered from least to greatest.
- Mode (Mo): The value that occurs most frequently.
Measures of Dispersion
- Range (Rg): The difference between the highest and lowest values.
- Variance (S^2x): Measures the spread of data around the mean.
- Standard Deviation (Sx): The square root of the variance.
- Coefficient of Variation: A relative measure of dispersion, expressed as a percentage of the mean.
Measures of Association
- Covariance (Sxy): Measures the linear relationship between two variables.
- Correlation Coefficient (r): A standardized measure of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).
Tema 3: Probability
Basic Concepts
- Sample Space (E): The set of all possible outcomes of a random experiment.
- Events: Subsets of the sample space.
- Incompatible Events: Events that cannot occur simultaneously.
- Exhaustive Events: Events that cover all possible outcomes.
- Basic/Complementary Events: Events that are both incompatible and exhaustive.
Approaches to Measuring Probability
- Classical Approach (Laplace): P(s) = Favorable Events / Possible Events
- Frequentist Approach: P(s) = lim nà∞ ns/N (Limit of the relative frequency of an event as the number of trials increases)
- Subjective Approach: Based on personal belief or judgment.
Probability Rules and Formulas
0≤P(s)≤1 P(O)=0 P(E)=1 P(A)=1-P(a) P(a)+P(A)=1 P(aub)=P(a)+P(b)-P(anb) P(a)=P(anb)+P(anB) P(a/b)=P(anb)/P(b) P(A∩B)=1-P(anb) P(AnB)=1-P(aub) P(a/B)=P(anB)/P(A) Teorema BayesàP(H/E)=P(E/H)*P(H)/P(E)
Tema 4: Random Variables and Probability Distributions
Random Variables
- Discrete Random Variable: Takes on a finite or countable number of values.
- Continuous Random Variable: Can take on any value within a given range.
Probability Function
P(X=x) gives the probability that a discrete random variable X takes on the value x.
Properties:
- 0≤P(X=x)≤1
- Sumx P(X=x)=1
Cumulative Probability Function
F(xo)=P(X≤x) gives the probability that a random variable X is less than or equal to a certain value xo.
Properties:
- 0<=F(xo)≤1
- if B>A then F(B)≥F(A)
- P(A
Expected Value
E(x)=Mux=Sumx x*P(X=x) represents the average value of a random variable.
Properties:
- if X=k then E(k)=k (constant)
- E(a+bX)=a+b*E(X) with a and b constants
- For two random variables X & Y à E(X+Y)=E(X)+E(Y)
- For two independent variables X & Yà E(X*Y)=E(X)*E(Y)
Variance and Standard Deviation
V(x)=S^2x=Sumx (x-Mux)2*P(X=x)=E(x-Mux)^2 measures the spread of a random variable around its mean.
Properties:
- if X=k then V(k)=0 (Constant)
- V(a+bX)=b2*V(X) with constant a and b
- For two independent variables X&YàV(X+-Y)=V(X)+V(Y)
- For two random variables X&YàV(X+-Y)=V(X)+V(Y)+-2cov(X,Y)
Discrete Random Variable Models
- Binomial (X—Bin(n,p)): Models the number of successes in n independent trials, each with probability p of success.
- P(X=x)=(nx)*p^x (1-p)^n-x
- E(X)=Mux=n*p
- V(X)=S^2x=n*p*(1-p)
- Poisson (L=n*p): Models the number of events occurring in a fixed interval of time or space, given an average rate of occurrence L.
- P(X=x)=e^-L *L^x /X!
- E(X)=Mux=L
- V(X)=S^2x=L
- Bernoulli Trial (X—Bin(1,p)): A special case of the binomial distribution with only one trial.
- P(X=x)=p^x (1-p)^1-x
- E(X)=p
- V(X)=p(1-p)
*When n is very large and p is very small, the Poisson distribution can be used to approximate the binomial distribution.
Tema 5: Statistical Inference
Classic Statistics
- Descriptive Statistics: Summarizes and describes data using graphical and numerical methods.
- Inferential Statistics: Uses sample data to make inferences about a larger population, with a degree of uncertainty or error.
Estimation and Hypothesis Testing
- Estimation: Involves estimating population parameters (e.g., mean weight of Spanish women) using sample data.
- Hypothesis Testing: Tests claims about population parameters (e.g., testing if the population mean weight is 60kg).
*Inference is the process of drawing conclusions about a population based on sample results.
Population vs. Sample
- Population (N): The entire set of individuals or items of interest.
- Population Variables (E,n,o): Random features of interest in the population.
- Parameter: A numerical characteristic of a population (e.g., population mean (Mu), population variance (S^2)).
- Sample: A subset of the population used to collect data.
- Statistic: A numerical characteristic of a sample (e.g., sample mean (X), sample variance (S^2x)).
Properties of Simple Random Sampling
- The mean of the distribution of the sample mean is the population mean à MuX=Mu
- The standard deviation of the distribution of the sample mean decreases when the sample size n increasesà SX=S/Raiz n
Central Limit Theorem
If the population is normally distributed or the sample size is large enough (nà∞), the sampling distribution of the sample mean (X) will be approximately normal, with mean Mu and standard deviation S/Raiz n.
Sampling Distribution of p^
For dichotomous data (0,1), the sampling distribution of the sample proportion (p^) will be approximately normal, with mean p and standard deviation Raiz p(1-p)/n, under certain conditions.
Tema 6: Estimation
Estimators and Estimates
- Estimator (Ô): A random variable that depends on sample information and provides an approximation to an unknown population parameter.
- Estimate (Ôo): A specific value of the estimator obtained from an observed sample.
Confidence Interval
A range of values that is likely to contain the true population parameter with a certain level of confidence (Y%).
Intervalo de Confianza para Y%(Y=Nivel de Confianza)àMu ¢(X-Za/2 *S/Raiz n, X+Za/2 *S/Raiz n)
Confidence Factor (Za/2)
The value from the standard normal distribution that leaves a probability of a/2 to its right, where y=1-a.
Margin of Error (ME)
ME=Za/2 *S/Raiz n
Common confidence levels and corresponding Z-scores:
- (-1,1) à68%
- (-2,2) à95%
- (-3,3) à99%