Simple Linear Regression
BIOS 7020: Introductory Biostatistics II Fall 2018
Hanwen Huang, Ph.D.
Department of Epidemiology & Biostatistics
College of Public Health University of Georgia huanghw@uga.edu
Data Summaries (Mean, Median, . . .) Introduction to Probability
Inference for one sample problem: estimates, CIs and tests for
continuous response: mean binary response: proportion
Inference for two sample problem: estimates, CIs and tests for
continuous response: mean binary response: proportion
Power Analysis and Sample Size Determination
Introduction to linear regression
More than two groups (variables) are of interest
Linear regression
Simple linear regression
Multiple linear regression
ANOVA
One-way ANOVA Two-way ANOVA
Categorical data analysis (binary data) Comparisons of proportions or odds Logistic regression
Survival Analysis
Basic concepts
Proportional hazards regression
Motivation examples Regression model Parameter estimation Hypothesis test Prediction
Bckground:
Time from exposure to AIDS Immunological status ranging from 0 to 10
Sample: 10 haemophilia patients
Objective: Predict number of exposure months as a function of immunologic status
Reference: Schock, M.A., and Remington, R.D. (2000). Statistics with Applications to the Biological and Health Sciences. Upper Saddle River, NJ: Prentice Hall.
Data
Immunologic Status Exposure Time (months)
8 4
9 2
3 15
7 6
5 7
8 3
4 12
6 9
3 17
4 11
Questions:
1Does there appear to be any relationship between the two variables?
2If so, what is the direction of that relationship?
Scatter Plot
Simple Linear Regression is used when:
We are interested in the relationship between two variables; Both variables are continuous;
We wish to predict the value of one variable from the value of the other.
Two types of variables:
Dependent Variable. Sometimes called the ”variable of interest” or Y -variable
Independent Variable. Sometimes called the explanatory variable, predictor variable or X -variable
Example:
Dependent Variable: Exposure Time
Independent Variable: Immunologic Status
n = Sample Size
xi= Independent Variable for Subject i (i = 1, · · · , n)
yi= Dependent Variable for Subject i (i = 1, · · · , n)
yi= α + βxi+ εi
α = y-intercept
β = Slope
εi= (Random) Error
Assumptions
1The data are realized from a linear regression model
yi= α + βxi+ εi
2The errors have population mean of zero
E (εi) = 0
3Homoskedasticity: The variance of the errors does not depent on X
var (εi) = σ2
4Independence: The subjects (sample units) are independently sampled
5Normality: The errors are sampled from a normal distribution
6The independent variable is measured without error
Parameter Estimation
The parameters α and β are estimated using the Method of Least
Squares.
Define the estimated regression line
where
yi= a + bxi
a = Estimated value of the y-intercept α
b = Estimated value of the slope β
yi= Fitted value of the dependent variable yiwhen the independent variable is equal to xi
Residual
Define: di= yi− yi= yi− a − bxi
Minimize the sum of the squared residuals:
n n
S = X d2= X (yi− a − bxi)2
i=1
i=1
Important quantities
Sum of Squares for x (x¯ = Pn
xi/n)
n n n!2
Lxx= X (xi− x)2= X x2−
Xxi/n
i=1
Sum of Squares for y (y¯ = Pn
i=1
yi/n)
i=1
i=1
n n
n!2
Lyy= X (yi− y)2= X y2−
Xyi/n
i=1
Sum of Cross Products
n
i=1
n
i=1
n
!n!
Lxy= X (xi− x) (yi− y) = X xiyi−
Xxi
Xyi/n
i=1
Estimated regression coefficients:
i=1
i=1
i=1
b = Lxyand a = y − bx
Lxx
Status (xi ) Time (yi ) x2 y 2 xi yi
i i
Sample means:
8 4 64 16 32
9 2 81 4 18
3 15 9 225 45
7 6 49 36 42
5 7 25 49 35
8 3 64 9 24
4 12 16 144 48
6 9 36 81 54
3 17 9 289 51
4 11 16 121 44
57 86 369 974 393
x = 5.7 and y = 8.6
Summary: Lxx= 44.1, Lxy= −97.2, Lyy= 234.4. The estimated regression coefficients are:
Lxy
and
b =
Lxx
= −2.204
a = y − bx
= 8.6 − (−2.204) × 5.7 = 8.6 + 12.6
= 21.2
In summary, we obtain the estimated (fitted) regression line
y = a + bx
b
= 21.2 − 2.204x
Suppose that a subject with unknown exposure time has an immunologic status of x = 5. Then the predicted exposure time for that subject is
y = 21.2 − 2.204 5
b
= 10.2 months
Reference: Bruce, Kusumi, and Hosner. (1973). Maximal oxygen intake and nomographic assessment of functional aerobic impairment in cardiovascular disease. American Heart Journal 65,
546-562. The data contain variables: Case
Duration (seconds)
VO2Max
Heart Rate (beats per minute) Age (years)
Height (cm) Weight (kg)
Objective: Predict VO2Max as a function of duration of exercise.
Question: What is your interpretation of this scatter plot?
Recall: Questions
1Is exposure time a decreasing function of immunologic status?
2If so, at what rate does exposure time decrease with increasing immunologic status?
Note: A complete answer requires testing the null hypothesis
H0: β = 0 against the one-sided alternative hypothesis
Ha: β 0
Step 1. Construct an Analysis of Variance table
Step 2. Estimate the variance σ2of the errors εi
yi= α + βxi+ εi
Step 3. Compute the standard error of the estimated slope b
Step 4. Compute the test statistic and carry out the test.
The total variation in the dependent variable is measured by the
Total Sum of Squares:
n
Total SS = X (yi− y)2= Lyy
i=1
The total sum of squares may be partitioned into two terms:
n n n
X(yi− y)2=X(yi− y)2+X(yi− yi)2
i=1
b
i=1
b
i=1
|Mod{ezl SS }
|Resid{uzal SS }
The regression sum of squares
n
Model SS = X(b − y) =
i 1
2
xy
Lxx
measures the variation of the estimated regression line about the sample mean.
The residual sum of squares
n
Residual SS = X(yi− b )
L2
= Lyy−
i=1
yi2
xy
Lxx
measures the variation of the data about the estimated regression line.
ANOVA
Source d.f. SS MS F
L2
Model 1
xy
Lxx
L2
Model SS
1
MS Model
MS Res
Residual n − 2 Lyy− xy
Res SS
Lxx n−2
Total n − 1 Lyy
Here, we have two sources of variation, that due to regression and that due to error.
We have a total of n − 1 degrees of freedom. The regression has only one d.f. This leaves n − 2 d.f. for the residuals.
MS is the Mean Square. Mean Squares are obtained by dividing the sum of squares term by its d.f.
The F-statistic is obtained by dividing the MS Model by MS Residual. This may be used to perform a two-sided test of H0: β = 0.
Step 2: Estimate σ2
The variance of the errors σ2may then be estimated by the Mean
Square Residual
σ2
b
Example: AIDS Data
Res SS
=
n − 2
L2 yy Lxxn − 2
σ2=
b
20.2
= 2.52
8
Step 3: Compute the Standard Error of the Estimated
Slope
Note: Since the estimated slope b is a function of our random data, b is also random variable. So, b has a mean and variance. Under our model assumptions:
E (b) = β
var(b)= σ2
i=1(xi−x)
σ2
Lxx
σ2 σ2
var (b) = Pn
b 2=Lb
ci=1(xi−x)xx
varqb2
SE (b) = pc (b) =
σ
Lxx
Step 4: Compute the Test Statistic and Carry out the Test
Consider test H0: β = 0 against Ha: β 0 Compute the test statistic:
b t =
SE(b)
Under H0, t is t-distributed with n − 2 d.f. Since we carrying out a one-tailed test, reject H0at level α if
|t| > tn−2,1−αand b 0
Notes:
To test H0: β = 0 against Ha: β > 0, reject H0at level α if
|t| > tn−2,1−αand b > 0
To test H0: β = 0 against Ha: β = 0, reject H0at level α if
|t|> tn−2,1−α/2
Note: A complete conclusion should have
Indication of the direction of the effect, when significant. Evidence including the test statistic, d.f., and p-value.
Should also indicate
Magnitude of the effect using the estimated slope.
An assessment of the uncertainty of our estimate by including either the standard error or a 95% confidence interval.
The ANOVA table
ANOVA
Source d.f. SS MS F
L2
Model 1
xy
Lxx
L2
Reg SS
1
MS Reg
MS Res
Residual n − 2 Lyy− xy
Res SS
Lxx n−1
Total n − 1 Lyy
may be used to test
H0: β = 0
against the two-sided alternative
Ha: β = 0
Under H0
MS Model
F =
MS Res
is F -distributed with 1 and n − 2 degrees of freedom. Here, we have two sets of degrees of freedom:
The numerator degrees of freedom is equal to 1
The denominator degrees of freedom is equal to n − 2
Both of these can be read directly from the ANOVA table. Reject H0at level α if
F > F1,n−2,1−α
Question: How well does the linear regression model fit the data? Consider the Analysis of Variance table:
ANOVA
Source d.f. SS MS F Regression 1 SS Regression MS Regression F Residual n − 2 SS Residual MSE
Total n − 1 Total SS
Definition: The coefficient of determination is
r2=SS Regression
Total SS
(1 − α) × 100% Confidence Interval for β
Confidence intervals are often more easily interpreted than standard errors. They give a range of plausible values for the parameter of interest.
A (1 − α) × 100% confidence interval for the slope β is given by
b ± tn−2,1−α/2× SE (b)
(1 − α) × 100% Confidence Interval for α
We can also compute a confidence interval for the intercept α. First, we need an expression for the standard error of the estimated intercept:
s1
x2
σ2
SE (a) = b
+
n Lxx
Then a (1 − α) × 100% confidence interval for α is given by
a ± tn−2,1−α/2× SE (a)
Prediction
Here, y may be used to:
b
y = a + bx
b
Estimate the mean of the dependent variable y for a population of subjects sharing a common level of the independent variable x (public health setting).
v
u
SE (b) = ub
1 (x − x)2!
+
y σ2
n
Lxx
Estimate the value of the dependent variable y for a subject whose value of the independent variable is x (medical setting).
v
u
y) = ub
1
1 + +
(x − x)2!
SE (b
tσ2
n Lxx
Estimate the mean of the dependent variable y for a population of subjects sharing a common level of the independent variable x.
v
u
y) = ub
1 (x − x)2!
+
SE (b
σ2
n
Lxx
v
u1
(x − x)2!
y ± tn−2,1−α/2
uσ2
b × tb
+
n Lxx
Estimate the value of the dependent variable y for a subject whose value of the independent variable is x.
v
u
SE (b) = ub
1
1 + +
(x − x)2!
y tσ2
n Lxx
v
u1 (x − x)2!
y ± tn−2,1−α/2
uσ2
b × tb
1 + +
n
Lxx
Interpretation and plot
95% confidence interval: for mean
95% prediction interval: for individual