Simple Linear Regression

BIOS 7020:  Introductory Biostatistics II Fall 2018

Hanwen Huang, Ph.D.

Department of Epidemiology & Biostatistics

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y O2wKJKo8mEhBIlZAOSZAIU4YyjWFJJax0AB8W0qk gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y OoA2DY6SxCeqr8ipyO0RtJAEfVBEZ6PCxMpgMQoc gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y CoVfBNNcIyFYscSO1yJLMOhpeQlH7zGyIoIZjPBo College of Public Health University of Georgia huanghw@uga.edu


gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Data Summaries (Mean, Median, . . .) Introduction to Probability

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Inference for one sample problem:  estimates, CIs and tests for

gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y continuous response:  mean binary response:  proportion

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Inference for two sample problem:  estimates, CIs and tests for

BB9yUs06yDPBeAgA7 BB9yUs06yDPBeAgA7 continuous response:  mean binary response:  proportion

gif;base64,R0lGODlhAwAEAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhGgAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Power Analysis and Sample Size Determination

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Introduction to linear regression


More than two groups (variables) are of interest

Linear regression

gif;base64,R0lGODlhDQAIAHcAMSH+GlNvZnR3Y Simple linear regression

gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y Multiple linear regression

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y ANOVA

gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y One-way ANOVA Two-way ANOVA

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhGgAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y Categorical data analysis (binary data) Comparisons of proportions or odds Logistic regression

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Survival Analysis

BB9yUs06yCvBeAgA7 Basic concepts

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhDQAHAHcAMSH+GlNvZnR3Y Proportional hazards regression


gif;base64,R0lGODlhBgAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Motivation examples Regression model Parameter estimation Hypothesis test Prediction


Bckground:

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Time from exposure to AIDS Immunological status ranging from 0 to 10

Sample:  10 haemophilia patients

Objective:  Predict number of exposure months as a function of immunologic status

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y Reference:  Schock, M.A., and Remington, R.D. (2000).  Statistics with Applications to the Biological and Health Sciences.  Upper Saddle River, NJ: Prentice Hall.


Data

Immunologic Status    Exposure Time (months)

gif;base64,R0lGODlhMwECAHcAMSH+GlNvZnR3Y 8                                        4

9                                        2

3                                       15

7                                        6

5                                        7

8                                        3

4                                       12

6                                        9

3                                       17

4                                       11

gif;base64,R0lGODlhMwECAHcAMSH+GlNvZnR3Y Questions:

gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCwAKAHcAMSH+GlNvZnR3Y 1Does there appear to be any relationship between the two variables?

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCwALAHcAMSH+GlNvZnR3Y 2If so, what is the direction of that relationship?


Scatter Plot

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y Z


Simple Linear Regression is used when:

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y We are interested in the relationship between two variables; Both variables are continuous;

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y We wish to predict the value of one variable from the value of the other.


Two types of variables:

0JADs= Dependent Variable.  Sometimes called the ”variable of interest” or Y -variable

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Independent Variable.  Sometimes called the explanatory variable, predictor variable or X -variable

Example:

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Dependent Variable:  Exposure Time

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Independent Variable:  Immunologic Status


gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y n = Sample Size

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y xi= Independent Variable for Subject i  (i = 1, · · · , n)

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y 0JADs= yi= Dependent Variable for Subject i  (i = 1, · · · , n)


yi= α + βxi+ εi

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y α = y-intercept

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y β = Slope

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y εi= (Random) Error

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y 2Q==


Assumptions

gif;base64,R0lGODlhCwAKAHcAMSH+GlNvZnR3Y 1The data are realized from a linear regression model

yi= α + βxi+ εi

gif;base64,R0lGODlhCwALAHcAMSH+GlNvZnR3Y 2The errors have population mean of zero

E (εi) = 0

gif;base64,R0lGODlhCwAKAHcAMSH+GlNvZnR3Y 3Homoskedasticity:  The variance of the errors does not depent on X

var (εi) = σ2

gif;base64,R0lGODlhCwALAHcAMSH+GlNvZnR3Y 4Independence:  The subjects (sample units) are independently sampled

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCwAKAHcAMSH+GlNvZnR3Y 5Normality:  The errors are sampled from a normal distribution

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCwALAHcAMSH+GlNvZnR3Y 6The independent variable is measured without error


Parameter Estimation

The parameters α and β are estimated using the Method of Least

Squares.

Define the estimated regression line

where


yi= a + bxi


gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y a = Estimated value of the y-intercept α

gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y b = Estimated value of the slope β

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y AVC3blgAAOw== gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y

yi= Fitted value of the dependent variable yiwhen the independent variable is equal to xi

Residual

9k=

Define:  di= yi− yi= yi− a − bxi

Minimize the sum of the squared residuals:

n                   n

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y

S = X d2= X (yi− a − bxi)2

i=1


i=1


Important quantities

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y

Sum of Squares for  x  (x¯ = Pn

xi/n)


n                                  n                n!2


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y

Lxx= X (xi− x)2= X x2−

Xxi/n


i=1

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Sum of Squares for  y  (y¯ = Pn


i=1

yi/n)


i=1


i=1

n                                  n


n!2


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y

Lyy= X (yi− y)2= X y2−

Xyi/n


i=1

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Sum of Cross Products

n


i=1

n


i=1

n


gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y !n!


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCwACAHcAMSH+GlNvZnR3Y Lxy= X (xi− x) (yi− y) = X xiyi−


Xxi


Xyi/n


i=1

Estimated regression coefficients:


i=1


i=1


i=1


gif;base64,R0lGODlhGgAHAHcAMSH+GlNvZnR3Y

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y TWfaK4rqlNBAA7 KJIL8UBLBUJWUMgC8qqy4AEYBEFgIAOw== gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y b =  Lxyand a = y − bx

Lxx


 Status (xi )    Time (yi )     x2 y 2 xi yi

i             i


Sample means:


8                   4            64      16      32

9                   2            81       4       18

3                  15            9      225     45

7                   6            49      36      42

5                   7            25      49      35

8                   3            64       9       24

4                  12           16     144     48

6                   9            36      81      54

3                  17            9      289     51

4                  11           16     121     44

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhEAECAHcAMSH+GlNvZnR3Y 57                 86          369    974    393

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y x = 5.7 and y  = 8.6


Summary:  Lxx= 44.1, Lxy= −97.2, Lyy= 234.4. The estimated regression coefficients are:

Lxy


and


b =

Lxx


= −2.204


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y a    =    y − bx

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y =    8.6 − (−2.204) × 5.7 = 8.6 + 12.6

=    21.2


In summary, we obtain the estimated (fitted) regression line

y    =    a + bx

b

=    21.2 − 2.204x

gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y Suppose that a subject with unknown exposure time has an immunologic status of x = 5.  Then the predicted exposure time for that subject is

y    =    21.2 − 2.204     5

b

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y =    10.2 months


Reference:  Bruce, Kusumi, and Hosner.  (1973).  Maximal oxygen intake and nomographic assessment of functional aerobic impairment in cardiovascular disease.  American Heart Journal  65,

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y 546-562.  The data contain variables: Case

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Duration (seconds)

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y VO2Max

gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Heart Rate (beats per minute) Age (years)

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Height (cm) Weight (kg)

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y Objective:  Predict VO2Max as a function of duration of exercise.


gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y 2Q==

gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y AVC3blgAAOw== Question:  What is your interpretation of this scatter plot?


Recall:  Questions

gif;base64,R0lGODlhCwALAHcAMSH+GlNvZnR3Y 1Is exposure time a decreasing function of immunologic status?

gif;base64,R0lGODlhCwALAHcAMSH+GlNvZnR3Y 2If so, at what rate does exposure time decrease with increasing immunologic status?

Note:  A complete answer requires testing the null hypothesis

AVC3blgAAOw== H0: β = 0 against the one-sided alternative hypothesis

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y Ha: β 0


Step 1.    Construct an Analysis of Variance table

Step 2.    Estimate the variance σ2of the errors εi

yi= α + βxi+ εi

Step 3.    Compute the standard error of the estimated slope b

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y Step 4.    Compute the test statistic and carry out the test.


The total variation in the dependent variable is measured by the

Total Sum of Squares:

n

gif;base64,R0lGODlhCwACAHcAMSH+GlNvZnR3Y Total SS  = X (yi− y)2= Lyy

i=1

The total sum of squares may be partitioned into two terms:

n                                  n                                  n

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y X(yi− y)2=X(yi− y)2+X(yi− yi)2


i=1


b

i=1


b

gif;base64,R0lGODlhHgADAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhHgADAHcAMSH+GlNvZnR3Y i=1


gif;base64,R0lGODlhBgAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhHAADAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhHAADAHcAMSH+GlNvZnR3Y |Mod{ezl SS    }


|Resid{uzal SS  }


The regression sum of squares

n                

Model SS  = X(b  − y)   =

i    1

2

xy

gif;base64,R0lGODlhFgACAHcAMSH+GlNvZnR3Y Lxx


measures the variation of the estimated regression line about the sample mean.

The residual sum of squares


n

Residual SS  = X(yi− b )


L2

= Lyy−


i=1


yi2


xy

gif;base64,R0lGODlhFgACAHcAMSH+GlNvZnR3Y Lxx


gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y

gif;base64,R0lGODlhBgAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y measures the variation of the data about the estimated regression line.


ANOVA

Source         d.f.            SS              MS              F

gif;base64,R0lGODlhQgECAHcAMSH+GlNvZnR3Y L2


Model           1


 xy

Lxx

L2


Model SS

1


MS Model

MS Res


Residual    n − 2    Lyy− xy


Res SS


gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y                                                            Lxx n−2

Total         n − 1         Lyy

Here, we have two sources of variation, that due to regression and that due to error.

We have a total of n − 1 degrees of freedom.  The regression has only one d.f.  This leaves n − 2 d.f.  for the residuals.

MS is the Mean Square.  Mean Squares are obtained by dividing the sum of squares term by its d.f.

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y The F-statistic is obtained by dividing the MS Model by MS Residual.  This may be used to perform a two-sided test of H0: β = 0.


Step 2:  Estimate σ2

The variance of the errors σ2may then be estimated by the Mean

Square Residual


σ2

b

Example:  AIDS Data


Res SS

=

n − 2


L2 yy Lxxn − 2


σ2=

b


20.2

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y = 2.52

8


Step 3:  Compute the Standard Error of the Estimated

Slope

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y Note:  Since the estimated slope b is a function of our random data, b is also random variable.  So, b has a mean and variance. Under our model assumptions:

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y E (b) = β


gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y

var(b)=      σ2

i=1(xi−x)


σ2

Lxx


      σ2 σ2


gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y var (b) =  Pn


b        2=Lb


ci=1(xi−x)xx

gif;base64,R0lGODlhFAACAHcAMSH+GlNvZnR3Y varqb2


gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhKgACAHcAMSH+GlNvZnR3Y SE (b) = pc (b) =


σ

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y Lxx


Step 4:  Compute the Test Statistic and Carry out the Test

Consider test H0: β = 0 against Ha: β 0 Compute the test statistic:

gif;base64,R0lGODlhKQACAHcAMSH+GlNvZnR3Y b t =

SE(b)

Under H0, t  is t-distributed with n − 2 d.f.  Since we carrying out a one-tailed test, reject H0at level α if

|t| > tn−2,1−αand b 0

Notes:

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y To test H0: β = 0 against Ha: β > 0, reject H0at level α if

|t| > tn−2,1−αand b > 0

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y To test H0: β = 0 against Ha: β = 0, reject H0at level α if

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y |t|> tn−2,1−α/2


Note:  A complete conclusion should have

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y Indication of the direction of the effect, when significant. Evidence including the test statistic, d.f., and p-value.

Should also indicate

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Magnitude of the effect using the estimated slope.

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y An assessment of the uncertainty of our estimate by including either the standard error or a 95% confidence interval.


The ANOVA table


ANOVA


Source         d.f.            SS            MS           F

gif;base64,R0lGODlhLAECAHcAMSH+GlNvZnR3Y L2


Model           1


 xy

Lxx

L2


Reg SS

1


MS Reg

MS Res


Residual    n − 2    Lyy− xy


Res SS


                                                           Lxx n−1

Total         n − 1         Lyy


may be used to test


H0: β = 0


against the two-sided alternative

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y AVC3blgAAOw== Ha: β = 0


Under H0


gif;base64,R0lGODlhQQACAHcAMSH+GlNvZnR3Y MS Model

F  =

MS Res


is F -distributed with 1 and n − 2 degrees of freedom.  Here, we have two sets of degrees of freedom:

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y The numerator degrees of freedom is equal to 1

gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y The denominator degrees of freedom is equal to n − 2

gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y Both of these can be read directly from the ANOVA table. Reject H0at level α if

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y F  > F1,n−2,1−α


Question:  How well does the linear regression model fit the data? Consider the Analysis of Variance table:

ANOVA

xQAOw== Source            d.f.               SS                       MS              F Regression        1        SS Regression    MS Regression    F Residual        n − 2      SS Residual              MSE

gif;base64,R0lGODlhawECAHcAMSH+GlNvZnR3Y Total             n − 1         Total SS

Definition:  The coefficient of determination is

gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y r2=SS Regression

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y Total SS


(1 − α) × 100% Confidence Interval for β

Confidence intervals are often more easily interpreted than standard errors.  They give a range of plausible values for the parameter of interest.

gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y A (1 − α) × 100% confidence interval for the slope β is given by

gif;base64,R0lGODlhBgAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y b ± tn−2,1−α/2× SE (b)


(1 − α) × 100% Confidence Interval for α

We can also compute a confidence interval for the intercept α. First, we need an expression for the standard error of the estimated intercept:


gif;base64,R0lGODlhXQACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y s1


x2


σ2              


SE (a) =     b


+

n      Lxx


gif;base64,R0lGODlhCAAIAHcAMSH+GlNvZnR3Y Then a (1 − α) × 100% confidence interval for α is given by

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y a ± tn−2,1−α/2× SE (a)


Prediction


Here, y  may be used to:

b


y  = a + bx

b


0JADs= Estimate the mean of the dependent variable y  for a population of subjects sharing a common level of the independent variable x  (public health setting).

gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y v


u

SE (b) = ub


1      (x − x)2!

gif;base64,R0lGODlhNgACAHcAMSH+GlNvZnR3Y +


y           σ2

gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y n


Lxx


AVC3blgAAOw== gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y Estimate the value of the dependent variable y  for a subject whose value of the independent variable is x  (medical setting).

gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y v


u

y) = ub


1

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhNQACAHcAMSH+GlNvZnR3Y 1 +     +


(x − x)2!


SE (b


tσ2


n          Lxx


Estimate the mean of the dependent variable y  for a population of subjects sharing a common level of the independent variable x.

v

u

y) = ub


1      (x − x)2!

gif;base64,R0lGODlhNgACAHcAMSH+GlNvZnR3Y +


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y SE (b


σ2

n


Lxx


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y v


gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y u1


(x − x)2!


y ± tn−2,1−α/2


uσ2              


b                        × tb


+

n          Lxx


Estimate the value of the dependent variable y  for a subject whose value of the independent variable is x.

gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y v


u

SE (b) = ub


1

gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhNQACAHcAMSH+GlNvZnR3Y 1 +     +


(x − x)2!


y       tσ2


n          Lxx


v

JGqCDAEAOw== gif;base64,R0lGODlhCgACAHcAMSH+GlNvZnR3Y u1      (x − x)2!


y ± tn−2,1−α/2


uσ2                       


gif;base64,R0lGODlhBwAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhGgAIAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhCAAHAHcAMSH+GlNvZnR3Y b                        × tb


1 +     +

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y n


gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y Lxx


gif;base64,R0lGODlhGgAIAHcAMSH+GlNvZnR3Y Interpretation and plot

JGqCDAEAOw== AVC3blgAAOw== 2Q==

gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y 95% confidence interval:  for mean

gif;base64,R0lGODlhBgAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhBAAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhAwAFAHcAMSH+GlNvZnR3Y gif;base64,R0lGODlhFgAHAHcAMSH+GlNvZnR3Y 7ko50oASydei5VI5lMHCWuEAA7 gif;base64,R0lGODlhBwAGAHcAMSH+GlNvZnR3Y 95% prediction interval:  for individual