Understanding Statistical Significance and Portfolio Diversification in Finance
Understanding Statistical Significance in Regression Analysis
Recall that the probability of rejecting a correct null hypothesis is equal to the size of the test, denoted α. The possibility of rejecting a correct null hypothesis arises from the fact that test statistics are assumed to follow a random distribution, and hence they will take on extreme values that fall in the rejection region some of the time by chance alone. A consequence of this is that it will almost always be possible to find significant relationships between variables if enough variables are examined. The implication is that for any regression, if enough explanatory variables are employed, often one or more will be significant by chance alone. More concretely, it could be stated that if an α% size of test is used, on average one in every (100/α) regressions will have a significant slope coefficient by chance alone.
Trying many variables in a regression without basing the selection of the candidate variables on a financial or economic theory is known as data mining or data snooping. The result in such cases is that the true significance level will be considerably greater than the nominal significance level assumed. For example, suppose that 20 separate regressions are conducted, of which three contain a significant regressor, and a 5% nominal significance level is assumed, then the true significance level would be much higher (e.g., 25%). Therefore, if the researcher then shows only the results for the regression containing the final three equations and states that they are significant at the 5% level, inappropriate conclusions concerning the significance of the variables would result.
As well as ensuring that the selection of candidate regressors for inclusion in a model is made on the basis of financial or economic theory, another way to avoid data mining is by examining the forecast performance of the model in an out-of-sample data set (see chapter 5). The idea is essentially that a proportion of the data is not used in model estimation but is retained for model testing. A relationship observed in the estimation period that is purely the result of data mining, and is therefore spurious, is very unlikely to be repeated for the out-of-sample period. Therefore, models that are the product of data mining are likely to fit very poorly and to give very inaccurate forecasts for the out-of-sample period.
Jensen’s Alpha and Mutual Fund Performance
Jensen systematically tested the performance of mutual funds, and in particular, examined whether any beat the market. He used a sample of annual returns on the portfolios of 115 mutual funds from 1945-64. Each of the 115 funds was subjected to a separate OLS time series regression of the form:
Rjt − Rft = αj + βj(Rmt − Rft) + ujt
where Rjt is the return on portfolio j at time t, Rft is the return on a risk-free proxy (a 1-year government bond), Rmt is the return on a market portfolio proxy, ujt is an error term, and αj, βj are parameters to be estimated. The quantity of interest is the significance of αj, since this parameter defines whether the fund outperforms or underperforms the market index. Thus the null hypothesis is given by: H0 : αj = 0. A positive and significant αj for a given fund would suggest that the fund is able to earn significant abnormal returns in excess of the market-required return for a fund of this given riskiness. This coefficient has become known as Jensen’s alpha.
Portfolio Diversification and Risk Reduction
Rp=w1r1+w2r2 ; w1+w2=1. Diversification reduces risks depending on the degree of correlation between the invested assets. If we invest in assets not correlated, we improve the ROI. If the correlation were 1, there would be perfect correlation and risk is not reduced. If it were -1, the return is completely canceled. Knowing the portfolio risk, we can obtain the minimum variance/risk formula for the portfolio.
Var(rp)=w12 var(r1)+w22 var(r2) + 2w1w2 cov(r1r2)
min var: dvar(rp)/dw1=0
2w1 var(r1) + 2(1-w1)(-1) var(r2) 2cov(r1r2)-4w cov(r1r2)=0
w=(var(r2)-cov(r1r2))/(var(r1)+var(r2)-2cov(r1r2))