Wednesday, 9 October 2013

                                                     Regression Analysis in Statistics

Introduction:
Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'Criterion Variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectations of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed.


Where is regression analysis used?
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer casual relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.


Assumptions made for Regression Analysis:
·         The sample is representative of the population for the inference prediction.
·         The error is a random variable with a mean of zero conditional on the explanatory variables.
·         The independent variables are measured with no error. (Note: If this is not so, modeling may be done instead using errors in variable models techniques).
·         The predictors are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others.
·         The errors are uncorrelated, that is, the variance-covariance matrix of the errors is diagonal and each non-zero element is the variance of the error.
·         The variance of the error is constant across observations. If not, weighted least squares or other methods might instead be used.

Regression Model:
Linear Regression:
Linear Regression is an approach to model the relationship between a scalar dependent variable y and one or more explanatory variable denoted X. The case of one explanatory variable is called simple linear regression. Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.


Practical Use of Linear Regression Model:
Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
·         If the goal is prediction, or forecasting, or reduction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.
·         Given a variable y and a number of variables X1, ..., Xp that may be related to y, linear regression analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y.

Assumptions for Linear Regression Model:
The following are the major assumptions made by standard linear regression models with standard estimation techniques.
·         Weak exogeneity:. This essentially means that the predictor variables x can be treated as fixed values, rather than random variables. This means, for example, that the predictor variables are assumed to be error-free, that is they are not contaminated with measurement errors.
·         Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values, linearity is really only a restriction on the parameters.
·         Constant variance: This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables. In practice this assumption is invalid if the response variables can vary over a wide scale. Typically, for example, a response variable whose mean is large will have a greater variance than one whose mean is small. For example, a given person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000 (a SD of around $20,000), while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, which would imply their actual income would vary anywhere between -$10,000 and $30,000.
·         Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. Some methods are capable of handling correlated errors, although they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors.

Extensions of Linear Regression:
Simple Multiple Regression:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector -valued predictor variables (denoted with a capital X) is known as multiple linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model.
Hierarchical Linear Model:
Hierarchical Linear Model (or multilevel regression) organizes the data into a hierarchy of regressions, for example where A is regressed on B, and B is regressed on C. It is often used where the data have a natural hierarchical structure such as in educational statistics, where students are nested in classrooms, classrooms are nested in schools, and schools are nested in some administrative grouping, such as a school district. The response variable might be a measure of student achievement such as a test score, and different covariates would be collected at the classroom, school, and school district levels.