Regression Analysis in Statistics
Introduction:
Regression analysis is a statistical process for estimating the relationships
among variables. It includes many techniques for modeling and analyzing several
variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one
understand how the typical value of the dependent variable (or 'Criterion
Variable') changes when any one of the independent variables is varied, while
the other independent variables are held fixed. Most commonly, regression
analysis estimates the conditional expectations of the dependent variable given the independent
variables – that is, the average
value of the dependent variable when the independent
variables are fixed.
Where is regression analysis used?
Regression analysis is
widely used for prediction and forecasting, where its use has substantial overlap with the
field of machine learning. Regression analysis is also used to understand
which among the independent variables are related to the dependent variable,
and to explore the forms of these relationships. In restricted circumstances,
regression analysis can be used to infer casual relationships between the independent and dependent variables.
However this can lead to illusions or false relationships, so caution is
advisable; for example, correlation
does not imply causation.
Assumptions made for Regression Analysis:
·
The sample is
representative of the population for the inference prediction.
·
The error is a random
variable with a mean of zero conditional on the explanatory variables.
·
The independent
variables are measured with no error. (Note: If this is not so, modeling may be
done instead using errors in variable models techniques).
·
The predictors are linearly
independent, i.e. it is not possible to express any predictor as a linear
combination of the others.
·
The errors are uncorrelated,
that is, the variance-covariance matrix of the errors is diagonal and
each non-zero element is the variance of the error.
·
The variance of the
error is constant across observations. If not, weighted least squares or
other methods might instead be used.
Regression Model:
Linear Regression:
Linear Regression is an approach to model the
relationship between a scalar dependent
variable y and
one or more explanatory
variable denoted X.
The case of one explanatory variable is called simple
linear regression. Linear
regression was the first type of regression
analysis to be studied
rigorously, and to be used extensively in practical applications. This is
because models which depend linearly on their unknown parameters are easier to
fit than models which are non-linearly related to their parameters and because
the statistical properties of the resulting estimators are easier to determine.
Practical
Use of Linear Regression Model:
Linear regression has many practical uses. Most applications fall
into one of the following two broad categories:
·
If the goal is
prediction, or forecasting, or reduction, linear regression can be used to fit
a predictive model to an observed data set of y and X values.
After developing such a model, if an additional value of X is
then given without its accompanying value of y, the fitted model
can be used to make a prediction of the value of y.
·
Given a variable y and
a number of variables X1, ..., Xp that
may be related to y, linear regression analysis can be applied to
quantify the strength of the relationship between y and
the Xj, to assess which Xj may
have no relationship with y at all, and to identify which
subsets of the Xj contain redundant information
about y.
Assumptions for Linear
Regression Model:
The following are the major assumptions made by standard linear
regression models with standard estimation techniques.
·
Weak
exogeneity:. This essentially means
that the predictor variables x can be treated as fixed values,
rather than random variables. This means, for example, that the predictor
variables are assumed to be error-free, that is they are not contaminated with
measurement errors.
·
Linearity. This means that the mean of the response
variable is a linear combination of the parameters (regression
coefficients) and the predictor variables. Note that this assumption is much
less restrictive than it may at first seem. Because the predictor variables are
treated as fixed values, linearity is really only a restriction on the parameters.
·
Constant
variance: This means that
different response variables have the same variance in their errors,
regardless of the values of the predictor variables. In practice this
assumption is invalid if the response variables can vary over a wide scale.
Typically, for example, a response variable whose mean is large will have a
greater variance than one whose mean is small. For example, a given person
whose income is predicted to be $100,000 may easily have an actual income of
$80,000 or $120,000 (a SD of around $20,000), while another person
with a predicted income of $10,000 is unlikely to have the same $20,000
standard deviation, which would imply their actual income would vary anywhere
between -$10,000 and $30,000.
·
Independence of errors. This assumes that the errors of
the response variables are uncorrelated with each other. Some methods are
capable of handling correlated errors, although they typically require
significantly more data unless some sort of regularization is used to
bias the model towards assuming uncorrelated errors.
Extensions of Linear Regression:
Simple Multiple
Regression:
The very simplest case of a single scalar predictor variable x and
a single scalar response variable y is known as simple
linear regression. The extension to multiple and/or vector -valued
predictor variables (denoted with a capital X) is known as multiple
linear regression. Nearly all real-world regression models involve multiple
predictors, and basic descriptions of linear regression are often phrased in
terms of the multiple regression model.
Hierarchical Linear
Model:
Hierarchical Linear Model (or multilevel regression)
organizes the data into a hierarchy of regressions, for example where A is
regressed on B, and B is regressed on C.
It is often used where the data have a natural hierarchical structure such as
in educational statistics, where students are nested in classrooms, classrooms
are nested in schools, and schools are nested in some administrative grouping,
such as a school district. The response variable might be a measure of student
achievement such as a test score, and different covariates would be collected
at the classroom, school, and school district levels.
