Regression
In the last chapter we saw how to describe the distribution of a single variable in a sample. However, in most cases, studies require to describe several variables that are often related. For instance, a nutritional study should consider all the variables that could be related to the weight, as height, age, gender, smoking, diet, physic exercise, etc.
To understand a phenomenon that involve several variables is not enough to study every variable by its own. We have to study all the variables together to describe how they interact and the type of relation among them.
Usually in a dependency study there is a dependent variable
Joint distribution
Joint frequencies
To study the relation between two variables
Definition - Joint sample frequencies. Given a sample of
- Absolute frequency
: Is the number of times that the pair appears in the sample. - Relative frequency
: Is the proportion of times that the pair appears in the sample.
Joint frequency distribution
The values of the two-dimensional variable with their frequencies is known as joint frequency distribution, and is represented in a joint frequency table.
Example (grouped data). The height (in cm) and weight (in kg) of a sample of 30 students is:
(172,62), (166,60), (194,90), (185,75), (162,55), (187,78),
(198,109), (177,61), (178,70), (165,58), (154,50), (183,93),
(166,51), (171,65), (175,70), (182,60), (167,59), (169,62),
(172,70), (186,71), (172,54), (176,68),(168,67), (187,80).
The joint frequency table is
Scatter plot
The joint frequency distribution can be represented graphically with a scatter plot, where data is displayed as a collections of points on a
Usually the independent variable is represented in the
The result is a set of points that usually is known as a point cloud.
Example. The scatter plot below represent the distribution of heights and weights of the previous sample.
The shape of the point cloud in a scatter plot gives information about the type of relation between the variables.
Marginal frequency distributions
The frequency distributions of each variable of the two-dimensional variable are known as marginal frequency distributions.
We can get the marginal frequency distributions from the joint frequency table by adding frequencies by rows and columns.
Example. The marginal frequency distributions for the previous sample of heights and weights are
and the corresponding statistics are
Covariance
To study the relation between two variables, we have to analyze the joint variation of them.
Dividing the point cloud of the scatter plot in 4 quadrants centered in the mean point
Quadrant | |||
---|---|---|---|
1 | |||
2 | |||
3 | |||
4 |
If there is an increasing linear relationship between the variables, most of the points will fall in quadrants 1 and 3, and the sum of the products of deviations from the mean will be positive.
If there is an decreasing linear relationship between the variables, most of the points will fall in quadrants 2 and 4, and the sum of the products of deviations from the mean will be negative.
Using the products of deviations from the means we get the following statistic.
It can also be calculated using the formula
The covariance measures the linear relation between two variables:
- If
there exists an increasing linear relation. - If
there exists a decreasing linear relation. - If
there is no linear relation.
Example. Using the joint frequency table of the sample of heights and weights
we get that the covariance is equal to
This means that there is a increasing linear relation between the weight and the height.
Regression
In most cases the goal of a dependency study is not only to detect a relation between two variables, but also to express that relation with a mathematical function,
Simple regression models
There are a lot of types of regression models. The most common models are shown in the table below.
Model | Equation |
---|---|
Linear | |
Quadratic | |
Cubic | |
Potential | |
Exponential | |
Logarithmic | |
Inverse | |
Sigmoidal |
The model choice depends on the shape of the points cloud in the scatterplot.
Residuals or predictive errors
Once chosen the type of regression model, we have to determine which function of that family explains better the relation between the dependent and the independent variables, that is, the function that predicts better the dependent variable.
That function is the function that minimizes the distances from the observed values for
Least squares fitting
A way to get the regression function is the least squares method, that determines the function that minimizes the squared residuals.
For a linear model
This reduces the problem to determine the values of
To solve the minimization problem, we have to set to zero the partial derivatives with respect to
And solving the equation system, we get
This values minimize the residuals on
Regression line
Example. Using the previous sample of heights (
the regression line of weight on height is
And the regression line of height on weight is
Relative position of the regression lines
Usually, the regression line of
If there is a perfect linear relation between the variables, then both regression lines are the same, as that line makes both
If there is no linear relation between the variables, then both regression lines are constant and equals to the respective means,
So, they intersect perpendicularly.
Regression coefficient
The most important parameter of a regression line is the slope.
Example. In the sample of heights and weights, the regression line of weight on height was
Thus, the regression coefficient of weight on height is
That means that, according to the regression line of weight on height, the weight will increase
Regression predictions
Usually the regression models are used to predict the dependent variable for some values of the independent variable.
Example. In the sample of heights and weights, to predict the weight of a person with a height of 180 cm, we have to use the regression line of weight on height,
But to predict the height of a person with a weight of 79 Kg, we have to use the regression line of height on weight,
However, how reliable are these predictions?
Correlation
Once we have a regression model, in order to see if it is a good predictive model we have to assess the goodness of fit of the model and the strength of the of relation set by it. The part of Statistics in charge of this is correlation.
The correlation study the residuals of a regression model: the smaller the residuals, the greater the goodness of fit, and the stronger the relation set by the model.
Residual variance
To measure the goodness of fit of a regression model is common to use the residual variance.
Definition - Sample residual variance
The greater the residuals, the greater the residual variance and the smaller the goodness of fit.
When the linear relation is perfect, the residuals are zero and the residual variance is zero. Conversely, when there are no relation, the residuals coincide with deviations from the mean, and the residual variance is equal to the variance of the dependent variable.
Explained and non-explained variation

Coefficient of determination
From the residual variance is possible to define another correlation statistic easier to interpret.
As the residual variance ranges from 0 to
The greater
- If
then there is no relation as set by the regression model. - If
then the relation set by the model is perfect.
When the regression model is linear, the coefficient of determination can be computed with this formula
When the fitted model is the regression line, the the residual variance is
and the coefficient of determination is
Example. In the sample of heights and weights, we had
Thus, the linear coefficient of determination is
This means that the linear model of weight on height explains the 65% of the variation of weight, and the linear model of height on weight also explains 65% of the variation of height.
Correlation coefficient
As
The correlation coefficient measures not only the strength of the linear association but also its direction (increasing or decreasing):
- If
then there is no linear relation. - Si
then there is a perfect increasing linear relation. - Si
then there is a perfect decreasing linear relation.
Example. In the sample of heights and weights, we had
Thus, the correlation coefficient is
This means that there is a rather strong linear, increasing, relation between height and weight.
Different linear correlations
The scatter plots below show linear regression models with differents correlations.
Reliability of regression predictions
The coefficient of determination explains the goodness of fit of a regression model, but there are other factors that influence the reliability of regression predictions:
-
The coefficient of determination: The greater
, the greater the goodness of fit and the more reliable the predictions are. -
The variability of the population distribution: The greater the variation, the more difficult to predict and the less reliable the predictions are.
-
The sample size: The greater the sample size, the more information we have and the more reliable the predictions are.
Non-linear regression
The fit of a non-linear regression can be also done by the least square fitting method.
However, in some cases the fitting of a non-linear model can be reduced to the fitting of a linear model applying a simple transformation to the variables of the model.
Transformations of non-linear regression models
-
Logarithmic: A logarithmic model
can be transformed in a linear model with the change : -
Exponential: An exponential model
can be transformed in a linear model with the change : -
Potential: A potential model
can be transformed in a linear model with the changes and : -
Inverse: An inverse model
can be transformed in a linear model with the change : -
Sigmoidal: A sigmoidal model
can be transformed in a linear model with the changes and :
Exponential relation
Example. The number of bacteria in a culture evolves with time according to the table below.
The scatter plot of the sample is showed below.
Fitting a linear model we get
Is a good model?
Although the linear model is not bad, according to the shape of the point cloud of the scatter plot, an exponential model looks more suitable.
To construct an exponential model
Now it only remains to compute the regression line of the logarithm of bacteria on hours,
and, undoing the change of variable,
Thus, the exponential model fits much better than the linear model.
Regression risks
Lack of fit does not mean independence
It is important to note that every regression model has its own coefficient of determination.
Outliers influence in regression
Outliers in regression studies are points that clearly do not follow the tendency of the rest of points, even if the values of the pair are not outliers for every variable separately.
Outliers in regression studies can provoke drastic changes in the regression models.
The Simpson’s paradox
Sometimes a trend can disappears or even reverses when we split the sample into groups according to a qualitative variable that is related to the dependent variable. This is known as the Simpson’s paradox.
Example. The scatterplot below shows an inverse relation between the study hours and the score in an exam.
But if we split the sample in two groups (good and bad students) we get different trends and now the relation is direct, which makes more sense.