Supervised learning is a branch of machine learning. Algorithms for supervised learning are designed in such a way that it learns by example. The dataset contains the input and the expected output from which the algorithm learns by dividing the data into training and testing sets.
When supervised learning algorithms are used on a dataset there can be situations in which the model outperforms on the training data, but when it is tested on new data, it may not perform well and shows high error. There can be several reasons for this, like bias-variance, over modeling the training data, multicollinearity, and more.
Bias and variance are the significant measures that tell us how the deviation of the function is varied. Bias is the estimate of error from the real value of the function. Variance estimates the deviation in the target variable function. If we measure it, it will change another sample training dataset.
The statement above means that during modeling of data we need to keep the bias as small as possible to assure greater accuracy.
By changing the training data samples, the result should get majorly varied. Low variance is selected for a better performing model.
Now comes the catch: when we try to lessen the bias, the model will fit well on a particular sample of the training data, but it won’t be able to find the hidden patterns in the rest of the dataset as it hasn’t seen it. So, it’s very probable that when a different sample is used for training of the model, the model will show a deviated output. This gives the outcome of high variance.
In the same way, when we want to have low variance or deviation when distinct samples are used, the model won’t fit well on the points and will lead to a high bias.
When we have high variance and low bias it is called overfitting. The model fits well with more accuracy on the data that is available, but when any new data is put into the model, it stops to predict, leading to more test error. This mostly happens when data has several variables, and the model assumes the contribution of the measured coefficients of all the variables and over measures the real value.
It can be the case that only a small set of features are important, and this can affect the predictions. So, if the variables that are not contributing much to data are high in number, they try to add value to the function in training data. When any new set of data is put in that is not much related to the extra variables, that’s when the predictions go incorrect.
There are cases when the model is not learning from training data and it generalizes nothing from the test data. This is called underfitting. This is not a major issue as it can be pointed out by the performance metrics. If the performance is not good for the model, one can try various other models to get better results. Therefore, underfitting is not a hot topic like overfitting.
The middle ground between overfitting and underfitting is a good fit. Taking into consideration the real-world problems, it is very difficult to find a perfect fit model. You can never get a perfect fit in one go; you need to retrain the model various times to get an appropriate fit of the data.
Techniques to avoid overfitting:
The idea of cross-validation is that the initial training data can be used to generate several train-test splits. These splits can be used to train your model and tune it in a better way.
In k-fold cross-validation, the data is partitioned into different subsets that are known as folds. Then the algorithm is looped on k-1 subsets and uses the rest of the subsets for the test data, also called the holdout fold.
Cross-validation trains the parameters with the real training set to ensure that the test data is totally unseen by the data.
It is used for combining predictions for several different models.
The methods for ensembling:
It decreases the probability of overfitting several complex models. It trains a huge number of main learners in parallel. The main learner is not constrained. Bagging combines the learners to smooth the predictions.
Boosting trains the rest of the learners in a pattern. The rest of the learners are weak and generally called constrained. The patterns emphasize learning from previous models’ mistakes. Boosting collaborates all the rest of the learners to a main single learner.
It deals with the set of techniques for making your model simpler.
Example: Pruning a decision tree, adding a penalty to the cost function of regression, etc.