Introduction to Python: The Basics via Descriptive Statistics and Libraries, Part III
In the last two introductory posts we demonstrated Python through the use of NumPy, SciPy, and Pandas. Through these libraries, we used descriptive statistics functions to present Python code via Jupyter notebooks. These notebooks are available on GitHub for readers to download and modify for your own purposes. This final introductory post presents scikit-learn, a library used for machine learning. The library is huge, so we will emphasize linear regression by creating a new Jupyter notebook with a 2D data set. We first define regression, then we look at scikit-learn, and finally look at our sample program.
Linear regression shows the relationship between two variables. Regression is based on this mathematical formula: y = mx + b. The coefficient, m, can be positive or negative. In this post, we only look at the regression model and its plot, a later post goes into more detail. For now, understand that the relationship is linear.
To work with scikit-learn and derive the linear function for our data, we must perform these steps:
Create a data set using NumPy or input one of scikit-learn's default data sets. If using 2D arrays, you must input NumPy and call the reshape() method on your x variable.
Build a model using the LinearRegression() and fit() functions. LinearRegression() tells the interpreter to build a least-squares linear model and fit() defines our coefficients.
Next, we find our slope and intercept to create our linear model. We call coef_ and intercept_ from scikit-learn to plug in our coefficients.
Then we make predictions on y using the predict() method. The program returns a list of values as an array.
Finally, we can calculate the coefficient of determination, R^2, to check the validity of the model.
Additionally, we generate a scatterplot of our values.
Finally, we will look at our sample program and explain each step. Always make sure you import the following in order for the model to generate:
NumPy
sklearn-learn (import LinearRegression)
matplotlib & %matplotlib inline. As from our last post, %matplotlib inline guarantees the output of our plots within Jupyter.
Once you have these libraries imported into your program, you are ready to build and generate your model. First, I defined two arrays, x and y. To make the array two-dimensional, I had to call reshape() to create two columns for our values. Because Python has lists, tuples, and dictionaries, we call the np.array() function to create our data arrays:
Next, we call both the LinearRegression() and fit() functions to determine our linear model. When both model.coef_ and model.intercept_ is called, we get 1.5 and 2.6, respectively. Therefore, our linear model is the following: 2.6 + 1.5x, with 1.5 being the slope, which is positive:
R^2 was found to be approximately 0.987, which is close to 1.0. So, we almost have a perfect linear model. Looking at the scatter plot, the points almost indicate a perfect fit:
In our next post, we shift gears and look at Python. The post will deal with input and output, look at variables, and define some of the keywords found in the language. We also look at the Python interpreter and create a sample program there to ensure our libraries installed properly.