7.3 Key Tips Before Execution

Overfitting Categorical Independent Variables

While a lot of the previous lessons have been theory, this course is ultimately about implementing these algorithms in Python and getting them to work well. To that effect, let us discuss a few key tips that you will need to make multivariate/multiple linear regression work well in Python.

Overfitting

If you are experiencing overfitting, your feature choice may be the reason. Having too many features inputted without any tenable theoretical justification can cause the algorithm to attach too much value to what may be just random information, making your predictions very inaccurate. On a commonsense basis, just think about it: are all of the features you're inputting clearly impacting the output you want to produce? You may not know their exact relation, or even whether they are proportional or otherwise, but as long as their is a relation between the two, you should be alright. For instance, if you are trying to predict how much a house costs, the color of its walls likely don't have any impact. Thus, it would be better to just omit that feature.

Categorical Independent Variables

In machine learning, the phrase "independent variables" refers to the features in the learning problem. While the concept of linear regression has nothing to do with this phenomenon, sklearn's library for ML does, so you should know how to handle this. A categorical independent variable is one that can fit only a finite number of values or categories. It is essentially a discrete variable. For example, boolean variables are categorical because they can only be true or false, nothing in between.

However, considering the numerical nature of linear regression, sklearn's package doesn't handle categorical independent variables very well. Instead, these discrete numbers are to be replaced with corresponding numerical values so that multiple linear regression can be done. This would be like replacing a true with 1 and a false with 0, for a boolean variable.

Otherwise, running multiple regression using Python is almost the same as Simple Linear Regression, just that you'll be inputting more than 1 feature. We'll go over the process in a subsequent notebook soon anyway.

Previous Section

Next Section

2️⃣

7.2 The Prediction Equation

4️⃣

7.4 Multiple Linear Regression: Housing Prices in King County

⚖️

Copyright © 2021 Code 4 Tomorrow. All rights reserved. The code in this course is licensed under the MIT License. If you would like to use content from any of our courses, you must obtain our explicit written permission and provide credit. Please contact classes@code4tomorrow.org for inquiries.