Each node has what’s called an Entropy. This indicates how random the data that passes through the particular node is. For example, a particular node would have zero entropy, if conducting that test makes the output class 100% clear, without any further testing. However, if data yielding a particular value for a the test (think “Rectangular Nose” for a Nose Shape Test), is still just about as likely to fit into any of the classes (could be a cat or a dog), then the Entropy for that node is 1, which is the maximum. Naturally, as you can see, Entropy acts like our cost function with Decision Trees. By minimizing the entropy, we gain node purity, and prediction accuracy. Great!

There’s just one more important concept to define.

Information Gain

Information Gain refers to the improvement in Entropy after splitting the data by a particular test. By splitting the data in the most efficient way, we can end up with a neat decision-making process to produce quick and accurate predictions.

Now, we’re ready to discuss the algorithm, so let’s jump into it!