Decision tree learning algorithm in Python

Ade Adeagbo
3 min readJun 26, 2020

Decision trees are one of the most popular prediction tools in machine learning. Also called CART (classification and regression trees), It is a great resource because it can be used to visually explain the decision-making process and the outcome at each level because of its top-down approach.

The decision algorithm makes sequential, top-down decisions in solving a problem. Classification tree is used to predict the class to which a discrete outcome belongs (e.g. Yes/No likelihood of loan default), while regression tree is used when the predicted outcome can be considered a real number (e.g. using square footage to predict the price of a house).

In implementing decision trees, different algorithms work to choose a variable at each step that best splits the set of data into two classes by taking into account the homogeneity of the target variable within the data subset. The GINI impurity metric or information gain metric is used in determining what is the best split and it is applied to each subset of the data in recursive partitioning until the resulting values are combined to provide an averaged measure of the quality of the split.

Architecture of the decision tree model consists of a root node from where we launch the first split, an internal node, and terminal nodes. A node is a representation of an input variable (X), while the terminal node represents the output variable (y) which is used in making a prediction.

This post will compare results from the implementation of the decision tree in python written from scratch with a result of the implementation in Scikit-Learn. You can learn more about the statistical intuition behind decision trees on this page

Predicting financial distress

This data set used in this implementation deals with the financial distress prediction for a sample of companies.

The target variable is denoted by “Financial Distress” if it will be greater than -0.50 the company should be considered as healthy (0). Otherwise, it would be regarded as financially distressed (1).

This data set is imbalanced (there are 136 financially distressed companies against 286 healthy ones i.e., 136 firm-year are financially distressed while 3546 firm-year are healthy) and skewed, so f-score should be employed as the performance evaluation criterion.

import pandas as pd
import numpy as np
df = pd.read_csv('Financial Distress.csv')
# convert target variable to binary values 1 and 0
def classifiers(x):
if x > -0.50:
return 0
else:
return 1


df['Financial_Distress'] = df['Financial Distress'].apply(classifiers)
# drop original target variable
df.drop('Financial Distress',axis=1,inplace=True)
# create feature names for viz
feature_names = df.drop('Financial_Distress',axis=1).columns
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(df.drop('Financial_Distress',axis=1),df['Financial_Distress'], test_size=0.30, random_state=101)from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
from pprint import pprint
pprint(clf)
prediction = clf.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,prediction))
# visualize decision tree
import os
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=feature_names,class_names=['Distressed','Not Distressed'],filled=True, rounded=True,special_characters=True)
graph = graphviz.Source(dot_data)
graph.render()

Result from Scikit-Learn decision tree algorithm

Result from scikit-lern decision tree
  • We have an unbalanced class, but both classes are important. So, we choose the classifier that gets high F1 scores on both classes, as well as low mis-classification rate. Our weighted average of 0.94 indicates a good model performance.
Decision tree forclassification task

Decision tree in python from scratch

This implementation makes use of entropy as a measurement of “best split”. First the entropy calculation.

Decision Tree class

Classification analysis using Financial Distress data

This story is being updated

--

--