Expectation-Maximization (EM) Algorithm: Mastering Statistical Learning with Hidden Variables → Explore with me!

CART (Classification and Regression Trees) stands as one of the most interpretable and versatile algorithms in machine learning. Developed by Leo Breiman and colleagues in 1984, CART revolutionized predictive modeling by creating tree-based models that humans can easily understand and interpret. This algorithm forms the backbone of many modern ensemble methods and remains a go-to choice when interpretability is crucial.

What is CART?

CART is a decision tree algorithm that builds binary trees for both classification and regression problems. Unlike other tree algorithms that can create multi-way splits, CART always creates binary splits, making the resulting trees easier to interpret and implement. The algorithm automatically handles both numerical and categorical features and includes built-in methods for dealing with missing values.

Core Principles

1. Binary Splits

CART always creates binary (yes/no) splits at each node, even for categorical variables with multiple values.

2. Recursive Partitioning

The algorithm recursively splits the data into increasingly homogeneous subsets.

3. Greedy Optimization

At each step, CART chooses the split that provides the best immediate improvement in the chosen criterion.

How CART Works

Start with root node: Contains all training data
Find best split: Test all possible binary splits on all features
Choose optimal split: Select split that best improves purity measure
Create child nodes: Split data based on chosen criterion
Repeat recursively: Apply process to each child node
Stop when criteria met: Minimum samples, maximum depth, or purity achieved
Prune tree: Remove branches that don’t improve validation performance

Splitting Criteria

For Classification (Impurity Measures)

Gini Impurity (CART Default)

Gini = 1 - Σ(i=1 to c) p_i²

Entropy (Information Gain)

Entropy = -Σ(i=1 to c) p_i * log₂(p_i)

For Regression

Mean Squared Error (MSE)

MSE = (1/n) * Σ(i=1 to n) (y_i - ȳ)²

Mean Absolute Error (MAE)

MAE = (1/n) * Σ(i=1 to n) |y_i - median(y)|

Implementation Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree, export_text
from sklearn.datasets import load_iris, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, r2_score
# Example 1: Classification with CART
print("=== CART Classification Example ===")
# Load iris dataset for interpretability
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create CART classifier
cart_classifier = DecisionTreeClassifier(
    criterion='gini',
    max_depth=3,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)
cart_classifier.fit(X_train, y_train)
y_pred = cart_classifier.predict(X_test)
print(f"Classification Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Tree depth: {cart_classifier.get_depth()}")
print(f"Number of leaves: {cart_classifier.get_n_leaves()}")
# Example 2: Regression with CART
print("\n=== CART Regression Example ===")
X_reg, y_reg = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42)
cart_regressor = DecisionTreeRegressor(
    criterion='squared_error',
    max_depth=5,
    min_samples_split=10,
    random_state=42
)
cart_regressor.fit(X_train_reg, y_train_reg)
y_pred_reg = cart_regressor.predict(X_test_reg)
print(f"Regression R² Score: {r2_score(y_test_reg, y_pred_reg):.4f}")
# Simple CART implementation for classification
class SimpleCARTNode:
    def __init__(self):
        self.feature = None
        self.threshold = None
        self.left = None
        self.right = None
        self.prediction = None
        self.samples = None
class SimpleCARTClassifier:
    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
        
    def gini_impurity(self, y):
        """Calculate Gini impurity"""
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        return 1 - np.sum(proportions ** 2)
    
    def information_gain(self, parent, left_child, right_child):
        """Calculate information gain from a split"""
        n_parent = len(parent)
        n_left = len(left_child)
        n_right = len(right_child)
        
        if n_parent == 0:
            return 0
        
        gini_parent = self.gini_impurity(parent)
        gini_left = self.gini_impurity(left_child)
        gini_right = self.gini_impurity(right_child)
        
        weighted_gini = (n_left/n_parent) * gini_left + (n_right/n_parent) * gini_right
        return gini_parent - weighted_gini
    
    def best_split(self, X, y):
        """Find the best split for the current node"""
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        n_features = X.shape[1]
        
        for feature in range(n_features):
            thresholds = np.unique(X[:, feature])
            
            for threshold in thresholds:
                left_indices = X[:, feature] <= threshold
                right_indices = ~left_indices
                
                if np.sum(left_indices) == 0 or np.sum(right_indices) == 0:
                    continue
                
                gain = self.information_gain(y, y[left_indices], y[right_indices])
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain
    
    def build_tree(self, X, y, depth=0):
        """Recursively build the decision tree"""
        node = SimpleCARTNode()
        node.samples = len(y)
        
        # Check stopping criteria
        if (depth >= self.max_depth or 
            len(y) < self.min_samples_split or 
            len(np.unique(y)) == 1):
            # Leaf node
            node.prediction = np.argmax(np.bincount(y))
            return node
        
        # Find best split
        best_feature, best_threshold, best_gain = self.best_split(X, y)
        
        if best_gain <= 0:
            # No good split found
            node.prediction = np.argmax(np.bincount(y))
            return node
        
        # Split the data
        left_indices = X[:, best_feature] <= best_threshold
        right_indices = ~left_indices
        
        node.feature = best_feature
        node.threshold = best_threshold
        
        # Recursively build child nodes
        node.left = self.build_tree(X[left_indices], y[left_indices], depth + 1)
        node.right = self.build_tree(X[right_indices], y[right_indices], depth + 1)
        
        return node
    
    def fit(self, X, y):
        """Train the CART classifier"""
        self.root = self.build_tree(X, y)
    
    def predict_sample(self, x, node):
        """Predict a single sample"""
        if node.prediction is not None:
            return node.prediction
        
        if x[node.feature] <= node.threshold:
            return self.predict_sample(x, node.left)
        else:
            return self.predict_sample(x, node.right)
    
    def predict(self, X):
        """Predict multiple samples"""
        return np.array([self.predict_sample(x, self.root) for x in X])
# Test simple implementation
simple_cart = SimpleCARTClassifier(max_depth=3)
simple_cart.fit(X_train, y_train)
simple_predictions = simple_cart.predict(X_test)
simple_accuracy = accuracy_score(y_test, simple_predictions)
print(f"\nSimple CART Accuracy: {simple_accuracy:.4f}")

Pruning in CART

CART includes sophisticated pruning techniques to prevent overfitting:

Cost Complexity Pruning (Post-Pruning)

CART uses a cost-complexity parameter (α) to balance tree complexity and accuracy:

Cost = Error + α × |Leaves|

# Demonstrate pruning with cost complexity
from sklearn.tree import DecisionTreeClassifier
# Train tree with different complexity parameters
alphas = [0.0, 0.01, 0.05, 0.1, 0.2]
results = []
for alpha in alphas:
    tree = DecisionTreeClassifier(
        criterion='gini',
        random_state=42,
        ccp_alpha=alpha  # Cost complexity pruning parameter
    )
    tree.fit(X_train, y_train)
    
    train_acc = tree.score(X_train, y_train)
    test_acc = tree.score(X_test, y_test)
    n_leaves = tree.get_n_leaves()
    
    results.append({
        'alpha': alpha,
        'train_accuracy': train_acc,
        'test_accuracy': test_acc,
        'n_leaves': n_leaves
    })
    
    print(f"Alpha: {alpha:.2f} | Leaves: {n_leaves:2d} | "
          f"Train Acc: {train_acc:.3f} | Test Acc: {test_acc:.3f}")
# Plot pruning results
import pandas as pd
results_df = pd.DataFrame(results)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(results_df['alpha'], results_df['train_accuracy'], 'o-', label='Training')
plt.plot(results_df['alpha'], results_df['test_accuracy'], 'o-', label='Testing')
plt.xlabel('Cost Complexity Parameter (α)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Pruning Parameter')
plt.legend()
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(results_df['alpha'], results_df['n_leaves'], 'o-', color='green')
plt.xlabel('Cost Complexity Parameter (α)')
plt.ylabel('Number of Leaves')
plt.title('Tree Complexity vs Pruning Parameter')
plt.grid(True)
plt.tight_layout()
plt.show()

Advantages of CART

Highly Interpretable: Easy to understand and visualize decision paths
No Assumptions: Makes no statistical assumptions about data distribution
Handles Mixed Data: Works with both numerical and categorical features
Missing Value Handling: Built-in methods for dealing with missing data
Feature Selection: Automatically selects most important features
Non-linear Relationships: Captures complex decision boundaries
Fast Prediction: Efficient prediction time O(log n)
No Preprocessing: Doesn't require feature scaling or normalization

Limitations of CART

Overfitting Prone: Can create overly complex trees without pruning
Instability: Small changes in data can result in different trees
Bias: Favors features with more levels or continuous variables
Limited Expressiveness: Axis-parallel splits only
Difficulty with Linear Relationships: Many splits needed for simple linear patterns
Greedy Algorithm: May not find globally optimal tree

Real-World Applications

Medical Diagnosis: Decision support systems for healthcare
Credit Scoring: Loan approval and risk assessment
Marketing: Customer segmentation and targeting
Manufacturing: Quality control and process optimization
HR Analytics: Employee performance prediction
Fraud Detection: Identifying suspicious transactions
Customer Service: Automated decision trees for support

Feature Importance in CART

# Analyze feature importance
feature_importance = cart_classifier.feature_importances_
# Create feature importance plot
plt.figure(figsize=(10, 6))
indices = np.argsort(feature_importance)[::-1]
plt.bar(range(len(feature_importance)), feature_importance[indices])
plt.title('Feature Importance in CART')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.xticks(range(len(feature_importance)), 
           [feature_names[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()
# Print feature importance
print("Feature Importance Rankings:")
for i, idx in enumerate(indices):
    print(f"{i+1}. {feature_names[idx]}: {feature_importance[idx]:.4f}")

Handling Missing Values

CART has built-in mechanisms for handling missing values:

Surrogate Splits: Use alternative features when primary feature is missing
Default Direction: Send missing values to the child with more samples
Missing Value Category: Treat missing as a separate category

CART vs Other Decision Tree Algorithms

vs ID3: CART handles continuous variables and missing values better
vs C4.5: CART uses binary splits only, simpler pruning
vs Random Forest: CART is single tree, RF is ensemble of trees
vs Gradient Boosting: CART is standalone, GB builds trees sequentially

Parameter Tuning

# Grid search for optimal parameters
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy']
}
# Perform grid search
grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)
# Test best model
best_tree = grid_search.best_estimator_
best_accuracy = best_tree.score(X_test, y_test)
print("Test accuracy with best parameters:", best_accuracy)

Ensemble Methods Built on CART

CART forms the foundation for many powerful ensemble methods:

Random Forest: Multiple CART trees with random feature selection
Gradient Boosting: Sequential CART trees correcting previous errors
Extra Trees: Extremely randomized trees
XGBoost: Optimized gradient boosting with CART base learners

Best Practices

Control Tree Depth: Use max_depth to prevent overfitting
Set Minimum Samples: Use min_samples_split and min_samples_leaf
Use Pruning: Apply cost complexity pruning for better generalization
Cross-Validation: Use CV to select optimal parameters
Feature Engineering: Create meaningful features that align with tree splits
Ensemble Methods: Consider Random Forest or Gradient Boosting for better performance
Validate Interpretability: Ensure tree remains interpretable for your use case

When to Use CART

Choose CART when:

Interpretability is crucial
You need to explain decisions to stakeholders
Data contains mixed types (numerical and categorical)
You have missing values in your dataset
You want to identify important features
You need a baseline model quickly
Non-linear relationships exist in your data

Consider alternatives when:

You prioritize predictive performance over interpretability
Your data has strong linear relationships
You have very large datasets
Features are highly correlated
You need probabilistic outputs

CART represents the perfect balance between simplicity and power in machine learning. Its ability to create highly interpretable models while handling complex, non-linear relationships makes it invaluable in domains where understanding the decision-making process is as important as the accuracy of predictions. Whether used standalone for interpretable modeling or as the building block for sophisticated ensemble methods, CART remains one of the most important algorithms in the data science toolkit. Understanding CART provides essential insights into tree-based learning and serves as the foundation for mastering more advanced ensemble techniques.

Expectation-Maximization (EM) Algorithm: Mastering Statistical Learning with Hidden Variables

What is CART?

Core Principles

1. Binary Splits

2. Recursive Partitioning

3. Greedy Optimization

How CART Works

Splitting Criteria

For Classification (Impurity Measures)

Gini Impurity (CART Default)

Entropy (Information Gain)

For Regression

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

Implementation Example

Pruning in CART

Cost Complexity Pruning (Post-Pruning)

Advantages of CART

Limitations of CART

Real-World Applications

Feature Importance in CART

Handling Missing Values

CART vs Other Decision Tree Algorithms

Parameter Tuning

Ensemble Methods Built on CART

Best Practices

When to Use CART

Like this:

You may like

Written by:

Chandan 377 Posts

You May Have Missed

Runtime Compile C# Code

Lets Create a Null Checking Extension on C#

How to encrypt and decrypt using cryptography (AES)?

How to auto-generate unique-identifier column in Sql?

What is CART?

Core Principles

1. Binary Splits

2. Recursive Partitioning

3. Greedy Optimization

How CART Works

Splitting Criteria

For Classification (Impurity Measures)

Gini Impurity (CART Default)

Entropy (Information Gain)

For Regression

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

Implementation Example

Pruning in CART

Cost Complexity Pruning (Post-Pruning)

Advantages of CART

Limitations of CART

Real-World Applications

Feature Importance in CART

Handling Missing Values

CART vs Other Decision Tree Algorithms

Parameter Tuning

Ensemble Methods Built on CART

Best Practices

When to Use CART

Like this:

You may like

Written by:

Chandan 377 Posts

Related Posts

CART Algorithm: The Foundation of Interpretable Machine Learning and Decision Trees

Naïve Bayes: The Surprisingly Effective Probabilistic Classifier Behind Spam Filters

k-Nearest Neighbors (kNN): The Intuitive Algorithm That Powers Recommendation Systems

You May Have Missed

Runtime Compile C# Code

Lets Create a Null Checking Extension on C#

How to encrypt and decrypt using cryptography (AES)?

How to auto-generate unique-identifier column in Sql?