Expectation-Maximization (EM) Algorithm: Mastering Statistical Learning with Hidden Variables

Expectation-Maximization (EM) Algorithm: Mastering Statistical Learning with Hidden Variables

CART (Classification and Regression Trees) stands as one of the most interpretable and versatile algorithms in machine learning. Developed by Leo Breiman and colleagues in 1984, CART revolutionized predictive modeling by creating tree-based models that humans can easily understand and interpret. This algorithm forms the backbone of many modern ensemble methods and remains a go-to choice when interpretability is crucial.

What is CART?

CART is a decision tree algorithm that builds binary trees for both classification and regression problems. Unlike other tree algorithms that can create multi-way splits, CART always creates binary splits, making the resulting trees easier to interpret and implement. The algorithm automatically handles both numerical and categorical features and includes built-in methods for dealing with missing values.

Core Principles

1. Binary Splits

CART always creates binary (yes/no) splits at each node, even for categorical variables with multiple values.

2. Recursive Partitioning

The algorithm recursively splits the data into increasingly homogeneous subsets.

3. Greedy Optimization

At each step, CART chooses the split that provides the best immediate improvement in the chosen criterion.

How CART Works

  1. Start with root node: Contains all training data
  2. Find best split: Test all possible binary splits on all features
  3. Choose optimal split: Select split that best improves purity measure
  4. Create child nodes: Split data based on chosen criterion
  5. Repeat recursively: Apply process to each child node
  6. Stop when criteria met: Minimum samples, maximum depth, or purity achieved
  7. Prune tree: Remove branches that don’t improve validation performance

Splitting Criteria

For Classification (Impurity Measures)

Gini Impurity (CART Default)

Gini = 1 - Σ(i=1 to c) p_i²

Entropy (Information Gain)

Entropy = -Σ(i=1 to c) p_i * log₂(p_i)

For Regression

Mean Squared Error (MSE)

MSE = (1/n) * Σ(i=1 to n) (y_i - ȳ)²

Mean Absolute Error (MAE)

MAE = (1/n) * Σ(i=1 to n) |y_i - median(y)|

Implementation Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree, export_text
from sklearn.datasets import load_iris, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, r2_score
# Example 1: Classification with CART
print("=== CART Classification Example ===")
# Load iris dataset for interpretability
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create CART classifier
cart_classifier = DecisionTreeClassifier(
    criterion='gini',
    max_depth=3,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)
cart_classifier.fit(X_train, y_train)
y_pred = cart_classifier.predict(X_test)
print(f"Classification Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Tree depth: {cart_classifier.get_depth()}")
print(f"Number of leaves: {cart_classifier.get_n_leaves()}")
# Example 2: Regression with CART
print("\n=== CART Regression Example ===")
X_reg, y_reg = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42)
cart_regressor = DecisionTreeRegressor(
    criterion='squared_error',
    max_depth=5,
    min_samples_split=10,
    random_state=42
)
cart_regressor.fit(X_train_reg, y_train_reg)
y_pred_reg = cart_regressor.predict(X_test_reg)
print(f"Regression R² Score: {r2_score(y_test_reg, y_pred_reg):.4f}")
# Simple CART implementation for classification
class SimpleCARTNode:
    def __init__(self):
        self.feature = None
        self.threshold = None
        self.left = None
        self.right = None
        self.prediction = None
        self.samples = None
class SimpleCARTClassifier:
    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
        
    def gini_impurity(self, y):
        """Calculate Gini impurity"""
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        return 1 - np.sum(proportions ** 2)
    
    def information_gain(self, parent, left_child, right_child):
        """Calculate information gain from a split"""
        n_parent = len(parent)
        n_left = len(left_child)
        n_right = len(right_child)
        
        if n_parent == 0:
            return 0
        
        gini_parent = self.gini_impurity(parent)
        gini_left = self.gini_impurity(left_child)
        gini_right = self.gini_impurity(right_child)
        
        weighted_gini = (n_left/n_parent) * gini_left + (n_right/n_parent) * gini_right
        return gini_parent - weighted_gini
    
    def best_split(self, X, y):
        """Find the best split for the current node"""
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        n_features = X.shape[1]
        
        for feature in range(n_features):
            thresholds = np.unique(X[:, feature])
            
            for threshold in thresholds:
                left_indices = X[:, feature] <= threshold
                right_indices = ~left_indices
                
                if np.sum(left_indices) == 0 or np.sum(right_indices) == 0:
                    continue
                
                gain = self.information_gain(y, y[left_indices], y[right_indices])
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain
    
    def build_tree(self, X, y, depth=0):
        """Recursively build the decision tree"""
        node = SimpleCARTNode()
        node.samples = len(y)
        
        # Check stopping criteria
        if (depth >= self.max_depth or 
            len(y) < self.min_samples_split or 
            len(np.unique(y)) == 1):
            # Leaf node
            node.prediction = np.argmax(np.bincount(y))
            return node
        
        # Find best split
        best_feature, best_threshold, best_gain = self.best_split(X, y)
        
        if best_gain <= 0:
            # No good split found
            node.prediction = np.argmax(np.bincount(y))
            return node
        
        # Split the data
        left_indices = X[:, best_feature] <= best_threshold
        right_indices = ~left_indices
        
        node.feature = best_feature
        node.threshold = best_threshold
        
        # Recursively build child nodes
        node.left = self.build_tree(X[left_indices], y[left_indices], depth + 1)
        node.right = self.build_tree(X[right_indices], y[right_indices], depth + 1)
        
        return node
    
    def fit(self, X, y):
        """Train the CART classifier"""
        self.root = self.build_tree(X, y)
    
    def predict_sample(self, x, node):
        """Predict a single sample"""
        if node.prediction is not None:
            return node.prediction
        
        if x[node.feature] <= node.threshold:
            return self.predict_sample(x, node.left)
        else:
            return self.predict_sample(x, node.right)
    
    def predict(self, X):
        """Predict multiple samples"""
        return np.array([self.predict_sample(x, self.root) for x in X])
# Test simple implementation
simple_cart = SimpleCARTClassifier(max_depth=3)
simple_cart.fit(X_train, y_train)
simple_predictions = simple_cart.predict(X_test)
simple_accuracy = accuracy_score(y_test, simple_predictions)
print(f"\nSimple CART Accuracy: {simple_accuracy:.4f}")

Pruning in CART

CART includes sophisticated pruning techniques to prevent overfitting:

Cost Complexity Pruning (Post-Pruning)

CART uses a cost-complexity parameter (α) to balance tree complexity and accuracy:

Cost = Error + α × |Leaves|
# Demonstrate pruning with cost complexity
from sklearn.tree import DecisionTreeClassifier
# Train tree with different complexity parameters
alphas = [0.0, 0.01, 0.05, 0.1, 0.2]
results = []
for alpha in alphas:
    tree = DecisionTreeClassifier(
        criterion='gini',
        random_state=42,
        ccp_alpha=alpha  # Cost complexity pruning parameter
    )
    tree.fit(X_train, y_train)
    
    train_acc = tree.score(X_train, y_train)
    test_acc = tree.score(X_test, y_test)
    n_leaves = tree.get_n_leaves()
    
    results.append({
        'alpha': alpha,
        'train_accuracy': train_acc,
        'test_accuracy': test_acc,
        'n_leaves': n_leaves
    })
    
    print(f"Alpha: {alpha:.2f} | Leaves: {n_leaves:2d} | "
          f"Train Acc: {train_acc:.3f} | Test Acc: {test_acc:.3f}")
# Plot pruning results
import pandas as pd
results_df = pd.DataFrame(results)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(results_df['alpha'], results_df['train_accuracy'], 'o-', label='Training')
plt.plot(results_df['alpha'], results_df['test_accuracy'], 'o-', label='Testing')
plt.xlabel('Cost Complexity Parameter (α)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Pruning Parameter')
plt.legend()
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(results_df['alpha'], results_df['n_leaves'], 'o-', color='green')
plt.xlabel('Cost Complexity Parameter (α)')
plt.ylabel('Number of Leaves')
plt.title('Tree Complexity vs Pruning Parameter')
plt.grid(True)
plt.tight_layout()
plt.show()

Advantages of CART

  • Highly Interpretable: Easy to understand and visualize decision paths
  • No Assumptions: Makes no statistical assumptions about data distribution
  • Handles Mixed Data: Works with both numerical and categorical features
  • Missing Value Handling: Built-in methods for dealing with missing data
  • Feature Selection: Automatically selects most important features
  • Non-linear Relationships: Captures complex decision boundaries
  • Fast Prediction: Efficient prediction time O(log n)
  • No Preprocessing: Doesn't require feature scaling or normalization

Limitations of CART

  • Overfitting Prone: Can create overly complex trees without pruning
  • Instability: Small changes in data can result in different trees
  • Bias: Favors features with more levels or continuous variables
  • Limited Expressiveness: Axis-parallel splits only
  • Difficulty with Linear Relationships: Many splits needed for simple linear patterns
  • Greedy Algorithm: May not find globally optimal tree

Real-World Applications

  • Medical Diagnosis: Decision support systems for healthcare
  • Credit Scoring: Loan approval and risk assessment
  • Marketing: Customer segmentation and targeting
  • Manufacturing: Quality control and process optimization
  • HR Analytics: Employee performance prediction
  • Fraud Detection: Identifying suspicious transactions
  • Customer Service: Automated decision trees for support

Feature Importance in CART

# Analyze feature importance
feature_importance = cart_classifier.feature_importances_
# Create feature importance plot
plt.figure(figsize=(10, 6))
indices = np.argsort(feature_importance)[::-1]
plt.bar(range(len(feature_importance)), feature_importance[indices])
plt.title('Feature Importance in CART')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.xticks(range(len(feature_importance)), 
           [feature_names[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()
# Print feature importance
print("Feature Importance Rankings:")
for i, idx in enumerate(indices):
    print(f"{i+1}. {feature_names[idx]}: {feature_importance[idx]:.4f}")

Handling Missing Values

CART has built-in mechanisms for handling missing values:

  • Surrogate Splits: Use alternative features when primary feature is missing
  • Default Direction: Send missing values to the child with more samples
  • Missing Value Category: Treat missing as a separate category

CART vs Other Decision Tree Algorithms

  • vs ID3: CART handles continuous variables and missing values better
  • vs C4.5: CART uses binary splits only, simpler pruning
  • vs Random Forest: CART is single tree, RF is ensemble of trees
  • vs Gradient Boosting: CART is standalone, GB builds trees sequentially

Parameter Tuning

# Grid search for optimal parameters
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy']
}
# Perform grid search
grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)
# Test best model
best_tree = grid_search.best_estimator_
best_accuracy = best_tree.score(X_test, y_test)
print("Test accuracy with best parameters:", best_accuracy)

Ensemble Methods Built on CART

CART forms the foundation for many powerful ensemble methods:

  • Random Forest: Multiple CART trees with random feature selection
  • Gradient Boosting: Sequential CART trees correcting previous errors
  • Extra Trees: Extremely randomized trees
  • XGBoost: Optimized gradient boosting with CART base learners

Best Practices

  • Control Tree Depth: Use max_depth to prevent overfitting
  • Set Minimum Samples: Use min_samples_split and min_samples_leaf
  • Use Pruning: Apply cost complexity pruning for better generalization
  • Cross-Validation: Use CV to select optimal parameters
  • Feature Engineering: Create meaningful features that align with tree splits
  • Ensemble Methods: Consider Random Forest or Gradient Boosting for better performance
  • Validate Interpretability: Ensure tree remains interpretable for your use case

When to Use CART

Choose CART when:

  • Interpretability is crucial
  • You need to explain decisions to stakeholders
  • Data contains mixed types (numerical and categorical)
  • You have missing values in your dataset
  • You want to identify important features
  • You need a baseline model quickly
  • Non-linear relationships exist in your data

Consider alternatives when:

  • You prioritize predictive performance over interpretability
  • Your data has strong linear relationships
  • You have very large datasets
  • Features are highly correlated
  • You need probabilistic outputs

CART represents the perfect balance between simplicity and power in machine learning. Its ability to create highly interpretable models while handling complex, non-linear relationships makes it invaluable in domains where understanding the decision-making process is as important as the accuracy of predictions. Whether used standalone for interpretable modeling or as the building block for sophisticated ensemble methods, CART remains one of the most important algorithms in the data science toolkit. Understanding CART provides essential insights into tree-based learning and serves as the foundation for mastering more advanced ensemble techniques.

Written by:

377 Posts

View All Posts
Follow Me :