Claude vs Copilot for Python Data Science

Python data science work spans multiple domains: data wrangling with pandas, statistical modeling with sklearn, visualization with matplotlib/seaborn/plotly, and exploratory analysis in Jupyter. Claude and GitHub Copilot take fundamentally different approaches — Claude as a conversational assistant for complex analysis, Copilot as an inline completion engine. This guide tests both on real data science tasks.

Task 1: Pandas Data Cleaning Pipeline

Scenario: A messy CSV with mixed types, nulls, inconsistent date formats, and duplicate rows. Build a cleaning pipeline.

Claude (conversational prompt):

I have a sales CSV with these issues:
- 'amount' column has values like "$1,234.56", "1234.56", and nulls
- 'date' column has mixed formats: "2025-03-15", "15/03/2025", "Mar 15, 2025"
- 'customer_id' has duplicates (keep latest by date)
- 'region' column has "US", "usa", "United States" (normalize to ISO)

Write a pandas pipeline using method chaining.

Claude’s output:

import pandas as pd
import numpy as np
from dateutil import parser as dateutil_parser

REGION_MAP = {
    "us": "US", "usa": "US", "united states": "US", "u.s.": "US",
    "uk": "GB", "united kingdom": "GB", "england": "GB",
    "ca": "CA", "canada": "CA",
    # extend as needed
}

def parse_amount(val):
    if pd.isna(val):
        return np.nan
    return float(str(val).replace("$", "").replace(",", "").strip())

def parse_flexible_date(val):
    if pd.isna(val):
        return pd.NaT
    try:
        return dateutil_parser.parse(str(val), dayfirst=False)
    except (ValueError, OverflowError):
        return pd.NaT

def clean_sales(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df
        .assign(
            amount=lambda x: x["amount"].map(parse_amount),
            date=lambda x: x["date"].map(parse_flexible_date),
            region=lambda x: (
                x["region"]
                .str.strip()
                .str.lower()
                .map(REGION_MAP)
                .fillna(x["region"].str.upper())  # keep unknown as-is
            ),
        )
        .dropna(subset=["amount", "date"])
        .sort_values("date", ascending=False)
        .drop_duplicates(subset=["customer_id"], keep="first")
        .reset_index(drop=True)
        .astype({"amount": "float64"})
        .assign(date=lambda x: pd.to_datetime(x["date"]))
    )

Claude produces a fully chained pipeline with custom parsers, handles edge cases in region normalization, and uses dateutil for flexible date parsing. It also explains why drop_duplicates(keep="first") after sort_values(date, ascending=False) keeps the latest row.

Copilot (inline in Jupyter):

When you type def clean_sales(df): and add a comment # normalize region names, parse currency, parse dates, Copilot generates a simpler version:

def clean_sales(df):
    # normalize region names
    df['region'] = df['region'].str.upper().str.strip()

    # parse currency
    df['amount'] = df['amount'].str.replace('$', '').str.replace(',', '').astype(float)

    # parse dates
    df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True, errors='coerce')

    # remove duplicates
    df = df.sort_values('date').drop_duplicates('customer_id', keep='last')

    return df

Copilot’s version is simpler but has issues: infer_datetime_format=True is deprecated in pandas 2.x, .str.replace('$', '') raises an error on NaN values, and it mutates the input DataFrame instead of returning a clean copy.

Winner: Claude — handles edge cases, idiomatic chaining, deprecation-aware.

Task 2: Sklearn Pipeline with Custom Transformer

Prompt: “Build a sklearn pipeline for a classification task. Include: StandardScaler for numeric, OneHotEncoder for categorical, a custom transformer that adds log1p features for skewed numerics, then LogisticRegression. Make it compatible with GridSearchCV.”

Claude:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd

class Log1pTransformer(BaseEstimator, TransformerMixin):
    """Applies log1p to columns with skewness above threshold."""

    def __init__(self, skew_threshold: float = 0.5):
        self.skew_threshold = skew_threshold
        self.skewed_cols_: list[int] = []

    def fit(self, X, y=None):
        if isinstance(X, pd.DataFrame):
            X = X.values
        skewness = pd.DataFrame(X).skew()
        self.skewed_cols_ = list(skewness[abs(skewness) > self.skew_threshold].index)
        return self

    def transform(self, X):
        X = np.array(X, dtype=float).copy()
        X[:, self.skewed_cols_] = np.log1p(X[:, self.skewed_cols_])
        return X


def build_pipeline(
    numeric_features: list[str],
    categorical_features: list[str],
) -> Pipeline:
    numeric_transformer = Pipeline([
        ("log1p", Log1pTransformer(skew_threshold=0.5)),
        ("scaler", StandardScaler()),
    ])

    categorical_transformer = OneHotEncoder(
        handle_unknown="ignore",
        sparse_output=False,
    )

    preprocessor = ColumnTransformer([
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

    return Pipeline([
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression(max_iter=1000, random_state=42)),
    ])


# GridSearchCV-compatible parameter grid
param_grid = {
    "preprocessor__num__log1p__skew_threshold": [0.3, 0.5, 0.8],
    "classifier__C": [0.01, 0.1, 1.0, 10.0],
    "classifier__solver": ["lbfgs", "saga"],
}

pipeline = build_pipeline(
    numeric_features=["age", "income", "tenure_days"],
    categorical_features=["region", "plan_type"],
)

grid_search = GridSearchCV(
    pipeline, param_grid, cv=5, scoring="roc_auc", n_jobs=-1, verbose=1
)

Claude correctly implements the sklearn estimator API (fit/transform, BaseEstimator, TransformerMixin) and uses the __ double-underscore notation for GridSearchCV parameter names.

Copilot struggles here without context — it generates a basic pipeline but doesn’t implement a proper custom transformer. It typically uses FunctionTransformer instead, which doesn’t expose hyperparameters for GridSearchCV.

Winner: Claude — proper sklearn API, GridSearchCV-compatible custom transformer.

Task 3: EDA and Visualization

For Jupyter-based EDA, Copilot has an advantage: it sees your DataFrame columns and generates relevant visualizations inline.

Copilot inline (after loading a DataFrame):

# Type: # plot distribution of amount by region
# Copilot generates:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution plot
sns.histplot(data=df, x='amount', hue='region', bins=30, ax=axes[0])
axes[0].set_title('Amount Distribution by Region')
axes[0].set_xlabel('Amount ($)')

# Box plot for outlier visibility
sns.boxplot(data=df, x='region', y='amount', ax=axes[1])
axes[1].set_title('Amount by Region (Box Plot)')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Copilot’s visualization completions are fast and contextually relevant. Claude requires more prompt effort but generates more sophisticated plots with proper statistical annotations.

Task 4: Time Series Feature Engineering

Time series work is where the gap between the two tools becomes clear. Claude can reason about lag features, rolling statistics, and seasonality; Copilot generates boilerplate without understanding the domain.

Prompt to Claude: “I have a daily sales DataFrame with columns: date, sales, store_id. Create a feature engineering function that adds: 7-day rolling mean and std, lag features for 1, 7, and 30 days, day-of-week and month indicators, and a ‘is_weekend’ flag. Handle the NaN values from lagging correctly.”

Claude:

import pandas as pd
import numpy as np


def engineer_time_features(
    df: pd.DataFrame,
    date_col: str = "date",
    target_col: str = "sales",
    group_col: str = "store_id",
) -> pd.DataFrame:
    """
    Add time-series features to a daily sales DataFrame.
    Applies lag and rolling features within each store group
    to avoid data leakage across stores.
    """
    df = df.copy()
    df[date_col] = pd.to_datetime(df[date_col])
    df = df.sort_values([group_col, date_col]).reset_index(drop=True)

    def add_features(group: pd.DataFrame) -> pd.DataFrame:
        g = group.copy()

        # Lag features — shift within the group
        for lag in [1, 7, 30]:
            g[f"{target_col}_lag_{lag}d"] = g[target_col].shift(lag)

        # Rolling statistics (requires lag to avoid leakage)
        g[f"{target_col}_roll7_mean"] = (
            g[target_col].shift(1).rolling(7, min_periods=4).mean()
        )
        g[f"{target_col}_roll7_std"] = (
            g[target_col].shift(1).rolling(7, min_periods=4).std()
        )

        return g

    df = df.groupby(group_col, group_keys=False).apply(add_features)

    # Calendar features (don't need grouping)
    df["day_of_week"] = df[date_col].dt.dayofweek       # 0=Monday
    df["month"] = df[date_col].dt.month
    df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
    df["quarter"] = df[date_col].dt.quarter

    return df


# Usage
featured_df = engineer_time_features(raw_df)

# NaN summary — expected for early rows
print(featured_df.isna().sum())

# Drop rows where any lag feature is NaN before training
train_df = featured_df.dropna(subset=[
    "sales_lag_1d", "sales_lag_7d", "sales_lag_30d",
    "sales_roll7_mean", "sales_roll7_std",
])

The critical detail: using .shift(1) before .rolling() prevents data leakage. Copilot sometimes generates rolling features on the raw column, which means the rolling mean at time T includes the value at time T — leaking the target into the feature.

Winner: Claude — understands data leakage risk, applies groupby correctly per store.

Task 5: Debugging pandas Errors

Claude’s conversational ability shines for debugging. Paste in a stack trace and Claude identifies root causes accurately.

Error to debug:

ValueError: Cannot convert non-finite values (NA or inf) to integer

Claude’s diagnosis: This error occurs when casting a float column containing NaN to int. Claude identifies the three root causes and gives the fix for each:

# Root cause 1: Column has NaN, astype(int) fails
df["col"].astype(int)           # Fails if any NaN

# Fix 1: Fill NaN before casting
df["col"].fillna(0).astype(int)

# Fix 2: Use nullable integer type (pandas 1.0+)
df["col"].astype("Int64")       # Capital I — supports pd.NA

# Fix 3: Drop NaN rows first
df.dropna(subset=["col"])["col"].astype(int)

# Root cause 2: Column has inf values from division
df["rate"] = df["a"] / df["b"]  # inf when b==0

# Fix: Replace inf before casting
df["rate"] = df["rate"].replace([np.inf, -np.inf], np.nan)

Copilot doesn’t debug errors conversationally — you’d need to paste the error into a comment and hope it completes a fix, which works inconsistently.

Tool Selection Guide

Task	Best Tool	Why
Complex data cleaning pipeline	Claude	Handles edge cases, no deprecation issues
Custom sklearn transformers	Claude	Proper API implementation
Quick EDA visualizations	Copilot	Inline, sees DataFrame context
Statistical analysis explanations	Claude	Interprets results, not just code
Boilerplate sklearn pipelines	Copilot	Fast inline completion
Debugging pandas errors	Claude	Full error + code context
Feature engineering ideas	Claude	Conversational exploration
Time series features	Claude	Understands leakage, groupby patterns
Simple plot generation	Copilot	Faster with existing column context

Task 4: Debugging Statistical Analysis

Scenario: Your model’s validation AUC is 0.68 (bad), but training AUC is 0.95 (overfitting).

Claude prompt:

Training AUC: 0.95, Validation AUC: 0.68

This is a severe overfitting problem. My dataset has:
- 10,000 training samples
- 2,000 validation samples
- 150 features (numerical and categorical)
- Logistic Regression with default parameters

What are the top 5 most likely causes?

Claude generates:

Too many features relative to samples (150:10000 ratio is bad)
Categorical features not properly encoded (might have high cardinality)
No regularization (add C parameter, use L1/L2)
Data leakage (feature contains target information)
Class imbalance (minority class overfit)

Claude then provides the diagnostic code:

# Check feature importance — might reveal leakage
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(model, X_val, y_val, n_repeats=10)

# Check for extreme feature values
print(X_train.describe())

# Verify no target leakage
print("Features correlated with target:")
print(X_train.corrwith(y_train).abs().sort_values(ascending=False).head(10))

Copilot in the same scenario would suggest:

# Add regularization
LogisticRegression(C=0.01, max_iter=1000)

Copilot gives the fix but not the diagnostic. Claude explains why each cause is likely.

Task 5: Hyperparameter Tuning Strategy

Prompt to Claude:

I'm tuning a RandomForest for a classification task:
- Dataset: 50,000 samples, 30 features
- Training takes 5 minutes per model
- Need results within 24 hours (288 minutes = ~57 models max)

Design a hyperparameter tuning strategy. I can evaluate:
1. n_estimators: [50, 100, 200, 500]
2. max_depth: [5, 10, 20, None]
3. min_samples_split: [2, 5, 10, 20]
4. min_samples_leaf: [1, 2, 4, 8]
5. max_features: ['sqrt', 'log2', None]

Which parameters matter most? What's the best search strategy?

Claude analyzes:

Priority ranking (impact on performance):
1. n_estimators: High impact, test [100, 300] first
2. max_depth: High impact, test [10, 20] first
3. min_samples_split: Medium impact, test [5, 10]
4. max_features: Medium impact, test ['sqrt', 'log2']
5. min_samples_leaf: Low impact, skip (use default)

Recommended approach:
- Phase 1 (10 models): GridSearch on n_estimators × max_depth
  Time: 50 minutes. Keep top 3 parameter combinations.
- Phase 2 (15 models): RandomSearch on remaining 3 params around winning combo
  Time: 75 minutes
- Phase 3 (20 models): Bayesian optimization to fine-tune top performers
  Time: 100 minutes
- Total: 225 minutes (leaves 63 minutes buffer)

Implementation:
"""
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from skopt import BayesSearchCV

# Phase 1: GridSearch (most important parameters)
grid = GridSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=42),
    {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 15, 20, 25]
    },
    cv=5, scoring='roc_auc', n_jobs=-1
)
grid.fit(X_train, y_train)

# Keep best params
best_params = grid.best_params_

# Phase 2: RandomSearch around winners
search = RandomizedSearchCV(
    RandomForestClassifier(**best_params, n_jobs=-1, random_state=42),
    {
        'min_samples_split': [2, 5, 10, 20],
        'max_features': ['sqrt', 'log2'],
        'min_samples_leaf': [1, 2, 4]
    },
    n_iter=15, cv=5, scoring='roc_auc', n_jobs=-1, random_state=42
)
search.fit(X_train, y_train)
"""

This strategy is impossible for Copilot — it can't reason about time constraints and feature importance ranking.

## Handling Imbalanced Data: Claude vs Copilot

**Prompt:** "My dataset is 95% class 0, 5% class 1. How do I fix this?"

**Copilot** suggests:
```python
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
model.fit(X_train_balanced, y_train_balanced)

This is correct but incomplete — SMOTE can cause leakage if applied before train-test split.

Claude explains:

SMOTE during training is common but risky. Best practice:

1. Split first: X_train/y_train, X_test/y_test
2. Apply SMOTE only to training data
3. Evaluate on original test distribution
4. Watch for inflated metrics due to synthetic minority samples

Better code:
"""
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# SMOTE inside pipeline — applies only during fit
pipeline = Pipeline([
 ('smote', SMOTE(random_state=42)),
 ('classifier', LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test) # Evaluated on original distribution
"""

Also mention: if SMOTE hurts performance, try class_weight='balanced' instead.

Tool Decision Framework

Data Science Task	Claude	Copilot	Winner
Exploratory Data Analysis	Good (explains patterns)	Excellent (sees columns)	Copilot
Feature engineering ideation	Excellent	Good	Claude
Debugging model overfitting	Excellent	Weak	Claude
Sklearn pipeline building	Excellent	Good	Claude
Visualization code	Good	Excellent (in-notebook)	Copilot
Statistical test interpretation	Excellent	Weak	Claude
Quick boilerplate code	Moderate	Excellent	Copilot
Cross-validation strategy	Excellent	Weak	Claude
Hyperparameter tuning design	Excellent	Poor	Claude

Real Workflow: ML Practitioners Use Both

Smart teams use Claude for reasoning and Copilot for speed:

# Step 1: Use Copilot to generate boilerplate
# Type: "import pandas" → Copilot fills in common imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Step 2: Load data with Copilot
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Step 3: Use Claude for strategy
# "I have 50k samples, 30 features, binary classification.
# Training is slow. What features should I engineer?"
# → Claude generates domain-specific feature ideas

# Step 4: Back to Copilot for implementation
# Type comment: "# normalize numerical features"
# → Copilot suggests StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Claude for debugging
# Paste error: "ValueError: y_true and y_pred have different shapes"
# → Claude explains and provides fix

# Step 6: Copilot for testing boilerplate
# Type: "# test model performance"
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, model.predict(X_test_scaled)))

This hybrid approach is faster than either tool alone.

Built by theluckystrike — More at zovo.one