Claude Skills Guide

Claude Code for Rook Ceph Storage Workflow Guide

Rook Ceph has become the de facto solution for running Ceph storage clusters on Kubernetes. When combined with Claude Code, developers can automate complex storage workflows, manage persistent volumes, and handle disaster recovery scenarios with unprecedented efficiency. This guide walks you through practical applications of Claude Code in managing Rook Ceph storage operations.

Understanding the Rook Ceph Architecture

Before diving into automation, it’s essential to understand how Rook Ceph integrates with Kubernetes. Rook acts as a storage orchestrator that transforms Ceph—a distributed storage system—into a self-managing, self-scaling storage layer native to Kubernetes.

The architecture consists of three primary components: the Rook operator (which manages the Ceph cluster lifecycle), the Ceph cluster (providing the actual storage), and the Kubernetes CSI (Container Storage Interface) drivers that expose storage to workloads.

Claude Code can interact with all these layers through kubectl commands, custom resources, and the Rook operator’s API endpoints. By writing Claude Code scripts, you can orchestrate complex multi-step operations that would otherwise require extensive manual intervention.

Setting Up Claude Code for Ceph Management

First, ensure your environment is properly configured. You’ll need:

Here’s a basic Claude Code configuration script to validate your setup:

#!/bin/bash
# Validate Rook Ceph environment

ROOK_NAMESPACE="${ROOK_NAMESPACE:-rook-ceph}"
CEPH_POOL_NAME="${CEPH_POOL_NAME:-replicapool}"

# Check Rook operator status
echo "Checking Rook operator status..."
kubectl get pods -n "$ROOK_NAMESPACE" -l app=rook-ceph-operator

# Verify Ceph cluster health
kubectl exec -n "$ROOK_NAMESPACE" rook-ceph-tools-0 -- ceph status

# List available storage classes
kubectl get storageclass | grep ceph

This script forms the foundation for more complex automation. Run this before executing any storage operations to ensure your cluster is healthy.

Automating Storage Pool Creation

One of the most common tasks in Ceph management is creating new storage pools. Claude Code can automate this process with a reusable function:

def create_ceph_pool(pool_name: str, replica_count: int = 3) -> dict:
    """
    Create a new Ceph pool with specified replica count.
    
    Args:
        pool_name: Name of the pool to create
        replica_count: Number of replicas (default: 3)
    """
    import subprocess
    import json
    
    pool_definition = {
        "apiVersion": "ceph.rook.io/v1",
        "kind": "CephBlockPool",
        "metadata": {
            "name": pool_name,
            "namespace": "rook-ceph"
        },
        "spec": {
            "replicated": {
                "size": replica_count,
                "requireSafeReplicaSize": True
            }
        }
    }
    
    # Write manifest to temporary file
    with open(f"/tmp/ceph-pool-{pool_name}.yaml", "w") as f:
        import yaml
        yaml.dump(pool_definition, f)
    
    # Apply the pool
    result = subprocess.run(
        ["kubectl", "apply", "-f", f"/tmp/ceph-pool-{pool_name}.yaml"],
        capture_output=True,
        text=True
    )
    
    return {
        "status": "success" if result.returncode == 0 else "failed",
        "output": result.stdout,
        "error": result.stderr
    }

This function creates a manifest and applies it to your cluster. You can extend it to automatically create corresponding StorageClass objects, making the pool immediately available to developers.

Managing Persistent Volumes with Dynamic Provisioning

Dynamic volume provisioning eliminates the need for pre-provisioned storage. When a PersistentVolumeClaim (PVC) is created, Rook’s CSI driver automatically provisions the underlying storage. Here’s how to optimize this workflow:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-database-storage
  annotations:
    volume.beta.kubernetes.io/storage-class: ceph-block
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

Claude Code can help you generate these manifests with sensible defaults, validate them against your cluster’s capacity, and apply them with proper error handling. You might create a prompt template that generates PVCs based on workload requirements:

def generate_pvc_manifest(workload_name: str, size_gb: int, storage_class: str = "ceph-block"):
    """Generate an optimized PVC manifest for a given workload."""
    
    manifest = {
        "apiVersion": "v1",
        "kind": "PersistentVolumeClaim",
        "metadata": {
            "name": f"{workload_name}-pvc",
            "labels": {
                "app": workload_name,
                "managed-by": "claude-code"
            }
        },
        "spec": {
            "accessModes": ["ReadWriteOnce"],
            "resources": {"requests": {"storage": f"{size_gi}Gi"}},
            "storageClassName": storage_class
        }
    }
    return manifest

Implementing Disaster Recovery Workflows

Rook Ceph provides robust mechanisms for data protection, but orchestrating disaster recovery requires careful planning. Claude Code can automate snapshot creation, backup verification, and restoration procedures.

Here’s a comprehensive disaster recovery script:

#!/bin/bash
# Automated Ceph snapshot and backup workflow

SNAPSHOT_NAME="backup-$(date +%Y%m%d-%H%M%S)"
PVC_NAME="${1:-my-database-pvc}"
BACKUP_LOCATION="${BACKUP_LOCATION:-s3://ceph-backups}"

echo "Creating snapshot: $SNAPSHOT_NAME for PVC: $PVC_NAME"

# Create VolumeSnapshotClass if needed
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-rbdplugin-snapclass
driver: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/snapshotter-secretName: rook-csi
  csi.storage.k8s.io/snapshotter-secretNamespace: rook-ceph
deletionPolicy: Retain
EOF

# Create the snapshot
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ${SNAPSHOT_NAME}
  namespace: default
spec:
  volumeSnapshotClassName: csi-rbdplugin-snapclass
  source:
    persistentVolumeClaimName: ${PVC_NAME}
EOF

echo "Snapshot created successfully"
kubectl get volumesnapshot "${SNAPSHOT_NAME}"

This script creates on-demand snapshots that serve as the foundation for your backup strategy. For production environments, extend this to include offsite replication and regular automated testing of restoration procedures.

Monitoring Ceph Cluster Health

Proactive monitoring prevents data loss and performance degradation. Claude Code can aggregate health metrics and alert on critical conditions:

def check_ceph_health() -> dict:
    """Query Ceph cluster health status."""
    import subprocess
    import json
    
    result = subprocess.run(
        ["kubectl", "exec", "-n", "rook-ceph", 
         "rook-ceph-tools-0", "--", "ceph", "status", "-f", "json"],
        capture_output=True,
        text=True
    )
    
    if result.returncode != 0:
        return {"status": "error", "message": result.stderr}
    
    status = json.loads(result.stdout)
    
    # Extract key health indicators
    health_indicators = {
        "overall_status": status.get("health", {}).get("status", "UNKNOWN"),
        "pg_status": status.get("pg_summary", {}).get("num_pg_by_state", []),
        "osd_count": len(status.get("osd_stats", {}).get("osd_stats", [])),
        "pool_count": len(status.get("pools", []))
    }
    
    return health_indicators

def alert_on_degraded_health(health_status: dict):
    """Send alerts when cluster health is degraded."""
    if health_status["overall_status"] != "HEALTH_OK":
        # Integration with your alerting system
        print(f"ALERT: Ceph cluster health is {health_status['overall_status']}")
        # Add PagerDuty, Slack, or email integration here

Best Practices and Actionable Advice

When working with Rook Ceph and Claude Code, follow these proven patterns:

Always use replica counts appropriate to your fault tolerance requirements. For production workloads, maintain at least three replicas with requireSafeReplicaSize set to true. This prevents data loss during node failures.

Implement proper capacity planning. Monitor usage trends and set up alerts at 70% and 85% capacity thresholds. Over-provisioning is preferable to running out of storage mid-operation.

Version control your storage manifests. Store all Ceph manifests in Git alongside your application code. This enables reproducible deployments and simplifies audit trails.

Test disaster recovery procedures regularly. Automate your DR testing with Claude Code scripts that create temporary PVCs from snapshots, verify data integrity, and clean up afterward.

Leverage the Rook toolbox for debugging. The toolbox pod provides direct Ceph commands for troubleshooting. Claude Code can generate diagnostic reports automatically when issues arise.

Conclusion

Claude Code transforms Rook Ceph management from manual operations into automated, repeatable workflows. By investing time in building comprehensive scripts for common tasks—pool creation, volume provisioning, snapshots, and monitoring—you’ll reduce operational overhead and improve reliability.

Start with the foundational scripts in this guide, then extend them to match your organization’s specific requirements. The combination of Kubernetes’ orchestration capabilities, Ceph’s robust storage primitives, and Claude Code’s automation power creates a formidable platform for modern cloud-native applications.

Built by theluckystrike — More at zovo.one