Kubernetes operators are complex. They combine CRD design, controller reconciliation logic, finalizers, status subresources, and RBAC rules. Writing an operator means coordinating across multiple Kubernetes concepts simultaneously. AI coding assistants struggle with this complexity because operator patterns are specialized and context-dependent.
This guide tests how well current AI tools handle operator development and when they fail.
Why Operators Are Hard for AI
Operators require understanding:
- Custom Resource Definitions (CRDs): Schema design, validation rules, status/spec separation
- Controller Logic: Reconciliation loops, event handling, idempotency
- Kubernetes Patterns: Finalizers, owner references, backoff strategies
- Error Handling: Transient vs permanent failures, retry logic
- RBAC: Role and RoleBinding generation for least privilege
AI models trained on general Go code see operators as complex state machines with unfamiliar APIs. They generate code that compiles but doesn’t handle edge cases like leader election, graceful shutdown, or handling deleted resources. Testing is minimal in AI output, which is critical for operators running production infrastructure.
Top AI Tools for Operators
| Tool | Strength | Weakness | Cost |
|---|---|---|---|
| Claude (GPT-4 level) | Architecture design, CRD planning | Sometimes verbose reconciliation loops | $20/mo |
| GitHub Copilot | Boilerplate code, controller scaffolding | Weak on RBAC, finalizer logic | $10/mo |
| Cursor | Full-codebase understanding | Limited operator examples in training | $20/mo |
| v0 (Vercel) | UI scaffolding only | Not applicable to operators | Free |
| Replit AI | Quick prototyping | Poor on production-ready code | Free/paid |
What Works: CRD Design
AI excels at generating CRD YAML when you describe the resource clearly.
Good input prompt:
Generate a CRD for a PostgreSQL Backup resource with:
- spec: dbName (string), schedule (cron), retentionDays (int), s3Bucket (string)
- status: lastBackupTime, nextScheduledBackup, status (enum: pending/running/succeeded/failed)
- validation: schedule must be valid cron, retentionDays >= 1
AI Output (80% usable):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: pgbackups.db.example.com
spec:
group: db.example.com
names:
kind: PGBackup
plural: pgbackups
scope: Namespaced
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: [dbName, schedule, s3Bucket]
properties:
dbName:
type: string
minLength: 1
schedule:
type: string
pattern: '^(\*|([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])|\*\/([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])) (\*|([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])|\*\/([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])) (\*|([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])|\*\/([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])) (\*|([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])|\*\/([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])) (\*|([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])|\*\/([0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]))$'
description: CRON schedule for backup
retentionDays:
type: integer
minimum: 1
default: 7
s3Bucket:
type: string
status:
type: object
properties:
lastBackupTime:
type: string
format: date-time
nextScheduledBackup:
type: string
format: date-time
status:
type: string
enum: [pending, running, succeeded, failed]
subresources:
status: {}
What needs fixing: CRON validation regex is aggressive and may be too strict. You’ll verify with actual cron-parser library. Status properties should have descriptions. Add conditions array for event handling.
What Fails: Reconciliation Logic
AI generates reconciliation loops that compile but lack production rigor.
AI Output (40% usable):
func (r *PGBackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &dbv1.PGBackup{}
if err := r.Get(ctx, req.NamespacedName, backup); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Check if backup should run
if shouldRunBackup(backup) {
if err := runBackup(ctx, backup); err != nil {
backup.Status.Status = "failed"
r.Update(ctx, backup)
return ctrl.Result{RequeueAfter: 5 * time.Minute}, err
}
backup.Status.Status = "succeeded"
backup.Status.LastBackupTime = metav1.Now()
r.Update(ctx, backup)
}
return ctrl.Result{}, nil
}
Critical missing pieces:
- No finalizer for cleanup when backup deleted
- No leader election—multiple instances run simultaneously
- No owner reference from Backup CRD to created Job/Pod
- Status update happens without checking if object changed
- No idempotency guards—running same backup twice
- Error handling returns but doesn’t distinguish transient vs permanent failures
- RequeueAfter of 5 minutes is arbitrary, not based on actual schedule
What you must add:
// Add finalizer for cleanup
const finalizerName = "pgbackup.example.com/cleanup"
if backup.ObjectMeta.DeletionTimestamp != nil {
if controllerutil.ContainsFinalizer(backup, finalizerName) {
// Do cleanup (delete S3 objects, etc)
controllerutil.RemoveFinalizer(backup, finalizerName)
if err := r.Update(ctx, backup); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
if !controllerutil.ContainsFinalizer(backup, finalizerName) {
controllerutil.AddFinalizer(backup, finalizerName)
if err := r.Update(ctx, backup); err != nil {
return ctrl.Result{}, err
}
}
// Owner reference linking
job := &batchv1.Job{}
// ...
job.SetOwnerReferences([]metav1.OwnerReference{
*metav1.NewControllerRef(backup, dbv1.GroupVersion.WithKind("PGBackup")),
})
// Idempotency: check if job already exists before creating
if err := r.Get(ctx, client.ObjectKey{Name: jobName, Namespace: backup.Namespace}, &batchv1.Job{}); err == nil {
// Job exists, don't recreate
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}
// Distinction between errors
if err := runBackup(ctx, backup); err != nil {
if isTransient(err) {
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil // Retry soon
}
// Permanent error—report and don't requeue immediately
backup.Status.Status = "failed"
backup.Status.Error = err.Error()
r.Update(ctx, backup)
return ctrl.Result{RequeueAfter: 1 * time.Hour}, nil
}
Better Approach: Operator SDK Scaffolding
Don’t ask AI to write operators from scratch. Use Operator SDK to generate scaffolding, then ask AI to enhance specific reconciliation logic.
operator-sdk init --domain example.com --repo github.com/example/pgbackup-operator
operator-sdk create api --group db --version v1alpha1 --kind PGBackup --resource --controller
This generates:
- CRD structure
- Reconciler boilerplate
- Makefile, tests, Dockerfile
- RBAC rules stub
Then use AI: Ask Claude to implement specific reconciliation behavior—run backup job, update status, handle errors—within the generated Reconciler struct.
// AI fills in THIS part only:
func (r *PGBackupReconciler) reconcileBackup(ctx context.Context, backup *dbv1.PGBackup) error {
// Actual backup logic here—AI can handle this well when scaffold is provided
}
Testing: The Biggest Gap
AI never generates comprehensive controller tests. Example:
// AI generates basic test (30% coverage)
func TestReconcile_Success(t *testing.T) {
r := &PGBackupReconciler{}
result, err := r.Reconcile(context.Background(), ctrl.Request{})
if err != nil {
t.Fatal(err)
}
// Usually this ends here—no assertion of controller behavior
}
// You must add (for real testing)
func TestReconcile_CreatesJobWithCorrectSpec(t *testing.T) {
backup := &dbv1.PGBackup{
ObjectMeta: metav1.ObjectMeta{Name: "test", Namespace: "default"},
Spec: dbv1.PGBackupSpec{Schedule: "0 2 * * *", RetentionDays: 7},
}
client := fake.NewClientBuilder().WithObjects(backup).Build()
r := &PGBackupReconciler{Client: client, Scheme: scheme}
result, err := r.Reconcile(context.Background(), ctrl.Request{
NamespacedName: types.NamespacedName{Name: "test", Namespace: "default"},
})
assert.NoError(t, err)
job := &batchv1.Job{}
err = client.Get(context.Background(), types.NamespacedName{
Name: "test-backup", Namespace: "default",
}, job)
assert.NoError(t, err)
assert.Equal(t, "0 2 * * *", job.Spec.Schedule) // Verify job has correct schedule
assert.NotNil(t, metav1.GetControllerOf(job)) // Verify owner reference
}
Real-World Tool Comparison
Claude for CRD planning:
- Prompt: “Design a CRD for database migration tracking with spec and status”
- Result: Well-structured YAML, good validation rules
- Usability: 85%
Copilot for reconciler scaffold:
- Prompt: “Write a basic reconciliation function that creates a Job”
- Result: Boilerplate compiles, missing 70% of production requirements
- Usability: 35%
Cursor for full operator:
- Upload entire repository, ask it to implement backup scheduling
- Result: Understands codebase, generates contextually appropriate code
- Usability: 60%
Practical Workflow
-
Design CRD manually or with AI: Use Claude to flesh out spec/status schema. It does this well.
-
Generate scaffolding with Operator SDK: Don’t ask AI to do this. Use the tool.
-
Ask AI for specific functions: “Implement the function that parses CRON schedule and returns next run time” or “Generate RBAC rules for this controller.”
-
Review everything: Never trust AI-generated operator code in production without thorough code review and testing.
-
Write tests yourself: AI can’t generate meaningful tests for controllers. Focus here.
-
Use AI for documentation: API docs and architecture explanations where AI excels.
Tools for Each Operator Stage
| Stage | Best Tool | Effort | Quality |
|---|---|---|---|
| CRD design | Claude | Low | High |
| Scaffolding | Operator SDK CLI | Low | High |
| Reconciliation logic | Copilot + human review | Medium | Medium |
| RBAC rules | Claude | Low | High |
| Testing | Manual + testing libraries | High | High |
| Documentation | Claude | Low | High |
Avoiding Common Mistakes
-
Don’t generate entire operators end-to-end: AI lacks Kubernetes expertise. Scaffold first.
-
Always add finalizers: AI frequently omits them. Your cleanup logic depends on them.
-
Implement leader election: If running multiple replicas, AI won’t add this. Essential for production.
-
Handle transient failures differently: Distinguish retry (RequeueAfter 30s) from backoff (1 hour). AI misses this distinction.
-
Test owner reference behavior: Deleting CRD shouldn’t orphan Jobs. Verify with tests.
-
Add proper RBAC: AI generates permissive rules. Tighten them for least privilege.
The Bottom Line
AI tools excel at CRD design and can scaffold reconcilers. They struggle with production-ready error handling, testing, and Kubernetes-specific patterns. Use them to accelerate scaffolding and design, then build production hardening yourself.
For most teams: Use Operator SDK to scaffold, ask Claude to explain Kubernetes patterns you don’t understand, write reconciliation logic with Copilot suggestions, and manually build comprehensive tests. This cuts development time by 40% compared to writing from scratch while maintaining code quality.
Estimated time: 2-3 weeks for simple operator (backup, scaling). Without AI: 4-5 weeks. The gap narrows for complex operators because AI can’t handle architectural decisions—you’re making those anyway.