Claude Code · Codex · Gemini CLI

Senior MLOps judgment for Production ML systems.

Structured decision frameworks for the full ML lifecycle — from deciding whether to build, through architecture, evaluation, and deployment, to governance and monitoring.

$ claude plugin install https://github.com/deepak-karkala/production-mlops-skills

Install Plugin View on GitHub

Works with Claude Code · Codex · OpenCode · MIT License

Claude Code — ml-project/

❯ /production-mlops:mlops-arch-review Reading project context... Analyzing ML system design... ── MLOPS MATURITY ASSESSMENT ─────────────── Level 1 → Manual pipeline, no automation Target: Level 2 → Automated training + CD ── TWO-LANE TOPOLOGY ───────────────────────

Code lane	→	Standard CI/CD · git-based
Model lane	→	Model registry + eval gate

✓ architecture-review.html written ⚠ Lane conflation risk detected (P1)

Capabilities

11 skills · 4 specialist agents · 7 commands.

Skills activate automatically from plain language. Explicit slash commands available for every workflow stage.

Problem Framing

1 skill

ML fit scoring across 5 dimensions before you commit
Label leakage and proxy trap detection
Go/no-go decision with documented rationale

mlops-opportunity-framing

System Design

1 skill

MLOps maturity assessment (Level 1→3)
Two-deployment-lane topology: code vs model
SLO-driven serving mode selection

production-ml-system-design

Data & Features

2 skills

Data lineage graph, schema contracts, freshness SLAs
Feature platform: offline/online stores, point-in-time correctness
Training-serving skew prevention and feature reuse policy

data-lineage-and-quality · feature-platform-design

Training Pipeline

1 skill

DAG design: Airflow vs Kubeflow vs SageMaker Pipelines
Idempotence and reproducibility enforcement
Experiment tracking and hyperparameter management

training-pipeline-and-orchestration

Evaluation

1 skill

5-layer testing pyramid: data, feature, unit, behavioral, integration
Slice-aware evaluation on critical subpopulations
3-stage promotion gate: dev → staging → prod

ml-test-and-eval-design

Deploy & Monitor

2 skills

Serving mode: batch vs online vs streaming vs embedded
Rollout strategy: shadow → canary → blue-green
Data drift, concept drift, and automated retraining triggers

deployment-readiness-for-ml · ml-observability-and-drift

Governance

1 skill

Risk tier classification and compliance mapping
Model cards authoring and fairness audits
Approval workflows for regulated ML systems

ml-governance-and-production-readiness

Specialist Agents

4 agents

ML Systems Architect: complex multi-model decomposition
Evaluation Auditor: deep eval gap analysis
Reliability Engineer: rollout risk analysis
Platform Economics Analyst: infrastructure ROI modeling

ml-systems-architect · ml-evaluation-auditor · ml-reliability-engineer · ml-platform-economics-analyst

How it Works

A skill for every stage of the ML lifecycle.

From deciding whether to build, through data engineering and training, all the way to governance — skills activate automatically as you describe your problem.

01 Frame /mlops-opportunity ML fit scoring
Anti-pattern detection

02 Design /mlops-plan Maturity assessment
Two-lane topology

03 Data /data-lineage Schema contracts
Feature platform

04 Train /training-pipeline DAG design
Reproducibility

05 Evaluate /mlops-evals 5-layer testing
Slice-aware evals

06 Deploy /mlops-ops Rollout strategy
Drift monitoring

07 Govern /mlops-governance Risk classification
Model cards · Fairness

Each stage has a dedicated skill with step-by-step workflows, decision tables, and artifact outputs

The Difference

Structured ML judgment, not trial and error.

Every ML system decision — whether to build, how to architect, when to deploy — deserves a defensible answer backed by proven patterns, not guesswork.

Without the plugin

No systematic way to decide if you should even use ML.

Teams skip problem framing and jump straight to model training — only to discover months later that a rule-based system would have worked just as well.

ML fit assessment done ad hoc or skipped entirely
Label leakage and proxy traps surface only in production
No documented go/no-go decision for stakeholders

Without plugin

❯ Should we use ML for customer churn prediction? Yes, ML is great for churn prediction! You could use XGBoost or a neural network. Make sure you have enough data. Consider using SHAP for explainability. Maybe try AutoML first! ✗ No fit assessment ✗ No label leakage check ✗ No go/no-go artifact

With production-mlops

ML fit score, anti-pattern scan, and a documented decision.

The plugin routes to /mlops-opportunity-framing automatically. It scores ML fit across 5 dimensions, checks for label leakage and proxy traps, and writes a go/no-go artifact.

5-dimension ML fit score with concrete signal per dimension
Label leakage and proxy trap detection before training
Documented decision artifact stakeholders can review

With plugin

❯ Should we use ML for customer churn prediction? → routing to mlops-opportunity-framing... ── ML FIT SCORE ───────────────────────────

Pattern complexity	✓ High	(behavioral signals)
Label quality	✓ Clear	(30-day churn window)
Data volume	✓ 2M rows	(sufficient)
Leakage risk	⚠ Check	(support tickets)

◆ BUILD · ML fit: 4.1/5 · Watch: proxy leakage in support features

The anti-pattern it catches

Lane conflation silently breaks rollback and promotion.

The most common MLOps architectural mistake: mixing code and model deployment into a single pipeline. When a model needs rollback, the code rolls back too. The plugin catches this before you build it.

Two-deployment-lane topology: code lane vs model lane — always separate
Model registry as the boundary between training and serving
Eval gate required for promotion: model never goes to prod without a score

Architecture Review — Anti-pattern detection

❯ /production-mlops:mlops-arch-review ── ANTI-PATTERN SCAN ────────────────────── ✗ Lane conflation detected (P1) Model + code in same CI/CD pipeline → rollback will revert both together ✗ No eval gate before serving (P1) Model goes to prod without score threshold ── RECOMMENDED TOPOLOGY ─────────────────── → Code lane: standard CI/CD (git-triggered) → Model lane: registry → eval gate → serving

See it in Action

Watch the skills work.

Select a workflow to see how the plugin responds — structured decision frameworks, not free-form conversation.

Opportunity Framing — Customer Churn Prediction

❯ Should we use ML for customer churn prediction? → routing to mlops-opportunity-framing... ── ML FIT SCORE ──────────────────────────────────

Pattern complexity	✓ High	(non-linear behavioral signals)
Label availability	✓ Clear	(30-day churn window defined)
Data volume	✓ 2M rows	(sufficient for training)
Feedback loop	~ 30 days	(acceptable latency)
Actionability	✓ High	(intervention exists)

── ANTI-PATTERN SCAN ───────────────────────────── ⚠ Proxy leakage risk: support ticket features may contain post-churn signals → check temporal order ✓ No label definition ambiguity detected ✓ No class imbalance blocker (12% churn rate)

◆ BUILD · ML fit: 4.1/5 · Watch: temporal leakage in support features

Architecture Review — Recommendation Ranking System

❯ /production-mlops:mlops-arch-review Reading system context... Delegating to ml-systems-architect... ── MLOPS MATURITY ──────────────────────────────── Current: Level 1 → Manual pipelines, ad-hoc retraining Target: Level 2 → Automated training + CD ── TWO-LANE TOPOLOGY ─────────────────────────────

Code lane	✓	Standard CI/CD · git-triggered
Model lane	✗	Missing — code+model in same pipeline

── ANTI-PATTERNS DETECTED ──────────────────────── ✗ P1 Lane conflation — rollback affects both code + model ✗ P1 No eval gate → model promotes without score threshold ⚠ P2 No feature store → training-serving skew risk ✓ architecture-review.html written to .agentic/artifacts/

Evaluation Design — Churn Model

❯ /production-mlops:mlops-evals Designing evaluation strategy for churn model... ── 5-LAYER TESTING PYRAMID ───────────────────────

Layer 1	Data validation	✓ Schema + freshness checks
Layer 2	Feature tests	✓ Parity + leakage guards
Layer 3	Unit tests	✓ Preprocessing + transforms
Layer 4	Behavioral tests	⚠ Add: edge case assertions
Layer 5	Integration tests	✗ Missing end-to-end test

── SLICE-AWARE EVALUATION ──────────────────────── Critical slices: High-LTV customers · New users (<30 days) Threshold: Recall ≥ 0.75 on high-LTV slice (non-negotiable) ── PROMOTION GATE ──────────────────────────────── Dev→Staging: AUC ≥ 0.82 + no data validation failures Staging→Prod: Shadow mode ≥ 7 days + business KPI neutral

Deployment Readiness — Churn Model v2

❯ /production-mlops:mlops-ops Checking deployment readiness for churn-model-v2... ── SERVING MODE ────────────────────────────────── Batch inference → daily scoring (latency not critical) Artifact: ONNX → framework-independent, versioned ── ROLLOUT STRATEGY ──────────────────────────────

Phase 1	Shadow mode	7 days · compare vs current model
Phase 2	Canary 10%	3 days · monitor business KPIs
Phase 3	Full rollout	On KPI threshold met

── DRIFT MONITORS ──────────────────────────────── ✓ Input drift: PSI on 12 top features (threshold 0.2) ✓ Output drift: prediction distribution shift alert ✓ Business KPI: weekly churn rate proxy (±15%) Retraining trigger: PSI > 0.2 OR KPI drift > 15%

What Gets Produced

Shareable artifacts, not just chat.

Flagship skills write structured HTML reports to disk — stakeholder-ready, version-controllable, reopenable anytime. These are real outputs from the plugin.

Get Started

Up and running in three steps.

No API keys. No config required. Skills activate automatically when you describe your problem.

Install the plugin

One command in your terminal. Works with Claude Code, Codex, and OpenCode.

claude plugin install github:deepak-karkala/production-mlops-skills

Run setup in your project

Optional but recommended — initializes artifact paths and ML framework context.

/production-mlops:setup-production-mlops

Describe your problem

Skills auto-route from plain language. Or use explicit slash commands for any workflow.

Should we use ML for this use case?

Full Skill Inventory

Everything that's included.

11 skills, each with step-by-step workflows, decision tables, and explicit scope boundaries.

mlops-opportunity-framingML fit scoring (5 dimensions), label leakage detection, proxy trap scan, go/no-go decision with artifact

production-ml-system-designMLOps maturity assessment (Level 1-3), two-deployment-lane topology, SLO-driven serving mode selection

data-lineage-and-quality-designData sourcing, lineage graph, quality gates, schema registry, batch vs streaming decision, serving-path contracts

feature-platform-architectureOffline/online feature stores, point-in-time correctness, training-serving parity audit, feature reuse policy

training-pipeline-and-orchestrationTraining DAG design, orchestration platform selection (Airflow/Kubeflow/SageMaker), idempotence, experiment tracking

ml-test-and-eval-design5-layer testing pyramid, slice-aware evaluation, 3-stage promotion gate policy (dev→staging→prod), behavioral tests

deployment-readiness-for-mlServing mode selection (batch/online/streaming/embedded), artifact packaging, rollout strategy, pre-deployment gates

ml-observability-and-drift-responseData drift, concept drift, output distribution shift, business KPI proxies, automated retraining triggers

ml-governance-and-production-readinessRisk tier classification, model card authoring, fairness audits, approval workflows, compliance mapping

mlops-handoffStructured handoff package: system status, lane status, open risks, owner assignments, artifact index

setup-production-mlopsOne-time repo config: artifact paths, ML framework selection, cloud platform settings, config.yml creation

Senior MLOps judgment for Production ML systems.

11 skills · 4 specialist agents · 7 commands.

Problem Framing

System Design

Data & Features

Training Pipeline

Evaluation

Deploy & Monitor

Governance

Specialist Agents

A skill for every stage of the ML lifecycle.

Structured ML judgment, not trial and error.

No systematic way to decide if you should even use ML.

ML fit score, anti-pattern scan, and a documented decision.

Lane conflation silently breaks rollback and promotion.

Watch the skills work.

Shareable artifacts, not just chat.

Up and running in three steps.

Install the plugin

Run setup in your project

Describe your problem

Everything that's included.

Ready to build production-grade ML systems?