Claude Code · Codex · Gemini CLI

Senior MLOps judgment for Production ML systems.

Structured decision frameworks for the full ML lifecycle — from deciding whether to build, through architecture, evaluation, and deployment, to governance and monitoring.

$ claude plugin install https://github.com/deepak-karkala/production-mlops-skills

Works with Claude Code · Codex · OpenCode · MIT License

Claude Code — ml-project/
/production-mlops:mlops-arch-review Reading project context... Analyzing ML system design... ── MLOPS MATURITY ASSESSMENT ─────────────── Level 1 → Manual pipeline, no automation Target: Level 2 → Automated training + CD ── TWO-LANE TOPOLOGY ───────────────────────
Code laneStandard CI/CD · git-based
Model laneModel registry + eval gate
architecture-review.html written Lane conflation risk detected (P1)

11 skills · 4 specialist agents · 7 commands.

Skills activate automatically from plain language. Explicit slash commands available for every workflow stage.

Problem Framing

1 skill
  • ML fit scoring across 5 dimensions before you commit
  • Label leakage and proxy trap detection
  • Go/no-go decision with documented rationale
mlops-opportunity-framing

System Design

1 skill
  • MLOps maturity assessment (Level 1→3)
  • Two-deployment-lane topology: code vs model
  • SLO-driven serving mode selection
production-ml-system-design

Data & Features

2 skills
  • Data lineage graph, schema contracts, freshness SLAs
  • Feature platform: offline/online stores, point-in-time correctness
  • Training-serving skew prevention and feature reuse policy
data-lineage-and-quality · feature-platform-design

Training Pipeline

1 skill
  • DAG design: Airflow vs Kubeflow vs SageMaker Pipelines
  • Idempotence and reproducibility enforcement
  • Experiment tracking and hyperparameter management
training-pipeline-and-orchestration

Evaluation

1 skill
  • 5-layer testing pyramid: data, feature, unit, behavioral, integration
  • Slice-aware evaluation on critical subpopulations
  • 3-stage promotion gate: dev → staging → prod
ml-test-and-eval-design

Deploy & Monitor

2 skills
  • Serving mode: batch vs online vs streaming vs embedded
  • Rollout strategy: shadow → canary → blue-green
  • Data drift, concept drift, and automated retraining triggers
deployment-readiness-for-ml · ml-observability-and-drift

Governance

1 skill
  • Risk tier classification and compliance mapping
  • Model cards authoring and fairness audits
  • Approval workflows for regulated ML systems
ml-governance-and-production-readiness

Specialist Agents

4 agents
  • ML Systems Architect: complex multi-model decomposition
  • Evaluation Auditor: deep eval gap analysis
  • Reliability Engineer: rollout risk analysis
  • Platform Economics Analyst: infrastructure ROI modeling
ml-systems-architect · ml-evaluation-auditor · ml-reliability-engineer · ml-platform-economics-analyst

A skill for every stage of the ML lifecycle.

From deciding whether to build, through data engineering and training, all the way to governance — skills activate automatically as you describe your problem.

01 Frame /mlops-opportunity ML fit scoring
Anti-pattern detection
02 Design /mlops-plan Maturity assessment
Two-lane topology
03 Data /data-lineage Schema contracts
Feature platform
04 Train /training-pipeline DAG design
Reproducibility
05 Evaluate /mlops-evals 5-layer testing
Slice-aware evals
06 Deploy /mlops-ops Rollout strategy
Drift monitoring
07 Govern /mlops-governance Risk classification
Model cards · Fairness

Each stage has a dedicated skill with step-by-step workflows, decision tables, and artifact outputs

Structured ML judgment, not trial and error.

Every ML system decision — whether to build, how to architect, when to deploy — deserves a defensible answer backed by proven patterns, not guesswork.

Without the plugin

No systematic way to decide if you should even use ML.

Teams skip problem framing and jump straight to model training — only to discover months later that a rule-based system would have worked just as well.

  • ML fit assessment done ad hoc or skipped entirely
  • Label leakage and proxy traps surface only in production
  • No documented go/no-go decision for stakeholders
Without plugin
Should we use ML for customer churn prediction? Yes, ML is great for churn prediction! You could use XGBoost or a neural network. Make sure you have enough data. Consider using SHAP for explainability. Maybe try AutoML first! ✗ No fit assessment ✗ No label leakage check ✗ No go/no-go artifact
With production-mlops

ML fit score, anti-pattern scan, and a documented decision.

The plugin routes to /mlops-opportunity-framing automatically. It scores ML fit across 5 dimensions, checks for label leakage and proxy traps, and writes a go/no-go artifact.

  • 5-dimension ML fit score with concrete signal per dimension
  • Label leakage and proxy trap detection before training
  • Documented decision artifact stakeholders can review
With plugin
Should we use ML for customer churn prediction? → routing to mlops-opportunity-framing... ── ML FIT SCORE ───────────────────────────
Pattern complexity✓ High(behavioral signals)
Label quality✓ Clear(30-day churn window)
Data volume✓ 2M rows(sufficient)
Leakage risk⚠ Check(support tickets)
◆ BUILD  · ML fit: 4.1/5  · Watch: proxy leakage in support features
The anti-pattern it catches

Lane conflation silently breaks rollback and promotion.

The most common MLOps architectural mistake: mixing code and model deployment into a single pipeline. When a model needs rollback, the code rolls back too. The plugin catches this before you build it.

  • Two-deployment-lane topology: code lane vs model lane — always separate
  • Model registry as the boundary between training and serving
  • Eval gate required for promotion: model never goes to prod without a score
Architecture Review — Anti-pattern detection
/production-mlops:mlops-arch-review ── ANTI-PATTERN SCAN ────────────────────── Lane conflation detected (P1) Model + code in same CI/CD pipeline → rollback will revert both together No eval gate before serving (P1) Model goes to prod without score threshold ── RECOMMENDED TOPOLOGY ─────────────────── Code lane: standard CI/CD (git-triggered) Model lane: registry → eval gate → serving

Watch the skills work.

Select a workflow to see how the plugin responds — structured decision frameworks, not free-form conversation.

Opportunity Framing — Customer Churn Prediction
Should we use ML for customer churn prediction? → routing to mlops-opportunity-framing... ── ML FIT SCORE ──────────────────────────────────
Pattern complexity✓ High(non-linear behavioral signals)
Label availability✓ Clear(30-day churn window defined)
Data volume✓ 2M rows(sufficient for training)
Feedback loop~ 30 days(acceptable latency)
Actionability✓ High(intervention exists)
── ANTI-PATTERN SCAN ───────────────────────────── Proxy leakage risk: support ticket features may contain post-churn signals → check temporal order No label definition ambiguity detected No class imbalance blocker (12% churn rate)
◆ BUILD  · ML fit: 4.1/5  · Watch: temporal leakage in support features
Architecture Review — Recommendation Ranking System
/production-mlops:mlops-arch-review Reading system context... Delegating to ml-systems-architect... ── MLOPS MATURITY ──────────────────────────────── Current: Level 1 → Manual pipelines, ad-hoc retraining Target: Level 2 → Automated training + CD ── TWO-LANE TOPOLOGY ─────────────────────────────
Code laneStandard CI/CD · git-triggered
Model laneMissing — code+model in same pipeline
── ANTI-PATTERNS DETECTED ──────────────────────── ✗ P1 Lane conflation — rollback affects both code + model ✗ P1 No eval gate → model promotes without score threshold ⚠ P2 No feature store → training-serving skew risk architecture-review.html written to .agentic/artifacts/
Evaluation Design — Churn Model
/production-mlops:mlops-evals Designing evaluation strategy for churn model... ── 5-LAYER TESTING PYRAMID ───────────────────────
Layer 1Data validation✓ Schema + freshness checks
Layer 2Feature tests✓ Parity + leakage guards
Layer 3Unit tests✓ Preprocessing + transforms
Layer 4Behavioral tests⚠ Add: edge case assertions
Layer 5Integration tests✗ Missing end-to-end test
── SLICE-AWARE EVALUATION ──────────────────────── Critical slices: High-LTV customers · New users (<30 days) Threshold: Recall ≥ 0.75 on high-LTV slice (non-negotiable) ── PROMOTION GATE ──────────────────────────────── Dev→Staging: AUC ≥ 0.82 + no data validation failures Staging→Prod: Shadow mode ≥ 7 days + business KPI neutral
Deployment Readiness — Churn Model v2
/production-mlops:mlops-ops Checking deployment readiness for churn-model-v2... ── SERVING MODE ────────────────────────────────── Batch inference → daily scoring (latency not critical) Artifact: ONNX → framework-independent, versioned ── ROLLOUT STRATEGY ──────────────────────────────
Phase 1Shadow mode7 days · compare vs current model
Phase 2Canary 10%3 days · monitor business KPIs
Phase 3Full rolloutOn KPI threshold met
── DRIFT MONITORS ──────────────────────────────── Input drift: PSI on 12 top features (threshold 0.2) Output drift: prediction distribution shift alert Business KPI: weekly churn rate proxy (±15%) Retraining trigger: PSI > 0.2 OR KPI drift > 15%

Shareable artifacts, not just chat.

Flagship skills write structured HTML reports to disk — stakeholder-ready, version-controllable, reopenable anytime. These are real outputs from the plugin.

Up and running in three steps.

No API keys. No config required. Skills activate automatically when you describe your problem.

01

Install the plugin

One command in your terminal. Works with Claude Code, Codex, and OpenCode.

claude plugin install github:deepak-karkala/production-mlops-skills
02

Run setup in your project

Optional but recommended — initializes artifact paths and ML framework context.

/production-mlops:setup-production-mlops
03

Describe your problem

Skills auto-route from plain language. Or use explicit slash commands for any workflow.

Should we use ML for this use case?

Everything that's included.

11 skills, each with step-by-step workflows, decision tables, and explicit scope boundaries.

mlops-opportunity-framingML fit scoring (5 dimensions), label leakage detection, proxy trap scan, go/no-go decision with artifact
production-ml-system-designMLOps maturity assessment (Level 1-3), two-deployment-lane topology, SLO-driven serving mode selection
data-lineage-and-quality-designData sourcing, lineage graph, quality gates, schema registry, batch vs streaming decision, serving-path contracts
feature-platform-architectureOffline/online feature stores, point-in-time correctness, training-serving parity audit, feature reuse policy
training-pipeline-and-orchestrationTraining DAG design, orchestration platform selection (Airflow/Kubeflow/SageMaker), idempotence, experiment tracking
ml-test-and-eval-design5-layer testing pyramid, slice-aware evaluation, 3-stage promotion gate policy (dev→staging→prod), behavioral tests
deployment-readiness-for-mlServing mode selection (batch/online/streaming/embedded), artifact packaging, rollout strategy, pre-deployment gates
ml-observability-and-drift-responseData drift, concept drift, output distribution shift, business KPI proxies, automated retraining triggers
ml-governance-and-production-readinessRisk tier classification, model card authoring, fairness audits, approval workflows, compliance mapping
mlops-handoffStructured handoff package: system status, lane status, open risks, owner assignments, artifact index
setup-production-mlopsOne-time repo config: artifact paths, ML framework selection, cloud platform settings, config.yml creation

Ready to build production-grade ML systems?

11 skills. 4 specialist agents. Decision frameworks for the full ML lifecycle.

$ claude plugin install https://github.com/deepak-karkala/production-mlops-skills

MIT LICENSE · CLAUDE CODE · CODEX · OPENCODE