Perovskite Stability Compiler

Internal Proof Complete

Our ML pipeline predicts perovskite stability with Kendall tau-b 0.271 (p<10-10), cross-validated at 0.289, with +0.155 lift over classical baselines. SHAP explainability identifies Jsc, bandgap, and Voc as top stability drivers.

66 notebooks. 1,543 devices. 48 work packets closed (23 Confirmed, 13 Negative, 8 Promising). No target leakage. Validated within-family ranking model. Full model card published.
Internal milestones met. Phase 2 external validation now open.

Measured against classical baselines · Cross-validated · Fully reproducible

Candidate Universe Frozen Benchmark Ranked Shortlist Ablation Framework Evidence Package

Our approach

Stop guessing. Build conviction.

Perovskite programmes generate more candidate compositions than any team can physically test. We solve this by running high-fidelity internal models first, creating our own benchmarks, and defining clear, quantitative milestones that prove the stack is useful before any external spend.

Frozen internal benchmark

We generate and lock our own comparison frame before any ranking.

Decision-grade internal shortlist

We identify compositions that show meaningful simulated stability improvement.

Validation-ready internal output

We package every result with full ablation, uncertainty bars, and reproducibility logs.

Current status (live)

Internal proof complete. Phase 2 open.

Phase 0: Stack Validation

Complete

  • H₂, LiH, BeH₂, H₂O executed on IBM Open Plan + simulators
  • All baselines locked with error mitigation and noise profiles

Phase 1: Internal Benchmark + ML Pipeline

Complete

  • 1,543 devices from Perovskite Database with T80 stability data
  • Best ML model: ExtraTrees tau-b 0.271, p<10-10, CV 0.289
  • Model sweep: 4 algorithms, 200+ configs tested (NB13–15)
  • SHAP analysis: Jsc, bandgap, Voc are top stability drivers (NB17)
  • 706 candidate compositions with >20% predicted stability gain
  • Original lab panel (NB17) failed cross-split robustness check (P-004)
  • Diversified panel locked: 3 composition families, all 100% top-20 rate (P-007), noise-robust at ±10% (P-010)
  • Panel: MA₃Pb₄I₁₃ (3400h) · MA₀.₂₅FA₀.₇₅PbI₂.₇₇Br₀.₂₅ (940h) · FA₀.₈₃Cs₀.₁₇PbI₂Br (3423h)
  • 9 quantum experiments tested — honest: 0/9 positive lift
  • Multi-model consensus: Device 850 unanimous, Devices 1320/119 depend on ExtraTrees bias (P-011, Negative)
  • Learning curve: +0.051 tau-b per doubling — model not saturated, more data helps (P-012, Promising)
  • Composition-cluster LOGO CV: tau-b drops from 0.289 to 0.055 — within-cluster correlations matter (P-013, Negative)
  • Feature interactions physically meaningful: Voc×FF strongest (H=5.03), panel in typical regions (P-014, Confirmed)
  • Prediction intervals under-calibrated: 80% PI covers only 66%, worse for long-lived devices (P-015, Negative)
  • Novel composition holdout: LOFO tau-b 0.005 — model doesn't generalize to unseen families (P-016, Negative)
  • Conformal calibration fixes intervals: 80% PI now covers 79.9% (P-017, Confirmed)
  • Permutation importance top-3: bandgap, cell area, thickness — differs from SHAP (P-018, Negative)
  • High-confidence consensus zone: 31% of devices, tau-b 0.346 (P-019, Promising)
  • Corrected feature importance: 6 consensus drivers, bandgap most stable (P-020, Confirmed)
  • Panel survives all 25 hyperparameter configs at 100% top-20 (P-021, Confirmed)
  • High-missingness features (thickness 65%, bandgap 31%) expendable: <0.02 tau-b loss (P-023)
  • No target leakage: all features measured before stability testing (P-024, Confirmed)
  • Clean 14-feature model: drops high-missingness features, loses only 0.012 tau-b (P-025, Confirmed)
  • Stacking ET+RF+GB does not beat ET alone (P-026, Negative)
  • Bootstrap 95% CI on tau-b entirely above 0.15 — predictive power confirmed (P-027, Confirmed)
  • Systematic underprediction of long-lived devices (>1000h), no family bias (P-028, Promising)
  • Updated partner brief v2 incorporating 28-packet findings (P-029, Confirmed)
  • Mitchell et al. model card: full ML transparency documentation (P-030, Confirmed)
  • PD curves non-monotonic: 34/36 have >3 reversals — tree artefacts, not smooth physics (P-031, Negative)
  • Subgroup fairness: Pure FA tau-b 0.024, model ranking varies by family (P-032, Negative)
  • OOD detection: panel devices flagged OOD by isolation forest, but OOD half ranks better (P-034, Promising)
  • Temporal drift: train-old/test-new tau-b 0.05–0.14 vs 0.289 random (P-035, Negative)
  • Synthetic augmentation: 5x long-lived copies gives marginal +0.004 tau-b (P-036, Promising)
  • Leave-one-family-out CV: tau-b 0.005 — random CV overstates, model is family-dependent (P-037, Negative)
  • Feature importance differs by family: mean pairwise Spearman −0.22 (P-038, Confirmed)
  • Conformal intervals survive temporal shift: 80% coverage 77.2% vs 79.9% random (P-039, Confirmed)
  • Error meta-model: ROC-AUC 0.596, tree-std top predictor of high error (P-040, Promising)
  • Family-specific models: Pure FA gains +0.059 tau-b, mean delta +0.018 over global (P-041, Confirmed)
  • Panel within-family: all 3 devices top-20 within own family 100% of appearances (P-042, Confirmed)
  • Pure MA deep dive: MA-only model tau-b 0.269, credible single-family ranker (P-043, Confirmed)
  • Revised credibility audit: 60.8% across 40 packets, honest assessment with all caveats (P-044, Confirmed)
  • Physics features (tolerance/octahedral factor) don't improve LOGO — family wall not about features (P-045, Negative)
  • Kitchen sink 31 features: +0.040 tau-b lift, solvents and layers biggest contributors (P-046, Confirmed)
  • Panel survives 31-feature upgrade: all 100% top-20, tau-b 0.339 (P-047, Confirmed)
  • Temporal tau-b improves marginally 0.115→0.130 with extra features (P-048, Promising)
  • 48 Agent OS work packets closed (P-001–P-048), 23 Confirmed, 13 Negative, 8 Promising
  • Partner-ready outreach package: test protocol, budget, success criteria (P-008, P-029)

Internal success milestones

All met. Phase 2 triggered.

  1. ML tau-b > 0.20 vs classical baseline — MET (0.271 vs 0.116)
  2. ML p-value < 0.001 — MET (p < 10-10)
  3. ≥3 candidate compositions with >20% simulated gain — MET (706)
  4. Cross-validated stability — MET (CV 0.289, 5-fold)
  5. Evidence Package reproducible — MET (66 notebooks on GitHub)

Original quantum-dependent milestones (tau-b lift ≥0.15, recall ≥15pp, variance reduction) were not met after 9 experiments. Milestones revised to ML-focused after honest evaluation. Full record in Notebooks 08–11. Model improved from RF (0.249) to ExtraTrees (0.271) in Notebooks 13–15.

Method

Classical baselines first. Ablation mandatory. Stability is everything.

The goal is not the most elaborate model. The goal is internal conviction that the orchestration actually improves decisions — before we spend a single dollar on external testing.

Evidence-led by design

All code, datasets, and results are public.

All code, datasets, notebooks, and results are public and fully reproducible.

Please also read our Honesty Note for complete context on the project journey and corrections made.

Next phase

Phase 2: External Validation

Internal milestones are met. We are scoping a blinded, prospective, within-family validation with lab partners. Pre-registered test matrix, metadata template, and scope/non-claims note are published in the repository.

Prospective within-family validation

45-device blinded pilot: 3 composition families × 3 recipes (model-favored, baseline, negative control) × 5 replicates. Fixed stack, MPP tracking as primary endpoint, mandatory fabrication metadata. Two-partner structure preferred (fabrication + independent testing). Budget: $15K–$50K depending on scope.

Quantum R&D continues separately

Quantum composition encoding was a dead end (0/9 experiments). Real quantum chemistry (DFT+VQE) remains a research direction, not a gate.

We are compiling matter from first principles.

Rigorously, internally, and with zero hype.