AOMTS

Visit

My Age: 18

AOMTS (Aurora Optimized Multi-Token Superposition) is a controlled ablation grid I ran to measure whether Token Superposition Training (TST) and Multi-Token Prediction (MTP) improve pretraining quality — alone and combined — before scaling winners to larger models.

All nine checkpoints share the same ~100M-parameter backbone (12 layers, d=512, 8 heads, SwiGLU FFN, 16k BPE, 2,048 context, RMSNorm, RoPE, tied embeddings) and an equal 3,000-step budget on open-index/open-wikipedia-markdown. Aurora optimizes 2D matrix weights; AdamW handles embeddings and norms. TST runs use bag size s=6 (900 superposition steps compressing 12,288 raw tokens into 2,048 positions, then 2,100 recovery steps with optimizer/LR state carried over). MTP auxiliary heads (weight 0.1) are training-only and excluded from reported val loss.

Key findings (validation loss in nats, lower is better; baseline = Base 0MTP run2 at 2.287):
• TST alone: ~0.073 nats improvement over baseline
• MTP=1 alone: ~0.011 nats improvement
• Best: TST + MTP=1 → 2.205 (AOMTS-TST-s6-100M-3k-1MTP-v1)
• TST + MTP=2 did not beat MTP=1 at this scale (2.215 vs 2.205)
• Resetting optimizer state at TST phase 2 hurt quality (2.303 vs 2.214 for TST-only)
• Cosine LR underperformed WSD for MTP=1 baseline (2.355 vs 2.276)

Full grid (sorted by val loss): TST+MTP=1 2.205 · TST-only 2.214 · TST+MTP=2 2.215 · Base+MTP=1 2.276 · Base+MTP=2 2.284 · Base 0MTP run2 2.287 · TST reset 2.303 · Base+MTP=1 cosine 2.355 · early Base 0MTP v1 2.376

Screening at ~100M / 3k steps was chosen because prior 200M–500M sweeps showed the winning configuration was usually ahead by step 2,000 — fast enough to run many conditions in parallel. Top candidates from this series informed my Chinchilla-optimal 300M ConvSwiGLU vs SwiGLU runs. Each repo includes modeling_aomts.py for inference.