Chinchilla 300M: ConvSwiGLU vs SwiGLU

Visit

My Age: 18

A paired release of two ~300M-parameter language models trained under identical Chinchilla-optimal conditions (~20 tokens/parameter, ~6B tokens each) on open-index/open-wikipedia-markdown, differing only in the feed-forward block: standard dense SwiGLU vs ConvSwiGLU — SwiGLU with a depthwise causal 1D convolution (kernel k=2) and extra SiLU between the gated activation and down-projection, following the Universal Reasoning Model design.

Shared stack: 18-layer decoder-only transformer, GQA (14 Q / 14 KV heads, d=896, 16k BPE), RMSNorm, RoPE, tied embeddings. Training uses Token Superposition Training (TST, bag size s=6, 30% of steps), Multi-Token Prediction (MTP depth 1, weight 0.1), Aurora on 2D matrix weights + AdamW on embeddings/norms/conv, WSD LR schedule, and NVFP4 precision via Transformer Engine on an NVIDIA RTX 5090 (~36.5k steps, train shards en-00001–en-00004, val shard en-00014).

Results (next-token CE val loss on held-out shard en-00014, main head only):
• SwiGLU baseline — 299.3M params, phase-1 (TST end) 4.768, final 2.664
• ConvSwiGLU — 299.4M params (+0.05%), phase-1 4.774, final 2.515 (−0.149 nats vs SwiGLU)

Phase 1 (TST) was effectively tied. ConvSwiGLU lagged early in recovery (e.g. step 16k: 2.810 vs 2.530) but crossed SwiGLU around step 28k and finished stronger — at step 36k, 2.516 vs 2.668. Training logs on W&B. Each repo ships modeling_chinchilla_300m.py for self-contained inference. This experiment scales up the TST + MTP=1 recipe from my AOMTS screening series to Chinchilla-optimal budget while isolating the FFN architecture variable.