DAT-Byte-Small

My Age: 17

DAT Byte is a family of byte-level Differential Attention Transformers trained from scratch on an RTX 5090. This is the smallest in the family at ~200M parameters. Decoder-only architecture with RoPE, pre-layernorm, 768 hidden, 3072 FFN, 12 attention heads, 28 layers, and a 259-token byte-level vocabulary (256 bytes + 3 specials). Trained on Gutenberg English, OpenDiscord (ChatML-formatted Discord dumps), proprietary Discord data, and a diverse set of public-domain English Bible translations. 31,200 steps, max sequence length 2048, ~5–10B tokens seen during training. Differential Attention from Ye et al. (2024) reduces attention noise, which is especially useful at byte granularity.

Two larger siblings were also in progress: DAT-Byte Medium (350M) completed training, and DAT-Byte Large (700M) was mid-training. Both were discontinued before publication after I identified an issue in the architecture implementation, compounded by underwhelming benchmark performance relative to similarly sized models. The family may be revisited once the implementation issue is resolved.