Top
New
🔦
Physics of Language Models: Architecture Design and the Magic of Canon Layers
by
nkko
on 5/4/25, 4:25 PM
with
1
comments
by
darknoon
on 5/15/25, 12:19 AM
anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?
anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?