Physics of Language Models: Architecture Design and the Magic of Canon Layers

by nkkoon 5/4/25, 4:25 PMwith 1 comments
by darknoonon 5/15/25, 12:19 AM

anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?