I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?
Well if the model is so smart, could it be that it is actually aware of its layers and parameters?
As the discussion of GPT-4 heats up, the absence of details on its technical implementation becomes only more glaring. As an engineer, I have not learned anything applicable I haven't known yesterday from the newest OpenAI publication!
I have been investigating issues of LLM training and inference for quite some time, and have developed a number of hypotheses about future SoTA models, which I believe very likely apply to GPT-4.