so here's a question: i was thinking about the structure of english and how it might affect the learning of positional embeddings. i went back to @karpathy's GPT2 video and he plotted the wpe matrix for GPT2 where the plot is basically the values of 3 specific channels (out of the 768 dimensions) as a function of the position (1024, context size). he said that the learned pos_embeddings have a structure in them. i got curious and plotted the same for 2 more open-source models: EleutherAI/gpt-neo-125M, and facebook/opt-125m, and i got the same result (i guess?). in the original transformers paper, the authors used a fixed sinusoidal function for positional embeddings. why is it the case that the models learn sinusoidal structure in natural language? is it because english has a sinusoidal structure? subjects usually precede verbs, clauses have temporal or causal order, etc.?
93,01K