n-simplex attention makes incredible sense because of its honesty: it literally says you can put more compute on attention operation to get more gains: we've seen this trend so many times. This differs from lot of 'suspicious' claim, such as you can use less compute to perform similarly (i.e., subquardatic compute to match quadratic compute).
43,62K