Papers, conjectures, and the occasional rabbit hole.
A benchmark in which each frontier model both solves problems and authors problems for others, yielding two ratings that avoid the saturation typical of fixed test sets.