Newscard - o1 'Strawberry'
At a conference last week, Anthropic co-founder Jack Clark said that LLMs are approaching a technical depth that most non-experts cannot evaluate. As if to immediately prove him correct, his competitor OpenAI released a model specialized for complex math and programming problems, prompting a dozen people to ask me for review requests.
https://openai.com/index/learning-to-reason-with-llms/
Newscard
- GPT o1 is the long-awaited strawberry
- OpenAI once again takes a substantial research lead
- o1 makes clear improvements in math, coding, and knowledge questions
- OpenAI describes it as a form of both Chain of Thought and Reinforcement Learning: ““Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process.”
- It likely blurs the line between Chain of Thought, an approach to instructing language models to form step by step processes, and tree search, a different kind of algorithm best known for its use in complex games such as Chess or Go, and more recently for Math Olympiads.
As a gold medallist in the International Olympiad in Informatics and a national competitor in the US and Canadian Math Olympiads, many people are asking me to explain what impact this new release will have. I asked o1 some choice questions in my areas of expertise, namely competitive math, competitive programming, and research combinatorics.
My takeaways from those domains:
- o1 is an exemplary lookup system. It accurately cites obscure math theorems, describes them correctly, and identifies their use cases. It likely uses some form of retrieval augmented generation.
- It struggles at more ad hoc programming and math questions, even ones which would be much easier for a human to do.
- For someone who has experience in math olympiads, it’s easy to identify problems that are “stereotypical” math olympiad problems, that use typical results and techniques. These “stereotypical” problems can be easy or difficult, but often depend on a fixed “vocabulary” of results. For o1, a stereotypical problem that would be very difficult for a human is still easier than a problem that would be easier for an inexperienced human.
- Without similar experience, it’s pretty difficult for a smart generalist to tell what kinds of problems would be well-represented or poorly-represented in training data
- This is epitomized by the combinatorics research examples, in which o1 correctly cites and applies very obscure results.
Appendix (The Brian Chau Scorecard)
mantel’s theorem ✅
notes: cites mantel’s theorem directly
strong perfect graph theorem ✅(?)
also cites SPGT, gives an incomplete proof
labelled functional graphs where node 1 is reachable (n^{n-1}): ❌
did not fix after 3 corrections, gave an imo incomplete proof after being prompted with the formula for spanning trees
Off-stereotype but difficult Canadian OI problem https://dmoj.ca/problem/cco13p6 ❌
(compile error and also obviously wrong)
Another off-type OI problem: http://ceoi.inf.elte.hu/probarch/14/question.pdf ❌
complete garbage solution
https://oj.uz/problem/view/POI11_imp: ❌
passes one batch(!), but is both wrong and inefficient
Goodman’s theorem: ✅
when prompted, just cites Goodman’s theorem without proving it. Gives the normal proof of Goodman’s theorem when prompted.
https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ B8 (straightforward application of turan’s theorem): ✅
Hamiltonian path in erdos renyi graph: ✅ (kinda)
Cites the correct threshold with no proof initially. When asked to prove the threshold correctly cites another result by Posa. When asked to prove the posa result, gets there eventually (though its a bit of a loose proof)
eigenvalues of bipartite graphs: ✅
gives the standard proof.
https://dmoj.ca/problem/valentines18s3
medium-difficulty dynamic programming problem set by me: ❌ (doesnt get any cases correct, even the samples)
https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ D16 (Alon–Boppana theorem) ✅
I picked this problem thinking it seemed not like a direct application of a result, but turns out it is also an existing theorem, the Alon–Boppana theorem.
HMMT 2018 combinatorics p7 ✅
HMMT 2019 combinatorics p8 ❌
HMMT 2019 combinatorics p9 ❌
HMMT 2019 team 8 (which is a copy of the earlier CEOI programming question) ❌ (gives the correct answer for this yes-or-no question, but a horrendously wrong proof)
https://chatgpt.com/c/66e345c3-ed3c-800e-b62c-a20e922aeaf1
https://chatgpt.com/c/66e346c7-9f9c-800e-a602-ebbac1dc89df
https://chatgpt.com/c/66e34a1b-aff8-800e-a335-5dabc448621c
https://chatgpt.com/c/66e34ba7-bc60-800e-a5fa-231af9a7a4e6
https://chatgpt.com/c/66e34cb5-4f80-800e-89d3-bbd4db4c1dce
https://chatgpt.com/c/66e34d61-74e4-800e-b0aa-c31b6d2a693c
https://chatgpt.com/c/66e34ecf-14bc-800e-ae68-216ea4a23b26
https://chatgpt.com/c/66e34fd0-71c4-800e-8951-58334e970cb8
https://chatgpt.com/c/66e350db-3d80-800e-afd6-7a67e71b9960
https://chatgpt.com/c/66e353fb-8cf4-800e-9b56-6f3faf562085
https://chatgpt.com/c/66e35596-4b20-800e-8a36-30846b2a1837
https://chatgpt.com/c/66e35816-8c98-800e-9e59-82a3d0c43dcc
https://chatgpt.com/c/66e35988-8798-800e-92ba-8ade56dd0ece
https://chatgpt.com/c/66e359f6-57ac-800e-9252-b62734e7a241
https://chatgpt.com/c/66e35a3c-c5ec-800e-a9e2-67524162f589
https://chatgpt.com/c/66e35b4f-6bb8-800e-864d-55eea5633667
https://hmmt-archive.s3.amazonaws.com/tournaments/2019/feb/comb/problems.pdf
https://hmmt-archive.s3.amazonaws.com/tournaments/2018/feb/comb/problems.pdf