Understanding mesa-optimization using toy models — LessWrong
Our project "Understanding Search in Transformers'' leveraged advanced concepts in modern AI – requiring the training of transformers to solve 2D mazes, and attempting to understand learned algorithms for search. Specifically, I examined the generalization behavior of current models, trained transformer models and performed mechanistic analysis using direct logit attribution. AI Safety Camp is a research programme focused on mitigating AI risks. I participated in the cohort running from March 2023 - June 2023. This project was funded by AI Safety Support.