The emergence of deep learning based methods for search poses several challenges and opportunities not just for modeling, but also for benchmarking and measuring progress in the field. Some of these challenges are new, while others have evolved from existing challenges in IR benchmarking exacerbated by the scale at which deep learning models operate. Evaluation efforts such as the TREC Deep Learning track and the MS MARCO public leaderboard are intended to encourage research and track our progress, addressing big questions in our field. The goal is not simply to identify which run is "best" but to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This entails a wider conversation in the IR community about what constitutes meaningful progress, how benchmark design can encourage or discourage certain outcomes, and about the validity of our findings. In this talk, I will present a brief overview of what we have learned from our work on MS MARCO and the TREC Deep Learning track--and reflect on the state of the field and the road ahead.