Events
ML Seminar - Alexander Stapleton
Centre for Theoretical Physics and AstronomyDate: 13 February 2026 Time: 14:30 - 15:30
Location: 114 G.O. Jones Building
Title: A path to natural language through tokenisation and transformers
Abstract: Natural language exhibits robust statistical regularities, most notably Zipf's and Heaps' laws, yet how these relate to the tokenisation schemes used in transformer-based language models remains unclear. In this talk, I will examine how byte–pair encoding (BPE) reshapes corpus statistics and mediates between linguistic structure and transformer predictions. Starting from an idealised Zipfian setting, I will derive the expected Shannon entropy of token slots and show that increasing BPE depth drives token frequencies toward a Zipfian power law while inducing characteristic entropy growth. Training transformers at increasing tokenisation depths will reveal a convergence of predictive entropies toward Zipf-based expectations and a reduction in local token dependencies. Together, these results will position BPE as a statistical transformation that reconstructs key informational properties of natural language. Finally, we will explore how Zipf's law constitutes an RG-type scheme.
Updated by: Dimitrios Bachtis
