Rising Stars 2020:

Arsha Nagrani

Research Scientist


PhD '20 University of Oxford

Areas of Interest

  • Artificial Intelligence
  • Computer Vision
  • Machine Learning


Speech2Action: Cross-modal Supervision for Action Recognition


Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this work we investigate the link between spoken words and actions in movies. Using a form of cross-modal supervision, data labels from a supervision-rich modality are used to learn representations in another, supervision-starved target modality, eschewing the need for costly manual annotation in the target modality domain. By using a text-based model to predict actions from speech segments alone, we demonstrate superior action recognition performance from video on standard action recognition benchmarks, without using a single manually labelled action example.


Arsha Nagrani is a Research Scientist at Google Research. She obtained her PhD from the VGG group in the University of Oxford and her BA and MEng degrees from Cambridge Uni, UK. Her research interests lie at the intersection of computer vision and speech technology, focusing on cross-modal and multi-modal machine learning techniques for video recognition. She has also spent time as a visiting researcher at the Wadhwani AI Research non-profit organisation in Mumbai and has a keen interest in AI for Social Good. Her work has been recognised by a Best Student Paper Award at Interspeech, a Google PhD Fellowship and a Townsend Scholarship, and has been covered by major outlets such as The New Scientist, MIT Tech review and Verdict.

Personal home page