Welcome to History of Data Science. Discover the stories of heroes who transformed our daily lives!

BROUGHT TO YOU BY Dataiku Dataiku

xperiences-ico Xperiences
Donald Rubin: The Statistician Who Caused a Stir
Machine Learning / Causal ML

Donald Rubin: The Statistician who Caused a Stir

4 min read
While the police concentrate on looking for missing people, Professor Emeritus at Harvard Donald Rubin (1943) focuses his attention on dealing with missing data. This world-famous American statistician spent much of his illustrious career hunting out causes, effects, potential outcomes and data that had gone AWOL.

Indecisive or curious?

Born in Washington D.C., Donald Rubin was an excellent student and embarked on an accelerated PhD physics program at Princeton University. Along the way, he switched to phycology, before being told to swot up on stats. Which he did with a PhD in statistics — not before dabbling in computer science and teaching himself to program in Fortran.

The missing pieces

Fresh from university, he took on a consulting role at the US Educational Testing Service. Unleashed to research what he wanted (within reason), he set about establishing the causal model that would later be named after him. Drawing on the work of Polish mathematician Jerzy Neyman, his approach was based on the idea of potential outcomes and explored what happens to individuals, or groups, if part of their environment changes. Not to mention how to deal with the problem of missing data — that is to say, when there is no data value for a particular variable e.g. nonresponse or dropout. Particularly common in economics, sociology and social sciences, this can have a significant impact on the validity of the conclusions.

“Often decisions about interventions must be made, even if based on limited empirical evidence, and we should help decision-makers make sensible decisions under clearly stated assumptions.”

Key causal concepts

Rubin didn’t stop there. Through a number of prestigious academic roles, he kept busy optimizing survey sampling, building on Bayesian inference and working on the Expectation-Maximization (EM) algorithm to find the maximum likelihood. He established the Propensity Score as a way of reducing or eliminating selection bias in observational studies by balancing covariates. Time and time again he went back to his lifelong passion for data, developing, most notably, Multiple Imputation to help account for uncertainty.

Beyond the big theories

His work on statistics and, in particular causal inference, has helped bring causality to the heart of social science — revolutionizing development economics and randomized field experiments, not to mention psychology and medicine by addressing dropout and noncompliance. This important contribution has been widely recognized, earning him numerous awards and positions, including fellowship of the American Statistical Association.