What are the sources for Our World in Data’s population estimates?

Population size is our most commonly used metric throughout Our World in Data. It is used directly to understand population growth over time or indirectly to calculate per capita adjustments of the many other metrics we care about: from extreme poverty to electricity access, CO₂ emissions, and vaccination rates.

Many population datasets cover a specific period – for example, the UN publishes data from 1950 onwards. However, few maintain very long-term datasets that are continually updated to the present day.

Our team, therefore, builds and maintains a long-run dataset on population by country, region, and for the world, based on three key sources:

The scripts that produce this long-run dataset can be accessed in our GitHub repository.

In all sources we rely on, historical population estimates are based on today’s geographical borders.

We provide a full citation for each of the sources below. If you cite population data for a specific period, please cite the source. For example, for the period 1950 onwards, please cite the UN World Population Prospects. You can add “via Our World in Data” if you downloaded the data from us.

You can find the complete list of the sources used for each particular country and year here.

HYDE version 3.2

The very ambitious HYDE database (History Database of the Global Environment) is maintained by researchers at the Netherlands Environmental Assessment Agency.

Full citation: Klein Goldewijk, K., A. Beusen, J.Doelman and E. Stehfest (2017), Anthropogenic land use estimates for the Holocene; HYDE 3.2, Earth System Science Data, 9, 927-953.

The HYDE estimates go up to 2020, but they are only available once per decade for the period 1800–2020. Therefore, from 1800 onwards, when data is available from both HYDE and Gapminder, we favor the Gapminder version 7 dataset, as it provides annual estimates.

Gapminder

Gapminder version 7

Gapminder maintains a population dataset based on Angus Maddison and Clio Infra data. Their documentation provides the following details on their sources:

We use Maddison population data improved by Clio Infra in April 2015 and Gapminder v3 documented in greater detail by Mattias Lindgren. The main source of v3 was Angus Maddison’s data which is maintained and improved by the Clio Infra Project. The updated Maddison data by Clio Infra were based on the following improvements:

i. Whenever estimates by Maddison were available, his figures were followed in favor of estimates by Gapminder;

ii. For Africa, estimates by Frankema and Jerven (2014) for 1850-1960 have been added to the existing database;

iii. For Latin America, estimates by Abad & Van Zanden (2014) for 1500-1940 have been added.

Full citation: Gapminder doesn’t provide a preferred citation. We cite their work as: Gapminder population dataset version 7, based on data by Angus Maddison improved by Clio Infra. https://www.gapminder.org/data/documentation/gd003/

We use this dataset as our source from 1800 to 1949. In addition, we use their population estimates for the Vatican until 2100 since these are missing in UN WPP’s dataset.

Systema Globalis

Systema Globalis is Gapminder’s primary dataset, used in tools on their official website. 

Full citation: Gapminder doesn’t provide a preferred citation. We cite their work as: Gapminder Systema Globalis. https://github.com/open-numbers/ddf–gapminder–systema_globalis

Data from this source covers the period 1555-2008. We use it to complement our population dataset with data from former countries (e.g., the Soviet Union, Yugoslavia, etc.) and other data not present in other sources.

UN World Population Prospects

We rely on the latest United Nations World Population Prospects revision as our primary source for recent historical data and future projections. We use this data for its reliability, its consistent methods, and because it includes population estimates for almost all territories in the world. The UN updates its dataset every 2 years with the following:

  • Annual historical estimates running from 1950 to the year before the most recent dataset publication;
  • Annual projections running from the year of the most recent dataset publication to 2100. The UN publishes multiple projections based on different scenarios of global fertility rates: a low, medium, and high scenario. In our dataset, we use the medium-variant scenario.

The United Nations estimates may not always reflect the latest censuses or national figures. However, there are several reasons why we use this data over country-by-country national population estimates:

  • The UNWPP dataset is the standard in research. The main reason is that it uses a reliable and standardized methodology for all countries. For example, if we used individual country data, some may include overseas workers, ex-pats, undocumented immigrants, etc., but others wouldn’t. The UNWPP dataset tries to maintain a consistent methodology across all countries.
  • Using data from the UN allows us to get accurate population estimates for all territories worldwide. Finding and maintaining estimates based on national censuses would be time-consuming and more prone to errors.
  • Other reasons include the availability of yearly data (national censuses are only conducted every few years) and avoiding double-counting in cases of border disputes.

Full citation: United Nations, Department of Economic and Social Affairs, Population Division (2022). World Population Prospects 2022, Online Edition. Rev. 1.


Keep reading at Our World in Data