Data. If you are writing a thesis or doing scientific research, and you want to perform some type of statistical analysis, then you will need data to answer your research question. However, it’s not always obvious where you can find the right data for the job. Therefore, a question that I am often asked by students is: “Where can I find data for my thesis?”. In this post, I have collected a list of secondary data sources that may help you to find the the right data for your own research. I have divided them in the following categories:
Not familiar with secondary data? Let’s start with a quick introduction!
Primary and secondary data
Let’s first make a distinction between two types of data. Primary data is data that you collect yourself, for example by doing an experiment or by administering surveys. Secondary data is data that has already been collected for you. For example, by a National Statistics Bureau of a certain country, or by an already existing survey such as the World Values Survey. Secondary data can also be data that is stored in an existing database, such as the Genetic Variation data from the European Variation Archive.
Okay, but I was looking for primary data?
In the remainder of this post, I will talk about secondary data. However, I would like to emphasize that nowadays (primary) data can be found everywhere. In fact, I encourage you to take advantage of the many sources of information that are available: whether you collect data yourself by crawling through Twitter, design a simple online survey using Google Forms, by using computer game statistics, or by analyzing Youtube videos.1 Data is everywhere!
A survey is a set of questions that aim to measure one or multiple topics. Usually, surveys are designed to gauge the opinions or preferences of individuals on a certain matter, but more objective questions (such as “How much do you earn per hour?”) are often also included.2
Although surveys are increasingly administered digitally, many global surveys still rely on researchers that go ‘out in the field’ to administer surveys by interviewing respondents.3 After a survey is administered, the data is collected in a dataset. This makes using this type of data very easy, because all the hard work of compiling and cleaning the survey data has already been done for you.
Below is a list of popular surveys that may help you with your thesis research. Most datasets are freely accessible, for others you have to register an account on the website of the survey to be able to access the data.
- World Values Survey (WVS): one of the first global surveys, it mostly contains information on people’s perceptions and values with regards to politics, the economy, and life in general.
- European Values Study (EVS): similar to the WVS, however it only includes European countries.
- Panel study of Income Dynamics (PSID): an ongoing survey of about 5,000 families from the United States, starting in 1968. As the title suggests it contains data on the economic background of individuals, but also a lot more: child rearing, marriage, education, and so forth.
- German Socio-Economic Panel (G-SOEP): similar to the PSID, however it focuses exclusively on Germany.
- Understanding Society: similar to the PSID, however it focuses exclusively on the United Kingdom.
- PEW Research Center Surveys: includes surveys on United States politics and public opinion.
- The DHS Program: the Demographic and Health Surveys (DHS) Program collects data on health, diseases and related topics (such as nutrition) for many countries in the world.
- The Gender and Generations Program: focuses on studying family- and partner relationships. Individuals from multiple countries are surveyed, predominantly from Europe.
Tip: national surveys
For some countries, multiple surveys are available. For example, in the United States alone there are about 15 different surveys on households!
Databases collect and store data from other sources. This may be data from national statistics offices, surveys, scientific articles, and so forth. Because most databases offer large amounts of data, and because you can often easily search this data, they are a great source to use for your dissertation.
Below is a list of popular databases that are worth browsing for your thesis. Most datasets are freely accessible, for others you have to register an account on the website of the database to be able to access the data.
- World Bank: offers country data for the period 1975-today on social, political and economic variables such as the literacy rate, quality of government and GDP per capita.
- UN Data: similar to the World Bank database.
- Quality of Government (QoG): the Quality of Government dataset from the University of Gothenborg offers national (country) and regional data on various governance indicators. Furthermore, the national data file includes popular metrics from other surveys (such as the World Values Survey) and variables from popular scientific papers (example of data from a paper).
- EuroStat: offers European data (national and regional) on social, political and economic indicators.
- Historical Statistics: a database with historical financial, economic, and social datasets.
- Correlates of War: extensive database on the history of war and related topics, such as military disputes, formal alliances, diplomatic connections and bilateral trade.
- Economagic: Economic and Financial timeseries data for many countries in the world, on various topics (unemployment, GDP growth, interest rates, and so forth).
- Federal Reserve Economic Data (FRED): similar to Economagic.
- Penn World Tables: similar to Economagic, but focuses on productivity.
- Luxembourg Income Study Database: a database of household and person-level income data for various countries. Also check the Luxembourg Wealth Study Database, which is similar to the Income study, but focuses on wealth (i.e. assets, debt) instead.
- CIA World Fact Book: a database with various country characteristics (e.g. level of democracy, geography, military status). Tip: the original data is only available in print, but datasets containing CIA World Factbook data have been compiled by others. Part of the data is also included in the previously mentioned QoG.
- Comparative Political Data Set (CPDS): country-level data on various political and institutional indicators.
Tip: national databases
Most countries have national statistics offices. Usually, these collect data on various topics for a specific country, which makes them a great source if you are interested in studying a single country. A list of national statistics offices can be found here.
Some scientific problems are too difficult to tackle all alone. This is why researchers collaborate in scientific projects. Often, the data from these projects is freely accessible. Below are some examples.
- Open Source Psychometrics Project: data from many different personality tests (Psychology) such as the Big-Five and Generic Conspiracist Beliefs Scale, collected by researchers from this project.
- 1000 Genomes Project: data from this project, which aimed to record human genetic variation.
- European Climate Assessment Project: this dataset aims to measure the impact of global warming for Europe and the Middle East.4 It records various weather and climate variables and is accessible for free!
The Registry of Research Data Repositories (r3data) collects information on project data for studies from all disciplines. Definitely worth a look!
Where and how should you discuss data in your thesis?
Increasingly, publicly available datasets are being indexed by search engines. These allow you to quickly find interesting data. Rather than being a full-fledged database, these search engines direct you to the place where the data can be retrieved. Some examples:
- Google Dataset Search: Launched in 2018, this Google service aims to index as many datasets as possible. Combined with the ‘traditional’ Google search functionality which most of us are familiar with, this is a good place to start.
- Quandl: a search engine connected to millions of financial, economic, and social datasets. Not all data is freely accessible.
- Datahub.io: similar to Quandl.
- Plenar.io: similar to Quandl, although most data is freely accessible.
- Data.world: similar to Quandl, but with an added bonus: it categorizes data based on academic discipline. For example, say you were interested in browsing Psychology datasets.
- Socrata Open data: a search engine that focuses exclusively on (open) Government data.
- Nation Master: a site that aggregates data on a multitude of ’popular’ topics, ranging from birth rates to crime statistics and alcohol consumption. Covers many countries in the world.
- Statista: similar to Nation Master, however it focuses on market and industry data (e.g. number of sold smart-phones, market size of cosmetics industry, etc.).
Scientific journals often demand that researchers make their dataset publicly available before an article is accepted and published. Usually, this dataset is posted on the website of the journal. Therefore, it may pay off to check a journal’s website to see if an interesting dataset is available.
Did you read an interesting article? Check the Appendix of the article to see whether data has been made available online. If data has been made available, the article will usually say something along the lines of: “Our data is available from the online appendix of
Additionally, scientific journals often maintain lists to various (publicly available) datasets, for example:
- American Psychological Association
- Harvard Dataverse: hosts a collection of datasets used in scientific papers and journals, which you can search using various queries.
- Inter-university Consortium for Political and Social Research The ICPSR is similar to the Harvard Dataverse, and hosts a large collection of published data.
Tip: Open data
In the last decade, many journals have started to focus on sharing research data. These are called ‘open data’ journals. Examples are the Journal of Open Psychology Data or the Open Data Journal for Agricultural Research .
Many researchers share data that they use for scientific publications on their personal (University) web page. I will list just a few examples here.
- Prof. Enrico Spolaore: A professor of Economics who has researched the economic consequences of genetic and cultural distances between countries.5
- Prof. Robert Putnam: World-renown Sociologist Robert Putnam shares his research data (in this example: on social capital) on his website.
- Vanderbilt Biostatistics Data: some university research groups, such as the Biostatistics group at Vanderbilt University, post datasets on their website.
Tip: visit researcher website
I personally think it is always a good idea to visit the website of a researcher or professor that you are reading an interesting article from: next to finding links to other useful articles and perhaps some data, it will also tell you if they have made any recent advancements in the study of a topic!
Obviously, I am not the first to compile a list of datasets. Below are some other examples that you may find helpful.
- Open Data Tools
- Google Public Data
- Stata data
- The Economics Network
- Reddit: Although not necessarily a research website, the popular discussion platform Reddit also houses many discussions on where to find data. Check here, here or here.
You are at the end of this post, so by now I’m pretty convinced that you will use data in your thesis . If yes, then you also need to show this data: in tables, and in figures. I have written a post on this topic:
I sincerely hope that this post has helped you discover new and exciting datasets for your thesis research. Thanks for reading, and good luck with your research!