Princeton Web Census Data Release
We are releasing the entire Princeton Web Census data containing privacy measurements of 1 million sites conducted regularly from December 2015 to June 2019.
By Steven Englehardt, Gunes Acar, Dillon Reisman, and Arvind Narayanan.
This page is part of the Princeton Web Transparency and Accountability Project.
Since 2015, we have conducted a web census to study third-party online tracking. Each month, our bot visits the web’s 1 million most popular sites and records data pertaining to user privacy, including cookies, fingerprinting scripts, the effect of browser privacy tools, and the exchange of tracking data between different sites ("cookie syncing").
Our open-source measurement software, OpenWPM, has been used in dozens of other studies. In 2016 we published a paper "Online Tracking: A 1-million-site Measurement and Analysis" based on a snapshot of this data, and released that snapshot.
Now we are releasing the entire Princeton Web Census data -- about 15 terabytes -- containing privacy measurements of 1 million sites conducted each month from December 2015 to June 2018.
We plan to run one or two more crawls in the next few months (until mid 2019), and we will update this data release periodically. (Update: November 2018 and June 2019 crawls are added to the release.)
Send an email to firstname.lastname@example.org to request access to the dataset. Please tell us who you are and a high-level description of what you plan to use it for. (We'll approve all requests, but we'd like to get an idea of how people are using the data.)
Overview of the data
Each month, we run measurements in eight configurations at scales ranging from 10,000 sites to 1 million sites, summarized here. Please visits dataset details page for usage information, timeline of changes and issues with the data.
|Type of measurement||Number of sites||Sample
|Stateless (cookies and other state cleared between visits to different sites)||1,000,000||Sample|
|Stateful (cookies and other state are loaded from a seed profile of 10K crawl)||100,000||Sample|
|Enabled automatic detection of identifying cookies (stateful, cookies and other sites retained between visits to different sites)||25,000||Sample A
|Visit home page + 4 inner pages per site (all other measurements visit only one page per site, the home page. This and all following measurements are stateless)||10,000||Sample|
|Ghostery installed and set to block all possible trackers||50,000||Sample|
|Firefox set to block all third-party cookies||50,000||Sample|
|DoNotTrack header is turned on||50,000||Sample|