Princeton Web Census Data Release - Dataset Details

The Princeton Web Census Dataset contains compressed archives of SQLite databases (schema), log files, JavaScript sources (when available) and the Alexa site list used in the crawl. An overview of different crawl types are given in the table below. We used Amazon EC2 instances to run the crawls.

Crawl type Description Number of sites Sample
(1000 sites)
Stateless Parallel Stateless Crawl 1,000,000 Sample
Stateful Parallel Stateful Crawl -- 10,000 site seed profile 100,000 Sample
ID Detection Sequential Stateful Crawls -- Flash enabled -- Synced with the pairing crawl 25,0001 Sample A
Sample B
Spider Parallel Stateless Crawl -- Homepages and 4 inner pages 10,000 Sample
Blocking - Ghostery Parallel Stateless Crawl -- Ghostery is installed and set to block all possible trackers 50,000 Sample
Blocking - HTTPSEverywhere Parallel Stateless Crawl -- HTTPSEverywhere is installed 50,000 Sample
Blocking - Block cookies Parallel Stateless Crawl -- Firefox set to block all third-party cookies 50,000 Sample
Blocking - DoNotTrack Parallel Stateless Crawl -- DoNotTrack header is turned on 50,000 Sample
1: Except January 2016 crawls which are of 10K sites.

OpenWPM code and database structure underwent occasional changes. We provide a list of such changes below.

For convenience, we unified all databases before the release (i.e. all released databases have the same structure). Columns that are added after a crawl are populated as NULL. For instance, func_name column is added to javascript table in January 2017. That means all 2015 and 2016 crawl databases will have NULL in the func_name column. The only exception to this is visit_id, since it enables joining of data that belog to a particular visit. This column in populated with the correct visit ID for all released databases.

After the the public release in Nov 2018, crawl databases will be shared without unification to keep up with the changes in OpenWPM's DB schema.

Database schema changes

Added tables: Added/removed columns:

http_requests, http_responses, javascript tables:

flash_cookies, javascript_cookies, profile_cookies tables:

javascript table:

http_requests table:

http_responses table:

crawl_history table:

To decompress .tar.lz4 archives, run this command:

     lz4 -dc --no-sparse [CRAWL_ARCHIVE_NAME.tar.lz4] | tar xf -

See examples of Jupyter notebooks on GitHub to get started with the data.

Please cite the following work if you use Princeton Web Census Data in your studies:

Englehardt, S., & Narayanan, A. (2016, October). Online tracking: A 1-million-site measurement and analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 1388-1401). ACM.

This data release is a part of the Princeton University's WebTAP project. Dillon Reisman helped with the data collection. Gunes Acar prepared the data for release.