Princeton Web Census Data Release - Dataset Details

The Princeton Web Census Dataset contains compressed archives of SQLite databases (schema), log files, JavaScript sources (when available) and the Alexa site list used in the crawl. An overview of different crawl types are given in the table below. We used Amazon EC2 instances to run the crawls.

Crawl type	Description	Number of sites	Sample (1000 sites)
Stateless	Parallel Stateless Crawl	1,000,000	Sample
Stateful	Parallel Stateful Crawl -- 10,000 site seed profile	100,000	Sample
ID Detection	Sequential Stateful Crawls -- Flash enabled -- Synced with the pairing crawl	25,000¹	Sample A Sample B
Spider	Parallel Stateless Crawl -- Homepages and 4 inner pages	10,000	Sample
Blocking - Ghostery	Parallel Stateless Crawl -- Ghostery is installed and set to block all possible trackers	50,000	Sample
Blocking - HTTPSEverywhere	Parallel Stateless Crawl -- HTTPSEverywhere is installed	50,000	Sample
Blocking - Block cookies	Parallel Stateless Crawl -- Firefox set to block all third-party cookies	50,000	Sample
Blocking - DoNotTrack	Parallel Stateless Crawl -- DoNotTrack header is turned on	50,000	Sample

1: Except January 2016 crawls which are of 10K sites.

OpenWPM code and database structure underwent occasional changes. We provide a list of such changes below.

For convenience, we unified all databases before the release (i.e. all released databases have the same structure). Columns that are added after a crawl are populated as NULL. For instance, func_name column is added to javascript table in January 2017. That means all 2015 and 2016 crawl databases will have NULL in the func_name column. The only exception to this is visit_id, since it enables joining of data that belog to a particular visit. This column in populated with the correct visit ID for all released databases.

After the the public release in Nov 2018, crawl databases will be shared without unification to keep up with the changes in OpenWPM's DB schema.

May 2016: Command sequence, site_visits table and visit_id column were first added.
January 2017: HTTP extension-based instrumentation replaced mitmproxy based instrumentation.

Database schema changes

Added tables:

site_visits table added in May 2016

Added/removed columns:

http_requests, http_responses, javascript tables:

2016-05: removed top_url, added visit_id

flash_cookies, javascript_cookies, profile_cookies tables:

2016-05: removed page_url, added visit_id

javascript table:

2016-05: added: script_line, script_col, call_stack
2017-01: added: func_name, script_loc_eval
2017-04: parameter_index, parameter_value are combined into arguments
2018-06: added: top_level_url, document_url

http_requests table:

2017-01: added: top_level_url, is_XHR, is_frame_load, is_full_page, is_third_party_channel, is_third_party_window, triggering_origin, loading_origin, loading_href, req_call_stack, content_policy_type
2017-04: added: post_body
2017-12: added: channel_id

http_responses table:

2017-01: added: is_cached
2017-12: added: channel_id

crawl_history table:

2018-08: added: visit_id

2017 and 2018 crawls use Alexa list from 11 November 2016, while previous crawls used an up to date Alexa list downloaded right before the crawl (see this GitHub issue). To address this issue we added two columns to the site_visits table:
- alexa_rank: rank based on the Alexa top 1M list at the time of the start of the crawl. The columns is NULL for sites that are not on the Alexa top 1M list as of the start of the crawl.
- crawled_alexa_rank: rank based on the list used for the crawl. For 2015 and 2016 crawls, this rank will be equal to `alexa_rank`. For latter crawls it will based on Alexa top 1M list of 11 November 2016.
In some crawls, a limited number of websites may not be crawled despite being on the list of sites to be crawled.
Some entries in http_requests, http_responses and javascript tables have their visit_id equal -1, i.e. they are not associated with any particular page visit. We measured the rate of such entries and found them to be extremely rare (e.g. 1-8 per million for stateless crawls).
Internal pages were not browsed in spidering crawls between June 2016 through November 2016.
Blocking crawls with Ghostery may have additional requests to Ghostery owned domains.
Crawls between 2015-12 and 2016-04 may have some data loss in javascript.ldb due to a LevelDBAggregator related issue.
2016-03_1m_stateless: Missing first ~200 top level domains
While running stateful crawls browser state might have been lost several times due browser crashes.

To decompress .tar.lz4 archives, run this command:

     lz4 -dc --no-sparse [CRAWL_ARCHIVE_NAME.tar.lz4] | tar xf -

See examples of Jupyter notebooks on GitHub to get started with the data.

Please cite the following work if you use Princeton Web Census Data in your studies:

Englehardt, S., & Narayanan, A. (2016, October). Online tracking: A 1-million-site measurement and analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 1388-1401). ACM.

This data release is a part of the Princeton University's WebTAP project. Dillon Reisman helped with the data collection. Gunes Acar prepared the data for release.

Princeton Web Census Data Release - Dataset Details

Timeline of important changes

Database schema changes

Data issues

Usage

Reference

About