Princeton Web Census Data Release - Dataset Details
The Princeton Web Census Dataset contains compressed archives of SQLite databases (schema), log files, JavaScript sources (when available) and the Alexa site list used in the crawl. An overview of different crawl types are given in the table below. We used Amazon EC2 instances to run the crawls.
Crawl type | Description | Number of sites | Sample (1000 sites) |
---|---|---|---|
Stateless | Parallel Stateless Crawl | 1,000,000 | Sample |
Stateful | Parallel Stateful Crawl -- 10,000 site seed profile | 100,000 | Sample |
ID Detection | Sequential Stateful Crawls -- Flash enabled -- Synced with the pairing crawl | 25,0001 | Sample A Sample B |
Spider | Parallel Stateless Crawl -- Homepages and 4 inner pages | 10,000 | Sample |
Blocking - Ghostery | Parallel Stateless Crawl -- Ghostery is installed and set to block all possible trackers | 50,000 | Sample |
Blocking - HTTPSEverywhere | Parallel Stateless Crawl -- HTTPSEverywhere is installed | 50,000 | Sample |
Blocking - Block cookies | Parallel Stateless Crawl -- Firefox set to block all third-party cookies | 50,000 | Sample |
Blocking - DoNotTrack | Parallel Stateless Crawl -- DoNotTrack header is turned on | 50,000 | Sample |
Timeline of important changes
OpenWPM code and database structure underwent occasional changes. We provide a list of such changes below.
For convenience, we unified all databases before the release (i.e. all released databases have the same structure).
Columns that are added after a crawl are populated as NULL.
For instance, func_name
column is added to javascript
table in January 2017.
That means all 2015 and 2016 crawl databases will have
NULL
in the func_name
column.
The only exception to this is visit_id
, since it enables joining
of data that belog to a particular visit. This column in populated
with the correct visit ID for all released databases.
After the the public release in Nov 2018, crawl databases will be shared without unification to keep up with the changes in OpenWPM's DB schema.
- May 2016: Command sequence, site_visits table and visit_id column were first added.
- January 2017: HTTP extension-based instrumentation replaced mitmproxy based instrumentation.
Database schema changes
Added tables:site_visits
table added in May 2016
http_requests, http_responses, javascript tables:
- 2016-05: removed top_url, added visit_id
flash_cookies, javascript_cookies, profile_cookies
tables:
- 2016-05: removed page_url, added visit_id
javascript
table:
- 2016-05: added:
script_line, script_col, call_stack
- 2017-01: added:
func_name, script_loc_eval
- 2017-04:
parameter_index, parameter_value
are combined intoarguments
- 2018-06: added:
top_level_url, document_url
http_requests
table:
- 2017-01: added:
top_level_url, is_XHR, is_frame_load, is_full_page, is_third_party_channel, is_third_party_window, triggering_origin, loading_origin, loading_href, req_call_stack, content_policy_type
- 2017-04: added:
post_body
- 2017-12: added:
channel_id
http_responses
table:
- 2017-01: added:
is_cached
- 2017-12: added:
channel_id
crawl_history
table:
- 2018-08: added:
visit_id
Data issues
- 2017 and 2018 crawls use Alexa list from 11 November 2016, while previous crawls used an up to date Alexa list downloaded right before the crawl (see this GitHub issue). To address this issue we added two columns to the site_visits table:
alexa_rank
: rank based on the Alexa top 1M list at the time of the start of the crawl. The columns is NULL for sites that are not on the Alexa top 1M list as of the start of the crawl.crawled_alexa_rank
: rank based on the list used for the crawl. For 2015 and 2016 crawls, this rank will be equal to `alexa_rank`. For latter crawls it will based on Alexa top 1M list of 11 November 2016.
- In some crawls, a limited number of websites may not be crawled despite being on the list of sites to be crawled.
- Some entries in http_requests, http_responses and javascript tables have their visit_id equal -1, i.e. they are not associated with any particular page visit. We measured the rate of such entries and found them to be extremely rare (e.g. 1-8 per million for stateless crawls).
- Internal pages were not browsed in spidering crawls between June 2016 through November 2016.
- Blocking crawls with Ghostery may have additional requests to Ghostery owned domains.
- Crawls between 2015-12 and 2016-04 may have some data loss in javascript.ldb due to a LevelDBAggregator related issue.
- 2016-03_1m_stateless: Missing first ~200 top level domains
- While running stateful crawls browser state might have been lost several times due browser crashes.
Usage
To decompress .tar.lz4 archives, run this command:
lz4 -dc --no-sparse [CRAWL_ARCHIVE_NAME.tar.lz4] | tar xf -
See examples of Jupyter notebooks on GitHub to get started with the data.
Reference
Please cite the following work if you use Princeton Web Census Data in your studies:
Englehardt, S., & Narayanan, A. (2016, October). Online tracking: A 1-million-site measurement and analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 1388-1401). ACM.
About
This data release is a part of the Princeton University's WebTAP project. Dillon Reisman helped with the data collection. Gunes Acar prepared the data for release.