zakird/crux-top-lists

Downloadable snapshots of the Chrome Top Million Websites pulled from public CrUX data in BigQuery.

Python
This is stars and forks stats for /zakird/crux-top-lists repository. As of 10 May, 2024 this repository has 682 stars and 34 forks.

Cached Chrome Top Million Websites Recent research showed that the top million most popular websites published by Google Chrome via their UX Report (CrUX) is significantly more accurate than other top lists like the Alexa Top Million and Tranco Top Million. This repository caches a CSV version of the Chrome top sites, queried from the CrUX data in Google BigQuery. You can browse all of the cached lists here. The most up-to-date top million global websites can be downloaded directly at: https://raw.githubusercontent.com/zakird/crux-top-lists/main/data/global/current.csv.gz. Data Structure The CrUX dataset has several important differences from other top lists: Websites are bucketed by rank magnitude order, not by specific rank. Rank will be 1000, 10K, 100K, or 1M in the provided files. The data is ordered by rank magnitude. Within each order of magnitude, websites are listed randomly. Websites are identified by origin (e.g., https://www.google.com) not by domain or FQDN. Data is released monthly, typically on the second Tuesday of the month. This is an example of what the data looks like: origin,rank https://www.ptwxz.com,1000 https://ameblo.jp,1000 https://danbooru.donmai.us,1000 https://game8.jp,1000 https://www.google.com.au,1000 https://www.repubblica.it,1000 https://www.w3schools.com,1000 https://animekimi.com,1000 Websites are ranked by completed pageloads (measured by First Contentful Paint) and aggregated by web origin. The dataset adheres as closely as possible to user-initiated pageloads (e.g., it excludes traffic from iframes). More information about CrUX and its data collection methodology can be found on its official website: https://developer.chrome.com/docs/crux/about/. Why 1 Million Sites? This repository does not contain all of the website ranking data published by Chrome. Their global list of popular websites contains approximately 15M websites. The top million websites captures over 95% of user traffic in Chrome by both Page Loads and Time on Page (Ruth et al.) and is a reasonable approximation: If you want to use more or fewer websites, this is the approximate breakdown of coverage: Websites Page Loads 1000 50% 10K 70% 100K 87% 1M 95% 5M 99% The following SQL can be used to generate a similar list of all globally popular websites: SELECT distinct origin, experimental.popularity.rank FROM `chrome-ux-report.experimental.global` WHERE yyyymm = ? -- e.g., integer 202210 GROUP BY origin, experimental.popularity.rank ORDER BY experimental.popularity.rank; Country-Specific Websites Ruth et al. also showed that browsing behavior is localized and a global top list skews towards global sites (e.g., technology and gaming) and away from local sites (e.g., education, government, and finance). As such, researchers may also want to investigate whether trends hold across individual countries. Chrome publishes country-specific top lists in BigQuery and the following SQL can be used to dump out country-specific top websites: SELECT distinct country_code, origin, experimental.popularity.rank FROM `chrome-ux-report.experimental.country` WHERE yyyymm = ? -- e.g., integer 202210 AND experimental.popularity.rank <= 1000000 GROUP BY country_code, origin, experimental.popularity.rank ORDER BY country_code, experimental.popularity.rank; The CrUX dataset is based on data collected from Google Chrome and is thus biased away from countries with limited Chrome usage (e.g., China). If you're specifically interested in looking at domain popularity in China, consider Building an Open, Robust, and Stable Voting-Based Domain Top List, which is based on data collected from 114DNS, a large DNS provider in China. Supporting Research The data in this repo is all publicly posted by Google to their CrUX dataset in Google BigQuery. This is simply a cache of that public data. Many of the arguments in this README are based on two recent research papers. The first describes how we evaluated the accuracy of lists of top websites. The second is a study on web browsing more broadly. Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists Kimberly Ruth, Deepak Kumar, Brandon Wang, Luke Valenta, and Zakir Durumeric ACM Internet Measurement Conference (IMC), October 2022 A World Wide View of Browsing the World Wide Web Kimberly Ruth, Aurore Fass, Jonathan Azose, Mark Pearson, Emma Thomas, Caitlin Sadowski, and Zakir Durumeric ACM Internet Measurement Conference (IMC), October 2022
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
ParthJadhav/Tkinter-DesignerPythonMakefile7.3k06730
moyix/gpt-wprePython3130250
yacineMTB/scribepodJupyter NotebookTypeScriptPython1550150
SofianeHamlaoui/Pentest-NotesXSLTShellPython3070870
LondheShubham153/90DaysOfDevOpsPython55103.3k0
sylikc/jpegviewC++CHTML1.3k0870
espressif/esp-matterC++CPython43301060
Zverik/every_doorDartPythonShell309+2327+1
HaodongMo/ARC-9LuaPython880490
saveweb/review-2022MarkdownPython1780220