Reproducible research artifacts.
Every measurement on the site is backed by a downloadable artifact. These exports are streamed live from the same MongoDB the site reads, so the version you download tonight matches what the page shows tonight. Cite the snapshot ID where applicable.
Lookalikes (full)
CSV live-counted rows CC BY 4.0Every brand-impersonation match across our brand list and 1,160 TLDs (1,081 gTLDs from CZDS + 79 ccTLDs from DoH probe), with detection layer, liveness, RDAP enrichment, Tranco rank, blocklist cross-reference, IP geo+ASN, and linked-bad graph attribution. Default precision_mode=high; pass ?precision_mode=raw for the unfiltered corpus.
Lookalikes - known-bad only
CSV live-counted rows CC BY 4.0Subset that appears in any of the abuse feeds we ingest (PhishTank, URLhaus, CryptoScamDB, OpenPhish, ThreatFox, Phishing.Database, MetaMask eth-phishing-detect, Phishing Army, abuse.ch MalwareBazaar / Feodo / SSLBL). High-precision evaluation set.
Lookalikes - live only
CSV live-counted rows CC BY 4.0Just the matches that resolved to an A/AAAA at the last DoH sweep. The operationally dangerous subset.
Lookalikes - combosquat layer only
CSV live-counted rows CC BY 4.0<brand>+phishing-keyword pattern hits (claim, login, support, airdrop, ...). The most realistic phishing phenotype.
ccTLD snapshot (per-ccTLD JSONL.gz, daily folder)
CSV (CSV via API; JSONL.gz on disk) live-counted rows CC BY 4.0DoH-probed ccTLD impostors, one .jsonl.gz file per ccTLD, plus a manifest.json summary. Mirrors the layout we use for CZDS gTLD zone files. Available on cb2 and cb1 at /data/domaindefender/cctld_data/<DATE>/.
approved_tlds.combined.txt (gTLDs)
TXT 1,081 rows CC BY 4.01,081 gTLDs we have CZDS access to. One TLD per line, sorted, no leading dot.
approved_cctlds.combined.txt (ccTLDs)
TXT 79 rows CC BY 4.079 ccTLDs we have any DoH-probe hit for. Refreshed by the daily pipeline. Same format as approved_tlds.combined.txt.
approved_tlds.all.txt (gTLD + ccTLD union)
TXT 1,160 rows CC BY 4.0Combined 1,160 TLD list, deduped and sorted. Use this if you want the full coverage set in one file.
Tranco list (snapshot we joined against)
CSV 1,000,000 rows Tranco termsThe exact Tranco ID used as the popularity oracle. Pinned for reproducibility.
Abuse feeds we cross-reference (linked upstream).
Daily-fetched into /data/domaindefender/external_feeds/<DATE>/
on cb1 with a manifest.json recording row counts + source URLs +
license notes. We do not republish these — most carry research-attribution
or non-commercial licenses; cite upstream. Inspired by the 10-feed cross-reference in
Sommese et al., DarkDNS, IMC 2024.
On the wishlist (academic-access pending): OpenINTEL (active forward-DNS for the ccTLD coverage gap) · CAIDA DZDB (historical zone-file archive for time-series claims) · Farsight DNSDB (passive DNS at registry scale).