← back to lookalikes
Datasets

Reproducible research artifacts.

Every measurement on the site is backed by a downloadable artifact. These exports are streamed live from the same MongoDB the site reads, so the version you download tonight matches what the page shows tonight. Cite the snapshot ID where applicable.

Downloads 9 artifacts

Lookalikes (full)

CSV live-counted rows CC BY 4.0

Every brand-impersonation match across our brand list and 1,160 TLDs (1,081 gTLDs from CZDS + 79 ccTLDs from DoH probe), with detection layer, liveness, RDAP enrichment, Tranco rank, blocklist cross-reference, IP geo+ASN, and linked-bad graph attribution. Default precision_mode=high; pass ?precision_mode=raw for the unfiltered corpus.

cite · DomainDefender / Tranco L7KW4 (2026-04-23) / CryptoScamDB / Phishing.Database

Lookalikes - known-bad only

CSV live-counted rows CC BY 4.0

Subset that appears in any of the abuse feeds we ingest (PhishTank, URLhaus, CryptoScamDB, OpenPhish, ThreatFox, Phishing.Database, MetaMask eth-phishing-detect, Phishing Army, abuse.ch MalwareBazaar / Feodo / SSLBL). High-precision evaluation set.

cite · DomainDefender / abuse-feed cross-reference

Lookalikes - live only

CSV live-counted rows CC BY 4.0

Just the matches that resolved to an A/AAAA at the last DoH sweep. The operationally dangerous subset.

cite · DomainDefender

Lookalikes - combosquat layer only

CSV live-counted rows CC BY 4.0

<brand>+phishing-keyword pattern hits (claim, login, support, airdrop, ...). The most realistic phishing phenotype.

cite · DomainDefender

ccTLD snapshot (per-ccTLD JSONL.gz, daily folder)

CSV (CSV via API; JSONL.gz on disk) live-counted rows CC BY 4.0

DoH-probed ccTLD impostors, one .jsonl.gz file per ccTLD, plus a manifest.json summary. Mirrors the layout we use for CZDS gTLD zone files. Available on cb2 and cb1 at /data/domaindefender/cctld_data/<DATE>/.

cite · DomainDefender (DoH probe via Cloudflare 1.1.1.1 + Google 8.8.8.8)

approved_tlds.combined.txt (gTLDs)

TXT 1,081 rows CC BY 4.0

1,081 gTLDs we have CZDS access to. One TLD per line, sorted, no leading dot.

cite · DomainDefender

approved_cctlds.combined.txt (ccTLDs)

TXT 79 rows CC BY 4.0

79 ccTLDs we have any DoH-probe hit for. Refreshed by the daily pipeline. Same format as approved_tlds.combined.txt.

cite · DomainDefender

approved_tlds.all.txt (gTLD + ccTLD union)

TXT 1,160 rows CC BY 4.0

Combined 1,160 TLD list, deduped and sorted. Use this if you want the full coverage set in one file.

cite · DomainDefender

Tranco list (snapshot we joined against)

CSV 1,000,000 rows Tranco terms

The exact Tranco ID used as the popularity oracle. Pinned for reproducibility.

cite · Le Pochat et al., Tranco List ID L7KW4 (2026-04-23)
External cross-reference sources

Abuse feeds we cross-reference (linked upstream).

Daily-fetched into /data/domaindefender/external_feeds/<DATE>/ on cb1 with a manifest.json recording row counts + source URLs + license notes. We do not republish these — most carry research-attribution or non-commercial licenses; cite upstream. Inspired by the 10-feed cross-reference in Sommese et al., DarkDNS, IMC 2024.

On the wishlist (academic-access pending): OpenINTEL (active forward-DNS for the ccTLD coverage gap) · CAIDA DZDB (historical zone-file archive for time-series claims) · Farsight DNSDB (passive DNS at registry scale).

Every download above is also a JSON endpoint. /api/ overview → Swagger UI → OpenAPI JSON →
Pipeline
loading…