From population to measurement.
The data flow that produces every claim on this site. The pipeline is
idempotent and can be re-run end-to-end via
web/lookalike_mining/run_pipeline.sh.
Three sources, three different views of the domain space.
Daily zone files for 1,081 gTLDs, ~265M apex domains. The universe of registered gTLD domains. Comprehensive but no ccTLD coverage.
Top 1M domains by aggregated popularity (CrUX + Farsight +
Majestic + Cloudflare Radar + Cisco Umbrella). Pinned ID
L7KW4 (2026-04-23).
Cloudflare 1.1.1.1 used as an existence oracle for ccTLDs we can't get zone files for (.ru, .cn, .ir, .de, .ph, ...). 85 ccTLDs probed; 79 returned at least one hit. Bypasses per-registry agreements entirely.
5 detection layers, 242 brands, 1,160 TLDs.
242 brands = 42 hand-curated wallet/fintech/bank/saas + 200 auto-expanded from Tranco top 1K (skipping noisy short SLDs and brands already curated).
Two existence oracles, one shared output.
Per-brand chunked $in queries against the zone_file collection. Misses discarded. Each match tagged with most-specific layer (homoglyph > DL-1 since homoglyph is a strict subset).
For each (brand, ccTLD) candidate, do a DoH A/AAAA lookup. If
Cloudflare returns a record, the domain exists AND is live.
Matches written with layer = cctld_squat
and tagged data_source: doh_probe_cctld. No
registry agreement needed.
Re-resolve every match over DoH.
Cloudflare 1.1.1.1 with Google 8.8.8.8 fall-back. ~50 concurrent lookups; ~6 minutes for the full set. The live flag is the most consequential filter on the page - dead/parked candidates are speculative, live candidates are operational.
RDAP + Tranco + IP intel + 6 abuse feeds + DNS RRs.
Each match gets joined against domain_metadata (RDAP), tranco (popularity), ip_intel (country / ASN / ISP via ip-api.com), dns_rrs (full A/AAAA/MX/NS/TXT/CAA records, populated for ccTLDs by DoH), and the blocklist collection (PhishTank + URLhaus + CryptoScamDB + OpenPhish + ThreatFox + Phishing.Database). The cross-reference produces the known-bad flag and powers the blind-spot evaluation.
For auto-expanded brands (amazon, ebay, shopify, ...) some ccTLD
"hits" are the brand's own subsidiaries (amazon.de etc.), not
impostors. flag_defensive.py marks rows where the
domain shares a SaaS verification token (Atlassian / Adobe /
Microsoft / Google / Stripe) with multiple ccTLDs of the same
brand, OR shares authoritative NS with the legitimate apex.
These are tagged defensive_likely: true and
excluded from impostor counts.
Campaigns + infrastructure + TXT operator fingerprint.
Same-day same-registrar same-country groups ≥ 3 promoted into campaign clusters.
Shared non-CDN IPs and non-cloud ASNs are operator signals. CDN/cloud prefixes filtered out so we don't spuriously link everyone-on-Cloudflare.
SaaS verification tokens (Atlassian / Adobe / Stripe / Google / Microsoft / ...) are per-account. Same token across many domains = same operator.
linked_bad: any candidate sharing a non-CDN IP or non-cloud ASN with a confirmed-bad row is operationally adjacent. Adds a graph-inferred tier between known-bad and blind-spot. See /eval/ and /infra/.
Run it yourself.
cd /home/C00621463/DomainDefender/web/lookalike_mining
./run_pipeline.sh
# logs at /tmp/dd-pipeline-*
End-to-end refresh on the current Mongo state. Each stage is idempotent and can be
re-run independently. Brand list at brands.py, generator at
generate.py, miner at match.py, see
/datasets/ for full citations.