← back to lookalikes
Pipeline

From population to measurement.

The data flow that produces every claim on this site. The pipeline is idempotent and can be re-run end-to-end via web/lookalike_mining/run_pipeline.sh.

Stage 1 / Inputs

Three sources, three different views of the domain space.

~265M zone_file rows
1,000,000 Tranco rows
79 ccTLDs (DoH-probed)
ICANN CZDS (gTLD)

Daily zone files for 1,081 gTLDs, ~265M apex domains. The universe of registered gTLD domains. Comprehensive but no ccTLD coverage.

Tranco daily list

Top 1M domains by aggregated popularity (CrUX + Farsight + Majestic + Cloudflare Radar + Cisco Umbrella). Pinned ID L7KW4 (2026-04-23).

DoH probe (ccTLD)

Cloudflare 1.1.1.1 used as an existence oracle for ccTLDs we can't get zone files for (.ru, .cn, .ir, .de, .ph, ...). 85 ccTLDs probed; 79 returned at least one hit. Bypasses per-registry agreements entirely.

Stage 2 / Generate candidates

5 detection layers, 242 brands, 1,160 TLDs.

generate.py
~68M candidates
tld_squat
exact <brand>.<X> for every gTLD
homoglyph
m/rn, l/1, o/0, i/1, e/3, a/4, s/5, b/6
idn_punycode
Cyrillic/Greek confusables (а→a, о→o, е→e) encoded as xn--
combosquat
<brand>+phish-keyword (claim, login, support, ...)
dl1
single insertion / deletion / substitution / transposition

242 brands = 42 hand-curated wallet/fintech/bank/saas + 200 auto-expanded from Tranco top 1K (skipping noisy short SLDs and brands already curated).

Stage 3 / Existence filter

Two existence oracles, one shared output.

match.py + cctld_probe.py
40,350 gTLD + 6,557 ccTLD = 46,907
gTLD path: zone_file lookup

Per-brand chunked $in queries against the zone_file collection. Misses discarded. Each match tagged with most-specific layer (homoglyph > DL-1 since homoglyph is a strict subset).

ccTLD path: DoH-as-oracle

For each (brand, ccTLD) candidate, do a DoH A/AAAA lookup. If Cloudflare returns a record, the domain exists AND is live. Matches written with layer = cctld_squat and tagged data_source: doh_probe_cctld. No registry agreement needed.

Stage 4 / Liveness filter

Re-resolve every match over DoH.

liveness.py
31,339 live (77.7%)

Cloudflare 1.1.1.1 with Google 8.8.8.8 fall-back. ~50 concurrent lookups; ~6 minutes for the full set. The live flag is the most consequential filter on the page - dead/parked candidates are speculative, live candidates are operational.

Stage 5 / Cross-reference

RDAP + Tranco + IP intel + 6 abuse feeds + DNS RRs.

enrich.py + ip_intel.py + cctld_dns_parse.py + attach_to_lookalikes.py
252 known-bad · 4,352 linked-bad

Each match gets joined against domain_metadata (RDAP), tranco (popularity), ip_intel (country / ASN / ISP via ip-api.com), dns_rrs (full A/AAAA/MX/NS/TXT/CAA records, populated for ccTLDs by DoH), and the blocklist collection (PhishTank + URLhaus + CryptoScamDB + OpenPhish + ThreatFox + Phishing.Database). The cross-reference produces the known-bad flag and powers the blind-spot evaluation.

Defensive-registration filter

For auto-expanded brands (amazon, ebay, shopify, ...) some ccTLD "hits" are the brand's own subsidiaries (amazon.de etc.), not impostors. flag_defensive.py marks rows where the domain shares a SaaS verification token (Atlassian / Adobe / Microsoft / Google / Stripe) with multiple ccTLDs of the same brand, OR shares authoritative NS with the legitimate apex. These are tagged defensive_likely: true and excluded from impostor counts.

Stage 6 / Cluster

Campaigns + infrastructure + TXT operator fingerprint.

cluster.py + cluster_infra.py + cluster_txt.py + linked_bad.py
1,590 non-CDN IP clusters · 4,352 linked-bad
RDAP-based campaigns

Same-day same-registrar same-country groups ≥ 3 promoted into campaign clusters.

IP / ASN infrastructure

Shared non-CDN IPs and non-cloud ASNs are operator signals. CDN/cloud prefixes filtered out so we don't spuriously link everyone-on-Cloudflare.

TXT verification fingerprint

SaaS verification tokens (Atlassian / Adobe / Stripe / Google / Microsoft / ...) are per-account. Same token across many domains = same operator.

linked_bad: any candidate sharing a non-CDN IP or non-cloud ASN with a confirmed-bad row is operationally adjacent. Adds a graph-inferred tier between known-bad and blind-spot. See /eval/ and /infra/.

Reproducibility

Run it yourself.

cd /home/C00621463/DomainDefender/web/lookalike_mining
./run_pipeline.sh
# logs at /tmp/dd-pipeline-*

End-to-end refresh on the current Mongo state. Each stage is idempotent and can be re-run independently. Brand list at brands.py, generator at generate.py, miner at match.py, see /datasets/ for full citations.

Pipeline
loading…