Evaluation

The blind-spot measurement.

How much of our brand-impersonation surface ever shows up in the abuse feeds defenders actually use? We cross-reference our lookalikes set at population scale against the feeds we ingest (PhishTank, URLhaus, CryptoScamDB, OpenPhish, ThreatFox, Phishing.Database, MetaMask eth-phishing-detect, Phishing Army, abuse.ch MalwareBazaar / Feodo / SSLBL — the live count is on /feeds/). We then add a linked-bad graph layer: any candidate sharing a non-CDN IP with a confirmed-bad row is operationally adjacent. The result is a known / linked / blind-spot trichotomy.

loading the live measurement...

The flow

Trichotomy at a glance.

Every candidate enters at the left and exits at one of three terminals. Bar widths are proportional to row count; numbers populate live.

widths proportional · CDN/cloud ASNs filtered before graph-linking · defensive_likely rows excluded

Citation chain

How each number is computed.

Every claim above expands into the exact Mongo query, source files, and join logic that produced it. Reviewers should be able to walk from a headline number to a SQL-equivalent query in one hop.

Lookalikes total — the candidate set →

db.lookalikes.countDocuments({ defensive_likely: { $ne: true } })

Rows where layer ∈ {tld_squat, homoglyph, idn_punycode, combosquat, dl1, cctld_squat}. defensive_likely excludes brand-operated portfolio domains (auto-brands like amazon.de that surface as their own ccTLD subsidiary). Generated by lookalike_mining/match.py + cctld_probe.py; deduplication is on (brand, domain) tuple.

Known-bad — rows in any abuse feed →

db.lookalikes.countDocuments({ blocklist_hit: { $exists: true, $ne: [] } })

Joined against blocklist on exact apex match. Six feeds ingested by lookalike_mining/blocklist_refresh.py: PhishTank online-valid.csv, URLhaus urls.txt, CryptoScamDB blacklist.json, OpenPhish public feed, abuse.ch ThreatFox CSV, Phishing.Database ALL-phishing-domains.txt (mitchellkrogza). "live" subset = resolves_a == true at last DoH sweep.

Linked-bad — graph-adjacent to a known-bad row →

db.lookalikes.countDocuments({ linked_bad: true, blocklist_hit: { $in: [null, []] } })

A row is linked_bad if its IP or ASN matches at least one known-bad row's IP/ASN, after filtering CDN/cloud anchors. Filter list: Cloudflare, Akamai, AWS, Google, Microsoft, Fastly, Apple, Vercel, Netlify, CloudFront, Hetzner, OVH (when fronting CDN). Implemented in lookalike_mining/linked_bad.py; uses both RDAP-derived ASN (rdap.asn) and ip-api/ip_intel-derived ASN (ip_intel.ip_asn) per row. We exclude rows already in blocklist_hit so this is mutually-exclusive with Known-bad.

True blind-spot % — the headline negative result →

blind_spot_pct = (total − known_bad − linked_bad) / total × 100

Computed in api/app/routers/lookalikes.py :: blindspot(). The remainder of the corpus — rows present in zone files / DoH oracle that have no abuse-feed signal and no IP/ASN graph link to a known-bad row — is what we call the true blind-spot. This is the steady-state coverage gap, not a temporal one.

Per-source rows — bars in the next section →

db.lookalikes.aggregate([{ $unwind: "$blocklist_hit" }, { $group: { _id: "$blocklist_hit", n: { $sum: 1 } } }])

"of ours" = how many of OUR lookalikes appear in that feed. "feed rows" = the feed's total row count at last fetch (for ratio context, not directly compared). Source feed-row totals come from blocklist.feeds_meta updated each ingest.

Per-source coverage

Where the few we DO catch turn up.

The coverage is so low that single-digit hits are the norm. CryptoScamDB - the only crypto-specific feed in the comparison set - does materially better, but still misses overwhelmingly.

Why this matters

The candidate set is the limiting factor.

Detection papers in this space (WalletProbe, TxPhishScope, Interface Illusions) operate on candidate sets - PhishTank URLs, Twitter scam reports, or vendor blocklists. They analyze what happens when a wallet connects to a known-suspicious site.

This page demonstrates that the candidate set is the bottleneck, not the detection pipeline downstream of it. A pipeline that starts from the population (zone files + Tranco) reveals dramatically more of the operational threat surface.

For each domain on the lookalikes corpus, we record the first time we observed it in zone_file. Where a blocklist hit exists, we have the time the source recorded it. The temporal gap between observable in zone and flagged on a feed is the pre-victimization window - how long an impostor sits live before existing intel sees it.

The blind-spot measurement is the population-coverage version of the temporal-gap claim: even given infinite time, most of our surface never appears on any of these feeds.

Honest limits

"Blind-spot" here means "absent from every feed we currently ingest AND not adjacent to a known-bad row by IP/ASN." Some lookalikes may have appeared on a feed earlier and been pruned; we measure the live snapshot. Live feed list and per-source coverage on /feeds/.
The feeds we ingest are not exhaustive. APWG eCrime Exchange, vendor-private feeds (e.g. ScamSniffer), and ML-based wallet-protection extensions (Blockaid, Wallet Guard) likely catch a different subset. Academic-access applications to APWG / OpenINTEL are future work, not yet sent.
"Linked-bad" is a graph claim, not a confirmed-bad claim. A row sharing a non-CDN IP or non-cloud ASN with a known-bad row is operationally adjacent but not necessarily malicious. We exclude common cloud/CDN ASNs (Cloudflare, Akamai, AWS, Google, etc.) before linking.
Some lookalikes are false positives by intent: a real popular site whose name happens to be DL-1 of a brand keyword is not impersonation. The Tranco-rank column on /lookalikes/ surfaces these so you can judge.
The evaluation now includes ccTLD impostors (collected via DoH-as-oracle on 85 ccTLDs). Rows the brand itself appears to operate (signals: SaaS-verification TXT cluster shared with the legitimate apex, or shared NS) are tagged defensive_likely: true and excluded from the impostor count.