The blind-spot measurement.
How much of our brand-impersonation surface ever shows up in the abuse feeds defenders actually use? We cross-reference our lookalikes set at population scale against the feeds we ingest (PhishTank, URLhaus, CryptoScamDB, OpenPhish, ThreatFox, Phishing.Database, MetaMask eth-phishing-detect, Phishing Army, abuse.ch MalwareBazaar / Feodo / SSLBL — the live count is on /feeds/). We then add a linked-bad graph layer: any candidate sharing a non-CDN IP with a confirmed-bad row is operationally adjacent. The result is a known / linked / blind-spot trichotomy.
Trichotomy at a glance.
Every candidate enters at the left and exits at one of three terminals. Bar widths are proportional to row count; numbers populate live.
How each number is computed.
Every claim above expands into the exact Mongo query, source files, and join logic that produced it. Reviewers should be able to walk from a headline number to a SQL-equivalent query in one hop.
Lookalikes total — the candidate set →
db.lookalikes.countDocuments({ defensive_likely: { $ne: true } }) tld_squat, homoglyph, idn_punycode, combosquat, dl1, cctld_squat}.
defensive_likely excludes brand-operated portfolio domains (auto-brands like amazon.de that surface as their own ccTLD subsidiary).
Generated by lookalike_mining/match.py + cctld_probe.py; deduplication is on (brand, domain) tuple.
Known-bad — rows in any abuse feed →
db.lookalikes.countDocuments({ blocklist_hit: { $exists: true, $ne: [] } }) blocklist on exact apex match. Six feeds ingested by lookalike_mining/blocklist_refresh.py:
PhishTank online-valid.csv,
URLhaus urls.txt,
CryptoScamDB blacklist.json,
OpenPhish public feed,
abuse.ch ThreatFox CSV,
Phishing.Database ALL-phishing-domains.txt (mitchellkrogza).
"live" subset = resolves_a == true at last DoH sweep.
Linked-bad — graph-adjacent to a known-bad row →
db.lookalikes.countDocuments({ linked_bad: true, blocklist_hit: { $in: [null, []] } }) linked_bad if its IP or ASN matches at least one known-bad row's IP/ASN, after filtering CDN/cloud anchors.
Filter list: Cloudflare, Akamai, AWS, Google, Microsoft, Fastly, Apple, Vercel, Netlify, CloudFront, Hetzner, OVH (when fronting CDN).
Implemented in lookalike_mining/linked_bad.py; uses both RDAP-derived ASN (rdap.asn) and ip-api/ip_intel-derived ASN (ip_intel.ip_asn) per row.
We exclude rows already in blocklist_hit so this is mutually-exclusive with Known-bad.
True blind-spot % — the headline negative result →
blind_spot_pct = (total − known_bad − linked_bad) / total × 100 api/app/routers/lookalikes.py :: blindspot(). The remainder of the corpus — rows present in zone files / DoH oracle that have no abuse-feed signal and no IP/ASN graph link to a known-bad row — is what we call the true blind-spot.
This is the steady-state coverage gap, not a temporal one.
Per-source rows — bars in the next section →
db.lookalikes.aggregate([{ $unwind: "$blocklist_hit" }, { $group: { _id: "$blocklist_hit", n: { $sum: 1 } } }]) blocklist.feeds_meta updated each ingest.
Where the few we DO catch turn up.
The coverage is so low that single-digit hits are the norm. CryptoScamDB - the only crypto-specific feed in the comparison set - does materially better, but still misses overwhelmingly.
The candidate set is the limiting factor.
Detection papers in this space (WalletProbe, TxPhishScope, Interface Illusions) operate on candidate sets - PhishTank URLs, Twitter scam reports, or vendor blocklists. They analyze what happens when a wallet connects to a known-suspicious site.
This page demonstrates that the candidate set is the bottleneck, not the detection pipeline downstream of it. A pipeline that starts from the population (zone files + Tranco) reveals dramatically more of the operational threat surface.
For each domain on the lookalikes corpus, we record the first time we observed it in zone_file. Where a blocklist hit exists, we have the time the source recorded it. The temporal gap between observable in zone and flagged on a feed is the pre-victimization window - how long an impostor sits live before existing intel sees it.
The blind-spot measurement is the population-coverage version of the temporal-gap claim: even given infinite time, most of our surface never appears on any of these feeds.
- "Blind-spot" here means "absent from every feed we currently ingest AND not adjacent to a known-bad row by IP/ASN." Some lookalikes may have appeared on a feed earlier and been pruned; we measure the live snapshot. Live feed list and per-source coverage on /feeds/.
- The feeds we ingest are not exhaustive. APWG eCrime Exchange, vendor-private feeds (e.g. ScamSniffer), and ML-based wallet-protection extensions (Blockaid, Wallet Guard) likely catch a different subset. Academic-access applications to APWG / OpenINTEL are future work, not yet sent.
- "Linked-bad" is a graph claim, not a confirmed-bad claim. A row sharing a non-CDN IP or non-cloud ASN with a known-bad row is operationally adjacent but not necessarily malicious. We exclude common cloud/CDN ASNs (Cloudflare, Akamai, AWS, Google, etc.) before linking.
- Some lookalikes are false positives by intent: a real popular site whose name happens to be DL-1 of a brand keyword is not impersonation. The Tranco-rank column on /lookalikes/ surfaces these so you can judge.
- The evaluation now includes ccTLD impostors (collected via DoH-as-oracle on 85 ccTLDs). Rows the brand itself appears to operate (signals: SaaS-verification TXT cluster shared with the legitimate apex, or shared NS) are tagged
defensive_likely: trueand excluded from the impostor count.