Platform architecture

Five layers, one lifecycle.

Every domain passes through the same stack: zone-file capture, RDAP lookup, web-content snapshot, link-graph construction, and detection. Each layer deepens the record.

Zone domains

across CZDS-accessible TLDs

Records

RDAP responses persisted

TLDs touched

with at least one record

Pipeline status

Active

Layers 01 + 02 actively collecting

Data flow

A record crosses five stations.

left to right · each moving dot is one record

Layers 01 and 02 are actively collecting. Layers 03, 04, 05 are scoped below but not yet persisting records at scale.

The pipeline

Five layers. Data flowing through.

Each layer deepens the record. Watch data flow from ground-truth registration facts up to detection-ready signals.

01

Zone File Collection
ICANN CZDS · RFC 1035 parser

Every registered domain under every accessible TLD, with all DNS record types.

domain tld dns_records{} parsed_at
02

RDAP enrichment
RDAP (IANA bootstrap)

Registrar, IANA ID, registration dates, nameservers, DNSSEC, EPP status, registrant country.

registrar registrar_iana_id creation_date expiry_date updated_date nameservers[] dnssec status[] registrant_country
03

DoH liveness + IP intel
Cloudflare 1.1.1.1 · Google 8.8.8.8 · ip-api

Re-resolve every match over DoH. For each live IP, ASN / country / ISP via ip-api enrichment.

resolves resolves_a[] live_check_at ip_intel.{asn,country,isp}
04

Brand-impersonation surface
match.py · cctld_probe.py · DoH-as-oracle

242 brands × 5 detection layers (TLD-squat, homoglyph, idn-punycode, combosquat, dl1) joined against 1,160 TLDs.

brand layer candidate_sld tld tranco_rank
05

Cross-reference + linking
PhishTank · URLhaus · CryptoScamDB · OpenPhish · ThreatFox · Phishing.Database

Six abuse feeds + a non-CDN IP/ASN graph layer. Trichotomy: known-bad / linked-bad / blind-spot.

known_bad_sources[] linked_bad defensive_likely blocklist_hit[]

Layer 01

Zone File Collection

records in collection

live

Captures

Every registered domain under every accessible generic TLD, with every DNS record type the zone publishes.

How

ICANN CZDS API for downloads; RFC 1035 compliant parser; dynamic record-type capture (no hardcoded whitelist).

Why it matters

This is ground truth. A domain's presence or absence in tonight's zone file is the anchor for every higher-level signal.

Schema

domaintlddns_records (NS, A, AAAA, MX, TXT, SOA, RRSIG, DNSKEY, ...)record_countparsed_atzone_file_date

Live, actively collecting.

rebuilding pulls the latest count

Layer 02

RDAP lookup

records

live

Captures

Registrar identity, registration / expiry / update dates, EPP status codes, nameservers, DNSSEC delegation, registrant country.

How

IANA RDAP bootstrap plus per-TLD registry RDAP endpoints; registry-aware orchestrator with per-TLD rate limits.

Why it matters

Who registered it, when, where they declared they are. The context that separates a new domain from one inherited from a prior owner.

Schema

registrarregistrar_iana_idcreation_dateexpiry_dateupdated_datestatus[]nameservers[]dnssecregistrant_country

Live, actively collecting.

rebuilding pulls the latest count

Layer 03

DoH liveness + IP intel

records in collection

live

Captures

For every match in the lookalikes corpus, an A/AAAA re-resolution over DoH; for each live IP, ASN / country / ISP via ip-api enrichment.

How

Cloudflare 1.1.1.1 and Google 8.8.8.8 DoH endpoints with parallel asyncio + retry. ip-api.com for the post-resolve enrichment, cached at the IP level in Mongo.

Why it matters

Static registration data isn't enough. Whether a domain currently resolves to a host — and where that host lives — is the signal that turns a candidate into an operational threat.

Schema

resolvesresolves_a[]resolves_aaaa[]live_check_atip_intel.{asn,country,isp,city}

Live, actively collecting.

rebuilding pulls the latest count

Layer 04

Brand-impersonation surface

records

live

Captures

242 brands × 5 detection layers (TLD-squat, homoglyph, idn-punycode, combosquat, dl1) joined against 1,160 TLDs (1,081 gTLDs from CZDS + 79 ccTLDs via DoH-as-existence-oracle).

How

match.py walks every parsed zone file, applies the layer transforms per brand, writes hits to the lookalikes collection. cctld_probe.py uses DoH as an existence oracle for ccTLDs we don't have CZDS access to.

Why it matters

Population-scale candidate generation. Detection papers operate on tiny vendor blocklists; we operate on every registered domain on every TLD we cover.

Schema

brandlayercandidate_sldtldtranco_rankfirst_seen_at

Live, actively collecting.

rebuilding pulls the latest count

Layer 05

Cross-reference + linking

records

live

Captures

Multiple abuse feeds + a non-CDN IP/ASN graph layer + TXT-cluster operator fingerprint + defensive-registration filter. Trichotomy: known-bad / linked-bad / blind-spot. Live feed list on /feeds/.

How

blocklist_refresh.py ingests PhishTank / URLhaus / CryptoScamDB / OpenPhish / ThreatFox / Phishing.Database. linked_bad.py builds the IP+ASN adjacency graph after filtering CDN/cloud anchors. cluster_txt.py groups SaaS-verification token siblings. flag_defensive.py marks brand-operated portfolio rows.

Why it matters

Even given infinite time, most of the impostor surface never appears in any abuse feed. The trichotomy quantifies the coverage gap as a publishable result.

Schema

known_bad_sources[]blocklist_hit[]linked_baddefensive_likelytxt_cluster

Live, actively collecting.

rebuilding pulls the latest count

Where to go from here

See what the layers produce.

Open the dashboard → Read the insights Browse the corpus See the record schema

Five layers, one lifecycle.

A record crosses five stations.

Five layers. Data flowing through.

Zone File Collection

RDAP enrichment

DoH liveness + IP intel

Brand-impersonation surface

Cross-reference + linking

Zone File Collection

RDAP lookup

DoH liveness + IP intel

Brand-impersonation surface

Cross-reference + linking

See what the layers produce.