Platform architecture

Five layers, one lifecycle.

Every domain passes through the same stack: zone-file capture, RDAP lookup, web-content snapshot, link-graph construction, and detection. Each layer deepens the record.

Zone domains
0
across CZDS-accessible TLDs
Records
0
RDAP responses persisted
TLDs touched
0
with at least one record
Pipeline status
Active
Layers 01 + 02 actively collecting
Data flow

A record crosses five stations.

left to right · each moving dot is one record
01 Zone 0 02 RDAP 0 03 DoH 0 04 Brand-impersonation 0 05 Cross-reference 0

Layers 01 and 02 are actively collecting. Layers 03, 04, 05 are scoped below but not yet persisting records at scale.

The pipeline

Five layers. Data flowing through.

Each layer deepens the record. Watch data flow from ground-truth registration facts up to detection-ready signals.

  1. 01

    Zone File Collection

    ICANN CZDS · RFC 1035 parser

    Every registered domain under every accessible TLD, with all DNS record types.

    domain tld dns_records{} parsed_at
  2. 02

    RDAP enrichment

    RDAP (IANA bootstrap)

    Registrar, IANA ID, registration dates, nameservers, DNSSEC, EPP status, registrant country.

    registrar registrar_iana_id creation_date expiry_date updated_date nameservers[] dnssec status[] registrant_country
  3. 03

    DoH liveness + IP intel

    Cloudflare 1.1.1.1 · Google 8.8.8.8 · ip-api

    Re-resolve every match over DoH. For each live IP, ASN / country / ISP via ip-api enrichment.

    resolves resolves_a[] live_check_at ip_intel.{asn,country,isp}
  4. 04

    Brand-impersonation surface

    match.py · cctld_probe.py · DoH-as-oracle

    242 brands × 5 detection layers (TLD-squat, homoglyph, idn-punycode, combosquat, dl1) joined against 1,160 TLDs.

    brand layer candidate_sld tld tranco_rank
  5. 05

    Cross-reference + linking

    PhishTank · URLhaus · CryptoScamDB · OpenPhish · ThreatFox · Phishing.Database

    Six abuse feeds + a non-CDN IP/ASN graph layer. Trichotomy: known-bad / linked-bad / blind-spot.

    known_bad_sources[] linked_bad defensive_likely blocklist_hit[]
01
Layer 01

Zone File Collection

records in collection
0
live
Captures

Every registered domain under every accessible generic TLD, with every DNS record type the zone publishes.

How

ICANN CZDS API for downloads; RFC 1035 compliant parser; dynamic record-type capture (no hardcoded whitelist).

Why it matters

This is ground truth. A domain's presence or absence in tonight's zone file is the anchor for every higher-level signal.

Schema
domaintlddns_records (NS, A, AAAA, MX, TXT, SOA, RRSIG, DNSKEY, ...)record_countparsed_atzone_file_date
Live, actively collecting.
rebuilding pulls the latest count
02
Layer 02

RDAP lookup

records
0
live
Captures

Registrar identity, registration / expiry / update dates, EPP status codes, nameservers, DNSSEC delegation, registrant country.

How

IANA RDAP bootstrap plus per-TLD registry RDAP endpoints; registry-aware orchestrator with per-TLD rate limits.

Why it matters

Who registered it, when, where they declared they are. The context that separates a new domain from one inherited from a prior owner.

Schema
registrarregistrar_iana_idcreation_dateexpiry_dateupdated_datestatus[]nameservers[]dnssecregistrant_country
Live, actively collecting.
rebuilding pulls the latest count
03
Layer 03

DoH liveness + IP intel

records in collection
0
live
Captures

For every match in the lookalikes corpus, an A/AAAA re-resolution over DoH; for each live IP, ASN / country / ISP via ip-api enrichment.

How

Cloudflare 1.1.1.1 and Google 8.8.8.8 DoH endpoints with parallel asyncio + retry. ip-api.com for the post-resolve enrichment, cached at the IP level in Mongo.

Why it matters

Static registration data isn't enough. Whether a domain currently resolves to a host — and where that host lives — is the signal that turns a candidate into an operational threat.

Schema
resolvesresolves_a[]resolves_aaaa[]live_check_atip_intel.{asn,country,isp,city}
Live, actively collecting.
rebuilding pulls the latest count
04
Layer 04

Brand-impersonation surface

records
0
live
Captures

242 brands × 5 detection layers (TLD-squat, homoglyph, idn-punycode, combosquat, dl1) joined against 1,160 TLDs (1,081 gTLDs from CZDS + 79 ccTLDs via DoH-as-existence-oracle).

How

match.py walks every parsed zone file, applies the layer transforms per brand, writes hits to the lookalikes collection. cctld_probe.py uses DoH as an existence oracle for ccTLDs we don't have CZDS access to.

Why it matters

Population-scale candidate generation. Detection papers operate on tiny vendor blocklists; we operate on every registered domain on every TLD we cover.

Schema
brandlayercandidate_sldtldtranco_rankfirst_seen_at
Live, actively collecting.
rebuilding pulls the latest count
05
Layer 05

Cross-reference + linking

records
0
live
Captures

Multiple abuse feeds + a non-CDN IP/ASN graph layer + TXT-cluster operator fingerprint + defensive-registration filter. Trichotomy: known-bad / linked-bad / blind-spot. Live feed list on /feeds/.

How

blocklist_refresh.py ingests PhishTank / URLhaus / CryptoScamDB / OpenPhish / ThreatFox / Phishing.Database. linked_bad.py builds the IP+ASN adjacency graph after filtering CDN/cloud anchors. cluster_txt.py groups SaaS-verification token siblings. flag_defensive.py marks brand-operated portfolio rows.

Why it matters

Even given infinite time, most of the impostor surface never appears in any abuse feed. The trichotomy quantifies the coverage gap as a publishable result.

Schema
known_bad_sources[]blocklist_hit[]linked_baddefensive_likelytxt_cluster
Live, actively collecting.
rebuilding pulls the latest count
Where to go from here

See what the layers produce.

Pipeline
loading…