How the project has evolved.
A dated record of the platform, the dataset, and the research. Entries are filed by month and tagged by the kind of change, milestones, data-shape changes, and infrastructure moves each have their own tone.
ccTLD coverage + trichotomy + operator profiles
Brand-impersonation pipeline expanded: 5 detection layers (added idn_punycode + cctld_squat) × 1,160 TLDs (1,081 gTLDs from CZDS + 79 ccTLDs via DoH-as-existence-oracle). Linked-bad graph layer added: rows sharing a non-CDN IP or non-cloud ASN with a known-bad row are tagged adjacent (CDN/cloud anchor ASNs filtered before linking). Trichotomy is the headline result: known-bad / linked-bad / true-blind-spot. Defensive-likely heuristic wired (TXT-cluster or shared-NS with legitimate apex). Abuse-feed cross-reference extended to 11 sources (PhishTank, URLhaus, CryptoScamDB, OpenPhish, ThreatFox, Phishing.Database, MetaMask eth-phishing-detect, Phishing Army, abuse.ch MalwareBazaar / Feodo / SSLBL) — adding ~600K rows of crypto + general-phishing intel; live coverage on /feeds/. Server-side precision filter added to the API: drops infrastructure brands, dl1-of-short-brand noise, ccTLD-squat portfolio rows, and Tranco-popular collisions; pass ?precision_mode=raw on any endpoint to see the unfiltered corpus. New pages: /feeds/ (live abuse-feed status), /operators/ (IP + TXT cluster profiles), /lookalikes-page/ + /d-page/ + /registrar-page/ (catch-alls so non-curated paths stop 404'ing). New /eval/ artifacts: per-claim citation chain, trichotomy Sankey diagram, multi-feed coverage bars. New /lookalikes/ artifact: brand × ccTLD heatmap collapsed to top-pairs view (the full grid is sparse pre-flag-defensive). New /hosting/ artifact: top-ASN-by-impostor table (joins resolves_a × ip_intel.asn). Site-wide pipeline-status strip above the footer. New API surface: /v1/lookalikes/{brand-tld-matrix, brand-summary, feeds, operators, daemons, top-asn, dns/{d}}, /v1/public/{domain, registrar} (no-auth alias for the catch-all pages). /certs/ + CertStream daemon + crt.sh poller decommissioned. New external dataset mirror: 16 feeds in /data/domaindefender/external_feeds/<DATE>/ on cb1, with manifest.json + daily refresh. All headline counts on the site are now live-fetched from the API rather than baked at build time.
Brand-impersonation surface + blind-spot measurement live
Population-scale lookalikes pipeline shipped: 42 brands × 4 detection layers (TLD-squat, DL-1, homoglyph, combosquat) joined against 1,081 gTLDs ⇒ 40,350 real apex domains in zone files we track. Re-resolved every match over DoH (31,339 live, 77.7%). Cross-referenced against PhishTank, URLhaus, CryptoScamDB, OpenPhish: only 50 of 40,350 (0.12%) appear in any abuse feed - the 99.88% blind-spot is the headline finding. New pages: /lookalikes/, per-brand SSG /lookalikes/<brand>/ ×42, /eval/, /infra/, /hosting/, /pipeline/, /datasets/, /recent/, /keywords/. New API surface: /v1/lookalikes/{summary, brand, blindspot, infra, recent, search, hosting, by-tld, methodology, campaigns, export.csv}. Daily refresh timer at 06:00. CertStream daemon + crt.sh fallback poller installed (both currently waiting on upstream).
DomainDefender Intelligence API (v1) — internal alpha
FastAPI service live with 12+ endpoints under /v1/: domain lookup, TLD and registrar aggregates, country stats, multi-facet search with cursor pagination, and the lifecycle endpoints (fresh / expiring / pending-delete / stale) that are DomainDefender's core differentiator against infrastructure-graph vendors. X-API-Key auth with SHA-256 hashed keys, per-tier rate limits, OpenAPI docs at /v1/docs, systemd-managed. First key issued to internal owner. Commercial tiers, Stripe, and public signup deferred pending IP / commercialization review.
Cloudflare Access gate
Public URL now sits behind Cloudflare Access: visitors authenticate via a one-time PIN sent to their email, then the allow-list policy checks the email against an internal whitelist. Access automated end-to-end via CF API; session duration 24h.
Three.js live wave-grid background
A full-viewport canvas behind every page renders ~3,000 cyan-purple particles animated with double sine waves, additive blending, fog, and gentle mouse parallax. Respects prefers-reduced-motion and pauses on hidden tabs. Inspired by mtd-playground-demo.vercel.app's Three.js plane.
Platform page redesign
Replaces the text-heavy layer list with numbered layer cards, per-layer icons, live DB counts per layer, and an animated SVG data-flow strip showing records moving through the five pipeline stations. Headline honesty banner reflects the crawlbox2 → crawlbox1 migration plan.
Deploy pipeline + systemd service + atomic-swap builds
Preview now runs as a systemd user service with linger enabled, so it survives reboots. Build script uses an atomic dist/ swap so rebuilds never blank the live site. wrangler CLI installed and deploy script ready to push dist/ to Cloudflare Pages once credentials land.
Storage plan: everything targets crawlbox1
Decided all data (2a zone_file, 2b domain_metadata, future 2c+) will live on crawlbox1:27018. crawlbox2 stays the current source until Run 3 finishes, then a mongodump + scp + mongorestore migration runs via scripts/migrate_mongo_to_crawlbox1.sh. Team env_setup.sh files updated to a single unified MONGO_* block.
Interactive world map with real country polygons
Masthead gets a compact rotating globe with drag-to-rotate, a lens switcher (All / Fresh 30d / Expiring / Pending), and click-for-popover showing top registrars, top TLDs, DNSSEC %, and lens counts per country. Full /worldmap section upgraded to real country outlines from Natural Earth 110m in both flat and globe views.
Platform page redesign
Replaces the text-heavy layer list with numbered layer cards, per-layer icons, live record counts from the database, and an animated data-flow strip showing records crossing the five pipeline stations.
Real-schema honesty pass
Schema introspection revealed several components assumed fields that don't exist in domain_metadata (asn, hosting_country, rdap_server, ip_addresses). CloudProviders, RegistrarASNFlow, and PerRegistryThrottle components were removed; WorldMap, HostingGeography, and the country x TLD heatmap switched to the real registrant_country field with an ISO-3166 filter; schema and per-domain deep-link pages corrected.
Naming-pattern and EPP-status views land
Public aggregations for SLD length, hyphen/numeric/IDN patterns, and EPP status distribution are derived directly from the dataset at build time.
Open research site goes public
Project platform, methodology, and live dataset slices are published openly. Dataset access is available to researchers on request.
Registry-aware RDAP orchestration
Lookup workers moved to per-registry daily quota tracking with adaptive backoff. Identity-Digital-gated TLDs now complete without triggering WAF bans.
Nested registrar-entity parsing
Registrar names that only appear in sub-entity vCards (e.g. .berlin) are now extracted correctly. Backfill reached 4,925 previously-unlabelled records.
Primary Mongo moves to crawlbox1
New MongoDB 7.0 node provisioned as the primary data store. Migration from the bootstrap crawlbox scheduled post-run-3.
Zone-wide lookup pass
First dataset-wide data refresh targeting the full CZDS footprint. Covers RDAP registration facts for every reachable TLD.
DNSSEC, EPP status, and expiry indexed
Record shape extended to carry DNSSEC flag, full EPP status vector, and expiry dates. Opens the door to lifecycle and abuse-lock analyses.
Batch IP-to-ASN lookup
Hosting columns populated via ip-api batch endpoint for every zone-published A record, with daily quota budgeting.
First end-to-end data pass
Zone collection → RDAP lookup → IP/ASN lookup → Mongo persistence completes on a single TLD sample end-to-end.
Project inception
DomainDefender chartered as an open research platform for DNS lifecycle measurement.