Crawler & DMCA policy
How the BuildCalcAPI-Crawler operates, robots.txt rules, citation requirements, and DMCA contact
The BuildCalcAPI-Crawler is the automated agent that populates the
products vertical (/v1/products/*) by fetching spec data from public
federal certification directories and manufacturer-published catalog
PDFs. This page documents the crawler's operational rules, identification,
rate behavior, and the DMCA takedown channel for rightsholders.
Identification
| Field | Value |
|---|---|
| User-Agent | BuildCalcAPI-Crawler/1.0 (+https://buildcalcapi.dev/crawler) |
| Operator contact | [email protected] |
| Documentation | This page (/crawler) |
If you see requests with this UA in your logs and want to verify they originate from BuildCalc API, the source IP will be a Cloudflare or Render egress range. The UA is the canonical identifier.
Rate behavior
The crawler self-throttles to a maximum of 1 request per second per
host. Throttle is enforced in code (app.etl._products_common._host_throttle),
not as a best-effort target. The intent is to keep load below the
threshold where a small public directory would notice us at all.
robots.txt
robots.txt is honored unconditionally. The fetcher (Python's
standard urllib.robotparser) is invoked once per host on first contact
and the result cached for the crawler lifetime. A Disallow rule that
matches our path raises an internal RobotsDisallowed exception and the
fetch never happens.
To block the crawler from a specific path, add to your robots.txt:
User-agent: BuildCalcAPI-Crawler
Disallow: /path/to/blockTo block it entirely:
User-agent: BuildCalcAPI-Crawler
Disallow: /What we extract — and what we don't
The crawler extracts factual spec values — SEER2 numbers, U-factor numbers, model numbers, certification IDs — and stores them in a JSONB column keyed by canonical spec name. We do not store:
- Descriptive marketing prose from product catalogs
- Photographs, renders, or other image bytes (we store only a URL into
the original source CDN in
products.image_url) - Full table layouts as compiled works
- Pricing data (out of scope for v1)
- Reseller or distributor information
Every product row carries a source_url and (where applicable) a
certification_number so downstream consumers can verify against the
authority of record.
Source classification
The crawler operates only against sources classified T0-T2 plus one T3
(ICC-ES ESR, deferred until LLC formation). Sources behind authentication
walls or with anti-automation infrastructure (UL Prospector, UL
Certifications, AHRI's paid subscription API) are flagged
scrapeable: false in our internal registry and rejected at the
framework level — no per-source code can fetch them.
Active sources (as of 2026-05-28)
| Source | Category | Tier | Endpoint |
|---|---|---|---|
| ENERGY STAR (Heat Pumps + Geothermal + Boilers + Furnaces) | hvac | T1 | data.energystar.gov/api/views/*/rows.csv |
| ENERGY STAR Storm Windows | windows | T1 | data.energystar.gov/api/views/qaxz-ikcb/rows.csv |
| ENERGY STAR Insulation | insulation | T1 | data.energystar.gov/api/views/kphf-22jd/rows.csv |
| ENERGY STAR Ceiling + Vent Fans | electrical | T1 | data.energystar.gov/api/views/{2te3-nmxp,8dv7-nngq}/rows.csv |
| EPA WaterSense | plumbing | T1 | api.epa.gov/watersense/products/{type}/?offset=N |
AHRI Reference Numbers are recorded as product_certifications.cert_type='ahri'
rows when present in the ENERGY STAR HVAC CSV — this gives agents a
cross-link into the AHRI Directory without us needing the paid AHRI
Data Subscription Program license.
See ADR-0015 for the full per-source tier table and the three-prong legal framework (Feist + ToS + CFAA) that grounds the crawler's posture.
DMCA takedown
If you believe a specific product row infringes your copyright (e.g.,
we extracted protected expression rather than fact), send a
§512(c)(3)-compliant notice to:
Acknowledgement SLA: 24 hours. Removal SLA: 72 hours from a
valid, complete notice. We follow §512(g) counter-notice process and
maintain a repeat-infringer policy per ADR-0015.
To expedite review, include in your notice:
- The specific
product_id(visible in/v1/products/{id}responses) or thesource_urlof the row in question - A description of the protected work
- A signed statement of good-faith belief that the material is infringing and not authorized by you, your agent, or the law
- Your contact information
We respond to all complete notices regardless of the requesting party's size or jurisdiction.
Reporting other concerns
For non-copyright concerns (a fact that's incorrect, a model number
that no longer exists, a source URL that 404s), email
[email protected] — these route to a different inbox and a
faster, non-legal review.
Costs vertical
Federal-data price and labor metrics — BLS PPI for 15 construction-material categories, OEWS+QCEW hourly wages for 11 trades at MSA/county/ZIP granularity, and Census BPS building permits for ~3,100 counties + ~930 CBSAs.
Errors & retries
RFC 7807 problem+json responses + idempotency safety