Seven Registries, One Confidence Score: An AI Business-Classification Pipeline

How I built an AI-powered NAICS classification tool by fanning out to six public government registries in parallel and using an LLM purely as a synthesis and confidence-scoring layer over retrieved evidence — not as the source of truth.

A client's operational data had a classification problem. Every business record in their system carries a NAICS code — the standard industry-classification code the U.S. government uses to describe what kind of business an entity is — and a meaningful number of those codes were wrong. That's not a cosmetic error. Downstream systems use NAICS to make real decisions: what priority a facility gets during an outage, what risk profile a business is assigned, what regulatory bucket it falls into. A hospital misclassified as a general office building gets treated like a general office building, right up until that's a problem.

They wanted a tool that could look up and validate the correct code fast, without a research team doing it by hand. I built it in an afternoon. This is the technical write-up: the architecture, the design decisions, and the pattern underneath it — because the pattern generalizes a lot further than NAICS codes.

The problem: the data exists, it's just scattered

There's no single authority that assigns or maintains NAICS codes. Businesses self-report one to whichever government registry they happen to interact with, and none of those registries talk to each other. A company's NAICS code might show up correctly in an SEC filing and incorrectly (or not at all) everywhere else. So the ground truth isn't missing — it's fragmented across half a dozen public systems, each with its own query shape, its own rate limits, and its own blind spots.

I mapped six structured public registries that each hold a slice of the answer space:

SEC EDGAR — public company 10-K filings include SIC codes and industry descriptions. Free, no key.
SAM.gov — any company that has ever done business with the federal government self-reports a primary NAICS code here at registration. Free, no key.
NPPES — CMS's registry of every licensed healthcare provider in the country, with provider taxonomy codes that map directly to NAICS. This is the one that catches the "urgent care that used to be a restaurant" case that trips up name-based guessing. Free, no key.
USASpending.gov — every federal contract and grant, including the NAICS code the recipient self-reported when bidding. Free, no key.
OpenStreetMap Nominatim — geocoding data that tags facility types (fire stations, airports, schools, broadcast facilities) in a way that maps cleanly onto NAICS categories. Free, no key.
CMS Care Compare — separate CMS datasets specifically for nursing homes and dialysis centers, with state-level filtering. Free, no key.

None of these is a "look up any business by name" API on its own. Together, they cover most of the answer space — if you're willing to query all of them and reconcile what comes back.

Architecture: fan out first, synthesize second

The naive approach is to pick the "best" registry and accept its blind spots. The approach I built instead: query all six in parallel, and let agreement across independent sources do the confidence-scoring, rather than trusting any single source's answer at face value.

Each source is its own isolated serverless function on Cloudflare Pages — six independent connectors, each responsible for exactly one registry's query shape and response format. That's a deliberate failure-isolation choice, not just a code-organization one: if one registry is slow, rate-limited, or down, that single path returns empty and every other path keeps running unaffected. Nothing blocks on the slowest or least reliable source. A monolithic handler that queried all six sequentially inside one function would fail (or at minimum slow to a crawl) the instant any one upstream had a bad day; six independent functions don't have that failure mode.

A seventh path exists as the last resort: if all six structured sources come back empty, the tool fetches the company's own website, strips it down to readable text, and hands that raw context to an LLM for classification without any structured source backing it. That path is explicitly the lowest-confidence tier — it's there so the tool degrades gracefully instead of returning nothing, but it's never confused with an actual registry hit.

Where the AI is — and isn't — doing the work

Once the fan-out returns, everything collected goes to Claude in a single synthesis call — not one model call per source, one call over all the retrieved evidence at once. The model's job is narrow and specific: cross-reference what came back, flag agreement or conflict between sources, and assign a confidence tier based on how many independent sources point to the same answer. Two or more structured sources agreeing scores High. A single structured-source hit scores High on its own. Web-scrape-only inference scores Medium or Low depending on how directly the page text states the business type.

The important architectural decision here is what the model is not asked to do: it is never asked to classify a business from its own training knowledge. It's asked to reason over evidence retrieved at request time, and only that evidence. That distinction is the difference between "the AI guessed" and "the AI synthesized verified, citable evidence" — and it's what makes the output auditable rather than a black box. Every result carries the source trail back to which registries said what, so anyone reading the output can see exactly why the tool landed on that classification instead of just being asked to trust it.

The stack, and why it's this small

Cloudflare Pages for static hosting. Cloudflare Pages Functions — plain ES modules, one per data source — for the connectors. Claude (via the Anthropic API) as the synthesis layer. No database, anywhere in the request path.

The no-database choice is deliberate, not an oversight. A classification lookup doesn't need to persist anything server-side to do its job, and skipping a database means there's no PII sitting at rest, nothing to back up, and nothing to patch. Total run cost at real usage volume is a few dollars a month in Anthropic API calls; everything else on the stack lives comfortably inside free tiers.

The pattern is the real deliverable

The specific tool solves NAICS classification. The pattern underneath it is reusable well past that one problem: identify the public registries that plausibly already contain the answer, query them concurrently with independent failure domains so one bad upstream can't take down the rest, and use an LLM as a synthesis-and-confidence-scoring layer over evidence that was actually retrieved — not as the source of truth itself. That pattern applies to contractor licensing verification, provider credentialing, business-risk screening, property classification — any problem where the ground truth exists but is scattered across several authoritative, incomplete sources, and nobody has built the connector yet.

The AI synthesis is genuinely the easy 20% of this build. Handing seven structured JSON blobs to a model and getting a scored answer back is not hard. The actual engineering work was mapping which registry holds which slice of the answer space, learning each one's query shape and rate limits, and building fan-out and fallback logic that degrades gracefully instead of failing outright the moment one source has a bad day.

I'm Jesse Myers — Marine veteran, 32 years in enterprise IT, now building production AI systems. This site is where I write about what I've actually built, technically, in my own words.