New York City Open Data ExplorerMethodology

What this is

A re-presentation of New York City's public Open Data catalog — the same datasets you'd find at data.cityofnewyork.us, organized for browsing, with a search engine that's more forgiving than the official portal's. Every link points back to the City's authoritative version of the dataset.

Source

The catalog is fetched from the public Socrata Discovery API:

https://api.us.socrata.com/api/catalog/v1?domains=data.cityofnewyork.us

This is the same API the City uses to power its own portal search. Every record includes the dataset's name, agency-written description, the City's category assignment, tags, last-updated timestamp, and view counts. We pull the full catalog in pages of 100.

Categories

Categories on the home page are mostly the City's own assignments (domain_category in the Socrata API). Where the City didn't categorize a dataset, it appears under "Uncategorized."

One deliberate departure from the City's labeling: the City's "City Government" bucket holds 904 datasets, which is unhelpful — every dataset on the portal is technically city government data. We split that one bucket into five more useful buckets using a deterministic, transparent rule:

A small number of Parks Department datasets that the City filed under City Government are moved to Recreation. The exact rule is in refine_government() in build/fetch_catalog.py. All other City categories are preserved as published.

Plain-language summaries

Each card shows a one-sentence summary. That summary is the first sentence of the agency-written description, with HTML markup, entity codes, and stray whitespace removed and the length capped at 240 characters. We do not paraphrase or rewrite agency language at this stage. Future iterations may add a model-generated plain-English layer, which would be clearly labeled as such on the methodology page when it ships.

Freshness pills

The colored pill on each card is computed from the dataset's data_updated_at timestamp:

PillMeaning
Fresh (green)Updated within the last 30 days
Recent (yellow)Updated within the last year
Stale (red)Older than a year, or no update timestamp

Note: "updated" reflects the dataset's data refresh, not metadata edits. Some agencies update on schedules (daily, monthly, quarterly); a "stale" pill doesn't always mean the dataset is abandoned, only that it hasn't refreshed in a year.

Agency normalization

Agency names on the City portal are inconsistent — the Department of Finance appears as both "Department of Finance" and "Department of Finance (DOF)", and curly apostrophes vs. straight create more duplicates ("Mayor's" vs. "Mayor's"). For the agency filter, we group records whose agency name matches after stripping a trailing parenthesized acronym, normalizing apostrophes, and lowercasing. The display name is the most common original spelling for that group. The exact rule is in normalize_agency() in build/fetch_catalog.py.

Search

Free-text search runs entirely in your browser using Fuse.js across dataset names, summaries, tags, and publishing agencies. It's a fuzzy match — typos and partial matches will still return results — and weights names highest. There's no logging, tracking, or server-side query handling.

The search box also supports a small set of operators, parsed before the fuzzy match runs:

Favorites

Click the heart on any dataset card to save it to your favorites. Favorites are stored in your browser's localStorage — there's no account, no login, and no data sent to any server. The downside: favorites don't sync across devices or browsers. If you clear your browser data, they're gone. The "My favorite datasets" section appears at the top of the page whenever you have any.

"The week in city data" cards

The five small cards near the top of the page show numbers pulled directly from City datasets refreshed in the last 7 days, queried via Socrata's SoQL. The current set: 311 complaint volume + most-common type, motor-vehicle crashes + persons injured, DOB construction permits issued, restaurant inspections completed + grade-A share, and total datasets refreshed. Each card links to its source dataset on data.cityofnewyork.us. If a query times out or 500s on rebuild, the card is silently skipped (the page still works). The exact queries are in build/weekly_stats.py; output lands in data/weekly_stats.json.

"In the news" rail section

Once per rebuild, we fetch recent NYC headlines from Google News's public RSS feed and pair each headline with the most relevant dataset on data.cityofnewyork.us. The pairing is editorial, not a clever NLP guess — we maintain a curated topic dictionary in build/news_match.py that maps news keywords (e.g. "shooting," "eviction," "subway delay") to specific datasets. Headlines that don't match a curated topic are skipped. The output goes to data/news_matches.json. Headlines link out to the original news article; dataset names link to data.cityofnewyork.us. Add or refine topics by editing the TOPICS list in the script.

"What's fresh" strip

Two horizontal carousels at the top of the page surface what's actually moving in the catalog: brand new this month (datasets whose createdAt is in the last 30 days, sorted by creation date) and updated this week (datasets whose data_updated_at is in the last 7 days, sorted by view count). Both lists are precomputed during the weekly build, so no client-side scanning is needed.

Journalist picks

About thirty datasets carry a gold star — these are an explicitly editorial selection of the datasets New York City newsrooms most often rely on (NYPD complaint data, 311, motor-vehicle collisions, ACRIS deeds, restaurant inspections, evictions, school quality, payroll, campaign finance, lobbying, jail population, and a few others). Each pick has a hand-written "why journalists use it" note and a "gotcha" callout. The list is in data/journalist_picks.json and is curated, not algorithmic — pull requests welcome. This layer is editorial; everything else on the page is mechanical.

Per-category feeds

For every category we publish a JSON and an RSS 2.0 feed of the most recently updated datasets, plus a master "newest datasets" feed. Feeds live under feeds/ and are regenerated on every weekly rebuild by build/generate_feeds.py.

Refresh cadence

The catalog is rebuilt weekly. The "Catalog refreshed" date in the search bar reflects the last successful fetch. Datasets new to the City portal appear in the explorer at the next rebuild.

Limitations and what's not here

Code and rebuild

The fetch script is build/fetch_catalog.py in the repository. To rebuild locally: run python3 build/fetch_catalog.py, which writes data/catalog.json (full archive) and data/catalog.min.json (the file the front end actually loads).

Contact

Questions, bugs, or category gripes: file a GitHub issue on the project repository.