How to Access and Query the FiveThirtyEight Internet Archive Index with Python 2025
Prerequisites and Setup Checklist
Before you write a single line of code, make sure your environment matches what's listed below. The fivethirtyeightindex.com site is a SvelteKit application that exposes server-side JSON data files alongside a downloadable CSV. Most of the gotchas people hit come from environment mismatches or misunderstanding the data structure, not the code itself.
- [ ] Python 3.9 or higher installed (
python --version) - [ ]
requests2.31+ (pip install requests) - [ ]
pandas2.0+ (pip install pandas) - [ ]
lxmlorhtml5libfor any HTML parsing (pip install lxml) - [ ] A stable internet connection (the Wayback Machine API is rate-limited)
- [ ] Optional: a GitHub account if you want to clone the index source
| Component | Required Version | Install Command |
|---|---|---|
| Python | 3.9+ | — |
| requests | 2.31+ | pip install requests |
| pandas | 2.0+ | pip install pandas |
| lxml | 4.9+ | pip install lxml |
Understanding the data structure: The site aggregates 38,593 items across five content types: articles (20,780), datasets (166), podcasts (1,233), graphics (13,276), and illustrations (3,138). The SvelteKit app hydrates the page with a large inline JSON blob (look for __sveltekit_ in the page source), and the same data is available as a flat CSV download. The CSV columns you'll work with are: Date, Headline, Byline, and type (values: article, dataset, podcast, graphic, illustration).
Internet Archive basics: Archive.org preserves fivethirtyeight.com snapshots via their Wayback Machine. You can programmatically check whether a URL has been archived using the Availability API at https://archive.org/wayback/available?url=<url>. No API key required, but keep requests under ~15/minute to avoid HTTP 429 errors.
Estimated time: 30–45 minutes.
Step 1: Download the Full Index as a CSV File
The CSV download is the fastest path to working with the complete dataset. It's a single flat file containing every article, dataset, podcast, graphic, and illustration that FiveThirtyEight ever published — all with consistent columns. Starting here gives you a local copy you can filter and re-query without making repeated HTTP requests.
Locating the CSV download link
The CSV is linked directly from the fivethirtyeightindex.com homepage with the label "Download the full index as CSV". As of 2025, the download URL is https://fivethirtyeightindex.com/index.csv. The file is regenerated from the upstream GitHub repository maintained by Ben Welsh.
Fetching and loading the CSV with Python
import requests
import pandas as pd
from io import StringIO
CSV_URL = "https://fivethirtyeightindex.com/index.csv"
def download_index(url: str = CSV_URL, filepath: str = "fte_index.csv") -> pd.DataFrame:
"""Download the FiveThirtyEight index CSV and return a cleaned DataFrame."""
response = requests.get(url, timeout=30)
response.raise_for_status()
# Write raw bytes to disk for caching
with open(filepath, "wb") as f:
f.write(response.content)
print(f"Downloaded {len(response.content):,} bytes to {filepath}")
# Read with explicit encoding to handle special characters in headlines
df = pd.read_csv(
filepath,
encoding="utf-8-sig", # handles BOM if present
parse_dates=["Date"],
dtype={"Headline": str, "Byline": str, "type": str},
)
# Normalize column names
df.columns = [c.strip().lower() for c in df.columns]
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["year"] = df["date"].dt.year
print(f"Loaded {len(df):,} rows | Columns: {list(df.columns)}")
return df
df = download_index()
print(df.head())
print(df["type"].value_counts())
Note: If
fivethirtyeightindex.comis unreachable, the raw CSV is also committed to the project's GitHub repository. Checkhttps://github.com/palewire/fivethirtyeight-indexfor the latest release assets.
After running this, you'll have a DataFrame with roughly 38,593 rows and columns: date, headline, byline, type, and the derived year. The type value_counts will confirm the breakdown across articles, datasets, podcasts, graphics, and illustrations.
Step 2: Query the JSON API Endpoints Directly
The SvelteKit data files that power fivethirtyeightindex.com return clean JSON responses when hit directly. This is useful when you want summary statistics or structured metadata (like podcast series names or graphic categories) without parsing the full CSV.
Discovering the endpoints
From the SvelteKit page source, the app fetches five distinct data endpoints. Based on the observed network responses, the patterns follow SvelteKit's __data.json convention. The inline hydration data confirms the totals: 20,780 articles, 166 datasets, 1,233 podcasts, 13,276 graphics, and 3,138 illustrations.
import requests
import json
BASE_URL = "https://fivethirtyeightindex.com"
# These endpoint paths correspond to SvelteKit route data files
ENDPOINTS = {
"articles": f"{BASE_URL}/articles/__data.json",
"datasets": f"{BASE_URL}/datasets/__data.json",
"podcasts": f"{BASE_URL}/podcasts/__data.json",
"graphics": f"{BASE_URL}/graphics/__data.json",
"illustrations": f"{BASE_URL}/illustrations/__data.json",
}
def fetch_summary_stats(endpoints: dict = ENDPOINTS) -> dict:
"""Fetch and print summary statistics from the FiveThirtyEight index API."""
stats = {}
headers = {"Accept": "application/json", "User-Agent": "fte-index-client/1.0"}
for name, url in endpoints.items():
try:
resp = requests.get(url, headers=headers, timeout=15)
resp.raise_for_status()
data = resp.json()
stats[name] = data
print(f"{name}: total={data.get('total', 'N/A')}")
if "series" in data:
print(f" Series: {data['series']}")
if "categories" in data:
print(f" Categories: {data['categories']}")
except requests.HTTPError as e:
print(f"Failed to fetch {name}: {e}")
# Fallback: use known values from source
fallback = {
"articles": {"total": 20780},
"datasets": {"total": 166},
"podcasts": {"total": 1233, "series": ["gerrymandering", "hot-takedown", "podcast-19", "politics", "the-lab", "whats-the-point"]},
"graphics": {"total": 13276, "categories": ["chart", "chart-screenshot", "infographic", "map", "table"]},
"illustrations": {"total": 3138},
}
stats[name] = fallback.get(name, {})
return stats
result = fetch_summary_stats()
print(json.dumps(result, indent=2))
Note: SvelteKit's
__data.jsonendpoint paths are tied to the deployed route structure. If the site is redeployed with updated chunk hashes, fall back to scraping the inline__sveltekit_*variable from the homepage HTML, or use the CSV download as a stable alternative.
The podcast series list (gerrymandering, hot-takedown, podcast-19, politics, the-lab, whats-the-point) and the graphics category list (chart, chart-screenshot, infographic, map, table) are useful for building filtered views.
Step 3: Filter and Explore Articles by Year and Byline
Once you have the DataFrame loaded from Step 1, the real analysis begins. FiveThirtyEight spanned 2008–2024 and had 554 distinct bylines. Knowing the distribution by author and year is essential for any downstream use — whether you're building a search tool, training a model, or just doing journalism research.
Filtering by year range and ranking bylines
import pandas as pd
# Assumes df is already loaded from Step 1
# Filter to articles only (exclude datasets, podcasts, graphics, illustrations)
articles = df[df["type"] == "article"].copy()
print(f"Total articles: {len(articles):,}")
# --- Filter by year range (2016–2020) ---
mask = articles["year"].between(2016, 2020)
articles_2016_2020 = articles[mask]
print(f"Articles published 2016–2020: {len(articles_2016_2020):,}")
# --- Rank top 10 bylines by total article count ---
top_bylines = (
articles["byline"]
.value_counts()
.head(10)
.reset_index()
)
top_bylines.columns = ["byline", "article_count"]
print("\nTop 10 bylines (all years):")
print(top_bylines.to_string(index=False))
# --- Group articles by author and year ---
byline_year = (
articles.groupby(["byline", "year"])
.size()
.reset_index(name="count")
.sort_values(["byline", "year"])
)
# Show Nate Silver's annual output
nate = byline_year[byline_year["byline"] == "Nate Silver"]
print("\nNate Silver articles by year:")
print(nate.to_string(index=False))
# --- Cross-tab: top 5 authors vs. year for 2016–2020 ---
top5_names = top_bylines["byline"].head(5).tolist()
crosstab = pd.crosstab(
articles_2016_2020["byline"],
articles_2016_2020["year"]
).loc[lambda x: x.index.isin(top5_names)]
print("\nTop 5 authors — articles per year (2016–2020):")
print(crosstab)
The confirmed top bylines from the source data are: Nate Silver (4,533), Neil Paine (1,442), Walt Hickey (1,210), Aaron Bycoffe (1,184), Oliver Roeder (712), Nathaniel Rakich (680), Harry Enten (673), Galen Druke (569), Dhrumil Mehta (552), and Perry Bacon Jr (479). These numbers include all content types attributed to that byline, so filter by type == 'article' if you only want editorial pieces.
Step 4: Retrieve Archived Article URLs from the Internet Archive
FiveThirtyEight.com's original URLs still work for many articles, but ABC News (which acquired the site) has deprecated or redirected some older paths. The Internet Archive's Wayback Machine is your safety net. This step shows you how to programmatically resolve a stable web.archive.org snapshot URL for any article in the index.
Using the Wayback Machine Availability API
The Availability API is free, requires no authentication, and returns the closest archived snapshot for a given URL. The endpoint is: https://archive.org/wayback/available?url=<url>×tamp=<YYYYMMDD>.
import requests
import time
import pandas as pd
WAYBACK_API = "https://archive.org/wayback/available"
def get_archive_url(article_url: str, timestamp: str = "20230101") -> str | None:
"""Return the closest Wayback Machine snapshot URL for a given article URL."""
params = {"url": article_url, "timestamp": timestamp}
try:
resp = requests.get(WAYBACK_API, params=params, timeout=10)
resp.raise_for_status()
data = resp.json()
snapshots = data.get("archived_snapshots", {})
closest = snapshots.get("closest", {})
if closest.get("available"):
return closest["url"]
return None
except requests.RequestException as e:
print(f"Error fetching archive URL for {article_url}: {e}")
return None
def bulk_fetch_archive_urls(
df: pd.DataFrame,
url_column: str = "url",
max_articles: int = 20,
sleep_seconds: float = 4.0,
) -> pd.DataFrame:
"""
Resolve Wayback Machine snapshot URLs for a list of FiveThirtyEight articles.
Rate-limited to ~15 requests/minute to avoid HTTP 429.
"""
sample = df.head(max_articles).copy()
archive_urls = []
for i, row in enumerate(sample.itertuples(), start=1):
article_url = getattr(row, url_column, None)
if not article_url:
archive_urls.append(None)
continue
print(f"[{i}/{max_articles}] Checking: {article_url}")
archived = get_archive_url(article_url)
archive_urls.append(archived)
# Stay well under the rate limit
time.sleep(sleep_seconds)
sample["archive_url"] = archive_urls
return sample
# Example: resolve archive URLs for the first 10 articles with a known URL
# If your DataFrame has a 'url' column:
# result = bulk_fetch_archive_urls(articles_with_urls, url_column="url", max_articles=10)
# Quick single-URL test:
test_url = "https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/"
archived = get_archive_url(test_url)
print(f"Archived URL: {archived}")
Note: Set
sleep_seconds=4.0for a safe rate of 15 requests/minute. If you get HTTP 429 responses, increase it to6.0. The Wayback Machine CDX API (athttp://web.archive.org/cdx/search/cdx) gives you a full list of all snapshots for a URL if you need to pick the best capture date rather than the closest.
Step 5: Access Dataset Metadata and GitHub Links
The 166 FiveThirtyEight datasets are a goldmine — cleaned, documented data used in published journalism. Each dataset entry in the index includes a dataset_url pointing to the GitHub repository, an archive_url on archive.org, and the date it was first associated with an article. Fetching them directly from GitHub means you get the exact data the reporters used.
Parsing dataset entries and fetching raw CSVs
import requests
import pandas as pd
from io import StringIO
# Known dataset slugs from the index source
DATASET_SLUGS = [
"ahca-polls",
"airline-safety",
"alcohol-consumption",
"bad-drivers",
"bechdel",
]
GITHUB_RAW_BASE = "https://raw.githubusercontent.com/fivethirtyeight/data/master"
# Map dataset slug to the specific CSV filename within the repo directory
DATASET_FILES = {
"airline-safety": "airline-safety/airline-safety.csv",
"alcohol-consumption": "alcohol-consumption/drinks.csv",
"bad-drivers": "bad-drivers/bad-drivers.csv",
"bechdel": "bechdel/movies.csv",
"ahca-polls": "ahca-polls/ahca_polls.csv",
}
def fetch_dataset_csv(slug: str, files_map: dict = DATASET_FILES) -> pd.DataFrame | None:
"""Fetch a raw CSV from the fivethirtyeight GitHub data repository."""
relative_path = files_map.get(slug)
if not relative_path:
print(f"No file mapping found for slug: {slug}")
return None
url = f"{GITHUB_RAW_BASE}/{relative_path}"
print(f"Fetching: {url}")
resp = requests.get(url, timeout=15)
resp.raise_for_status()
df = pd.read_csv(StringIO(resp.text))
print(f" → {len(df)} rows, {len(df.columns)} columns: {list(df.columns)}")
return df
# Example: load the Airline Safety dataset
airline_df = fetch_dataset_csv("airline-safety")
if airline_df is not None:
print("\nAirline Safety Dataset preview:")
print(airline_df.head())
print("\nIncidents per airline (1985–1999):")
print(
airline_df[["airline", "incidents_85_99", "fatal_accidents_85_99"]]
.sort_values("incidents_85_99", ascending=False)
.head(10)
.to_string(index=False)
)
# Iterate over all known datasets
def fetch_all_datasets(slugs: list, files_map: dict = DATASET_FILES) -> dict:
"""Fetch all datasets and return a dict of slug -> DataFrame."""
results = {}
for slug in slugs:
try:
df = fetch_dataset_csv(slug, files_map)
if df is not None:
results[slug] = df
except requests.HTTPError as e:
print(f"Skipping {slug}: {e}")
time.sleep(1) # polite delay between GitHub requests
return results
all_datasets = fetch_all_datasets(DATASET_SLUGS)
print(f"\nSuccessfully fetched {len(all_datasets)} datasets.")
The airline-safety CSV has columns like airline, avail_seat_km_per_week, incidents_85_99, fatal_accidents_85_99, fatalities_85_99, incidents_00_14, fatal_accidents_00_14, and fatalities_00_14. It's one of the most-cited FiveThirtyEight datasets and a good integration test for your pipeline.
Common Issues and Fixes
Most problems with this workflow fall into four categories. Here's how to diagnose and fix each one.
Error: HTTP 404 when fetching archived fivethirtyeight.com URLs
Cause: The Wayback Machine didn't capture every FiveThirtyEight URL, and some older article paths were restructured when ABC News took over the site.
Fix: Use the CDX API to find any available snapshot instead of relying on the closest match. A 404 from the Availability API means no snapshot was found at that timestamp — try without a timestamp, or search the CDX index:
def find_any_snapshot(article_url: str) -> str | None:
cdx_url = "http://web.archive.org/cdx/search/cdx"
params = {"url": article_url, "output": "json", "limit": 1, "fl": "timestamp,original", "filter": "statuscode:200"}
resp = requests.get(cdx_url, params=params, timeout=15)
data = resp.json()
if len(data) > 1: # first row is header
ts, url = data[1]
return f"https://web.archive.org/web/{ts}/{url}"
return None
Error: UnicodeDecodeError when reading the CSV with pd.read_csv()
Cause: Some FiveThirtyEight headlines contain em dashes, curly quotes, or other non-ASCII characters. If the CSV was saved with a Windows BOM or Latin-1 encoding, pandas will raise a UnicodeDecodeError with the default UTF-8 setting.
Fix: Pass encoding="utf-8-sig" first; if that still fails, fall back to encoding="latin-1":
try:
df = pd.read_csv("fte_index.csv", encoding="utf-8-sig")
except UnicodeDecodeError:
df = pd.read_csv("fte_index.csv", encoding="latin-1")
Error: HTTP 429 Too Many Requests from archive.org
Cause: The Wayback Machine Availability API enforces a rate limit. Hitting it with tight loops — especially in multithreaded code — triggers 429 responses that can temporarily block your IP.
Fix: Use time.sleep(4) between requests (15 requests/minute). For bulk operations, implement exponential backoff:
import time
def get_with_backoff(url: str, params: dict, max_retries: int = 4) -> requests.Response:
delay = 4
for attempt in range(max_retries):
resp = requests.get(url, params=params, timeout=15)
if resp.status_code == 429:
print(f"Rate limited. Waiting {delay}s (attempt {attempt + 1})")
time.sleep(delay)
delay *= 2
else:
resp.raise_for_status()
return resp
raise RuntimeError("Max retries exceeded due to rate limiting")
Error: SvelteKit JSON endpoint returns 404 after site redeployment
Cause: SvelteKit's __data.json files are tied to the specific route and build hash. When fivethirtyeightindex.com is redeployed, the endpoint paths may shift.
Fix: Fall back to the CSV download (https://fivethirtyeightindex.com/index.csv), which is a stable, versioned file. Alternatively, parse the homepage HTML and extract the inline __sveltekit_* JSON blob:
import re, json
def extract_inline_data(homepage_html: str) -> dict:
# Find the data passed to kit.start()
match = re.search(r'data:\s*(\[null,\{.*?\}\])', homepage_html, re.DOTALL)
if match:
return json.loads(match.group(1))
return {}
| Issue | Cause | Fix |
|---|---|---|
| HTTP 404 on archived URL | Wayback Machine gap or URL change | Use CDX API with statuscode:200 filter |
| UnicodeDecodeError in CSV | BOM or Latin-1 encoding mismatch | Try utf-8-sig, then latin-1 |
| HTTP 429 from archive.org | Exceeding ~15 req/min rate limit | time.sleep(4) + exponential backoff |
| 404 on SvelteKit endpoint | Build hash changed after redeployment | Fall back to CSV download or parse inline JSON |
FAQ
Q: Can I use this index to train an LLM or build a RAG pipeline on FiveThirtyEight data?
Yes, and the index structure is well-suited for it. Use the CSV to build a metadata store: chunk each row as {headline} | {byline} | {date} | {type} and embed it into a vector store like Chroma or Pinecone. The headline field is dense enough to carry semantic signal, especially for political and sports content. For full-text RAG, you'll need to pair archive.org snapshot URLs (from Step 4) with an HTML-to-text extractor like trafilatura to get article bodies. The 166 datasets are immediately usable as structured context — load them as DataFrames and inject relevant rows as context at query time.
Q: How often is the fivethirtyeightindex.com data updated?
The index is maintained by Ben Welsh as an open-source project on GitHub. Since FiveThirtyEight ceased publishing new content in late 2023 (when ESPN shut down the editorial operation), the dataset is effectively frozen at its final state of 38,593 items through 2024. The GitHub repository (github.com/palewire/fivethirtyeight-index or similar) is the canonical source — watch it for any corrections or schema changes. The Internet Archive preserves both the site itself and individual dataset files, so even if the domain goes offline, the data remains accessible via archive.org.
Q: Is it legal to scrape and use archived FiveThirtyEight articles and datasets?
The datasets in the GitHub repository (github.com/fivethirtyeight/data) are published under a Creative Commons Attribution 4.0 license, which explicitly permits reuse with attribution — including commercial use. The article text is a different matter: it's copyrighted by ABC News / ESPN. Using archived article headlines and metadata (date, byline, type) for research, indexing, or building search tools falls comfortably within fair use. Reproducing full article text for commercial products without a license does not. The fivethirtyeightindex.com index itself, created by Ben Welsh, is open source — check its repository license before redistributing derivative works.