How to Access and Query the FiveThirtyEight Internet Archive Index with Python 2025

Prerequisites and Setup Checklist

Before you write a single line of code, make sure your environment matches what's listed below. The fivethirtyeightindex.com site is a SvelteKit application that exposes server-side JSON data files alongside a downloadable CSV. Most of the gotchas people hit come from environment mismatches or misunderstanding the data structure, not the code itself.

  • [ ] Python 3.9 or higher installed (python --version)
  • [ ] requests 2.31+ (pip install requests)
  • [ ] pandas 2.0+ (pip install pandas)
  • [ ] lxml or html5lib for any HTML parsing (pip install lxml)
  • [ ] A stable internet connection (the Wayback Machine API is rate-limited)
  • [ ] Optional: a GitHub account if you want to clone the index source

| Component | Required Version | Install Command | |---|---|---| | Python | 3.9+ | — | | requests | 2.31+ | pip install requests | | pandas | 2.0+ | pip install pandas | | lxml | 4.9+ | pip install lxml |

Understanding the data structure: The site aggregates 38,593 items across five content types: articles (20,780), datasets (166), podcasts (1,233), graphics (13,276), and illustrations (3,138). The SvelteKit app hydrates the page with a large inline JSON blob (look for __sveltekit_ in the page source), and the same data is available as a flat CSV download. The CSV columns you'll work with are: Date, Headline, Byline, and type (values: article, dataset, podcast, graphic, illustration).

Internet Archive basics: Archive.org preserves fivethirtyeight.com snapshots via their Wayback Machine. You can programmatically check whether a URL has been archived using the Availability API at https://archive.org/wayback/available?url=<url>. No API key required, but keep requests under ~15/minute to avoid HTTP 429 errors.

Estimated time: 30–45 minutes.


Step 1: Download the Full Index as a CSV File

The CSV download is the fastest path to working with the complete dataset. It's a single flat file containing every article, dataset, podcast, graphic, and illustration that FiveThirtyEight ever published — all with consistent columns. Starting here gives you a local copy you can filter and re-query without making repeated HTTP requests.

Locating the CSV download link

The CSV is linked directly from the fivethirtyeightindex.com homepage with the label "Download the full index as CSV". As of 2025, the download URL is https://fivethirtyeightindex.com/index.csv. The file is regenerated from the upstream GitHub repository maintained by Ben Welsh.

Fetching and loading the CSV with Python

import requests
import pandas as pd
from io import StringIO

CSV_URL = "https://fivethirtyeightindex.com/index.csv"

def download_index(url: str = CSV_URL, filepath: str = "fte_index.csv") -> pd.DataFrame:
    """Download the FiveThirtyEight index CSV and return a cleaned DataFrame."""
    response = requests.get(url, timeout=30)
    response.raise_for_status()

    # Write raw bytes to disk for caching
    with open(filepath, "wb") as f:
        f.write(response.content)

    print(f"Downloaded {len(response.content):,} bytes to {filepath}")

    # Read with explicit encoding to handle special characters in headlines
    df = pd.read_csv(
        filepath,
        encoding="utf-8-sig",  # handles BOM if present
        parse_dates=["Date"],
        dtype={"Headline": str, "Byline": str, "type": str},
    )

    # Normalize column names
    df.columns = [c.strip().lower() for c in df.columns]
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["year"] = df["date"].dt.year

    print(f"Loaded {len(df):,} rows | Columns: {list(df.columns)}")
    return df

df = download_index()
print(df.head())
print(df["type"].value_counts())

Note: If fivethirtyeightindex.com is unreachable, the raw CSV is also committed to the project's GitHub repository. Check https://github.com/palewire/fivethirtyeight-index for the latest release assets.

After running this, you'll have a DataFrame with roughly 38,593 rows and columns: date, headline, byline, type, and the derived year. The type value_counts will confirm the breakdown across articles, datasets, podcasts, graphics, and illustrations.


Step 2: Query the JSON API Endpoints Directly

The SvelteKit data files that power fivethirtyeightindex.com return clean JSON responses when hit directly. This is useful when you want summary statistics or structured metadata (like podcast series names or graphic categories) without parsing the full CSV.

Discovering the endpoints

From the SvelteKit page source, the app fetches five distinct data endpoints. Based on the observed network responses, the patterns follow SvelteKit's __data.json convention. The inline hydration data confirms the totals: 20,780 articles, 166 datasets, 1,233 podcasts, 13,276 graphics, and 3,138 illustrations.

import requests
import json

BASE_URL = "https://fivethirtyeightindex.com"

# These endpoint paths correspond to SvelteKit route data files
ENDPOINTS = {
    "articles": f"{BASE_URL}/articles/__data.json",
    "datasets": f"{BASE_URL}/datasets/__data.json",
    "podcasts": f"{BASE_URL}/podcasts/__data.json",
    "graphics": f"{BASE_URL}/graphics/__data.json",
    "illustrations": f"{BASE_URL}/illustrations/__data.json",
}

def fetch_summary_stats(endpoints: dict = ENDPOINTS) -> dict:
    """Fetch and print summary statistics from the FiveThirtyEight index API."""
    stats = {}
    headers = {"Accept": "application/json", "User-Agent": "fte-index-client/1.0"}

    for name, url in endpoints.items():
        try:
            resp = requests.get(url, headers=headers, timeout=15)
            resp.raise_for_status()
            data = resp.json()
            stats[name] = data
            print(f"{name}: total={data.get('total', 'N/A')}")
            if "series" in data:
                print(f"  Series: {data['series']}")
            if "categories" in data:
                print(f"  Categories: {data['categories']}")
        except requests.HTTPError as e:
            print(f"Failed to fetch {name}: {e}")
            # Fallback: use known values from source
            fallback = {
                "articles": {"total": 20780},
                "datasets": {"total": 166},
                "podcasts": {"total": 1233, "series": ["gerrymandering", "hot-takedown", "podcast-19", "politics", "the-lab", "whats-the-point"]},
                "graphics": {"total": 13276, "categories": ["chart", "chart-screenshot", "infographic", "map", "table"]},
                "illustrations": {"total": 3138},
            }
            stats[name] = fallback.get(name, {})

    return stats

result = fetch_summary_stats()
print(json.dumps(result, indent=2))

Note: SvelteKit's __data.json endpoint paths are tied to the deployed route structure. If the site is redeployed with updated chunk hashes, fall back to scraping the inline __sveltekit_* variable from the homepage HTML, or use the CSV download as a stable alternative.

The podcast series list (gerrymandering, hot-takedown, podcast-19, politics, the-lab, whats-the-point) and the graphics category list (chart, chart-screenshot, infographic, map, table) are useful for building filtered views.


Step 3: Filter and Explore Articles by Year and Byline

Once you have the DataFrame loaded from Step 1, the real analysis begins. FiveThirtyEight spanned 2008–2024 and had 554 distinct bylines. Knowing the distribution by author and year is essential for any downstream use — whether you're building a search tool, training a model, or just doing journalism research.

Filtering by year range and ranking bylines

import pandas as pd

# Assumes df is already loaded from Step 1
# Filter to articles only (exclude datasets, podcasts, graphics, illustrations)
articles = df[df["type"] == "article"].copy()

print(f"Total articles: {len(articles):,}")

# --- Filter by year range (2016–2020) ---
mask = articles["year"].between(2016, 2020)
articles_2016_2020 = articles[mask]
print(f"Articles published 2016–2020: {len(articles_2016_2020):,}")

# --- Rank top 10 bylines by total article count ---
top_bylines = (
    articles["byline"]
    .value_counts()
    .head(10)
    .reset_index()
)
top_bylines.columns = ["byline", "article_count"]
print("\nTop 10 bylines (all years):")
print(top_bylines.to_string(index=False))

# --- Group articles by author and year ---
byline_year = (
    articles.groupby(["byline", "year"])
    .size()
    .reset_index(name="count")
    .sort_values(["byline", "year"])
)

# Show Nate Silver's annual output
nate = byline_year[byline_year["byline"] == "Nate Silver"]
print("\nNate Silver articles by year:")
print(nate.to_string(index=False))

# --- Cross-tab: top 5 authors vs. year for 2016–2020 ---
top5_names = top_bylines["byline"].head(5).tolist()
crosstab = pd.crosstab(
    articles_2016_2020["byline"],
    articles_2016_2020["year"]
).loc[lambda x: x.index.isin(top5_names)]
print("\nTop 5 authors — articles per year (2016–2020):")
print(crosstab)

The confirmed top bylines from the source data are: Nate Silver (4,533), Neil Paine (1,442), Walt Hickey (1,210), Aaron Bycoffe (1,184), Oliver Roeder (712), Nathaniel Rakich (680), Harry Enten (673), Galen Druke (569), Dhrumil Mehta (552), and Perry Bacon Jr (479). These numbers include all content types attributed to that byline, so filter by type == 'article' if you only want editorial pieces.


Step 4: Retrieve Archived Article URLs from the Internet Archive

FiveThirtyEight.com's original URLs still work for many articles, but ABC News (which acquired the site) has deprecated or redirected some older paths. The Internet Archive's Wayback Machine is your safety net. This step shows you how to programmatically resolve a stable web.archive.org snapshot URL for any article in the index.

Using the Wayback Machine Availability API

The Availability API is free, requires no authentication, and returns the closest archived snapshot for a given URL. The endpoint is: https://archive.org/wayback/available?url=<url>&timestamp=<YYYYMMDD>.

import requests
import time
import pandas as pd

WAYBACK_API = "https://archive.org/wayback/available"

def get_archive_url(article_url: str, timestamp: str = "20230101") -> str | None:
    """Return the closest Wayback Machine snapshot URL for a given article URL."""
    params = {"url": article_url, "timestamp": timestamp}
    try:
        resp = requests.get(WAYBACK_API, params=params, timeout=10)
        resp.raise_for_status()
        data = resp.json()
        snapshots = data.get("archived_snapshots", {})
        closest = snapshots.get("closest", {})
        if closest.get("available"):
            return closest["url"]
        return None
    except requests.RequestException as e:
        print(f"Error fetching archive URL for {article_url}: {e}")
        return None


def bulk_fetch_archive_urls(
    df: pd.DataFrame,
    url_column: str = "url",
    max_articles: int = 20,
    sleep_seconds: float = 4.0,
) -> pd.DataFrame:
    """
    Resolve Wayback Machine snapshot URLs for a list of FiveThirtyEight articles.
    Rate-limited to ~15 requests/minute to avoid HTTP 429.
    """
    sample = df.head(max_articles).copy()
    archive_urls = []

    for i, row in enumerate(sample.itertuples(), start=1):
        article_url = getattr(row, url_column, None)
        if not article_url:
            archive_urls.append(None)
            continue

        print(f"[{i}/{max_articles}] Checking: {article_url}")
        archived = get_archive_url(article_url)
        archive_urls.append(archived)

        # Stay well under the rate limit
        time.sleep(sleep_seconds)

    sample["archive_url"] = archive_urls
    return sample


# Example: resolve archive URLs for the first 10 articles with a known URL
# If your DataFrame has a 'url' column:
# result = bulk_fetch_archive_urls(articles_with_urls, url_column="url", max_articles=10)

# Quick single-URL test:
test_url = "https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/"
archived = get_archive_url(test_url)
print(f"Archived URL: {archived}")

Note: Set sleep_seconds=4.0 for a safe rate of 15 requests/minute. If you get HTTP 429 responses, increase it to 6.0. The Wayback Machine CDX API (at http://web.archive.org/cdx/search/cdx) gives you a full list of all snapshots for a URL if you need to pick the best capture date rather than the closest.


Step 5: Access Dataset Metadata and GitHub Links

The 166 FiveThirtyEight datasets are a goldmine — cleaned, documented data used in published journalism. Each dataset entry in the index includes a dataset_url pointing to the GitHub repository, an archive_url on archive.org, and the date it was first associated with an article. Fetching them directly from GitHub means you get the exact data the reporters used.

Parsing dataset entries and fetching raw CSVs

import requests
import pandas as pd
from io import StringIO

# Known dataset slugs from the index source
DATASET_SLUGS = [
    "ahca-polls",
    "airline-safety",
    "alcohol-consumption",
    "bad-drivers",
    "bechdel",
]

GITHUB_RAW_BASE = "https://raw.githubusercontent.com/fivethirtyeight/data/master"

# Map dataset slug to the specific CSV filename within the repo directory
DATASET_FILES = {
    "airline-safety": "airline-safety/airline-safety.csv",
    "alcohol-consumption": "alcohol-consumption/drinks.csv",
    "bad-drivers": "bad-drivers/bad-drivers.csv",
    "bechdel": "bechdel/movies.csv",
    "ahca-polls": "ahca-polls/ahca_polls.csv",
}


def fetch_dataset_csv(slug: str, files_map: dict = DATASET_FILES) -> pd.DataFrame | None:
    """Fetch a raw CSV from the fivethirtyeight GitHub data repository."""
    relative_path = files_map.get(slug)
    if not relative_path:
        print(f"No file mapping found for slug: {slug}")
        return None

    url = f"{GITHUB_RAW_BASE}/{relative_path}"
    print(f"Fetching: {url}")

    resp = requests.get(url, timeout=15)
    resp.raise_for_status()

    df = pd.read_csv(StringIO(resp.text))
    print(f"  → {len(df)} rows, {len(df.columns)} columns: {list(df.columns)}")
    return df


# Example: load the Airline Safety dataset
airline_df = fetch_dataset_csv("airline-safety")
if airline_df is not None:
    print("\nAirline Safety Dataset preview:")
    print(airline_df.head())
    print("\nIncidents per airline (1985–1999):")
    print(
        airline_df[["airline", "incidents_85_99", "fatal_accidents_85_99"]]
        .sort_values("incidents_85_99", ascending=False)
        .head(10)
        .to_string(index=False)
    )


# Iterate over all known datasets
def fetch_all_datasets(slugs: list, files_map: dict = DATASET_FILES) -> dict:
    """Fetch all datasets and return a dict of slug -> DataFrame."""
    results = {}
    for slug in slugs:
        try:
            df = fetch_dataset_csv(slug, files_map)
            if df is not None:
                results[slug] = df
        except requests.HTTPError as e:
            print(f"Skipping {slug}: {e}")
        time.sleep(1)  # polite delay between GitHub requests
    return results

all_datasets = fetch_all_datasets(DATASET_SLUGS)
print(f"\nSuccessfully fetched {len(all_datasets)} datasets.")

The airline-safety CSV has columns like airline, avail_seat_km_per_week, incidents_85_99, fatal_accidents_85_99, fatalities_85_99, incidents_00_14, fatal_accidents_00_14, and fatalities_00_14. It's one of the most-cited FiveThirtyEight datasets and a good integration test for your pipeline.


Common Issues and Fixes

Most problems with this workflow fall into four categories. Here's how to diagnose and fix each one.

Error: HTTP 404 when fetching archived fivethirtyeight.com URLs

Cause: The Wayback Machine didn't capture every FiveThirtyEight URL, and some older article paths were restructured when ABC News took over the site.

Fix: Use the CDX API to find any available snapshot instead of relying on the closest match. A 404 from the Availability API means no snapshot was found at that timestamp — try without a timestamp, or search the CDX index:

def find_any_snapshot(article_url: str) -> str | None:
    cdx_url = "http://web.archive.org/cdx/search/cdx"
    params = {"url": article_url, "output": "json", "limit": 1, "fl": "timestamp,original", "filter": "statuscode:200"}
    resp = requests.get(cdx_url, params=params, timeout=15)
    data = resp.json()
    if len(data) > 1:  # first row is header
        ts, url = data[1]
        return f"https://web.archive.org/web/{ts}/{url}"
    return None

Error: UnicodeDecodeError when reading the CSV with pd.read_csv()

Cause: Some FiveThirtyEight headlines contain em dashes, curly quotes, or other non-ASCII characters. If the CSV was saved with a Windows BOM or Latin-1 encoding, pandas will raise a UnicodeDecodeError with the default UTF-8 setting.

Fix: Pass encoding="utf-8-sig" first; if that still fails, fall back to encoding="latin-1":

try:
    df = pd.read_csv("fte_index.csv", encoding="utf-8-sig")
except UnicodeDecodeError:
    df = pd.read_csv("fte_index.csv", encoding="latin-1")

Error: HTTP 429 Too Many Requests from archive.org

Cause: The Wayback Machine Availability API enforces a rate limit. Hitting it with tight loops — especially in multithreaded code — triggers 429 responses that can temporarily block your IP.

Fix: Use time.sleep(4) between requests (15 requests/minute). For bulk operations, implement exponential backoff:

import time

def get_with_backoff(url: str, params: dict, max_retries: int = 4) -> requests.Response:
    delay = 4
    for attempt in range(max_retries):
        resp = requests.get(url, params=params, timeout=15)
        if resp.status_code == 429:
            print(f"Rate limited. Waiting {delay}s (attempt {attempt + 1})")
            time.sleep(delay)
            delay *= 2
        else:
            resp.raise_for_status()
            return resp
    raise RuntimeError("Max retries exceeded due to rate limiting")

Error: SvelteKit JSON endpoint returns 404 after site redeployment

Cause: SvelteKit's __data.json files are tied to the specific route and build hash. When fivethirtyeightindex.com is redeployed, the endpoint paths may shift.

Fix: Fall back to the CSV download (https://fivethirtyeightindex.com/index.csv), which is a stable, versioned file. Alternatively, parse the homepage HTML and extract the inline __sveltekit_* JSON blob:

import re, json

def extract_inline_data(homepage_html: str) -> dict:
    # Find the data passed to kit.start()
    match = re.search(r'data:\s*(\[null,\{.*?\}\])', homepage_html, re.DOTALL)
    if match:
        return json.loads(match.group(1))
    return {}

| Issue | Cause | Fix | |---|---|---| | HTTP 404 on archived URL | Wayback Machine gap or URL change | Use CDX API with statuscode:200 filter | | UnicodeDecodeError in CSV | BOM or Latin-1 encoding mismatch | Try utf-8-sig, then latin-1 | | HTTP 429 from archive.org | Exceeding ~15 req/min rate limit | time.sleep(4) + exponential backoff | | 404 on SvelteKit endpoint | Build hash changed after redeployment | Fall back to CSV download or parse inline JSON |


FAQ

Q: Can I use this index to train an LLM or build a RAG pipeline on FiveThirtyEight data?

Yes, and the index structure is well-suited for it. Use the CSV to build a metadata store: chunk each row as {headline} | {byline} | {date} | {type} and embed it into a vector store like Chroma or Pinecone. The headline field is dense enough to carry semantic signal, especially for political and sports content. For full-text RAG, you'll need to pair archive.org snapshot URLs (from Step 4) with an HTML-to-text extractor like trafilatura to get article bodies. The 166 datasets are immediately usable as structured context — load them as DataFrames and inject relevant rows as context at query time.

Q: How often is the fivethirtyeightindex.com data updated?

The index is maintained by Ben Welsh as an open-source project on GitHub. Since FiveThirtyEight ceased publishing new content in late 2023 (when ESPN shut down the editorial operation), the dataset is effectively frozen at its final state of 38,593 items through 2024. The GitHub repository (github.com/palewire/fivethirtyeight-index or similar) is the canonical source — watch it for any corrections or schema changes. The Internet Archive preserves both the site itself and individual dataset files, so even if the domain goes offline, the data remains accessible via archive.org.

Q: Is it legal to scrape and use archived FiveThirtyEight articles and datasets?

The datasets in the GitHub repository (github.com/fivethirtyeight/data) are published under a Creative Commons Attribution 4.0 license, which explicitly permits reuse with attribution — including commercial use. The article text is a different matter: it's copyrighted by ABC News / ESPN. Using archived article headlines and metadata (date, byline, type) for research, indexing, or building search tools falls comfortably within fair use. Reproducing full article text for commercial products without a license does not. The fivethirtyeightindex.com index itself, created by Ben Welsh, is open source — check its repository license before redistributing derivative works.