Social Media Link Rot with Wayback CDX Data
Updated on

Wayback Machine CDX data is useful for social media link rot analysis because it gives you an archive index: captured URLs, timestamps, status codes, MIME types, digests, and response lengths. It does not prove what the archived page contained. Use CDX records to find candidate captures, then clean, dedupe, visualize, and manually validate the actual snapshots before making claims.
| Step | What to do | Why it matters |
|---|---|---|
| 1. Collect seed URLs | Store original URL, source, date collected, and target type | Keeps provenance visible |
| 2. Normalize variants | Remove tracking noise without hiding meaningful platform differences | Prevents false duplicates |
| 3. Query CDX | Pull timestamp, original URL, MIME type, status code, digest, and length | Creates the raw archive dataset |
| 4. Enrich rows | Parse timestamps, infer platform, classify profile/post/thread targets | Makes the data analyzable |
| 5. Dedupe carefully | Compare digest, URL, and target groups | Avoids overstating preservation |
| 6. Validate snapshots | Open https://web.archive.org/web/{timestamp}/{original_url} | Separates indexed rows from evidence |
| 7. Visualize coverage | Chart captures by platform, status, date, and target | Reveals gaps and review priorities |
CDX is the index layer for archived captures. Think of it as a catalog entry, not the archived page itself.
For adjacent workflows, see Pandas Data Cleaning, Pandas to_datetime, Pandas Drop Duplicates, and Web Scraping with Python.
Why Social Media Archive Data Is Messy
Social URLs look stable, but platform behavior makes archive analysis noisy. A Twitter/X post may appear under twitter.com, x.com, or mobile.twitter.com. A Reddit thread may include old paths, comment anchors, query parameters, and shortlinks. Instagram URLs vary by profile, post, reel, and embedded media paths.
These variants matter because two URLs that look equivalent to a human may appear as separate CDX keys. A broad query may collect unrelated assets. A narrow query may miss captures under another hostname.
Common noise sources:
- URL variants: hostnames, trailing slashes, fragments, and tracking parameters.
- Timestamp density: many captures may represent repeated identical responses.
- Platform behavior: redirects, login walls, interstitials, app shells, or JavaScript-heavy pages.
- Evidence confusion: a CDX row is often mistaken for content proof.
The goal is not to make social archive data perfect. The goal is to make the workflow auditable.
Public Data Availability Matrix
Use this matrix to avoid overclaiming before the analysis starts.
| Data type | Twitter/X | TikTok | ||
|---|---|---|---|---|
| Public profile metadata | CDX and partial snapshots | Public JSON and CDX | CDX and partial snapshots | CDX and partial snapshots |
| Public post text | Snapshot dependent | Public JSON and snapshots | Snapshot dependent | Snapshot dependent |
| Public post media | Best effort | Often easier than Instagram | Difficult and partial | Difficult and partial |
| Deleted public posts | Only if captured before deletion | Only if captured before deletion | Rare | Rare |
| Historical engagement counts | Snapshot dependent | Recent JSON or snapshot dependent | Usually unavailable | Usually unavailable |
Archive data is strongest for public page history. It is not a private export, a complete account history, or proof that missing records never existed.
What the CDX API Returns
The Wayback CDX API returns archive index rows for a URL pattern. A compact query looks like this:
curl "https://web.archive.org/cdx/search/cdx?url=twitter.com/*/status/*&output=json&fl=timestamp,original,mimetype,statuscode,digest,length&filter=statuscode:200&limit=100"The core fields are:
| Field | Meaning | Use |
|---|---|---|
timestamp | Capture time in YYYYMMDDhhmmss | Build timelines |
original | Original captured URL | Recover and normalize targets |
mimetype | Archived response MIME type | Separate HTML, images, JSON, and assets |
statuscode | HTTP status recorded by the archive | Filter redirects, errors, and blocked responses |
digest | Content fingerprint | Find repeated identical captures |
length | Response size | Flag tiny error pages or unusual responses |
collapse=digest can reduce duplicates, but for research workflows it is often better to keep raw rows first and dedupe later with explicit rules.
Build the Dataset
Start with explicit seed URLs:
label,platform,url
example_x_post,x,https://twitter.com/example/status/1234567890
example_reddit_thread,reddit,https://www.reddit.com/r/example/comments/abc123/example_thread/
example_instagram_profile,instagram,https://www.instagram.com/example/Add fields such as source, collection_date, expected_platform, target_type, and notes. Provenance is part of the dataset, not a separate memory task.
Normalize conservatively. Strip obvious tracking noise, but preserve paths that may identify a specific post, comment, or media item:
from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse
TRACKING_PREFIXES = ("utm_",)
TRACKING_KEYS = {"fbclid", "gclid", "igshid"}
def normalize_url(url: str) -> str:
parsed = urlparse(url.strip())
query = [
(key, value)
for key, value in parse_qsl(parsed.query, keep_blank_values=True)
if key not in TRACKING_KEYS
and not any(key.startswith(prefix) for prefix in TRACKING_PREFIXES)
]
path = parsed.path[:-1] if parsed.path != "/" and parsed.path.endswith("/") else parsed.path
return urlunparse(("https", parsed.netloc.lower(), path, "", urlencode(query), ""))If two URLs might represent different objects, keep both and add a derived normalized_group field instead of forcing them into one value.
Clean and Enrich CDX Rows
After exporting CDX JSON or CSV, parse the timestamp, coerce numeric fields, infer the platform, and create Wayback snapshot URLs:
from urllib.parse import urlparse
import pandas as pd
df = pd.read_csv("social_cdx_export.csv")
df["captured_at"] = pd.to_datetime(
df["timestamp"].astype(str),
format="%Y%m%d%H%M%S",
utc=True,
errors="coerce",
)
df["capture_month"] = df["captured_at"].dt.strftime("%Y-%m")
df["statuscode"] = pd.to_numeric(df["statuscode"], errors="coerce")
df["length"] = pd.to_numeric(df["length"], errors="coerce")
def platform_from_url(url: str) -> str:
host = urlparse(str(url)).netloc.lower()
if "twitter.com" in host or "x.com" in host:
return "x"
if "reddit.com" in host:
return "reddit"
if "instagram.com" in host:
return "instagram"
return "other"
df["platform"] = df["original"].apply(platform_from_url)
df["wayback_url"] = (
"https://web.archive.org/web/"
+ df["timestamp"].astype(str)
+ "/"
+ df["original"].astype(str)
)Then dedupe based on the question you are answering:
deduped = (
df.sort_values("captured_at")
.drop_duplicates(subset=["original", "digest"], keep="first")
.copy()
)
duplicate_ratio = 1 - (len(deduped) / len(df))If you care about every time the archive saw a URL, keep more rows. If you care about distinct content states, dedupe more aggressively.
Validate Before You Claim
A CDX row with statuscode=200 is a lead, not a conclusion. The snapshot might be the original post, but it might also be a login wall, redirect destination, app shell, interstitial, or partial capture.
Use three confidence levels:
| Level | Meaning | Use it for |
|---|---|---|
| Indexed | CDX returned a row | Discovery and queue building |
| Retrieved | The Wayback snapshot opens and returns meaningful content | Candidate analysis |
| Validated | A reviewer confirmed the needed content or metadata is visible | Evidence-sensitive work |
For every row that supports a conclusion, add review fields:
| Column | Purpose |
|---|---|
wayback_url | Direct snapshot URL |
validation_status | indexed, retrieved, valid_snapshot, login_wall, redirect_only, error, or unknown |
validation_notes | Human-readable reason for the status |
This is especially important for Instagram. Research on Instagram captures has found that many mementos can redirect to login pages or miss post images, so raw capture counts can be misleading.
Visualize Coverage
Once the dataset is clean, visualization helps decide whether more review is worth the effort:
- capture timeline by platform,
- status-code distribution by platform,
- MIME-type distribution,
- unique digest count per target,
- first and last capture date by URL,
- duplicate ratio by platform.
For notebook exploration, PyGWalker can inspect the dataframe interactively:
import pygwalker as pyg
pyg.walk(deduped)For a quick summary table:
summary = (
df.groupby("platform")
.agg(
total_captures=("original", "count"),
unique_urls=("original", "nunique"),
unique_digests=("digest", "nunique"),
first_capture=("captured_at", "min"),
last_capture=("captured_at", "max"),
)
.reset_index()
)
summary["duplicate_ratio"] = 1 - summary["unique_digests"] / summary["total_captures"]Graphic Walker is useful for a browser-based version of the same workflow: load the cleaned dataset, drag timestamp and platform fields into a timeline, then switch between status-code, MIME-type, and digest summaries.
What CDX Data Can and Cannot Prove
CDX data can help establish that a public URL had archive captures, when those captures were indexed, which status code and MIME type were recorded, and whether repeated captures appear to share a digest.
CDX data cannot prove by itself that a deleted post existed in a specific form, that a screenshot is authentic, that the archived page rendered the same way for every user, or that every asset, comment, image, or embed was preserved.
Use CDX for discovery and structure. Use snapshot validation for evidence.
Common Failure Cases
| Failure case | What to do |
|---|---|
200 response is a login wall | Label it as login_wall, not valid evidence |
| Too many duplicate captures | Compare digest before interpreting capture volume |
| Redirect chains | Keep redirect-only rows, but review the destination |
| Missing assets | Inspect rendered snapshots when images or video matter |
| Hostname migration | Track twitter.com, x.com, and mobile hostnames explicitly |
If you prefer a UI-first workflow, PeekVault (opens in a new tab) can search public Wayback records for social media URLs and export rows as CSV, JSON, or HTML. Treat any export as a starting dataset that still needs validation.
Responsible Use
Archive analysis can touch sensitive contexts, especially when social media content has changed, disappeared, or become contested.
Use these boundaries:
- Work with public archive records.
- Do not imply access to private accounts or private content.
- Do not treat missing archive records as proof that something never existed.
- Do not treat a CDX row as proof of content without opening the snapshot.
- Keep provenance and validation notes.
- Respect takedown, privacy, and safety concerns.
These boundaries are ethical, but they are also practical. A dataset that separates raw index rows from validated evidence is easier to review, reproduce, and challenge.
FAQ
What is Wayback Machine CDX data?
Wayback Machine CDX data is archive index metadata. It lists captured URLs, timestamps, status codes, MIME types, digests, and lengths. It helps you find candidate snapshots, but it is not the archived page content itself.
Can CDX data prove that a deleted social media post existed?
Not by itself. CDX data can show that a public URL had archive records, but you still need to inspect the actual Wayback snapshot and confirm that the content you care about is visible and meaningful.
Why does a 200 status code sometimes fail in archive analysis?
A 200 status means the archive recorded a successful HTTP response, but the response may still be a login wall, app shell, redirect page, or incomplete capture. Treat status code as a filter, not as final evidence.
How should I dedupe repeated Wayback captures?
Start with original plus digest so repeated identical responses for the same URL collapse into one row. For broader analysis, compare deduping by platform, target type, normalized URL, and digest.
Related Guides
- Data Science Topic Hub
- Best Public Datasets for Your Projects
- Pandas Data Cleaning
- Pandas to_datetime
- Pandas Drop Duplicates
- Export a DataFrame to CSV
- Web Scraping with Python
References
- Internet Archive: Wayback CDX Server API documentation (opens in a new tab)
- Bragg and Weigle: Discovering the Traces of Disinformation on Instagram in the Internet Archive (opens in a new tab)
- Kanaries: PyGWalker documentation (opens in a new tab)
- Kanaries: Graphic Walker documentation (opens in a new tab)