R dplyr Data Wrangling Pipeline: From Raw Data to Clean Tables

Problem

Analysts lose time writing repetitive R loops to clean datasets. The result is brittle scripts, inconsistent columns, and slow iterations whenever a new file arrives.

Agitate

Manual steps stack up: selecting the wrong columns, applying transforms in the wrong order, or forgetting to handle factors versus characters. Each mistake forces a rerun and risks shipping incorrect metrics to downstream dashboards.

Solution

Adopt a consistent dplyr pipeline that mirrors SQL logic and keeps every transform readable. The verbs below cover column selection, row filtering, new features, summaries, ordering, and safe joins so data is tidy before modeling or visualization.

Core pipeline in action

library(dplyr)
 
clean_sales <- raw_sales %>%
  mutate(
    region = as.character(region),        # avoid accidental factor levels
    revenue = price * quantity
  ) %>%
  filter(!is.na(revenue), revenue > 0) %>%
  select(order_id, region, revenue, channel, order_date) %>%
  arrange(desc(revenue)) %>%
  group_by(region, channel) %>%
  summarise(
    orders = n(),
    revenue = sum(revenue),
    avg_order = mean(revenue),
    .groups = "drop"
  )

dplyr verbs and SQL equivalents

dplyr verb	Purpose	SQL analogy	Quick note
`select()`	keep or rename columns	`SELECT col1, col2`	Use `everything()` to reorder quickly.
`filter()`	row subsets	`WHERE condition`	Combine with `%in%` for multiple values.
`mutate()`	add/transform columns	`SELECT col1, col2*2 AS col2x`	Prefer `if_else()` for type-stable logic.
`summarise()` + `group_by()`	grouped metrics	`GROUP BY ...`	Add `.groups = "drop"` to ungroup.
`arrange()`	ordering rows	`ORDER BY`	Use `desc()` for descending.
`left_join()`	enrich with lookup	`LEFT JOIN`	Keep join keys the same type (character vs factor).
`bind_rows()` / `bind_cols()`	stack or widen tables	`UNION ALL` / column concatenation	Ensure matching schemas when stacking.

Joins vs. binds at a glance

# Join: add columns from a lookup table
sales_with_regions <- sales %>%
  left_join(region_lookup, by = "region_id")
 
# Bind: stack identical schemas
all_sales <- bind_rows(sales_2023, sales_2024)

Avoid factor and encoding pitfalls

Convert identifiers to character before joins to prevent mismatched levels.
Use mutate(across(where(is.factor), as.character)) if a CSV was read with implicit factors.
Standardize text with stringr::str_trim and tolower before grouping to avoid duplicate categories.

Quick starter template

prep_data <- function(df) {
  df %>%
    mutate(across(where(is.factor), as.character)) %>%
    filter(!is.na(key_id)) %>%
    distinct() %>%
    select(key_id, everything())
}

Checklist before saving

Columns are explicitly selected and ordered.
Key joins use matching types and clear suffixes when needed (suffix = c(".src", ".lkp")).
Summaries drop grouping to avoid surprises in later steps.
Outputs contain only clean, human-readable columns ready for plots or models.

Related guides

Build tidy tables from scratch: Creating Dataframe in R
Group data for summaries: Grouping in R with group_by()
Prep for plotting: R ggplot2 Quickstart