Skip to content

R dplyr Data Wrangling Pipeline: From Raw Data to Clean Tables

Problem

Analysts lose time writing repetitive R loops to clean datasets. The result is brittle scripts, inconsistent columns, and slow iterations whenever a new file arrives.

Agitate

Manual steps stack up: selecting the wrong columns, applying transforms in the wrong order, or forgetting to handle factors versus characters. Each mistake forces a rerun and risks shipping incorrect metrics to downstream dashboards.

Solution

Adopt a consistent dplyr pipeline that mirrors SQL logic and keeps every transform readable. The verbs below cover column selection, row filtering, new features, summaries, ordering, and safe joins so data is tidy before modeling or visualization.

Core pipeline in action

library(dplyr)
 
clean_sales <- raw_sales %>%
  mutate(
    region = as.character(region),        # avoid accidental factor levels
    revenue = price * quantity
  ) %>%
  filter(!is.na(revenue), revenue > 0) %>%
  select(order_id, region, revenue, channel, order_date) %>%
  arrange(desc(revenue)) %>%
  group_by(region, channel) %>%
  summarise(
    orders = n(),
    revenue = sum(revenue),
    avg_order = mean(revenue),
    .groups = "drop"
  )

dplyr verbs and SQL equivalents

dplyr verbPurposeSQL analogyQuick note
select()keep or rename columnsSELECT col1, col2Use everything() to reorder quickly.
filter()row subsetsWHERE conditionCombine with %in% for multiple values.
mutate()add/transform columnsSELECT col1, col2*2 AS col2xPrefer if_else() for type-stable logic.
summarise() + group_by()grouped metricsGROUP BY ...Add .groups = "drop" to ungroup.
arrange()ordering rowsORDER BYUse desc() for descending.
left_join()enrich with lookupLEFT JOINKeep join keys the same type (character vs factor).
bind_rows() / bind_cols()stack or widen tablesUNION ALL / column concatenationEnsure matching schemas when stacking.

Joins vs. binds at a glance

# Join: add columns from a lookup table
sales_with_regions <- sales %>%
  left_join(region_lookup, by = "region_id")
 
# Bind: stack identical schemas
all_sales <- bind_rows(sales_2023, sales_2024)

Avoid factor and encoding pitfalls

  • Convert identifiers to character before joins to prevent mismatched levels.
  • Use mutate(across(where(is.factor), as.character)) if a CSV was read with implicit factors.
  • Standardize text with stringr::str_trim and tolower before grouping to avoid duplicate categories.

Quick starter template

prep_data <- function(df) {
  df %>%
    mutate(across(where(is.factor), as.character)) %>%
    filter(!is.na(key_id)) %>%
    distinct() %>%
    select(key_id, everything())
}

Checklist before saving

  • Columns are explicitly selected and ordered.
  • Key joins use matching types and clear suffixes when needed (suffix = c(".src", ".lkp")).
  • Summaries drop grouping to avoid surprises in later steps.
  • Outputs contain only clean, human-readable columns ready for plots or models.

Related guides