R dplyr Data Wrangling Pipeline: From Raw Data to Clean Tables
Problem
Analysts lose time writing repetitive R loops to clean datasets. The result is brittle scripts, inconsistent columns, and slow iterations whenever a new file arrives.
Agitate
Manual steps stack up: selecting the wrong columns, applying transforms in the wrong order, or forgetting to handle factors versus characters. Each mistake forces a rerun and risks shipping incorrect metrics to downstream dashboards.
Solution
Adopt a consistent dplyr pipeline that mirrors SQL logic and keeps every transform readable. The verbs below cover column selection, row filtering, new features, summaries, ordering, and safe joins so data is tidy before modeling or visualization.
Core pipeline in action
library(dplyr)
clean_sales <- raw_sales %>%
mutate(
region = as.character(region), # avoid accidental factor levels
revenue = price * quantity
) %>%
filter(!is.na(revenue), revenue > 0) %>%
select(order_id, region, revenue, channel, order_date) %>%
arrange(desc(revenue)) %>%
group_by(region, channel) %>%
summarise(
orders = n(),
revenue = sum(revenue),
avg_order = mean(revenue),
.groups = "drop"
)dplyr verbs and SQL equivalents
| dplyr verb | Purpose | SQL analogy | Quick note |
|---|---|---|---|
select() | keep or rename columns | SELECT col1, col2 | Use everything() to reorder quickly. |
filter() | row subsets | WHERE condition | Combine with %in% for multiple values. |
mutate() | add/transform columns | SELECT col1, col2*2 AS col2x | Prefer if_else() for type-stable logic. |
summarise() + group_by() | grouped metrics | GROUP BY ... | Add .groups = "drop" to ungroup. |
arrange() | ordering rows | ORDER BY | Use desc() for descending. |
left_join() | enrich with lookup | LEFT JOIN | Keep join keys the same type (character vs factor). |
bind_rows() / bind_cols() | stack or widen tables | UNION ALL / column concatenation | Ensure matching schemas when stacking. |
Joins vs. binds at a glance
# Join: add columns from a lookup table
sales_with_regions <- sales %>%
left_join(region_lookup, by = "region_id")
# Bind: stack identical schemas
all_sales <- bind_rows(sales_2023, sales_2024)Avoid factor and encoding pitfalls
- Convert identifiers to
characterbefore joins to prevent mismatched levels. - Use
mutate(across(where(is.factor), as.character))if a CSV was read with implicit factors. - Standardize text with
stringr::str_trimandtolowerbefore grouping to avoid duplicate categories.
Quick starter template
prep_data <- function(df) {
df %>%
mutate(across(where(is.factor), as.character)) %>%
filter(!is.na(key_id)) %>%
distinct() %>%
select(key_id, everything())
}Checklist before saving
- Columns are explicitly selected and ordered.
- Key joins use matching types and clear suffixes when needed (
suffix = c(".src", ".lkp")). - Summaries drop grouping to avoid surprises in later steps.
- Outputs contain only clean, human-readable columns ready for plots or models.
Related guides
- Build tidy tables from scratch: Creating Dataframe in R
- Group data for summaries: Grouping in R with group_by()
- Prep for plotting: R ggplot2 Quickstart