PySpark: Convert DataFrame or Column to Python List (Beginner-Friendly Guide)
Updated on
Working with PySpark often involves converting distributed Spark DataFrames into native Python objects.
One common need—especially during debugging, exporting, or data transformation—is converting a PySpark DataFrame into a Python list.
Although PySpark does not provide a built-in .tolist() method like Pandas, there are several reliable ways to achieve the same result depending on dataset size and memory constraints.
This updated guide covers:
- What “tolist” means in PySpark
- Best techniques for converting Spark DataFrames → Python lists
- Handling single/multiple columns
- Performance considerations
- Code examples for small and large datasets
Want an AI agent that truly understands your PySpark, Pandas, and Jupyter workflows?
RunCell is a JupyterLab AI agent that can read your code, analyze DataFrames, understand notebook context, debug errors, and even generate & execute code for you.
It works directly inside JupyterLab—no switching windows or copy-pasting.
👉 Try RunCell: https://www.runcell.dev (opens in a new tab)
What Does "tolist()" Mean in PySpark?
Unlike Pandas, PySpark DataFrames do not have a native .tolist() method.
When PySpark users refer to “tolist”, they usually mean:
✔ Option A — Convert the entire DataFrame into a Python list
df.collect()✔ Option B — Convert a DataFrame to Pandas, then to a list
df.toPandas().values.tolist()✔ Option C — Convert a single column to a Python list
df.select("col").rdd.flatMap(lambda x: x).collect()This guide walks through all these methods with clear examples.
Method 1: Convert a PySpark DataFrame to a Python List (Small Data)
This is the most common pattern, but should be used only when the dataset fits in memory.
df.toPandas().values.tolist()Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()
data = [('Alice', 1), ('Bob', 2), ('Charlie', 3), ('David', 4)]
df = spark.createDataFrame(data, ['Name', 'Age'])
df.toPandas().values.tolist()Output
[['Alice', 1], ['Bob', 2], ['Charlie', 3], ['David', 4]]Method 2: Convert Spark DataFrame to List Without Pandas (Recommended for Large Data)
If your dataset is big, always avoid toPandas().
Use Spark’s distributed API instead:
df.collect()This returns:
[Row(Name='Alice', Age=1), Row(Name='Bob', Age=2), ...]To convert rows into plain Python lists:
[x.asDict().values() for x in df.collect()]Or convert each row to a dict:
[row.asDict() for row in df.collect()]Method 3: Convert a Single Column to Python List
A very common use case.
Option A: Using RDD (fast & scalable)
df.select("Name").rdd.flatMap(lambda x: x).collect()Option B: Using Pandas (small data)
df.toPandas()["Name"].tolist()Method 4: Convert a DataFrame With an Index to a Python List
PySpark DataFrames don’t have a built-in index, but you can add one manually:
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df_index = df.withColumn(
"index", row_number().over(Window.orderBy("Name"))
)
df_index.show()Convert to list of dictionaries:
df_index.toPandas().to_dict("records")Output
[
{'Name': 'Alice', 'Age': 1, 'index': 1},
{'Name': 'Bob', 'Age': 2, 'index': 2},
{'Name': 'Charlie', 'Age': 3, 'index': 3},
{'Name': 'David', 'Age': 4, 'index': 4}
]Performance Notes (Must Read)
🚫 Avoid df.toPandas() when:
- Dataset is large
- Cluster memory is limited
- Columns contain large binary/text objects
✔ Use collect() or RDD operations when:
- Working with medium-to-large data
- You only need specific columns
- You want to avoid driver memory overload
✔ Convert only what you need
Instead of doing:
df.toPandas()Prefer:
df.select("col_of_interest")Conclusion
PySpark does not include a native .tolist() function, but converting a DataFrame into a Python list is very straightforward using:
toPandas().values.tolist()— for small datasetscollect()orrddoperations — for scalable workloadsto_dict("records")— for JSON-friendly output
Choose the method that fits your data size and workflow.
References
- https://sparkbyexamples.com/pyspark/pyspark-tolist-function/ (opens in a new tab)
- https://www.mygreatlearning.com/blog/pyspark-how-to-convert-a-dataframe-tolist/ (opens in a new tab)
- https://sparkbyexamples.com/pyspark/pyspark-dataframe-to-pandas-dataframe/ (opens in a new tab)
Frequently Asked Questions
1. How do I convert a PySpark DataFrame to a Python list?
Use df.collect() or df.toPandas().values.tolist() depending on data size.
2. How do I convert a single column to a list?
df.select("col").rdd.flatMap(lambda x: x).collect()3. How do I convert a Spark row to dict?
row.asDict()