Skip to content

PySpark: Convert DataFrame or Column to Python List (Beginner-Friendly Guide)

Updated on

Working with PySpark often involves converting distributed Spark DataFrames into native Python objects.
One common need—especially during debugging, exporting, or data transformation—is converting a PySpark DataFrame into a Python list.

Although PySpark does not provide a built-in .tolist() method like Pandas, there are several reliable ways to achieve the same result depending on dataset size and memory constraints.
This updated guide covers:

  • What “tolist” means in PySpark
  • Best techniques for converting Spark DataFrames → Python lists
  • Handling single/multiple columns
  • Performance considerations
  • Code examples for small and large datasets

Want an AI agent that truly understands your PySpark, Pandas, and Jupyter workflows?

RunCell is a JupyterLab AI agent that can read your code, analyze DataFrames, understand notebook context, debug errors, and even generate & execute code for you.
It works directly inside JupyterLab—no switching windows or copy-pasting.

👉 Try RunCell: https://www.runcell.dev (opens in a new tab)


What Does "tolist()" Mean in PySpark?

Unlike Pandas, PySpark DataFrames do not have a native .tolist() method.

When PySpark users refer to “tolist”, they usually mean:

✔ Option A — Convert the entire DataFrame into a Python list

df.collect()

✔ Option B — Convert a DataFrame to Pandas, then to a list

df.toPandas().values.tolist()

✔ Option C — Convert a single column to a Python list

df.select("col").rdd.flatMap(lambda x: x).collect()

This guide walks through all these methods with clear examples.


Method 1: Convert a PySpark DataFrame to a Python List (Small Data)

This is the most common pattern, but should be used only when the dataset fits in memory.

df.toPandas().values.tolist()

Example

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()
 
data = [('Alice', 1), ('Bob', 2), ('Charlie', 3), ('David', 4)]
df = spark.createDataFrame(data, ['Name', 'Age'])
 
df.toPandas().values.tolist()

Output

[['Alice', 1], ['Bob', 2], ['Charlie', 3], ['David', 4]]

Method 2: Convert Spark DataFrame to List Without Pandas (Recommended for Large Data)

If your dataset is big, always avoid toPandas(). Use Spark’s distributed API instead:

df.collect()

This returns:

[Row(Name='Alice', Age=1), Row(Name='Bob', Age=2), ...]

To convert rows into plain Python lists:

[x.asDict().values() for x in df.collect()]

Or convert each row to a dict:

[row.asDict() for row in df.collect()]

Method 3: Convert a Single Column to Python List

A very common use case.

Option A: Using RDD (fast & scalable)

df.select("Name").rdd.flatMap(lambda x: x).collect()

Option B: Using Pandas (small data)

df.toPandas()["Name"].tolist()

Method 4: Convert a DataFrame With an Index to a Python List

PySpark DataFrames don’t have a built-in index, but you can add one manually:

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
 
df_index = df.withColumn(
    "index", row_number().over(Window.orderBy("Name"))
)
df_index.show()

Convert to list of dictionaries:

df_index.toPandas().to_dict("records")

Output

[
 {'Name': 'Alice', 'Age': 1, 'index': 1},
 {'Name': 'Bob', 'Age': 2, 'index': 2},
 {'Name': 'Charlie', 'Age': 3, 'index': 3},
 {'Name': 'David', 'Age': 4, 'index': 4}
]

Performance Notes (Must Read)

🚫 Avoid df.toPandas() when:

  • Dataset is large
  • Cluster memory is limited
  • Columns contain large binary/text objects

✔ Use collect() or RDD operations when:

  • Working with medium-to-large data
  • You only need specific columns
  • You want to avoid driver memory overload

✔ Convert only what you need

Instead of doing:

df.toPandas()

Prefer:

df.select("col_of_interest")

Conclusion

PySpark does not include a native .tolist() function, but converting a DataFrame into a Python list is very straightforward using:

  • toPandas().values.tolist() — for small datasets
  • collect() or rdd operations — for scalable workloads
  • to_dict("records") — for JSON-friendly output

Choose the method that fits your data size and workflow.


References


Frequently Asked Questions

1. How do I convert a PySpark DataFrame to a Python list?

Use df.collect() or df.toPandas().values.tolist() depending on data size.

2. How do I convert a single column to a list?

df.select("col").rdd.flatMap(lambda x: x).collect()

3. How do I convert a Spark row to dict?

row.asDict()

Related PySpark Guides