Skip to content

Need help? Join our Discord Community!

PySpark tolist() Function Made Easy: A Comprehensive Guide

PySpark tolist() Function Made Easy: A Comprehensive Guide

As a data scientist, you are probably familiar with PySpark, a powerful tool for processing and analyzing big data. PySpark is a Python library for Apache Spark, a unified analytics engine for big data processing. In PySpark, DataFrames are a key data structure used for data processing. One common task in data processing is converting PySpark DataFrames into Python Lists. In this guide, we will explain how to use the PySpark tolist() function to accomplish this task.

Want to quickly create Data Visualisation from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

What is PySpark tolist() function?

The tolist() function is a PySpark SQL function that is used to convert a DataFrame into a Python list. The function takes no arguments and returns a list of rows in the DataFrame. Each row in the DataFrame is represented as a list of values.

How to use PySpark tolist() function?

Using the tolist() function in PySpark is straightforward. Here is the basic syntax:


In the above code, we are using the PySpark toPandas() function to convert the DataFrame into a Pandas DataFrame. Then, we are using the Pandas values.tolist() function to convert the Pandas DataFrame into a Python list.

Let's take a closer look at how to use the PySpark tolist() function with examples.

Example 1: Converting a PySpark DataFrame to a Python List

Let's say we have the following PySpark DataFrame:

from pyspark.sql import SparkSession
# create SparkSession
spark = SparkSession.builder.appName('PySparkTutorial').getOrCreate()
# create DataFrame
data = [('Alice', 1), ('Bob', 2), ('Charlie', 3), ('David', 4)]
df = spark.createDataFrame(data, ['Name', 'Age'])
# display DataFrame


|   Name|Age|
|  Alice|  1|
|    Bob|  2|
|Charlie|  3|
|  David|  4|

To convert this DataFrame into a Python list, we can use the tolist() function as follows:



[['Alice', 1], ['Bob', 2], ['Charlie', 3], ['David', 4]]

Example 2: Converting a PySpark DataFrame with Index to a Python List

In some cases, you may want to include the DataFrame index in the Python list. Here is an example that will show you how to do this:

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
# create DataFrame with index
df_index = df.withColumn("index", row_number().over(Window.orderBy("Name")))
# display DataFrame with index


|   Name|Age|index|
|  Alice|  1|    1|
|    Bob|  2|    2|
|Charlie|  3|    3|
|  David|  4|    4|

To convert this DataFrame into a Python list that includes the index, we can use the Pandas to_dict() function and then convert the resulting dictionary into a list:



[{'Name': 'Alice', 'Age': 1, 'index': 1},
 {'Name': 'Bob', 'Age': 2, 'index': 2},
 {'Name': 'Charlie', 'Age': 3, 'index': 3},
 {'Name': 'David', 'Age': 4, 'index': 4}]


In this guide, we have learned how to use the PySpark tolist() function to convert PySpark DataFrames into Python Lists. We have also shown examples of how to use this function with and without DataFrame indices. We hope this guide has been helpful in your data processing tasks, and we encourage you to explore other PySpark functions to further enhance your skills.