Skip to content

PySpark tolist() Function Made Easy: A Comprehensive Guide

Updated on

As a data scientist, you are probably familiar with PySpark, a powerful tool for processing and analyzing big data. PySpark is a Python library for Apache Spark, a unified analytics engine for big data processing. In PySpark, DataFrames are a key data structure used for data processing. One common task in data processing is converting PySpark DataFrames into Python Lists. In this guide, we will explain how to use the PySpark tolist() function to accomplish this task.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

What is PySpark tolist() function?

The tolist() function is a PySpark SQL function that is used to convert a DataFrame into a Python list. The function takes no arguments and returns a list of rows in the DataFrame. Each row in the DataFrame is represented as a list of values.

How to use PySpark tolist() function?

Using the tolist() function in PySpark is straightforward. Here is the basic syntax:

df.toPandas().values.tolist()

In the above code, we are using the PySpark toPandas() function to convert the DataFrame into a Pandas DataFrame. Then, we are using the Pandas values.tolist() function to convert the Pandas DataFrame into a Python list.

Let's take a closer look at how to use the PySpark tolist() function with examples.

Example 1: Converting a PySpark DataFrame to a Python List

Let's say we have the following PySpark DataFrame:

from pyspark.sql import SparkSession
 
# create SparkSession
spark = SparkSession.builder.appName('PySparkTutorial').getOrCreate()
 
# create DataFrame
data = [('Alice', 1), ('Bob', 2), ('Charlie', 3), ('David', 4)]
df = spark.createDataFrame(data, ['Name', 'Age'])
 
# display DataFrame
df.show()

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Alice|  1|
|    Bob|  2|
|Charlie|  3|
|  David|  4|
+-------+---+

To convert this DataFrame into a Python list, we can use the tolist() function as follows:

df.toPandas().values.tolist()

Output:

[['Alice', 1], ['Bob', 2], ['Charlie', 3], ['David', 4]]

Example 2: Converting a PySpark DataFrame with Index to a Python List

In some cases, you may want to include the DataFrame index in the Python list. Here is an example that will show you how to do this:

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
 
# create DataFrame with index
df_index = df.withColumn("index", row_number().over(Window.orderBy("Name")))
 
# display DataFrame with index
df_index.show()

Output:

+-------+---+-----+
|   Name|Age|index|
+-------+---+-----+
|  Alice|  1|    1|
|    Bob|  2|    2|
|Charlie|  3|    3|
|  David|  4|    4|
+-------+---+-----+

To convert this DataFrame into a Python list that includes the index, we can use the Pandas to_dict() function and then convert the resulting dictionary into a list:

df_index.toPandas().to_dict('records')

Output:

[{'Name': 'Alice', 'Age': 1, 'index': 1},
 {'Name': 'Bob', 'Age': 2, 'index': 2},
 {'Name': 'Charlie', 'Age': 3, 'index': 3},
 {'Name': 'David', 'Age': 4, 'index': 4}]

Conclusion

In this guide, we have learned how to use the PySpark tolist() function to convert PySpark DataFrames into Python Lists. We have also shown examples of how to use this function with and without DataFrame indices. We hope this guide has been helpful in your data processing tasks, and we encourage you to explore other PySpark functions to further enhance your skills.

References

Frequently Asked Questions

  1. How do I convert a DataFrame to a list in Python?

    To convert a DataFrame to a list in Python, you can use the values.tolist() method. This method returns a nested list where each inner list represents a row in the DataFrame. The resulting list can be used for further processing or analysis.

  2. Can I convert a specific column of a DataFrame to a list?

    Yes, you can convert a specific column of a DataFrame to a list in Python. Use the indexing operator [] to access the column by name and then apply the tolist() method. This will return a list containing the values of the selected column.

  3. Is it possible to convert multiple columns of a DataFrame to a list?

    Yes, it is possible to convert multiple columns of a DataFrame to a list in Python. Use the indexing operator [] to select the desired columns by name and then apply the values.tolist() method. This will return a nested list where each inner list represents the values of the selected columns.