Skip to content

Pandas Crosstab: Create Simple Cross Tabulation Tables in Python

Updated on

Cross-tabulation is a powerful tool in data analysis, allowing us to inspect the relationship between two or more categorical or discrete variables. This article will delve into the world of pandas crosstab, a function in Python's pandas library that simplifies the creation of cross-tabulation tables. We'll explore what pandas crosstab is, how it compares to the pivot table function, and how to use it effectively for data manipulation.

Pandas is a popular data manipulation tool in Python, offering a variety of functions for data analysis. One such function is crosstab, which allows for easy cross-tabulation, providing a simple yet effective way to understand the relationships in your data. Whether you're a seasoned data analyst or a beginner in the field, understanding how to use pandas crosstab can significantly enhance your data manipulation skills.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

What is Pandas Crosstab?

Pandas crosstab is a function that provides a simple way to create a cross-tabulation of two (or more) factors. It computes a simple cross-tabulation of two or more data arrays. By default, it provides a frequency table of the factors unless an array of values and an aggregation function are passed.

Here's a basic example of how to use pandas crosstab:

import pandas as pd
 
## Sample data
data = {'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
        'B': ['one', 'one', 'two', 'two', 'one', 'one'],
        'C': ['small', 'large', 'large', 'small', 'small', 'large'],
        'D': [1, 2, 2, 3, 3, 4]}
df = pd.DataFrame(data)
 
## Use crosstab
table = pd.crosstab(df['A'], df['B'])
print(table)

In this example, we create a DataFrame from a dictionary and then use pandas crosstab to create a cross-tabulation table that shows the frequency of 'B' categories for each 'A' category.

Should I Use Pivot Table or Crosstab in Pandas?

While both pivot tables and crosstabs in pandas serve similar purposes, they are used in slightly different scenarios. The pivot table is a more general solution, used for creating a new derived table out of a given one. On the other hand, crosstab is more specialized for calculating group frequencies.

The pivot table can deal with multiple types of input data and can handle multiple index and column names, while crosstab is mainly used for frequency tables. So, if you're looking to create a frequency table, pandas crosstab is the way to go. However, if you're dealing with more complex data manipulation tasks, a pivot table might be a better choice.

How to Define Crosstab in Python?

Defining a crosstab in Python using pandas is straightforward. The basic syntax for creating a crosstab is as follows:

pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: str = 'All', dropna: bool = True, normalize=False)

Here's what each parameter means:

  • index: array-like, values to group by in the rows.
  • columns: array-like, values to

group by in the columns.

  • values: array-like, optional, array of values to aggregate according to the factors.
  • rownames: sequence, default None, if passed, must match number of row arrays passed.
  • colnames: sequence, default None, if passed, must match number of column arrays passed.
  • aggfunc: function, optional, if passed, requires values to be specified as well.
  • margins: bool, default False, add row/column margins (subtotals).
  • margins_name: str, default 'All', name of the row/column that will contain the totals when margins is True.
  • dropna: bool, default True, do not include columns whose entries are all NaN.
  • normalize: bool, columns, or 1, default False. Normalize by dividing all values by the sum of values.

How to Create a Crosstab with Percentages in Pandas?

Creating a crosstab with percentages in pandas is quite simple. You just need to set the normalize parameter to True or specify the axis (index or columns) you want to normalize. Here's an example:

import pandas as pd
 
## Sample data
data = {'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
        'B': ['one', 'one', 'two', 'two', 'one', 'one'],
        'C': ['small', 'large', 'large', 'small', 'small', 'large'],
        'D': [1, 2, 2, 3, 3, 4]}
df = pd.DataFrame(data)
 
## Use crosstab with normalization
table = pd.crosstab(df['A'], df['B'], normalize=True)
print(table)

In this example, the crosstab will show the proportion of each category instead of the count, providing a clearer picture of the distribution of categories.

What are Some Easy Ways to Crosstab in Pandas?

There are several ways to create crosstabs in pandas, and the best method depends on your specific needs. Here are three easy ways to create crosstabs:

  1. Basic Crosstab: The simplest way to create a crosstab is to pass two series to the pandas crosstab function, which will provide a frequency table.

  2. Crosstab with Aggregation: If you have a third series of values, you can pass it to the crosstab function along with an aggregation function (like mean, sum, etc.) to get a more complex crosstab.

  3. Crosstab with Normalization: By setting the normalize parameter to True, you can create a crosstab that shows the proportion of each category, which can be useful for understanding the distribution of categories.

Pandas Crosstab Example

Let's dive into a practical example of using pandas crosstab. Suppose we have a dataset of a store's sales, including the type of product sold, the salesperson who made the sale, and the number of units sold. We want to create a crosstab that shows the total units sold by each salesperson for each product type.

import pandas as pd
 
# Sample data
data = {'Product': ['Apples', 'Oranges', 'Bananas', 'Apples', 'Bananas', 'Bananas'],
        'Salesperson': ['John', 'John', 'Claire', 'Claire', 'John', 'Claire'],
        'Units': [5, 3, 8, 7, 2, 4]}
df = pd.DataFrame(data)
 
# Use crosstab with aggregation
table = pd.crosstab(df['Salesperson'], df['Product'], values=df['Units'], aggfunc='sum')
print(table)

In this example, the crosstab shows the total units of each product type sold by each salesperson. This is a powerful way to summarize and visualize your data.

How to Sort Crosstab in Pandas

Sorting a crosstab in pandas is as simple as using the sort_values function. You can sort by a specific column or by the index. Here's an example:

# Continuing from the previous example
# Sort by 'Apples'
table = table.sort_values('Apples', ascending=False)
print(table)

In this example, the crosstab is sorted by the 'Apples' column in descending order. You can easily change this to sort by a different column or in ascending order.

Pandas Crosstab Count and Percentage

Sometimes, you might want to get both the count and the percentage in your crosstab. This can be achieved by creating two crosstabs and concatenating them. Here's how:

# Continuing from the first example
# Create count crosstab
count_table = pd.crosstab(df['Salesperson'], df['Product'])
 
# Create percentage crosstab
perc_table = pd.crosstab(df['Salesperson'], df['Product'], normalize='index')
 
# Concatenate the two tables
final_table = pd.concat([count_table, perc_table], keys=['Count', 'Percentage'])
print(final_table)

In this example, the final crosstab shows both the count and the percentage of each category. This can be very useful for getting a comprehensive view of your data.

FAQs

What is pandas crosstab?

Pandas crosstab is a function in Python's pandas library that allows you to create a cross-tabulation of two or more factors. It provides a simple and effective way to understand the relationships in your data.

How do you define crosstab in Python?

Defining a crosstab in Python using pandas is straightforward. The basic syntax for creating a crosstab is pd.crosstab(index, columns), where index and columns are the values to group by in the rows and columns, respectively.

How to create a crosstab with percentages in pandas?

To create a crosstab with percentages in pandas, you need to set the normalize parameter to True or specify the axis (index or columns) you want to normalize. This will create a crosstab that shows the proportion of each category, providing a clearer picture of the distribution of categories.