Skip to content

How to Use DuckDB and Pandas for Data Analysis

When it comes to data analysis, the tools you use can make a world of difference. That's where DuckDB and Pandas come in. These two have been making waves in the data analysis world, and for good reason. This article is all about getting you up to speed with these tools, their strengths, their differences, and how they can work together to supercharge your data analysis tasks.

What is DuckDB?

Let's kick things off with DuckDB. Now, DuckDB is an in-memory analytical database that's written in C++. It's designed to support analytical SQL queries and data-intensive tasks. But here's the kicker: DuckDB is fast. I mean, really fast. It's built to be a speedy, in-memory database, which makes it a top pick for tasks that need to be done in a jiffy.

But it's not just about speed. DuckDB is also about efficiency. It's designed to run complex queries on large datasets, which is a big plus if you're dealing with a ton of data. And the best part? It integrates seamlessly with Pandas DataFrames, allowing you to run these queries directly on top of them. No need to move data in and out. How cool is that?

What is Pandas?

Now, let's talk about Pandas. If you're into data analysis, chances are you've crossed paths with Pandas. It's a flexible data analysis library in Python that's loved by many for its simplicity and power. It can handle a wide range of data formats and sizes, and for the most part, it's pretty intuitive to use.

But here's where it gets interesting. If you're familiar with SQL, you know how powerful it can be when it comes to handling complex data transformations. And that's where DuckDB comes in. By combining the speed and efficiency of DuckDB with the flexibility of Pandas, you get a powerful tool for data analysis that's hard to beat.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

Comparing DuckDB and Pandas

Okay, so we've talked about DuckDB and Pandas separately. But how do they stack up against each other? Let's find out.

DuckDB vs Pandas

When it comes to performance, DuckDB has a clear edge. It's faster and uses less memory, which is a big deal when you're dealing with large datasets. But that's not all. DuckDB also shines when it comes to SQL capabilities. It allows you to run SQL queries on Pandas DataFrames, which can be a game-changer if you're more comfortable with SQL than Python.

On the other hand, Pandas is no slouch either. It's incredibly flexible and powerful, capable of handling a wide range of data formats and sizes. Plus, it's pretty intuitive to use, which is always a plus.

But here's the thing: you don't have to choose between DuckDB and Pandas. In fact, they work best when used together. By combining the speed and efficiency of DuckDB with the flexibility of Pandas, you get a powerful tool for data analysis that's hard to beat.

But don't just take my word for it. Let's take a look at some examples.

Getting Started with DuckDB and Pandas

Before we dive into the examples, let's talk about how you can get started with DuckDB and Pandas. Don't worry, it's pretty straightforward.

Installation Guide

First things first, you'll need to install DuckDB and Pandas. If you're using Python, you can do this using pip, the Python package installer. Here's how:

pip install duckdb pandas

That's it! You now have DuckDB and Pandas installed and ready to go.

Setting Up DuckDB with Pandas

Once you've installed DuckDB and Pandas, you can start using them together. The first step is to import the libraries in your Python script or Jupyter notebook. Here's how you can do it:

import duckdb
import pandas as pd

With these lines of code, you're telling Python that you want to use the DuckDB and Pandas libraries in your script or notebook.

Using DuckDB to Query Pandas DataFrames

Now that you've set up DuckDB and Pandas, let's talk about how you can use DuckDB to query Pandas DataFrames. This is where the magic happens.

Basic SQL Queries on DataFrames

With DuckDB, you can run SQL queries directly on Pandas DataFrames. This means you can use the power and familiarity of SQL to manipulate and analyze your data. Here's an example:

# Create a Pandas DataFrame
df = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': np.random.randn(8),
   'D': np.random.randn(8)
})
 
# Use DuckDB to run a SQL query on the DataFrame
result = duckdb.query("SELECT A, AVG(D) FROM df GROUP BY A").to_df()

In this example, we're creating a Pandas DataFrame and then using DuckDB to run a SQL query on it. The query is calculating the average of column 'D' for each unique value in column 'A'. The result is another DataFrame with the results of the query.

Advanced SQL Operations with DuckDB

But DuckDB isn't just for basic SQL queries. You can also use it for more advanced operations, like joins, window functions, and even machine learning operations. Here's an example of a join operation:

# Create two Pandas DataFrames
df1 = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': np.random.randn(8),
   'D': np.random.randn(8)
})
 
df2 = pd.DataFrame({
   'A': ['foo', 'bar', 'baz', 'bat'],
   'E': ['apple', 'orange', 'banana', 'grape']
})
 
# Use DuckDB to run a SQL join operation on the DataFrames
result = duckdb.query("SELECT df1.A, df1.B, df2.E FROM df1 JOIN df2 ON df1.A = df2.A").to_df()

In this example, we're creating two Pandas DataFrames and then using DuckDB to run a SQL join operation on them. The result is a new DataFrame with the joined data.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

Conclusion

And that's just the tip of the iceberg. With DuckDB and Pandas, you can supercharge your data analysis tasks, making them faster, more efficient, and more intuitive. So why not give it a try? You might just find that it's the dynamic duo you've been looking for.

Frequently Asked Questions

Is DuckDB faster than Pandas?

Yes, DuckDB is generally faster than Pandas when it comes to executing SQL queries on large datasets. This is because DuckDB is an in-memory analytical database designed for speed and efficiency. However, the actual performance can depend on the specific task and the size of the dataset.

What is the difference between DuckDB and Polars?

DuckDB and Polars are both tools used for data analysis, but they have some key differences. DuckDB is an in-memory analytical database that excels at executing SQL queries on large datasets. On the other hand, Polars is a DataFrame library implemented in Rust and Python that is designed to be fast and flexible. While DuckDB focuses on providing SQL capabilities, Polars provides a DataFrame API that is similar to Pandas.

Is DuckDB multithreaded?

Yes, DuckDB is multithreaded. It uses multiple threads to execute queries, which can lead to significant performance improvements on multicore systems. This is one of the reasons why DuckDB can be faster than other data analysis tools for certain tasks.

Is SQLite faster than Pandas?

SQLite and Pandas serve different purposes and their performance can vary depending on the task. SQLite is a database engine that is optimized for data storage and retrieval, while Pandas is a data analysis library that excels at data manipulation and transformation. For tasks that involve complex SQL queries, SQLite might be faster. However, for tasks that involve complex data manipulation, Pandas might be faster. When used together, they can provide a powerful tool for data analysis.