Skip to content

Python Collections 模块:Counter、defaultdict、deque、namedtuple 指南

Updated on

Python 内置的数据结构——lists、dicts、tuples、sets——能覆盖大多数任务。但当你的代码不再停留在玩具示例级别时,你会开始碰到它们的边界:统计元素需要手写字典循环;分组数据会让你的代码到处都是 if key not in dict 的防御性判断;用 list 当队列会让你在从头部弹出时付出 O(n) 的代价;用普通 tuple 表达结构化记录会把字段访问变成难以阅读的“下标猜谜游戏”。每个变通方式单看都不大,但很快会叠加起来,让代码更难读、运行更慢、也更容易出错。

Python 标准库中的 collections 模块用一组“为特定问题而生”的容器类型解决这些痛点:Counter 一行完成计数;defaultdict 通过自动默认值消除 KeyErrordeque 让序列两端的操作都达到 O(1);namedtuple 在不引入完整 class 开销的前提下为 tuple 增加字段名;OrderedDictChainMap 则处理顺序敏感、以及分层查找等普通 dict 难以优雅表达的模式。

本指南将覆盖 collections 模块中的每个主要类,配套可运行代码、性能分析与真实场景用法。无论你是在处理日志文件、构建缓存、管理多层配置,还是搭建数据处理流水线,这些容器都能让代码更短、更快、也更可靠。

📚

collections 模块概览

collections 模块提供了一些专用的容器数据类型,用来扩展 Python 通用的内置容器。

import collections
 
# See all available classes
print([name for name in dir(collections) if not name.startswith('_')])
# ['ChainMap', 'Counter', 'OrderedDict', 'UserDict', 'UserList',
#  'UserString', 'abc', 'defaultdict', 'deque', 'namedtuple']
Class用途替代
Counter统计可哈希对象手写 dict 计数循环
defaultdict带自动默认值的 dictdict.setdefault()if key not in 判断
deque双端队列,两端操作 O(1)用作队列/栈的 list
namedtuple带命名字段的 tuple普通 tuple、简单 data class
OrderedDict记住插入顺序的 dictdict(3.7 之前)、顺序相关操作
ChainMap分层字典查找手动合并 dict

Counter:元素计数

Counter 是 dict 的子类,用于统计可哈希对象。它将元素映射到出现次数,并提供常用的频率分析方法。

创建 Counter

from collections import Counter
 
# From an iterable
words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
word_count = Counter(words)
print(word_count)
# Counter({'apple': 3, 'banana': 2, 'cherry': 1})
 
# From a string
letter_count = Counter('mississippi')
print(letter_count)
# Counter({'s': 4, 'i': 4, 'p': 2, 'm': 1})
 
# From a dictionary
inventory = Counter({'shirts': 25, 'pants': 15, 'hats': 10})
 
# From keyword arguments
stock = Counter(laptops=5, monitors=12)

most_common() 与频率排序

from collections import Counter
 
text = "to be or not to be that is the question"
words = Counter(text.split())
 
# Get the 3 most common words
print(words.most_common(3))
# [('to', 2), ('be', 2), ('or', 1)]
 
# Get all elements sorted by frequency
print(words.most_common())
# [('to', 2), ('be', 2), ('or', 1), ('not', 1), ('that', 1), ('is', 1), ('the', 1), ('question', 1)]
 
# Least common: reverse the list or slice from the end
print(words.most_common()[-3:])
# [('is', 1), ('the', 1), ('question', 1)]

Counter 运算

Counter 支持加、减、交集、并集等运算——可以把它当作 multiset(多重集合)来用。

from collections import Counter
 
a = Counter(x=4, y=2, z=1)
b = Counter(x=1, y=3, z=5)
 
# Addition: combine counts
print(a + b)  # Counter({'z': 6, 'y': 5, 'x': 5})
 
# Subtraction: drops zero and negative results
print(a - b)  # Counter({'x': 3})
 
# Intersection (min of each)
print(a & b)  # Counter({'y': 2, 'x': 1, 'z': 1})
 
# Union (max of each)
print(a | b)  # Counter({'z': 5, 'x': 4, 'y': 3})

Counter 的实用模式

from collections import Counter
 
# Word frequency analysis
log_entries = [
    "ERROR: disk full",
    "WARNING: high memory",
    "ERROR: disk full",
    "ERROR: timeout",
    "WARNING: high memory",
    "ERROR: disk full",
    "INFO: backup complete",
]
error_types = Counter(entry.split(":")[0].strip() for entry in log_entries)
print(error_types)
# Counter({'ERROR': 4, 'WARNING': 2, 'INFO': 1})
 
# Find unique elements (count == 1)
data = [1, 2, 3, 2, 1, 4, 5, 4]
unique = [item for item, count in Counter(data).items() if count == 1]
print(unique)  # [3, 5]
 
# Check if one collection is a subset of another (anagram check)
def is_anagram(word1, word2):
    return Counter(word1.lower()) == Counter(word2.lower())
 
print(is_anagram("listen", "silent"))  # True
print(is_anagram("hello", "world"))    # False

想深入了解 Counter,可阅读我们专门的 Python Counter guide

defaultdict:自动默认值

defaultdict 是 dict 的子类,会在访问缺失 key 时调用一个 factory function 来提供默认值,从而消除 KeyError 与各种防御性判断。

Factory functions

from collections import defaultdict
 
# int factory: default is 0
counter = defaultdict(int)
counter['apples'] += 1
counter['oranges'] += 3
print(dict(counter))  # {'apples': 1, 'oranges': 3}
 
# list factory: default is []
groups = defaultdict(list)
pairs = [('fruit', 'apple'), ('veggie', 'carrot'), ('fruit', 'banana'), ('veggie', 'pea')]
for category, item in pairs:
    groups[category].append(item)
print(dict(groups))
# {'fruit': ['apple', 'banana'], 'veggie': ['carrot', 'pea']}
 
# set factory: default is set()
index = defaultdict(set)
words = [('file1', 'python'), ('file2', 'python'), ('file1', 'java'), ('file3', 'python')]
for filename, lang in words:
    index[lang].add(filename)
print(dict(index))
# {'python': {'file1', 'file2', 'file3'}, 'java': {'file1'}}

分组(Grouping)模式

defaultdict(list) 来说,“把同类数据归组”是最常见用法。对比手写方式:

from collections import defaultdict
 
students = [
    ('Math', 'Alice'), ('Science', 'Bob'), ('Math', 'Charlie'),
    ('Science', 'Diana'), ('Math', 'Eve'), ('History', 'Frank'),
]
 
# Without defaultdict -- verbose and error-prone
groups_manual = {}
for subject, name in students:
    if subject not in groups_manual:
        groups_manual[subject] = []
    groups_manual[subject].append(name)
 
# With defaultdict -- clean and direct
groups = defaultdict(list)
for subject, name in students:
    groups[subject].append(name)
 
print(dict(groups))
# {'Math': ['Alice', 'Charlie', 'Eve'], 'Science': ['Bob', 'Diana'], 'History': ['Frank']}

嵌套 defaultdict

无需为每一层手动初始化,就能构建多层数据结构。

from collections import defaultdict
 
# Two-level nested defaultdict
def nested_dict():
    return defaultdict(int)
 
sales = defaultdict(nested_dict)
sales['2025']['Q1'] = 150000
sales['2025']['Q2'] = 175000
sales['2026']['Q1'] = 200000
print(sales['2025']['Q1'])  # 150000
print(sales['2024']['Q3'])  # 0 (auto-created, no KeyError)
 
# Arbitrary depth nesting with a recursive factory
def deep_dict():
    return defaultdict(deep_dict)
 
config = deep_dict()
config['database']['primary']['host'] = 'localhost'
config['database']['primary']['port'] = 5432
config['database']['replica']['host'] = 'replica.local'
print(config['database']['primary']['host'])  # localhost

自定义 factory function

from collections import defaultdict
 
# Lambda for custom defaults
scores = defaultdict(lambda: 100)  # Every student starts with 100
scores['Alice'] -= 5
scores['Bob'] -= 10
print(scores['Charlie'])  # 100 (new student gets default)
print(dict(scores))  # {'Alice': 95, 'Bob': 90, 'Charlie': 100}
 
# Named function for complex defaults
def default_user():
    return {'role': 'viewer', 'active': True, 'login_count': 0}
 
users = defaultdict(default_user)
users['alice']['role'] = 'admin'
print(users['bob'])  # {'role': 'viewer', 'active': True, 'login_count': 0}

更多模式请参考:Python defaultdict guide

deque:双端队列

deque(读作 “deck”)提供两端 O(1) 的 append 与 pop 操作。对 list 来说,pop(0)insert(0, x) 是 O(n),因为需要移动所有元素。只要你的工作负载会频繁操作序列两端,deque 就是正确选择。

核心操作

from collections import deque
 
d = deque([1, 2, 3, 4, 5])
 
# O(1) operations on both ends
d.append(6)         # Add to right: [1, 2, 3, 4, 5, 6]
d.appendleft(0)     # Add to left:  [0, 1, 2, 3, 4, 5, 6]
 
right = d.pop()     # Remove from right: 6
left = d.popleft()  # Remove from left:  0
print(d)  # deque([1, 2, 3, 4, 5])
 
# Extend from both sides
d.extend([6, 7])          # Right extend: [1, 2, 3, 4, 5, 6, 7]
d.extendleft([-1, 0])     # Left extend (reversed): [0, -1, 1, 2, 3, 4, 5, 6, 7]

带 maxlen 的有界 deque

设置 maxlen 后,当加入元素超过上限时,会自动从另一端丢弃元素。非常适合滑动窗口与缓存。

from collections import deque
 
# Keep only the last 5 items
recent = deque(maxlen=5)
for i in range(10):
    recent.append(i)
 
print(recent)  # deque([5, 6, 7, 8, 9], maxlen=5)
 
# Sliding window average
def moving_average(iterable, window_size):
    window = deque(maxlen=window_size)
    for value in iterable:
        window.append(value)
        if len(window) == window_size:
            yield sum(window) / window_size
 
data = [10, 20, 30, 40, 50, 60, 70]
print(list(moving_average(data, 3)))
# [20.0, 30.0, 40.0, 50.0, 60.0]

旋转(Rotation)

rotate(n) 将元素向右移动 n 步;负值则向左旋转。

from collections import deque
 
d = deque([1, 2, 3, 4, 5])
 
d.rotate(2)   # Rotate right by 2
print(d)  # deque([4, 5, 1, 2, 3])
 
d.rotate(-3)  # Rotate left by 3
print(d)  # deque([2, 3, 4, 5, 1])

deque vs list 性能对比

from collections import deque
import time
 
# Benchmark: append/pop from left side
n = 100_000
 
# List: O(n) for each insert at position 0
start = time.perf_counter()
lst = []
for i in range(n):
    lst.insert(0, i)
list_time = time.perf_counter() - start
 
# Deque: O(1) for appendleft
start = time.perf_counter()
dq = deque()
for i in range(n):
    dq.appendleft(i)
deque_time = time.perf_counter() - start
 
print(f"List insert(0, x): {list_time:.4f}s")
print(f"Deque appendleft:  {deque_time:.4f}s")
print(f"Deque is {list_time / deque_time:.0f}x faster")
# Typical output:
# List insert(0, x): 1.2340s
# Deque appendleft:  0.0065s
# Deque is 190x faster
Operationlistdeque
append(x) (right)O(1) amortizedO(1)
pop() (right)O(1)O(1)
insert(0, x) / appendleft(x)O(n)O(1)
pop(0) / popleft()O(n)O(1)
access by index [i]O(1)O(n)
Memory per element更低略高

当你需要两端的快速操作时用 deque;当你需要按下标的快速随机访问时用 list

完整内容见:Python deque

namedtuple:带命名字段的 tuple

namedtuple 能创建 tuple 的子类并添加命名字段,让代码更自解释,同时避免定义完整 class 的额外开销。

创建 namedtuple

from collections import namedtuple
 
# Define a type
Point = namedtuple('Point', ['x', 'y'])
p = Point(3, 4)
 
# Access by name or index
print(p.x)     # 3
print(p[1])    # 4
print(p)       # Point(x=3, y=4)
 
# Alternative field definition styles
Color = namedtuple('Color', 'red green blue')        # Space-separated string
Config = namedtuple('Config', 'host, port, database')  # Comma-separated string

为什么用 namedtuple 而不是普通 tuple?

from collections import namedtuple
 
# Plain tuple: which index is what?
employee_tuple = ('Alice', 'Engineering', 95000, True)
print(employee_tuple[2])  # 95000 -- but what does index 2 mean?
 
# namedtuple: self-documenting
Employee = namedtuple('Employee', 'name department salary active')
employee = Employee('Alice', 'Engineering', 95000, True)
print(employee.salary)     # 95000 -- immediately clear
print(employee.department) # Engineering

关键方法

from collections import namedtuple
 
Employee = namedtuple('Employee', 'name department salary')
emp = Employee('Alice', 'Engineering', 95000)
 
# _replace: create a new instance with some fields changed (immutable)
promoted = emp._replace(salary=110000)
print(promoted)  # Employee(name='Alice', department='Engineering', salary=110000)
print(emp)       # Employee(name='Alice', department='Engineering', salary=95000)  -- unchanged
 
# _asdict: convert to OrderedDict (Python 3.8+ returns regular dict)
print(emp._asdict())
# {'name': 'Alice', 'department': 'Engineering', 'salary': 95000}
 
# _fields: get field names
print(Employee._fields)  # ('name', 'department', 'salary')
 
# _make: create from an iterable
data = ['Bob', 'Marketing', 85000]
emp2 = Employee._make(data)
print(emp2)  # Employee(name='Bob', department='Marketing', salary=85000)

默认值

from collections import namedtuple
 
# defaults parameter (Python 3.6.1+)
Connection = namedtuple('Connection', 'host port timeout', defaults=[5432, 30])
conn1 = Connection('localhost')               # port=5432, timeout=30
conn2 = Connection('db.example.com', 3306)    # timeout=30
conn3 = Connection('db.example.com', 3306, 60)
 
print(conn1)  # Connection(host='localhost', port=5432, timeout=30)
print(conn2)  # Connection(host='db.example.com', port=3306, timeout=30)

typing.NamedTuple 替代方案

如果你需要 type annotations 与更“类”的写法,可用 typing.NamedTuple

from typing import NamedTuple
 
class Point(NamedTuple):
    x: float
    y: float
    label: str = "origin"
 
p = Point(3.0, 4.0, "A")
print(p.x, p.label)  # 3.0 A
 
# Still a tuple -- supports unpacking, indexing, iteration
x, y, label = p
print(f"({x}, {y})")  # (3.0, 4.0)

namedtuple vs dataclass

Featurenamedtupledataclass
默认不可变否(需要 frozen=True
内存占用与 tuple 相同(小)更大(普通 class)
迭代/解包支持(它就是 tuple)不支持(除非你添加方法)
Type annotations通过 typing.NamedTuple内置支持
方法/属性需要 subclassing直接支持
继承受限完整 class 继承
最适合轻量数据记录复杂可变对象

OrderedDict:有序字典操作

自 Python 3.7 起,普通 dict 已保留插入顺序。那么 OrderedDict 还有什么价值?

OrderedDict 仍然重要的场景

from collections import OrderedDict
 
# 1. Equality considers order
d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}
print(d1 == d2)  # True -- regular dicts ignore order in comparison
 
od1 = OrderedDict([('a', 1), ('b', 2)])
od2 = OrderedDict([('b', 2), ('a', 1)])
print(od1 == od2)  # False -- OrderedDict considers order
 
# 2. move_to_end() for reordering
od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])
od.move_to_end('a')           # Move 'a' to the end
print(list(od.keys()))  # ['b', 'c', 'a']
 
od.move_to_end('c', last=False)  # Move 'c' to the beginning
print(list(od.keys()))  # ['c', 'b', 'a']

用 OrderedDict 构建 LRU Cache

from collections import OrderedDict
 
class LRUCache:
    def __init__(self, capacity):
        self.cache = OrderedDict()
        self.capacity = capacity
 
    def get(self, key):
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)  # Mark as recently used
        return self.cache[key]
 
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # Remove oldest
 
cache = LRUCache(3)
cache.put('a', 1)
cache.put('b', 2)
cache.put('c', 3)
cache.get('a')       # Access 'a', moves it to end
cache.put('d', 4)    # Evicts 'b' (least recently used)
print(list(cache.cache.keys()))  # ['c', 'a', 'd']

ChainMap:分层字典查找

ChainMap 会把多个 dict 组合成一个用于查找的“视图”。它按顺序搜索每个 dict,返回第一个匹配项。非常适合多层配置、作用域变量查找与上下文管理等场景。

基础用法

from collections import ChainMap
 
defaults = {'theme': 'light', 'language': 'en', 'timeout': 30}
user_prefs = {'theme': 'dark'}
session = {'language': 'fr'}
 
config = ChainMap(session, user_prefs, defaults)
 
# Lookup searches session -> user_prefs -> defaults
print(config['theme'])     # 'dark'    (from user_prefs)
print(config['language'])  # 'fr'      (from session)
print(config['timeout'])   # 30        (from defaults)

配置分层(Configuration layering)

from collections import ChainMap
import os
 
# Real-world config pattern: CLI args > env vars > config file > defaults
defaults = {
    'debug': False,
    'log_level': 'WARNING',
    'port': 8080,
    'host': '0.0.0.0',
}
 
config_file = {
    'log_level': 'INFO',
    'port': 9090,
}
 
env_vars = {
    k.lower(): v for k, v in os.environ.items()
    if k.lower() in defaults
}
 
cli_args = {'debug': True}  # Parsed from argparse
 
config = ChainMap(cli_args, env_vars, config_file, defaults)
print(config['debug'])      # True (from cli_args)
print(config['log_level'])  # 'INFO' (from config_file)
print(config['host'])       # '0.0.0.0' (from defaults)

使用 new_child() 的作用域上下文

from collections import ChainMap
 
# Simulating variable scoping (like nested function scopes)
global_scope = {'x': 1, 'y': 2}
local_scope = ChainMap(global_scope)
 
# Enter a new scope
inner_scope = local_scope.new_child()
inner_scope['x'] = 10  # Shadows global x
inner_scope['z'] = 30  # New local variable
 
print(inner_scope['x'])  # 10 (local)
print(inner_scope['y'])  # 2  (falls through to global)
print(inner_scope['z'])  # 30 (local)
 
# Exit scope -- original is unchanged
print(local_scope['x'])  # 1 (global still intact)

所有集合类型对比

TypeBase ClassMutable使用场景核心优势
CounterdictYes计数most_common()、multiset 运算
defaultdictdictYes自动初始化缺失 keyKeyError、factory function
deque--Yes双端队列两端 O(1)、maxlen
namedtupletupleNo结构化数据记录命名字段访问、轻量
OrderedDictdictYes顺序敏感的 dictmove_to_end()、顺序影响相等性
ChainMap--Yes分层查找配置分层、作用域上下文

性能基准测试

Counter vs 手写计数

from collections import Counter, defaultdict
import time
 
data = list(range(1000)) * 1000  # 1 million items, 1000 unique
 
# Method 1: Counter
start = time.perf_counter()
c = Counter(data)
counter_time = time.perf_counter() - start
 
# Method 2: defaultdict(int)
start = time.perf_counter()
dd = defaultdict(int)
for item in data:
    dd[item] += 1
dd_time = time.perf_counter() - start
 
# Method 3: Manual dict
start = time.perf_counter()
manual = {}
for item in data:
    manual[item] = manual.get(item, 0) + 1
manual_time = time.perf_counter() - start
 
print(f"Counter:         {counter_time:.4f}s")
print(f"defaultdict(int):{dd_time:.4f}s")
print(f"dict.get():      {manual_time:.4f}s")
# Typical: Counter ~0.03s, defaultdict ~0.07s, dict.get() ~0.09s

deque vs list 的队列操作

from collections import deque
import time
 
n = 100_000
 
# Simulate a FIFO queue: append right, pop left
# List
start = time.perf_counter()
q = list(range(n))
while q:
    q.pop(0)
list_queue_time = time.perf_counter() - start
 
# Deque
start = time.perf_counter()
q = deque(range(n))
while q:
    q.popleft()
deque_queue_time = time.perf_counter() - start
 
print(f"List pop(0):     {list_queue_time:.4f}s")
print(f"Deque popleft(): {deque_queue_time:.4f}s")
print(f"Deque is {list_queue_time / deque_queue_time:.0f}x faster")
# Typical: List ~2.5s, Deque ~0.004s -> ~600x faster

真实场景示例

使用 Counter 进行日志分析

from collections import Counter
from datetime import datetime
 
# Parse and analyze server logs
log_lines = [
    "2026-02-18 10:15:03 GET /api/users 200",
    "2026-02-18 10:15:04 POST /api/login 401",
    "2026-02-18 10:15:05 GET /api/users 200",
    "2026-02-18 10:15:06 GET /api/products 500",
    "2026-02-18 10:15:07 POST /api/login 200",
    "2026-02-18 10:15:08 GET /api/users 200",
    "2026-02-18 10:15:09 GET /api/products 500",
    "2026-02-18 10:15:10 POST /api/login 401",
]
 
# Count status codes
status_codes = Counter(line.split()[-1] for line in log_lines)
print("Status codes:", status_codes.most_common())
# [('200', 4), ('401', 2), ('500', 2)]
 
# Count endpoints
endpoints = Counter(line.split()[3] for line in log_lines)
print("Top endpoints:", endpoints.most_common(2))
# [('/api/users', 3), ('/api/login', 3)]
 
# Count error endpoints (status >= 400)
errors = Counter(
    line.split()[3] for line in log_lines
    if int(line.split()[-1]) >= 400
)
print("Error endpoints:", errors)
# Counter({'/api/login': 2, '/api/products': 2})

用 ChainMap 做配置管理

from collections import ChainMap
import json
 
# Multi-layer config system for a web application
def load_config(config_path=None, cli_overrides=None):
    # Layer 1: Hard-coded defaults
    defaults = {
        'host': '127.0.0.1',
        'port': 8000,
        'debug': False,
        'db_pool_size': 5,
        'log_level': 'WARNING',
        'cors_origins': ['http://localhost:3000'],
    }
 
    # Layer 2: Config file
    file_config = {}
    if config_path:
        with open(config_path) as f:
            file_config = json.load(f)
 
    # Layer 3: CLI overrides (highest priority)
    cli = cli_overrides or {}
 
    # ChainMap searches cli -> file_config -> defaults
    return ChainMap(cli, file_config, defaults)
 
# Usage
config = load_config(cli_overrides={'debug': True, 'port': 9000})
print(config['debug'])        # True (CLI override)
print(config['port'])         # 9000 (CLI override)
print(config['db_pool_size']) # 5    (default)
print(config['log_level'])    # WARNING (default)

用 deque 实现最近项缓存

from collections import deque
 
class RecentItemsTracker:
    """Track the N most recent unique items."""
 
    def __init__(self, max_items=10):
        self.items = deque(maxlen=max_items)
        self.seen = set()
 
    def add(self, item):
        if item in self.seen:
            # Move to front by removing and re-adding
            self.items.remove(item)
            self.items.append(item)
        else:
            if len(self.items) == self.items.maxlen:
                # Remove the oldest item from the set too
                oldest = self.items[0]
                self.seen.discard(oldest)
            self.items.append(item)
            self.seen.add(item)
 
    def get_recent(self):
        return list(reversed(self.items))
 
# Track recently viewed products
tracker = RecentItemsTracker(max_items=5)
for product in ['shoes', 'shirt', 'hat', 'shoes', 'jacket', 'belt', 'hat']:
    tracker.add(product)
 
print(tracker.get_recent())
# ['hat', 'belt', 'jacket', 'shoes', 'shirt']

用 namedtuple 构建数据流水线

from collections import namedtuple, Counter, defaultdict
 
# Define structured records
Transaction = namedtuple('Transaction', 'id customer product amount date')
 
transactions = [
    Transaction(1, 'Alice', 'Widget', 29.99, '2026-02-01'),
    Transaction(2, 'Bob', 'Gadget', 49.99, '2026-02-01'),
    Transaction(3, 'Alice', 'Widget', 29.99, '2026-02-03'),
    Transaction(4, 'Charlie', 'Gadget', 49.99, '2026-02-05'),
    Transaction(5, 'Alice', 'Gizmo', 19.99, '2026-02-07'),
    Transaction(6, 'Bob', 'Widget', 29.99, '2026-02-08'),
]
 
# Most popular products
product_count = Counter(t.product for t in transactions)
print("Popular products:", product_count.most_common())
# [('Widget', 3), ('Gadget', 2), ('Gizmo', 1)]
 
# Revenue by customer
revenue = defaultdict(float)
for t in transactions:
    revenue[t.customer] += t.amount
print("Revenue:", dict(revenue))
# {'Alice': 79.97, 'Bob': 79.98, 'Charlie': 49.99}
 
# Convert to DataFrame for visualization
import pandas as pd
df = pd.DataFrame(transactions, columns=Transaction._fields)
print(df.groupby('customer')['amount'].sum())

使用 PyGWalker 可视化集合数据

当你用 Counterdefaultdictnamedtuple 处理完数据后,通常还需要把结果可视化。PyGWalker (opens in a new tab) 可以把任意 pandas DataFrame 直接变成类似 Tableau 的交互式可视化界面,并在 Jupyter notebooks 中使用:

from collections import Counter
import pandas as pd
import pygwalker as pyg
 
# Process data with collections
log_data = ["ERROR", "WARNING", "ERROR", "INFO", "ERROR", "WARNING", "INFO", "INFO"]
counts = Counter(log_data)
 
# Convert to DataFrame
df = pd.DataFrame(counts.items(), columns=['Level', 'Count'])
 
# Launch interactive visualization
walker = pyg.walk(df)

它支持拖拽字段、创建图表、筛选数据、交互式探索分布与模式——无需手写可视化代码。尤其当你处理了大规模数据,并通过 Counterdefaultdict 分组得到统计结果时,它能让你更快地理解数据特征。

如果你想以交互方式运行这些集合实验,RunCell (opens in a new tab) 提供了 AI-powered 的 Jupyter 环境,支持你带即时反馈地迭代数据处理流水线。

组合多种集合类型

collections 的真正威力往往体现在把多种类型串在同一条流水线里使用。

from collections import Counter, defaultdict, namedtuple, deque
 
# Named record type
LogEntry = namedtuple('LogEntry', 'timestamp level message')
 
# Simulated log stream
log_stream = deque([
    LogEntry('10:01', 'ERROR', 'Connection timeout'),
    LogEntry('10:02', 'INFO', 'Request processed'),
    LogEntry('10:03', 'ERROR', 'Connection timeout'),
    LogEntry('10:04', 'WARNING', 'High memory'),
    LogEntry('10:05', 'ERROR', 'Disk full'),
    LogEntry('10:06', 'INFO', 'Request processed'),
    LogEntry('10:07', 'ERROR', 'Connection timeout'),
], maxlen=100)
 
# Count error types
error_counts = Counter(
    entry.message for entry in log_stream if entry.level == 'ERROR'
)
print("Error types:", error_counts.most_common())
# [('Connection timeout', 3), ('Disk full', 1)]
 
# Group entries by level
by_level = defaultdict(list)
for entry in log_stream:
    by_level[entry.level].append(entry)
 
for level, entries in by_level.items():
    print(f"{level}: {len(entries)} entries")
# ERROR: 4 entries
# INFO: 2 entries
# WARNING: 1 entries

FAQ

什么是 Python collections 模块?

collections 模块是 Python 标准库的一部分。它提供专用的容器数据类型,用于在内置类型(dictlisttupleset)基础上增加更多能力。主要类包括 CounterdefaultdictdequenamedtupleOrderedDictChainMap。每一种都能更高效地解决某类特定的数据处理问题,而不仅仅依赖内置类型。

什么时候用 Counter,什么时候用 defaultdict(int)?

当你的核心目标是“计数”或比较频率分布时,用 Counter:它提供 most_common()、算术运算符(+-&|),并且可以在构造时一次性统计整个 iterable。当计数只是更大数据结构模式中的一部分,或你需要一个带整数默认值的通用字典时,用 defaultdict(int) 更合适。

deque 在 Python 中是 thread-safe 吗?

是的。在 CPython 中,deque.append()deque.appendleft()deque.pop()deque.popleft() 由于 GIL(Global Interpreter Lock)是原子操作。因此,deque 可以在无需额外锁的情况下作为 thread-safe queue 使用。但需要注意:复合操作(例如先判断再执行的 check-then-act 流程)仍然需要显式同步。

namedtuple 和 dataclass 有什么区别?

namedtuple 创建带命名字段的、不可变的 tuple 子类。它很轻量,支持迭代与解包,占用内存也很小。dataclassdataclasses 模块,Python 3.7+)创建完整的 class,默认属性可变,并支持方法、属性与继承。简单的不可变记录用 namedtuple;当你需要可变性、复杂行为或更丰富的 type annotations 时用 dataclass

Python 3.7+ 中 OrderedDict 还重要吗?

是的,但主要集中在两个场景:第一,OrderedDict 的相等性比较会考虑元素顺序(OrderedDict(a=1, b=2) != OrderedDict(b=2, a=1)),而普通 dict 的比较不会;第二,OrderedDict 提供 move_to_end() 用于重排元素,这在实现 LRU cache 与基于优先级的数据结构时很有用。其他大多数场景下,普通 dict 已足够且性能更好。

ChainMap 和合并字典有什么不同?

ChainMap 在不复制数据的前提下,为多个 dict 提供一个查找视图:查找会按顺序搜索每个 dict,并且对底层 dict 的修改会立即反映到 ChainMap 中。相比之下,使用 {**d1, **d2}d1 | d2 会创建一个新 dict,并复制所有数据。对大字典来说,ChainMap 更省内存,并能保留“分层结构”,非常适合配置与作用域模式。

collections 的类型能配合 type hints 使用吗?

可以。你可以用 collections.Counter[str] 声明带类型的 Counter,用 collections.defaultdict[str, list[int]] 声明带类型的 defaultdict,用 collections.deque[int] 声明带类型的 deque。对 namedtuple,更推荐 typing.NamedTuple,因为它能在 class 定义中直接写 type annotations。所有这些类型都与 mypy 等类型检查工具兼容。

总结

Python 的 collections 模块提供了六种专用容器类型,用来消除常见的样板代码模式:Counter 替代手写计数循环;defaultdict 省去 KeyError 处理;deque 提供高效的双端操作;namedtuple 为 tuple 增加可读字段名;OrderedDict 处理顺序敏感的比较与重排;ChainMap 在不复制数据的情况下管理分层字典查找。

每一种类型都在特定问题上比内置容器更合适。掌握何时使用它们,会让你的 Python 代码更短、更快、也更容易维护。关键在于让数据结构匹配操作模式:计数(Counter)、分组(defaultdict)、队列/栈(deque)、结构化记录(namedtuple)、有序操作(OrderedDict)、分层查找(ChainMap)。

📚