Python Collections 模块:Counter、defaultdict、deque、namedtuple 指南
Updated on
Python 内置的数据结构——lists、dicts、tuples、sets——能覆盖大多数任务。但当你的代码不再停留在玩具示例级别时,你会开始碰到它们的边界:统计元素需要手写字典循环;分组数据会让你的代码到处都是 if key not in dict 的防御性判断;用 list 当队列会让你在从头部弹出时付出 O(n) 的代价;用普通 tuple 表达结构化记录会把字段访问变成难以阅读的“下标猜谜游戏”。每个变通方式单看都不大,但很快会叠加起来,让代码更难读、运行更慢、也更容易出错。
Python 标准库中的 collections 模块用一组“为特定问题而生”的容器类型解决这些痛点:Counter 一行完成计数;defaultdict 通过自动默认值消除 KeyError;deque 让序列两端的操作都达到 O(1);namedtuple 在不引入完整 class 开销的前提下为 tuple 增加字段名;OrderedDict 与 ChainMap 则处理顺序敏感、以及分层查找等普通 dict 难以优雅表达的模式。
本指南将覆盖 collections 模块中的每个主要类,配套可运行代码、性能分析与真实场景用法。无论你是在处理日志文件、构建缓存、管理多层配置,还是搭建数据处理流水线,这些容器都能让代码更短、更快、也更可靠。
collections 模块概览
collections 模块提供了一些专用的容器数据类型,用来扩展 Python 通用的内置容器。
import collections
# See all available classes
print([name for name in dir(collections) if not name.startswith('_')])
# ['ChainMap', 'Counter', 'OrderedDict', 'UserDict', 'UserList',
# 'UserString', 'abc', 'defaultdict', 'deque', 'namedtuple']| Class | 用途 | 替代 |
|---|---|---|
Counter | 统计可哈希对象 | 手写 dict 计数循环 |
defaultdict | 带自动默认值的 dict | dict.setdefault()、if key not in 判断 |
deque | 双端队列,两端操作 O(1) | 用作队列/栈的 list |
namedtuple | 带命名字段的 tuple | 普通 tuple、简单 data class |
OrderedDict | 记住插入顺序的 dict | dict(3.7 之前)、顺序相关操作 |
ChainMap | 分层字典查找 | 手动合并 dict |
Counter:元素计数
Counter 是 dict 的子类,用于统计可哈希对象。它将元素映射到出现次数,并提供常用的频率分析方法。
创建 Counter
from collections import Counter
# From an iterable
words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
word_count = Counter(words)
print(word_count)
# Counter({'apple': 3, 'banana': 2, 'cherry': 1})
# From a string
letter_count = Counter('mississippi')
print(letter_count)
# Counter({'s': 4, 'i': 4, 'p': 2, 'm': 1})
# From a dictionary
inventory = Counter({'shirts': 25, 'pants': 15, 'hats': 10})
# From keyword arguments
stock = Counter(laptops=5, monitors=12)most_common() 与频率排序
from collections import Counter
text = "to be or not to be that is the question"
words = Counter(text.split())
# Get the 3 most common words
print(words.most_common(3))
# [('to', 2), ('be', 2), ('or', 1)]
# Get all elements sorted by frequency
print(words.most_common())
# [('to', 2), ('be', 2), ('or', 1), ('not', 1), ('that', 1), ('is', 1), ('the', 1), ('question', 1)]
# Least common: reverse the list or slice from the end
print(words.most_common()[-3:])
# [('is', 1), ('the', 1), ('question', 1)]Counter 运算
Counter 支持加、减、交集、并集等运算——可以把它当作 multiset(多重集合)来用。
from collections import Counter
a = Counter(x=4, y=2, z=1)
b = Counter(x=1, y=3, z=5)
# Addition: combine counts
print(a + b) # Counter({'z': 6, 'y': 5, 'x': 5})
# Subtraction: drops zero and negative results
print(a - b) # Counter({'x': 3})
# Intersection (min of each)
print(a & b) # Counter({'y': 2, 'x': 1, 'z': 1})
# Union (max of each)
print(a | b) # Counter({'z': 5, 'x': 4, 'y': 3})Counter 的实用模式
from collections import Counter
# Word frequency analysis
log_entries = [
"ERROR: disk full",
"WARNING: high memory",
"ERROR: disk full",
"ERROR: timeout",
"WARNING: high memory",
"ERROR: disk full",
"INFO: backup complete",
]
error_types = Counter(entry.split(":")[0].strip() for entry in log_entries)
print(error_types)
# Counter({'ERROR': 4, 'WARNING': 2, 'INFO': 1})
# Find unique elements (count == 1)
data = [1, 2, 3, 2, 1, 4, 5, 4]
unique = [item for item, count in Counter(data).items() if count == 1]
print(unique) # [3, 5]
# Check if one collection is a subset of another (anagram check)
def is_anagram(word1, word2):
return Counter(word1.lower()) == Counter(word2.lower())
print(is_anagram("listen", "silent")) # True
print(is_anagram("hello", "world")) # False想深入了解 Counter,可阅读我们专门的 Python Counter guide。
defaultdict:自动默认值
defaultdict 是 dict 的子类,会在访问缺失 key 时调用一个 factory function 来提供默认值,从而消除 KeyError 与各种防御性判断。
Factory functions
from collections import defaultdict
# int factory: default is 0
counter = defaultdict(int)
counter['apples'] += 1
counter['oranges'] += 3
print(dict(counter)) # {'apples': 1, 'oranges': 3}
# list factory: default is []
groups = defaultdict(list)
pairs = [('fruit', 'apple'), ('veggie', 'carrot'), ('fruit', 'banana'), ('veggie', 'pea')]
for category, item in pairs:
groups[category].append(item)
print(dict(groups))
# {'fruit': ['apple', 'banana'], 'veggie': ['carrot', 'pea']}
# set factory: default is set()
index = defaultdict(set)
words = [('file1', 'python'), ('file2', 'python'), ('file1', 'java'), ('file3', 'python')]
for filename, lang in words:
index[lang].add(filename)
print(dict(index))
# {'python': {'file1', 'file2', 'file3'}, 'java': {'file1'}}分组(Grouping)模式
对 defaultdict(list) 来说,“把同类数据归组”是最常见用法。对比手写方式:
from collections import defaultdict
students = [
('Math', 'Alice'), ('Science', 'Bob'), ('Math', 'Charlie'),
('Science', 'Diana'), ('Math', 'Eve'), ('History', 'Frank'),
]
# Without defaultdict -- verbose and error-prone
groups_manual = {}
for subject, name in students:
if subject not in groups_manual:
groups_manual[subject] = []
groups_manual[subject].append(name)
# With defaultdict -- clean and direct
groups = defaultdict(list)
for subject, name in students:
groups[subject].append(name)
print(dict(groups))
# {'Math': ['Alice', 'Charlie', 'Eve'], 'Science': ['Bob', 'Diana'], 'History': ['Frank']}嵌套 defaultdict
无需为每一层手动初始化,就能构建多层数据结构。
from collections import defaultdict
# Two-level nested defaultdict
def nested_dict():
return defaultdict(int)
sales = defaultdict(nested_dict)
sales['2025']['Q1'] = 150000
sales['2025']['Q2'] = 175000
sales['2026']['Q1'] = 200000
print(sales['2025']['Q1']) # 150000
print(sales['2024']['Q3']) # 0 (auto-created, no KeyError)
# Arbitrary depth nesting with a recursive factory
def deep_dict():
return defaultdict(deep_dict)
config = deep_dict()
config['database']['primary']['host'] = 'localhost'
config['database']['primary']['port'] = 5432
config['database']['replica']['host'] = 'replica.local'
print(config['database']['primary']['host']) # localhost自定义 factory function
from collections import defaultdict
# Lambda for custom defaults
scores = defaultdict(lambda: 100) # Every student starts with 100
scores['Alice'] -= 5
scores['Bob'] -= 10
print(scores['Charlie']) # 100 (new student gets default)
print(dict(scores)) # {'Alice': 95, 'Bob': 90, 'Charlie': 100}
# Named function for complex defaults
def default_user():
return {'role': 'viewer', 'active': True, 'login_count': 0}
users = defaultdict(default_user)
users['alice']['role'] = 'admin'
print(users['bob']) # {'role': 'viewer', 'active': True, 'login_count': 0}更多模式请参考:Python defaultdict guide。
deque:双端队列
deque(读作 “deck”)提供两端 O(1) 的 append 与 pop 操作。对 list 来说,pop(0) 与 insert(0, x) 是 O(n),因为需要移动所有元素。只要你的工作负载会频繁操作序列两端,deque 就是正确选择。
核心操作
from collections import deque
d = deque([1, 2, 3, 4, 5])
# O(1) operations on both ends
d.append(6) # Add to right: [1, 2, 3, 4, 5, 6]
d.appendleft(0) # Add to left: [0, 1, 2, 3, 4, 5, 6]
right = d.pop() # Remove from right: 6
left = d.popleft() # Remove from left: 0
print(d) # deque([1, 2, 3, 4, 5])
# Extend from both sides
d.extend([6, 7]) # Right extend: [1, 2, 3, 4, 5, 6, 7]
d.extendleft([-1, 0]) # Left extend (reversed): [0, -1, 1, 2, 3, 4, 5, 6, 7]带 maxlen 的有界 deque
设置 maxlen 后,当加入元素超过上限时,会自动从另一端丢弃元素。非常适合滑动窗口与缓存。
from collections import deque
# Keep only the last 5 items
recent = deque(maxlen=5)
for i in range(10):
recent.append(i)
print(recent) # deque([5, 6, 7, 8, 9], maxlen=5)
# Sliding window average
def moving_average(iterable, window_size):
window = deque(maxlen=window_size)
for value in iterable:
window.append(value)
if len(window) == window_size:
yield sum(window) / window_size
data = [10, 20, 30, 40, 50, 60, 70]
print(list(moving_average(data, 3)))
# [20.0, 30.0, 40.0, 50.0, 60.0]旋转(Rotation)
rotate(n) 将元素向右移动 n 步;负值则向左旋转。
from collections import deque
d = deque([1, 2, 3, 4, 5])
d.rotate(2) # Rotate right by 2
print(d) # deque([4, 5, 1, 2, 3])
d.rotate(-3) # Rotate left by 3
print(d) # deque([2, 3, 4, 5, 1])deque vs list 性能对比
from collections import deque
import time
# Benchmark: append/pop from left side
n = 100_000
# List: O(n) for each insert at position 0
start = time.perf_counter()
lst = []
for i in range(n):
lst.insert(0, i)
list_time = time.perf_counter() - start
# Deque: O(1) for appendleft
start = time.perf_counter()
dq = deque()
for i in range(n):
dq.appendleft(i)
deque_time = time.perf_counter() - start
print(f"List insert(0, x): {list_time:.4f}s")
print(f"Deque appendleft: {deque_time:.4f}s")
print(f"Deque is {list_time / deque_time:.0f}x faster")
# Typical output:
# List insert(0, x): 1.2340s
# Deque appendleft: 0.0065s
# Deque is 190x faster| Operation | list | deque |
|---|---|---|
append(x) (right) | O(1) amortized | O(1) |
pop() (right) | O(1) | O(1) |
insert(0, x) / appendleft(x) | O(n) | O(1) |
pop(0) / popleft() | O(n) | O(1) |
access by index [i] | O(1) | O(n) |
| Memory per element | 更低 | 略高 |
当你需要两端的快速操作时用 deque;当你需要按下标的快速随机访问时用 list。
完整内容见:Python deque。
namedtuple:带命名字段的 tuple
namedtuple 能创建 tuple 的子类并添加命名字段,让代码更自解释,同时避免定义完整 class 的额外开销。
创建 namedtuple
from collections import namedtuple
# Define a type
Point = namedtuple('Point', ['x', 'y'])
p = Point(3, 4)
# Access by name or index
print(p.x) # 3
print(p[1]) # 4
print(p) # Point(x=3, y=4)
# Alternative field definition styles
Color = namedtuple('Color', 'red green blue') # Space-separated string
Config = namedtuple('Config', 'host, port, database') # Comma-separated string为什么用 namedtuple 而不是普通 tuple?
from collections import namedtuple
# Plain tuple: which index is what?
employee_tuple = ('Alice', 'Engineering', 95000, True)
print(employee_tuple[2]) # 95000 -- but what does index 2 mean?
# namedtuple: self-documenting
Employee = namedtuple('Employee', 'name department salary active')
employee = Employee('Alice', 'Engineering', 95000, True)
print(employee.salary) # 95000 -- immediately clear
print(employee.department) # Engineering关键方法
from collections import namedtuple
Employee = namedtuple('Employee', 'name department salary')
emp = Employee('Alice', 'Engineering', 95000)
# _replace: create a new instance with some fields changed (immutable)
promoted = emp._replace(salary=110000)
print(promoted) # Employee(name='Alice', department='Engineering', salary=110000)
print(emp) # Employee(name='Alice', department='Engineering', salary=95000) -- unchanged
# _asdict: convert to OrderedDict (Python 3.8+ returns regular dict)
print(emp._asdict())
# {'name': 'Alice', 'department': 'Engineering', 'salary': 95000}
# _fields: get field names
print(Employee._fields) # ('name', 'department', 'salary')
# _make: create from an iterable
data = ['Bob', 'Marketing', 85000]
emp2 = Employee._make(data)
print(emp2) # Employee(name='Bob', department='Marketing', salary=85000)默认值
from collections import namedtuple
# defaults parameter (Python 3.6.1+)
Connection = namedtuple('Connection', 'host port timeout', defaults=[5432, 30])
conn1 = Connection('localhost') # port=5432, timeout=30
conn2 = Connection('db.example.com', 3306) # timeout=30
conn3 = Connection('db.example.com', 3306, 60)
print(conn1) # Connection(host='localhost', port=5432, timeout=30)
print(conn2) # Connection(host='db.example.com', port=3306, timeout=30)typing.NamedTuple 替代方案
如果你需要 type annotations 与更“类”的写法,可用 typing.NamedTuple:
from typing import NamedTuple
class Point(NamedTuple):
x: float
y: float
label: str = "origin"
p = Point(3.0, 4.0, "A")
print(p.x, p.label) # 3.0 A
# Still a tuple -- supports unpacking, indexing, iteration
x, y, label = p
print(f"({x}, {y})") # (3.0, 4.0)namedtuple vs dataclass
| Feature | namedtuple | dataclass |
|---|---|---|
| 默认不可变 | 是 | 否(需要 frozen=True) |
| 内存占用 | 与 tuple 相同(小) | 更大(普通 class) |
| 迭代/解包 | 支持(它就是 tuple) | 不支持(除非你添加方法) |
| Type annotations | 通过 typing.NamedTuple | 内置支持 |
| 方法/属性 | 需要 subclassing | 直接支持 |
| 继承 | 受限 | 完整 class 继承 |
| 最适合 | 轻量数据记录 | 复杂可变对象 |
OrderedDict:有序字典操作
自 Python 3.7 起,普通 dict 已保留插入顺序。那么 OrderedDict 还有什么价值?
OrderedDict 仍然重要的场景
from collections import OrderedDict
# 1. Equality considers order
d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}
print(d1 == d2) # True -- regular dicts ignore order in comparison
od1 = OrderedDict([('a', 1), ('b', 2)])
od2 = OrderedDict([('b', 2), ('a', 1)])
print(od1 == od2) # False -- OrderedDict considers order
# 2. move_to_end() for reordering
od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])
od.move_to_end('a') # Move 'a' to the end
print(list(od.keys())) # ['b', 'c', 'a']
od.move_to_end('c', last=False) # Move 'c' to the beginning
print(list(od.keys())) # ['c', 'b', 'a']用 OrderedDict 构建 LRU Cache
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key):
if key not in self.cache:
return -1
self.cache.move_to_end(key) # Mark as recently used
return self.cache[key]
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False) # Remove oldest
cache = LRUCache(3)
cache.put('a', 1)
cache.put('b', 2)
cache.put('c', 3)
cache.get('a') # Access 'a', moves it to end
cache.put('d', 4) # Evicts 'b' (least recently used)
print(list(cache.cache.keys())) # ['c', 'a', 'd']ChainMap:分层字典查找
ChainMap 会把多个 dict 组合成一个用于查找的“视图”。它按顺序搜索每个 dict,返回第一个匹配项。非常适合多层配置、作用域变量查找与上下文管理等场景。
基础用法
from collections import ChainMap
defaults = {'theme': 'light', 'language': 'en', 'timeout': 30}
user_prefs = {'theme': 'dark'}
session = {'language': 'fr'}
config = ChainMap(session, user_prefs, defaults)
# Lookup searches session -> user_prefs -> defaults
print(config['theme']) # 'dark' (from user_prefs)
print(config['language']) # 'fr' (from session)
print(config['timeout']) # 30 (from defaults)配置分层(Configuration layering)
from collections import ChainMap
import os
# Real-world config pattern: CLI args > env vars > config file > defaults
defaults = {
'debug': False,
'log_level': 'WARNING',
'port': 8080,
'host': '0.0.0.0',
}
config_file = {
'log_level': 'INFO',
'port': 9090,
}
env_vars = {
k.lower(): v for k, v in os.environ.items()
if k.lower() in defaults
}
cli_args = {'debug': True} # Parsed from argparse
config = ChainMap(cli_args, env_vars, config_file, defaults)
print(config['debug']) # True (from cli_args)
print(config['log_level']) # 'INFO' (from config_file)
print(config['host']) # '0.0.0.0' (from defaults)使用 new_child() 的作用域上下文
from collections import ChainMap
# Simulating variable scoping (like nested function scopes)
global_scope = {'x': 1, 'y': 2}
local_scope = ChainMap(global_scope)
# Enter a new scope
inner_scope = local_scope.new_child()
inner_scope['x'] = 10 # Shadows global x
inner_scope['z'] = 30 # New local variable
print(inner_scope['x']) # 10 (local)
print(inner_scope['y']) # 2 (falls through to global)
print(inner_scope['z']) # 30 (local)
# Exit scope -- original is unchanged
print(local_scope['x']) # 1 (global still intact)所有集合类型对比
| Type | Base Class | Mutable | 使用场景 | 核心优势 |
|---|---|---|---|---|
Counter | dict | Yes | 计数 | most_common()、multiset 运算 |
defaultdict | dict | Yes | 自动初始化缺失 key | 无 KeyError、factory function |
deque | -- | Yes | 双端队列 | 两端 O(1)、maxlen |
namedtuple | tuple | No | 结构化数据记录 | 命名字段访问、轻量 |
OrderedDict | dict | Yes | 顺序敏感的 dict | move_to_end()、顺序影响相等性 |
ChainMap | -- | Yes | 分层查找 | 配置分层、作用域上下文 |
性能基准测试
Counter vs 手写计数
from collections import Counter, defaultdict
import time
data = list(range(1000)) * 1000 # 1 million items, 1000 unique
# Method 1: Counter
start = time.perf_counter()
c = Counter(data)
counter_time = time.perf_counter() - start
# Method 2: defaultdict(int)
start = time.perf_counter()
dd = defaultdict(int)
for item in data:
dd[item] += 1
dd_time = time.perf_counter() - start
# Method 3: Manual dict
start = time.perf_counter()
manual = {}
for item in data:
manual[item] = manual.get(item, 0) + 1
manual_time = time.perf_counter() - start
print(f"Counter: {counter_time:.4f}s")
print(f"defaultdict(int):{dd_time:.4f}s")
print(f"dict.get(): {manual_time:.4f}s")
# Typical: Counter ~0.03s, defaultdict ~0.07s, dict.get() ~0.09sdeque vs list 的队列操作
from collections import deque
import time
n = 100_000
# Simulate a FIFO queue: append right, pop left
# List
start = time.perf_counter()
q = list(range(n))
while q:
q.pop(0)
list_queue_time = time.perf_counter() - start
# Deque
start = time.perf_counter()
q = deque(range(n))
while q:
q.popleft()
deque_queue_time = time.perf_counter() - start
print(f"List pop(0): {list_queue_time:.4f}s")
print(f"Deque popleft(): {deque_queue_time:.4f}s")
print(f"Deque is {list_queue_time / deque_queue_time:.0f}x faster")
# Typical: List ~2.5s, Deque ~0.004s -> ~600x faster真实场景示例
使用 Counter 进行日志分析
from collections import Counter
from datetime import datetime
# Parse and analyze server logs
log_lines = [
"2026-02-18 10:15:03 GET /api/users 200",
"2026-02-18 10:15:04 POST /api/login 401",
"2026-02-18 10:15:05 GET /api/users 200",
"2026-02-18 10:15:06 GET /api/products 500",
"2026-02-18 10:15:07 POST /api/login 200",
"2026-02-18 10:15:08 GET /api/users 200",
"2026-02-18 10:15:09 GET /api/products 500",
"2026-02-18 10:15:10 POST /api/login 401",
]
# Count status codes
status_codes = Counter(line.split()[-1] for line in log_lines)
print("Status codes:", status_codes.most_common())
# [('200', 4), ('401', 2), ('500', 2)]
# Count endpoints
endpoints = Counter(line.split()[3] for line in log_lines)
print("Top endpoints:", endpoints.most_common(2))
# [('/api/users', 3), ('/api/login', 3)]
# Count error endpoints (status >= 400)
errors = Counter(
line.split()[3] for line in log_lines
if int(line.split()[-1]) >= 400
)
print("Error endpoints:", errors)
# Counter({'/api/login': 2, '/api/products': 2})用 ChainMap 做配置管理
from collections import ChainMap
import json
# Multi-layer config system for a web application
def load_config(config_path=None, cli_overrides=None):
# Layer 1: Hard-coded defaults
defaults = {
'host': '127.0.0.1',
'port': 8000,
'debug': False,
'db_pool_size': 5,
'log_level': 'WARNING',
'cors_origins': ['http://localhost:3000'],
}
# Layer 2: Config file
file_config = {}
if config_path:
with open(config_path) as f:
file_config = json.load(f)
# Layer 3: CLI overrides (highest priority)
cli = cli_overrides or {}
# ChainMap searches cli -> file_config -> defaults
return ChainMap(cli, file_config, defaults)
# Usage
config = load_config(cli_overrides={'debug': True, 'port': 9000})
print(config['debug']) # True (CLI override)
print(config['port']) # 9000 (CLI override)
print(config['db_pool_size']) # 5 (default)
print(config['log_level']) # WARNING (default)用 deque 实现最近项缓存
from collections import deque
class RecentItemsTracker:
"""Track the N most recent unique items."""
def __init__(self, max_items=10):
self.items = deque(maxlen=max_items)
self.seen = set()
def add(self, item):
if item in self.seen:
# Move to front by removing and re-adding
self.items.remove(item)
self.items.append(item)
else:
if len(self.items) == self.items.maxlen:
# Remove the oldest item from the set too
oldest = self.items[0]
self.seen.discard(oldest)
self.items.append(item)
self.seen.add(item)
def get_recent(self):
return list(reversed(self.items))
# Track recently viewed products
tracker = RecentItemsTracker(max_items=5)
for product in ['shoes', 'shirt', 'hat', 'shoes', 'jacket', 'belt', 'hat']:
tracker.add(product)
print(tracker.get_recent())
# ['hat', 'belt', 'jacket', 'shoes', 'shirt']用 namedtuple 构建数据流水线
from collections import namedtuple, Counter, defaultdict
# Define structured records
Transaction = namedtuple('Transaction', 'id customer product amount date')
transactions = [
Transaction(1, 'Alice', 'Widget', 29.99, '2026-02-01'),
Transaction(2, 'Bob', 'Gadget', 49.99, '2026-02-01'),
Transaction(3, 'Alice', 'Widget', 29.99, '2026-02-03'),
Transaction(4, 'Charlie', 'Gadget', 49.99, '2026-02-05'),
Transaction(5, 'Alice', 'Gizmo', 19.99, '2026-02-07'),
Transaction(6, 'Bob', 'Widget', 29.99, '2026-02-08'),
]
# Most popular products
product_count = Counter(t.product for t in transactions)
print("Popular products:", product_count.most_common())
# [('Widget', 3), ('Gadget', 2), ('Gizmo', 1)]
# Revenue by customer
revenue = defaultdict(float)
for t in transactions:
revenue[t.customer] += t.amount
print("Revenue:", dict(revenue))
# {'Alice': 79.97, 'Bob': 79.98, 'Charlie': 49.99}
# Convert to DataFrame for visualization
import pandas as pd
df = pd.DataFrame(transactions, columns=Transaction._fields)
print(df.groupby('customer')['amount'].sum())使用 PyGWalker 可视化集合数据
当你用 Counter、defaultdict 或 namedtuple 处理完数据后,通常还需要把结果可视化。PyGWalker (opens in a new tab) 可以把任意 pandas DataFrame 直接变成类似 Tableau 的交互式可视化界面,并在 Jupyter notebooks 中使用:
from collections import Counter
import pandas as pd
import pygwalker as pyg
# Process data with collections
log_data = ["ERROR", "WARNING", "ERROR", "INFO", "ERROR", "WARNING", "INFO", "INFO"]
counts = Counter(log_data)
# Convert to DataFrame
df = pd.DataFrame(counts.items(), columns=['Level', 'Count'])
# Launch interactive visualization
walker = pyg.walk(df)它支持拖拽字段、创建图表、筛选数据、交互式探索分布与模式——无需手写可视化代码。尤其当你处理了大规模数据,并通过 Counter 或 defaultdict 分组得到统计结果时,它能让你更快地理解数据特征。
如果你想以交互方式运行这些集合实验,RunCell (opens in a new tab) 提供了 AI-powered 的 Jupyter 环境,支持你带即时反馈地迭代数据处理流水线。
组合多种集合类型
collections 的真正威力往往体现在把多种类型串在同一条流水线里使用。
from collections import Counter, defaultdict, namedtuple, deque
# Named record type
LogEntry = namedtuple('LogEntry', 'timestamp level message')
# Simulated log stream
log_stream = deque([
LogEntry('10:01', 'ERROR', 'Connection timeout'),
LogEntry('10:02', 'INFO', 'Request processed'),
LogEntry('10:03', 'ERROR', 'Connection timeout'),
LogEntry('10:04', 'WARNING', 'High memory'),
LogEntry('10:05', 'ERROR', 'Disk full'),
LogEntry('10:06', 'INFO', 'Request processed'),
LogEntry('10:07', 'ERROR', 'Connection timeout'),
], maxlen=100)
# Count error types
error_counts = Counter(
entry.message for entry in log_stream if entry.level == 'ERROR'
)
print("Error types:", error_counts.most_common())
# [('Connection timeout', 3), ('Disk full', 1)]
# Group entries by level
by_level = defaultdict(list)
for entry in log_stream:
by_level[entry.level].append(entry)
for level, entries in by_level.items():
print(f"{level}: {len(entries)} entries")
# ERROR: 4 entries
# INFO: 2 entries
# WARNING: 1 entriesFAQ
什么是 Python collections 模块?
collections 模块是 Python 标准库的一部分。它提供专用的容器数据类型,用于在内置类型(dict、list、tuple、set)基础上增加更多能力。主要类包括 Counter、defaultdict、deque、namedtuple、OrderedDict 与 ChainMap。每一种都能更高效地解决某类特定的数据处理问题,而不仅仅依赖内置类型。
什么时候用 Counter,什么时候用 defaultdict(int)?
当你的核心目标是“计数”或比较频率分布时,用 Counter:它提供 most_common()、算术运算符(+、-、&、|),并且可以在构造时一次性统计整个 iterable。当计数只是更大数据结构模式中的一部分,或你需要一个带整数默认值的通用字典时,用 defaultdict(int) 更合适。
deque 在 Python 中是 thread-safe 吗?
是的。在 CPython 中,deque.append()、deque.appendleft()、deque.pop()、deque.popleft() 由于 GIL(Global Interpreter Lock)是原子操作。因此,deque 可以在无需额外锁的情况下作为 thread-safe queue 使用。但需要注意:复合操作(例如先判断再执行的 check-then-act 流程)仍然需要显式同步。
namedtuple 和 dataclass 有什么区别?
namedtuple 创建带命名字段的、不可变的 tuple 子类。它很轻量,支持迭代与解包,占用内存也很小。dataclass(dataclasses 模块,Python 3.7+)创建完整的 class,默认属性可变,并支持方法、属性与继承。简单的不可变记录用 namedtuple;当你需要可变性、复杂行为或更丰富的 type annotations 时用 dataclass。
Python 3.7+ 中 OrderedDict 还重要吗?
是的,但主要集中在两个场景:第一,OrderedDict 的相等性比较会考虑元素顺序(OrderedDict(a=1, b=2) != OrderedDict(b=2, a=1)),而普通 dict 的比较不会;第二,OrderedDict 提供 move_to_end() 用于重排元素,这在实现 LRU cache 与基于优先级的数据结构时很有用。其他大多数场景下,普通 dict 已足够且性能更好。
ChainMap 和合并字典有什么不同?
ChainMap 在不复制数据的前提下,为多个 dict 提供一个查找视图:查找会按顺序搜索每个 dict,并且对底层 dict 的修改会立即反映到 ChainMap 中。相比之下,使用 {**d1, **d2} 或 d1 | d2 会创建一个新 dict,并复制所有数据。对大字典来说,ChainMap 更省内存,并能保留“分层结构”,非常适合配置与作用域模式。
collections 的类型能配合 type hints 使用吗?
可以。你可以用 collections.Counter[str] 声明带类型的 Counter,用 collections.defaultdict[str, list[int]] 声明带类型的 defaultdict,用 collections.deque[int] 声明带类型的 deque。对 namedtuple,更推荐 typing.NamedTuple,因为它能在 class 定义中直接写 type annotations。所有这些类型都与 mypy 等类型检查工具兼容。
总结
Python 的 collections 模块提供了六种专用容器类型,用来消除常见的样板代码模式:Counter 替代手写计数循环;defaultdict 省去 KeyError 处理;deque 提供高效的双端操作;namedtuple 为 tuple 增加可读字段名;OrderedDict 处理顺序敏感的比较与重排;ChainMap 在不复制数据的情况下管理分层字典查找。
每一种类型都在特定问题上比内置容器更合适。掌握何时使用它们,会让你的 Python 代码更短、更快、也更容易维护。关键在于让数据结构匹配操作模式:计数(Counter)、分组(defaultdict)、队列/栈(deque)、结构化记录(namedtuple)、有序操作(OrderedDict)、分层查找(ChainMap)。