FluentPythonCh14

 20th April 2019 at 8:05am

这节写得很赞,如果这块知识生疏了,值得重新看看回顾下。

Why Sequences Are Iterable: The iter Function

class Sentence:
    def __init__(self, text):
        self.text = text
        self.words = RE_WORD.findall(text)

    def __getitem__(self, index):
        return self.words[index]   

    def __len__(self):   
        return len(self.words)

    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)

Python 解释器需要遍历(iterate)一个 object x 时,会先调用 iter(x) 来返回一个 iterator。iter 会:

  1. Checks whether the object implements __iter__, and calls that to obtain an iterator.
  2. If __iter__ is not implemented, but __getitem__ is implemented, Python creates an iterator that attempts to fetch items in order, starting from index 0 (zero).
  3. If that fails, Python raises TypeError, usually saying “C object is not iterable,” where C is the class of the target object.

但是 __getitem__ 这个行为,只是为了向后兼容,并不被提倡(也没被 deprecated)。Python 的内置 sequence 类型,都有实现 __iter__ 函数。Iterable 这个 ABC 的 __subclasshook__ 中,只判断一个类型有没有 __iter__ 函数,而不判断 __getitem__ 相关。所以上面的 Sequence 类并不能通过 issubclass(Sequence, abc.Iterable) 判断。如果你想判断一个对象是否可以被遍历,当前(Python 3.4)最佳做法是调用 iter(x) 看看有无抛出 TypeError 异常。

Iterables Versus Iterators

iterable
Any object from which the iter built-in function can obtain an iterator. Objects implementing an __iter__ method returning an iterator are iterable. Sequences are always iterable; as are objects implementing a __getitem__ method that takes 0-based indexes.

Python obtains iterators from iterables.

下面两段代码演示了 Python 的 iterate 机制:

>>> s = 'ABC'
>>> for char in s:
...     print(char)
...
A
B
C

# the code above is equivalent to below
>>> s = 'ABC'
>>> it = iter(s)
>>> while True:
...     try:
...         print(next(it))
...     except StopIteration:
...         del it
...         break
...
A
B
C

Iterator 的 ABC 定义了两个函数,__next____iter____next__ 在被内置函数 next(it) 调用,它需要返回下一个元素,或者抛出 StopIteration 异常。__iter__ 一般来说直接返回 self,即 iter(it) is it,使得 iterator 本身也 iterable。

iterator
Any object that implements the next no-argument method that returns the next item in a series or raises StopIteration when there are no more items. Python iterators also implement the iter method so they are iterable as well.

Sentence Take #2: A Classic Iterator

将上面的例子改造一下,引入了 iterator:

import re
import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text
        self.words = RE_WORD.findall(text)
        
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
        
    def __iter__(self):   
        return SentenceIterator(self.words)   
        
class SentenceIterator:
    def __init__(self, words):
        self.words = words   
        self.index = 0   
        
    def __next__(self):
        try:
            word = self.words[self.index]   
        except IndexError:
            raise StopIteration()   
        self.index += 1   
        return word   
        
    def __iter__(self):   
        return self

这是一个比较啰嗦的示例,为的是介绍后面的 generator 机制。

Sentence Take #3: A Generator Function

import re
import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text
        self.words = RE_WORD.findall(text)
        
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
        
    def __iter__(self):
        for word in self.words:   
            yield word

函数体中带 yield 关键字的,即是 generator function。generator function 的机制是:

A generator function builds a generator object that wraps the body of the function. When we invoke next(…) on the generator object, execution advances to the next yield in the function body, and the next(…) call evaluates to the value yielded when the function body is suspended. Finally, when the function body returns, the enclosing generator object raises StopIteration, in accordance with the Iterator protocol.
Calling a generator function returns a generator. A generator yields or produces values. A generator doesn’t “return” values in the usual way: the return statement in the body of a generator function causes StopIteration to be raised by the generator object.

Sentence Take #4: A Lazy Implementation

前面的例子并不是 lazy 的。因为 words 是提前准备好的。当你只需要前面几个词时,把整串 words 都准备好,可能是对性能和内存的不必要消耗。下面给了一个延迟计算的方法:

import re
import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text

    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)

    def __iter__(self):
        for match in RE_WORD.finditer(self.text):
            yield match.group()

generator function 很方便,但是 generator expression 更方便。

Sentence Take #5: A Generator Expression

对于 generator expression,一个容易理解的解释是:

A generator expression can be understood as a lazy version of a list comprehension: it does not eagerly build a list, but returns a generator that will lazily produce the items on demand. In other words, if a list comprehension is a factory of lists, a generator expression is a factory of generators.

简化后的代码不再由 generator function 返回 generator,而是由 generator expression 产生 generator:

import re
import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text
        
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
        
    def __iter__(self):
        return (match.group() for match in RE_WORD.finditer(self.text))

Another Example: Arithmetic Progression Generator

之前的例子都是在一个 collection 中遍历 item,现在作者给出一些其他的例子,比如一个类似 range 的类:

class ArithmeticProgression:
    def __init__(self, begin, step, end=None):   
        self.begin = begin
        self.step = step
        self.end = end  # None -> "infinite" series
        
    def __iter__(self):
        result = type(self.begin + self.step)(self.begin)      # [1]
        forever = self.end is None   
        index = 0
        while forever or result < self.end:   
            yield result   
            index += 1
            result = self.begin + self.step * index            # [2]

这段代码虽然并不是很复杂,但是作者写得很严谨仔细,让我很想把它记下来。作者给代码写了注释:

[1] 处,作者仔细解释了为什么做数据类型的隐式变换,以及他为了写一个地道的变换代码,而去邮件组做了询问:

This line produces a result value equal to self.begin, but coerced to the type of the subsequent additions. In Python 2, there was a coerce() built-in function but it’s gone in Python 3, deemed unnecessary because the numeric coercion rules are implicit in the arithmetic operator methods. So the best way I could think of to coerce the initial value to be of the same type as the rest of the series was to perform the addition and use its type to convert the result. I asked about this in the Python-list and got an excellent response from Steven D’Aprano.

[2] 处,作者故意维护了一个 index 变量,来避免浮点数运算中误差不段积累:

[...], instead of simply incrementing the result with self.step iteratively, I opted to use an index variable and calculate each result by adding self.begin to self.step multiplied by index to reduce the cumulative effect of errors when working with with floats.

Generator Functions in the Standard Library

作者将分散在各处的函数按功能做了分类,这部分值得翻阅。具体的代码例子,参考书中代码以及 itertools 模块官方文档。

感觉 itertools 的能力难掌握,建议看看 Future Reading 部分提供的延伸阅读,提供了一些 recipes 供参考。

感觉这部分内容,很有函数式编程的意味。

New Syntax in Python 3.3: yield from

>>> def chain(*iterables):
...     for i in iterables:
...         yield from i
...
>>> list(chain(s, t))
['A', 'B', 'C', 0, 1, 2]

yield from 看起来像语言糖,但是又不止是语法粮。在第 16 章讲协程时会提到。

A Closer Look at the iter Function

iter 还有个特殊功能。接受一个为 callable 类型的参数,这个 callable 必须是不需要参数的函数。当这个函数执行到返回第 2 个 sentinel 值时,迭代结束:

with open('mydata.txt') as fp:
    for line in iter(fp.readline, ''):
        process_line(line)

一旦 fp.readline 读到空行或者文件耗尽,则循环结束。

Soapbox

讨论了 generator function 是否应该用一个新的关键字的问题。Guido 不喜欢用新的关键字。但是作者觉得这里应该用新的关键字,因为 generator function 难以复用:

def f(): 
    x = 0
    while True:
        x += 1
        yield x

# 如果想复用 yield 能力,这样做是有问题的
def f():
    def do_yield(n):
        yield n
    x = 0
    while True:
        x += 1
        do_yield(x)

# 虽然你可以用 yield from do_yield(x),但是跟平时函数调用的语义不符