[译]Let’s Build A Simple Interpreter. Part 2.

原文地址：Let’s Build A Simple Interpreter. Part 2.

在《The 5 Elements of Effective Thinking》¹这本令人惊叹的书中，作者 Burger 和 Starbird 分享了一个故事，讲述了他们如何观察国际知名小号演奏家 Tony Plog 为有造诣的小号演奏家举办大师班。学生们首先演奏了复杂的音乐片段，他们演奏的很好。但随后他们被要求演奏非常基础、简单的音符。当他们演奏这些音符时，和前面演奏的复杂片段相比这些音符显得非常幼稚。当他们演奏完，大师也演奏了相同的音符，但他演奏时，这些音符听上去并不幼稚。差别很惊人。Tony 解释说，掌握简单音符的演奏会使得对复杂作品有更好的控制。这一课要传递的想法很清楚──要掌握真正的高超技艺，就必须专注于掌握简单基础的概念。

故事中的教训显然不仅适用于音乐，也适用于软件开发。这个故事很好地提醒我们所有人，不要忽视深入研究简单、基础想法的重要性，即使这有时感觉像是退步。不仅精通所使用的工具或框架很重要，知道它们背后的原理同样非常重要。正如 Ralph Waldo Emerson 所说：

If you learn only methods, you’ll be tied to your methods. But if you learn principles, you can devise your own methods.

让我们以此为指导，再次深入到解释器和编译器当中。

今天我会向你展示第一部分中计算器的一个新版本，它可以做到：

处理输入字符串中任何位置的空白符
处理输入中的多位数
两个整数相减（当前只有加法）

下面是新版计算器的源代码，它可以做到上述几点：

# Token types
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, MINUS, EOF = 'INTEGER', 'PLUS', 'MINUS', 'EOF'

class Token():
    def __init__(self, type, value):
	# token type: 'INTEGER', 'PLUS', 'MINUS', or 'EOF'
	self.type  = type
	# token value: non-negative integer value, '+', '-', or None
	self.value = value

    def __str__(self):
	"""String representation of class instance

	Examples:
	    Token(INTEGER, 3)
	    Token(PLUS, '+')
	"""
	return f'Token({self.type}, {self.value})'

    def __repr__(self):
	return self.__str__()

class Interpreter():

    def __init__(self, text):
	# client string input, e.g. "3 + 5", "12 - 5", etc
	self.text = text
	# self.pos is an index into self.text
	self.pos  = 0
	# current token instance
	self.current_token = None
	self.current_char  = self.text[self.pos]

    def error(self):
	raise Exception('Error parsing input')

    def advance(self):
	"""Advance the 'pos' pointer and set the 'current_char' variable."""
	self.pos += 1
	if self.pos >= len(self.text):
	    self.current_char = None  # Indicates end of input
	else:
	    self.current_char = self.text[self.pos]

    def skip_whitespace(self):
	while self.current_char is not None and self.current_char.isspace():
	    self.advance()

    def integer(self):
	"""Return a (multidigit) integer consumed from the input."""
	result = ''
	while self.current_char is not None and self.current_char.isdigit():
	    result += self.current_char
	    self.advance()
	return int(result)

    def get_next_token(self):
	"""Lexical analyzer (also known as scanner or tokenizer)

	This method is responsible for breaking a sentence
	apart into tokens.
	"""
	while self.current_char is not None:
	    if self.current_char.isspace():
		self.skip_whitespace()
		continue
	    if self.current_char.isdigit():
		return Token(INTEGER, self.integer())
	    if self.current_char == '+':
		self.advance()
		return Token(PLUS, '+')
	    if self.current_char == '-':
		self.advance()
		return Token(MINUS, '-')

	    self.error()

	return Token(EOF, None)

    def eat(self, token_type):
	# compare the current token type with the passed token
	# type and if they match then "eat" the current token
	# and assign the next token to the self.current_token,
	# otherwise raise an exception.
	if self.current_token.type == token_type:
	    self.current_token = self.get_next_token()
	else:
	    self.error()

    def expr(self):
	"""Parser / Interpreter

	expr -> INTEGER PLUS INTEGER
	expr -> INTEGER MINUS INTEGER
	"""

	# set current token to the first token from the input
	self.current_token = self.get_next_token()

	# we expect the current token to be an integer
	left = self.current_token
	self.eat(INTEGER)

	# we expect the current token to be either a '+' or '-'
	op = self.current_token
	if op.type == PLUS:
	    self.eat(PLUS)
	elif op.type == MINUS:
	    self.eat(MINUS)
	else:
	    self.error()

	# we expect the current token to be an integer
	right = self.current_token
	self.eat(INTEGER)
	# after the above call the self.current_token is set to
	# EOF token

	# at this point either the INTEGER PLUS INTEGER or
	# the INTEGER MINUS INTEGER sequence of tokens
	# has been successfully found and the method can just
	# return the result of adding or subtracting two integers,
	# thus effectively interpreting client input
	if op.type == PLUS:
	    result = left.value + right.value
	elif op.type == MINUS:
	    result = left.value - right.value
	else:
	    self.error()
	return result

def main():
    while True:
	try:
	    # To run under Python3 replace 'raw_input' call with 'input'
	    text = input('calc> ')
	except EOFError:
	    break
	if not text:
	    continue
	interpreter = Interpreter(text)
	result      = interpreter.expr()
	print(result)

if __name__ == '__main__':
    main()

把以上代码保存到名为 calc2.py 中，或者直接从 GitHub 上下载。试一试。亲眼看一下它可以按预期运行：它可以处理输入中的任何位置的空白符；它接受多位整数，除了整数相加还可以处理整数相减。

下面是在我笔记本上的一次尝试：

$ python calc2.py
calc> 27 + 3
30
calc> 27 - 7
20
calc>

与第一部分相比代码的主要变化有：

get_next_token 方法做了一点重构。增加指针 pos 的逻辑被重构到了方法 advance 中。
增加了两个方法： skip_whitespace 用来忽略空白符， integer 用来处理输入中的多位整数。
expr 方法在修改后，除了可以识别 INTEGER -> PLUS -> INTEGER 这个组合(phrase) 之外，还可以识别INTEGER -> MINUS -> INTEGER。而且在成功识别相应的组合后，也可以进行相应的加减操作。

在第一部分你尝到了两个重要的概念，即 token 和 词法分析器 。今天我想聊一聊 lexeme 、 parsing 和 parser 。

你已经知道 token 了。但为了叙述方便，我需要介绍一下 lexeme。什么是 lexeme？ lexeme 是组成 token 的一个字符序列。在下面的图片中是一些 token 和 lexeme 的例子，希望它能把两者之间的关系表达清楚：

Figure 1: Lexemes

现在还记得 expr 方法吗？我以前说过这是真正解释算术表达式的地方。但在解释一个表达式之前，你需要知道它是哪种组合，比如相加或相减。这是 expr 方法本质上做的事：它从 get_next_token 方法得到的 token 流中找到结构，然后解释它识别出的组合，产生算术表达式的结果。

从 token 流中查找结构，或者说从 token 流中识别组合，的过程叫做 parsing. 解释器或编译器中执行这部分任务的叫 parser.

现在你知道解释器的 parsing 和解释都在 expr 方法中了── expr 方法首先尝试从 token 流中识别（即parse） INTEGER -> PLUS -> INTEGER 或 the INTEGER -> MINUS -> INTEGER 组合，在成功识别到（即parsed）其中一个组合时，该方法就解释执行它并返回给调用者两个整数相加或相减的结果。

又到了做练习的时间了。

Figure 2: exercises

扩展计算器以处理两个整数相乘
扩展计算器以处理两个整数相除
修改代码以使它可以解释包含任意个数字的加减操作，如“9 - 5 + 3 + 11”

检查你的理解。

什么是 lexeme？
在 token 流中找到结构的过程叫什么？或者这么问，在 token 流中识别出特定组合的过程叫什么？
解释器（编译器）做 parsing 工作的部分叫什么？

我希望你能喜欢今天讲的内容。在本系列的下一篇文章中你会扩展你的计算器来处理更复杂的算术表达式。敬请关注。

下面是我推荐的帮助你学习解释器和编译器的一些书：

顺便说一下，我还在写一本书“et’s Build A Web Server: First Steps”，解释了怎么从头写一个基本的 Web 服务器。你可以从这里，这里和这里来感受一下这本书。

本系列的所有文章：

Footnotes:

The 5 Elements of Effective Thinking

Last Updated 2018-05-26 Sat 14:12.
Render by hexo-renderer-org with Emacs 25.2.2 (Org mode 9.1.13)