Getting Started with Pyparsing

The grammar specification should be a natural-looking part of the Python program, easy-to-read, and familiar in style and format to Python programmers. - Zen of Pyparsing.

上篇博客在讲设计模式解释器模式的时候提到了pyparsing库,当时用该库处理命令从而实现了很简单的DSL,这次详细介绍下该库。 pyparsing是一个文本处理库,可以用来处理html,日志文件,复杂数据结构或者命令的解析等。比如去除源码中的注释,或者写个简单的DSL等。


编程中处理文本的需求还是很多的,比如处理json串(各种json库),爬虫解析html(re, bs,lxml)等。对于简单的字符串一般用str内置的split(),index(),startswith()等方法就能处理,对于源文件有lex/yacc工具。正则对于文本处理也是个强大的工具,但是大部分人不是正则表达式专家,对于复杂的文本处理写好正则是很难的,而且正则表达式也不是很直观,举个例子,我们写个处理ip地址后边跟个美国电话号码格式的字符串:

# 正则
pat = '(\d{1,3}(?:\.\d{1,3}){3})\s+(\(\d{3}\)\d{3}-\d{4})'

# 使用 pyparsing,虽然代码写得多,但是从可读性可维护性和扩展性来说更好
ipField = Word(nums, max=3)
ipAddr = Combine( ipField + "." + ipField + "." + ipField + "." + ipField )
phoneNum = Combine( "(" + Word(nums, exact=3) + ")" +
                    Word(nums, exact=3) + "−" + Word(nums, exact=4) )
userdata = ipAddr + phoneNum

pyparsing有以下特点:

  • 纯python实现,兼容2,3。易于开发和维护。
  • 内置了很多处理模式。
    • C,C++,Java,Pypthon, HTML 注释处理
    • 引号字符串
    • HMLT和XML标签
    • 逗号等任意界定符表达式
  • 只有一个python源文件,容易移植和使用
  • MIT协议

使用pyparsing的正确姿势

1. 导入需要的函数或者类

2. 定义语法和hepler函数:

  • 比如定义变量名identifier = Word(alphas, alphanums + '_')
  • 定义整数或者浮点数number = Word(nums+ ".")
  • 定义个赋值语句assignmentExpr = identifier + "=" + (identifier | number)
# 利用我们定义的 assignmentExpr 可以解析下边的所有赋值语句
a = 10
a_2=100
pi=3.14159
goldenRatio = 1.61803
E = mc2

我们只使用Backus-Naur Form (BNF)这些缩写标记:

  • ::= 表示 “is defined as”
  • + 表示 “1 or more”
  • * 表示 “0 or more”
  • 在[]中的项目表示可选的
  • 一连串项 表示要匹配的串必须出现在序列里
  • | 表示或许会出现

3. 使用定义的语法处理输入文本

  • parseString: 把定义的语法应用到输入文本上
  • scanString: 扫描输入文本寻找匹配
  • searchString: 封装了scanString,返回所有匹配token的list
  • transformString: 也是封装了scanString, 简化了匹配token并且修改文本的操作

4. 处理parsing文本得到的结果

pyparsing可以返回list或者 ParseResults

# 输出list
assignemntToken = assignmentExpr.parseString("pi=3.14159")
print(assignemntToken)    # 输出 ['pi', '=', '3.14159']

# 根据attributes输出
assignmentExpr = identifier.setResultsName("lhs") + "=" + \
    (identifier | number).setResultsName("rhs")
assignmentTokens = assignmentExpr.parseString("pi=3.14159")
# 输出 3.14159 is assigned to pi
print(assignmentTokens.rhs, "is assigned to", assignmentTokens.lhs)

“Hello World!” 例子

我们通过解析”Hello World!” 来说明pyparsing的使用。”Hello, World!”的形式表示如下”word, word !”

Hello, World!
Hi, Mom!
Good morning, Miss Crabtree!
Yo, Adrian!
Whattup, G?
How's it goin', Dude?
Hey, Jude!
Goodbye, Mr. Chips!

我们用BNF来定义这个字符串表示 (这个例子有点杀鸡用牛刀,不过基本覆盖了pyparsing使用)

greeting ::= salutation comma greetee endpunc
salutation ::= word+
comma ::= ,
greetee ::= word+
word ::= a collection of one or more characters, which are any alpha or ' or .
endpunc ::= ! | ?

可以把上面这个BNF直接翻译到pyparsing表示,使用pyparsing定义的Word, Literal, OneOrMore和helper方法 onOf。BNF使用自顶向下,
翻译到pyparsing我们用自底向上方法:

word = Word(alphas + "'.")
salutation = OneOrMore(word)
comma = Literal(",")
greetee = OneOrMore(word)
endpunc = oneOf("! ?")    # oneOf可以避免这种麻烦的写法 Literal("!") | Literal("?")
greeting = salutation + comma + greetee + endpunc

完整代码如下:

from pyparsing import *    # 偷个懒

tests = """\
Hello, World!
Hi, Mom!
Good morning, Miss Crabtree!
Yo, Adrian!
Whattup, G?
How's it goin', Dude?
Hey, Jude!
Goodbye, Mr. Chips!
"""

word = Word(alphas + "'.")
salutation = Group(OneOrMore(word))
comma = Literal(",")
greetee = Group(OneOrMore(word))
endpunc = oneOf("! ?")    # oneOf可以避免这种麻烦的写法 Literal("!") | Literal("?")
greeting = salutation + comma + greetee + endpunc
for test_str in tests.splitlines():
    # print(test_str)
    print(greeting.parseString(test_str))

输出结果如下:

['Hello', ',', 'World', '!']
['Hi', ',', 'Mom', '!']
['Good', 'morning', ',', 'Miss', 'Crabtree', '!']
['Yo', ',', 'Adrian', '!']
['Whattup', ',', 'G', '?']
["How's", 'it', "goin'", ',', 'Dude', '?']
['Hey', ',', 'Jude', '!']
['Goodbye', ',', 'Mr.', 'Chips', '!']

注意问候语被拆分长了多个,我们修改下定义:

salutation = Group( OneOrMore(word) )
greetee = Group( OneOrMore(word) )

结果如下:

[['Hello'], ',', ['World'], '!']
[['Hi'], ',', ['Mom'], '!']
[['Good', 'morning'], ',', ['Miss', 'Crabtree'], '!']
[['Yo'], ',', ['Adrian'], '!']
[['Whattup'], ',', ['G'], '?']
[["How's", 'it', "goin'"], ',', ['Dude'], '?']
[['Hey'], ',', ['Jude'], '!']
[['Goodbye'], ',', ['Mr.', 'Chips'], '!']

我们再修改下输出的结果显示:

salutation, dummy, greetee, endpunc = greeting.parseString(t)
print(salutation, greetee, endpunc)

输出如下:

['Hello'] ['World'] !
['Hi'] ['Mom'] !
['Good', 'morning'] ['Miss', 'Crabtree'] !
['Yo'] ['Adrian'] !
['Whattup'] ['G'] ?
["How's", 'it', "goin'"] ['Dude'] ?
['Hey'] ['Jude'] !
['Goodbye'] ['Mr.', 'Chips'] !

对于不感兴趣的内容可以压缩:

comma = Suppress(Literal(","))    #  压缩不感兴趣的
for test_str in tests.splitlines():
    salutation, greetee, endpunc = greeting.parseString(test_str)
    print(salutation, greetee, endpunc)

我们收集下所有的问候语:

salutation_list = []
for test_str in tests.splitlines():
    salutation, greetee, endpunc = greeting.parseString(test_str)
    salutation_list.append((" ".join(salutation)))
print(salutation_list)
# 输出 ['Hello', 'Hi', 'Good morning', 'Yo', 'Whattup', "How's it goin'", 'Hey', 'Goodbye']

pyparsing有什么特别的?

Class names are easier to read and understand than specialized typography.

用pyparsing可以让代码更易维护,更加易读。之前介绍的时候举过例子,再举个例子,比如我们想匹配c语言中的函数调用,包含0或者多个参数:

# 匹配c函数调用的正则, 看起来很不直观
(\w+)\((((\d+|\w+)(,(\d+|\w+))*)?)\)
# 使用 pyparsing
Word(alphas)+ "(" + Group(Optional(Word(nums)|Word(alphas) + ZeroOrMore("," + Word(nums)|Word (alphas)))) + ")"
# x + ZeroOrMore(","+x) is so common 这种形式pyparsing提供了helper函数delimitedList
# 还可以进一步简化成一下形式
Word(alphas)+ "(" + Group( Optional(delimitedList(Word(nums)|Word(alphas)))  ) + ")"

Whitespace markers clutter and distract from the grammar definition.

空白符等符号在我们处理字符串的时候有时候是无用的,上面的正则能处理
abd(1, 2, def, 5)但是处理不了abc(1, 2, def, 5),用正则的话需要加上空白处理
(\w+)\s*\(\s*(((\d+|\w+)(\s*,\s*(\d+|\w+))*)?)\s*\), 但是pyparsing版本不需要改动,会自动处理。
另外如果在参数里有代码注释我们可以这么处理:(用正则的话就很难实现了)

cFunction = Word(alphas)+ "(" + \
    Group( Optional(delimitedList(Word(nums)|Word(alphas)))  ) + ")"
cFunction.ignore( cStyleComment  )

The results of the parsing process should do more than just represent a nested list of tokens, especially when grammars get complicated.

对于复杂的语法,我们可以通过ParseResults对结果进行访问,返回的list我们可以通过下标也可以通过属性名来访问,给处理结果带来很大便利。

Parse time is a good time for additional text processing.

pyparsing支持在匹配的时候传入回调函数进行一些操作(parse-time callbacks,called parse actions)。下边这个例子匹配到引号括起来的字符串以后传入个lambda函数去除两头的引号。我们也可以在parse action里加入额外的字符串验证等。

quotedString.setParseAction( lambda t: t[0][1:−1]  )

Grammars must tolerate change, as grammar evolves or input text becomes more challenging.

当输入文本变化的时候,处理文本会变得复杂。pyparsing可以使得代码更容易修改和扩充,更易写出自解释的代码。


使用Parse Actions 和 ParseResults 从表格中解析数据

先来看一组简单的数据,学校球赛得分,每行数据是日期和学校名及对应分数。

09/04/2004  Virginia         44   Temple             14
09/04/2004  LSU              22   Oregon State       21
09/09/2004  Troy State       24   Missouri           14
01/02/2003  Florida State   103   University of Miami 2

改数据的BNF定义:

digit      ::= '0'..'9'
alpha      ::= 'A'..'Z' 'a'..'z'
date       ::= digit+ '/' digit+ '/' digit+
schoolName ::= ( alpha+ )+
score      ::= digit+
schoolAndScore ::= schoolName score
gameResult ::= date schoolAndScore schoolAndScore

我们把BNF翻译成pyparsing中的类

num = Word(nums)
date = num + "/" + num + "/" + num
schoolName = OneOrMore(Word(alphas))
# 结合上面定义更复杂的表达式
score = Word(nums)
schoolAndScore = schoolName + score
gameResult = date + schoolAndScore + schoolAndScore

所有代码如下:

tests = """\
09/04/2004  Virginia         44   Temple             14
09/04/2004  LSU              22   Oregon State       21
09/09/2004  Troy State       24   Missouri           14
01/02/2003  Florida State   103   University of Miami 2
"""

num = Word(nums)
date = num + "/" + num + "/" + num
schoolName = OneOrMore(Word(alphas))
# 结合上面定义更复杂的表达式
score = Word(nums)
schoolAndScore = schoolName + score
gameResult = date + schoolAndScore + schoolAndScore

for test in tests.splitlines():
    stats = gameResult.parseString(test)
    print(stats.asList())
"""  输出的是无结构的字符串list
['09', '/', '04', '/', '2004', 'Virginia', '44', 'Temple', '14']
['09', '/', '04', '/', '2004', 'LSU', '22', 'Oregon', 'State', '21']
['09', '/', '09', '/', '2004', 'Troy', 'State', '24', 'Missouri', '14']
['01', '/', '02', '/', '2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']
"""


# 我们把日期组合起来,修改date为
date = Combine(num + "/" + num + "/" + num)
gameResult = date + schoolAndScore + schoolAndScore

""" 输出变成
['09/04/2004', 'Virginia', '44', 'Temple', '14']
['09/04/2004', 'LSU', '22', 'Oregon', 'State', '21']
['09/09/2004', 'Troy', 'State', '24', 'Missouri', '14']
['01/02/2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']
"""

还有个问题,大学名称现在是分隔的,我们现在给schoolName加上action
schoolName.setParseAction( lambda tokens: " ".join(tokens) )

action一般还可以用来作为数据校验,比如我们想校验日期格式:

def validateDateString(tokens):
    try:
        time.strptime(tokens[0], "%m/%d/%Y")
    except ValueError:
        raise ParseException("Invalid date string (%s)" % tokens[0])
date.setParseAction(validateDateString)

接下来看看Gropu类的用法,Group用来把解析后的token嵌套成一个sublist,我们修改下schoolAndScore定义: schoolAndScore = Group( schoolName + score ), 输出结果如下,可以看到学校和分数被[]括成了一个list

['09/04/2004', ['Virginia', '44'], ['Temple', '14']]
['09/04/2004', ['LSU', '22'], ['Oregon State', '21']]
['09/09/2004', ['Troy State', '24'], ['Missouri', '14']]
['01/02/2003', ['Florida State', '103'], ['University of Miami', '2']]

另外还想把score转成int类型,而不是使用字符串,我们可以在解析过程中传入action

score = Word(nums).setParseAction( lambda tokens : int(tokens[0])  )

最后我们给结果添加label。之前的结果都是list,也就是说我们需要用下标访问,但是这样不够优雅,我们可以给每个定义的字段加上标签,这样就可以用标签访问结果,使得代码更加容易维护。就好比相比函数返回一个tuple,比如(res, err),我们可以用namedtuple或者类封装下,这样函数返回的结果就可以用Result.res访问了。

schoolAndScore = Group(
    schoolName.setResultsName("school") +
    score.setResultsName("score")
)
gameResult = date.setResultsName("date") + \
    schoolAndScore.setResultsName("team1") + \
    schoolAndScore.setResultsName("team2")

完整代码如下:

from pyparsing import Word, Group, Combine, Suppress, OneOrMore, alphas, nums,\
    alphanums, stringEnd, ParseException
import time

tests = """\
09/04/2004  Virginia         44   Temple             14
09/04/2004  LSU              22   Oregon State       21
09/09/2004  Troy State       24   Missouri           14
01/02/2003  Florida State   103   University of Miami 2
""".splitlines()

num = Word(nums)
date = Combine(num + "/" + num + "/" + num)
def validateDateString(tokens):
    try:
        time.strptime(tokens[0], "%m/%d/%Y")
    except ValueError:
        raise ParseException("Invalid date string (%s)" % tokens[0])

date.setParseAction(validateDateString)
schoolName = OneOrMore( Word(alphas) )
schoolName.setParseAction( lambda tokens: " ".join(tokens) )
score = Word(nums).setParseAction(lambda tokens: int(tokens[0]))
schoolAndScore = Group( schoolName.setResultsName("school") + \
        score.setResultsName("score") )
gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") + \
        schoolAndScore.setResultsName("team2")
for test in tests:
    stats = (gameResult + stringEnd).parseString(test)
    if stats.team1.score != stats.team2.score:
        if stats.team1.score > stats.team2.score:
            result = "won by " + stats.team1.school
        else:
            result = "won by " + stats.team2.school
    else:
        result = "tied"
    print("%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1.score, stats.team2.school, stats.team2.score, result))
    # or print one of these alternative formats
    #print "%(date)s %(team1)s %(team2)s" % stats
    #print stats.asXML("GAME")

从网页中抽取数据

从网页中抽取数据的库很多,比如lxml,BeautifulSoup,pyquery,内置的HTMLParser, htmllib等。不过我觉得还是用bs比较好,pyparsing要做的工作比较多。下边一个简单的例子,获取img标签的内容:

from pyparsing import makeHTMLTags
html = """
<div class="content clearfix">

    <dl class="">
        <dt>
            <a href="https://book.douban.com/subject/7564420/" onclick="moreurl(this, {'total': 10, 'clicked': '7564420', 'pos': 0, 'identifier': 'book-rec-books'})"><img class="m_sub_img" src="https://img3.doubanio.com/spic/s8950064.jpg"></a>
        </dt>
        <dd>
        <a href="https://book.douban.com/subject/7564420/" onclick="moreurl(this, {'total': 10, 'clicked': '7564420', 'pos': 0, 'identifier': 'book-rec-books'})" class="">
            软件之道
        </a>
        </dd>
    </dl>

    <dl class="">
        <dt>
            <a href="https://book.douban.com/subject/7063664/" onclick="moreurl(this, {'total': 10, 'clicked': '7063664', 'pos': 1, 'identifier': 'book-rec-books'})"><img class="m_sub_img" src="https://img3.doubanio.com/spic/s10180950.jpg"></a>
        </dt>
        <dd>
        <a href="https://book.douban.com/subject/7063664/" onclick="moreurl(this, {'total': 10, 'clicked': '7063664', 'pos': 1, 'identifier': 'book-rec-books'})" class="">
            程序设计中实用的数据结构
        </a>
        </dd>
    </dl>
</div>
"""

# define expression for <img> tag
imgTag,endImgTag = makeHTMLTags("img")
# search for matching tags, and
# print key attributes
for img in imgTag.searchString(html):
    print("'%(class)s' : %(src)s" % img)