メタコマンド実行スクリプトを Python で書いてみた

半指導している学生が Python のライブラリを使って実験しているので，最低限の知識はあった方が良いかと思って，今週半ばぐらいから Python でプログラムを書いたりしている．以下は，Ruby で書いたスクリプトの中で，最も使用頻度の高いメタコマンド実行スクリプトの Python 翻訳．

#!/usr/bin/env python
#  cont: execte meta-commands with disjunctive arguments
import sys, re, os

if len (sys.argv) < 3:
    sys.exit ("Usage: run [n][c|p] command")

# handle options
opt = sys.argv[1]
nruns, flag = (int (opt[:-1]), opt[-1] == 'p') \
    if re.match (r'\d+[cp]$', opt) else sys.exit ("unrecognized opt: %s" % opt)

def pair (args):
    """Return a list of command by pairing disjunctive arguments."""
    xs = set (len (x) for x in args)
    return (zip (*(x * max (xs) if len (x) == 1 else x for x in args))) \
        if len (xs) <= 2 else sys.exit ("unbalanced pairing: %s" % args)

def combo (args):
    """Return a list of command by combinating disjunctive arguments."""
    return [[]] if not args \
        else ([x] + rest for x in args[0] for rest in combo (args[1:]))

args = []
for x in sys.argv[2:]:
    m = re.search (r'\[(.+?)\]', x) # disjunctive arguments
    args.append ([x.replace (m.group (), s) for s in m.group (1).split ('|')] \
                     if m else [x])
else: # execute
    for com in (' '.join (z) for z in (pair (args) if flag else combo (args))):
        if nruns == 1:
            sys.stderr.write ("> %s\x1b[0m\n" % com)
        for j in range (nruns):
            if nruns != 1:
                sys.stderr.write ("\x1b[34m(%d/%d)> %s\x1b[0m\n"
                                  % (j + 1, nruns, com))
            for line in os.popen (com):
                sys.stderr.write (line)
            sys.stderr.write ("\n")

機械学習のハイパーパラメタなど，引数の一部だけ変えて複数のコマンドを連続実行するときに使う．例えば

> cont 2p run svm-train -t 1 -d '[1|2]' -c '0.[1|005]' train 'model[1|2]'

> run svm-train -t 1 -d 1 -c 0.1   train model1
   run svm-train -t 1 -d 1 -c 0.1   train model1
   run svm-train -t 1 -d 2 -c 0.005 train model2
   run svm-train -t 1 -d 2 -c 0.005 train model2
> cont 1c run svm-train -t 1 -d '[1|2]' -c '0.[1|005]' train model

> run svm-train -t 1 -d 1 -c 0.1   train model
   run svm-train -t 1 -d 1 -c 0.005 train model
   run svm-train -t 1 -d 2 -c 0.1   train model
   run svm-train -t 1 -d 2 -c 0.005 train model

みたいな感じ．消費メモリ・時間を計測する run コマンドを組み込んで，平均消費メモリ・時間を出すように改造しても良いかも．Ruby スクリプトを愚直に Python に書き換えたので，Python の作法として不自然なところもあるかも知れない．
まだほんのさわり程度の文法知識しかないものの，Python はスライス，内包表現／ジェネレータを積極的に使えば Ruby より短いコードで済ませることもできそうな感じ．処理速度も ruby 1.9.2 v.s. python 2.6 だと Python の方が数倍程度速かった．Python (programming language) - Wikipedia によれば，

While offering choice in coding methodology, the Python philosophy rejects exuberant syntax, such as in Perl, in favor of a sparser, less-cluttered grammar.

とのことだが，思っていたよりは書き方に多様性があるように感じる（リスト内包表現，ジェネレータ，map/filter，またメソッドの代わりにスライスなど; 効率に差が出る場合もあるようだ）．','.join() がなぜキモイのか - methaneのブログとか，len が関数になっている理由 - methaneのブログなど，違和感を感じる場面もあるけど，慣れればそれほど気にならないかな．正規表現がやや使いにくいけど，普段スクリプトを書くには python でも良さそうだ．
二値素性向けの Passive Aggressive-I も簡単に書ける．svm-light フォーマットで学習／テストする場合は以下のようになる（素性値は全て 1 とされていると見なして無視）．

#!/usr/bin/env python
import sys, collections

# handle arguments
if len (sys.argv) != 5:
    sys.exit ("Usage: %s train test c iter\n" % sys.argv[0])

train, test  = sys.argv[1:3]
c    = float (sys.argv[3])
iter = int   (sys.argv[4])

# read data into memory; assuming value :1
examples = [(int (line[:2]), [int (fi[:-2]) for fi in line[3:-1].split (' ')])
            for line in open (train)]

# estimate w
w = collections.defaultdict (float)
for i in range (iter):
    for y, x in examples:
        margin = y * sum (w[fi] for fi in x)
        if margin <= 1:
            t = y * min (c, (1 - margin) / len (x))
            for fi in x:
                w[fi] += t
    sys.stderr.write(".")
sys.stderr.write("done.\n")

# test
result = [0] * 4
for line in open (test):
    m = sum (w[int (fi[:-2])] for fi in line[3:-1].split (' '))
    result[((int (line[:2]) > 0) << 1) + (m > 0)] += 1

nn, np, pn, pp = result # there was a bug, here (corrected)
sys.stderr.write ("acc. %2.3f%% (pp %d) (pn %d) (np %d) (nn %d)\n" % \
                      (float (pp + nn) * 100 / (sum (result)), pp, pn, np, nn))

素性を文字列のまま扱ったほうが（int (fi[:-2]) -> fi[:-2]）実行速度は速いのだけど，メモリを二倍近く消費する．ほぼ同じプログラムを C++ で書いたら97行と Python 実装の約3倍のコード量．スクリプト言語なので，Python 実装の実行速度が C++ 実装の3050倍遅いのはやむなしというところなんだろうけど，メモリ消費が5倍以上だったのは全く納得いかない．
ちなみに，おまけ - ny23の日記の単語カウントスクリプトの Python 翻訳

#!/usr/bin/env python
import collections
counter = collections.defaultdict (int)

for line in open ("unigram_raw.txt"):
    counter[line[:-1]] += 1

for k, v in counter.items ():
    print k, v

は 125s ほどで終了したので，Ruby のワンライナーの10倍以上 (1.8.6 比，1.9.2 比だと約3倍) 速く，C++ の4倍遅い程度で済んだ．メモリは C++ 実装（ハッシュ）の3倍消費した．単語カウントに関しては，Perl のワンライナー (5.12.3) は Python と大体同じ速さで，メモリ消費は C++ より少し多い程度．