入力文字列自身でマーク付けに使う文字を指定する #4

示したコードに問題点があった。

...snip
    while (1) {
...snip
        c = getchar();
        if (c == current_start_mark_char) {
            do {
                int c2 = getchar();
                if (c2 == current_start_mark_char) {
...snip
                } else {
                    ungetc(c2, stdin); /* ++++++ (2) ++++++ */
                    break;
                }
            } while ((c = getchar()) == current_start_mark_char);
        }
        if (c == EOF || c == current_start_mark_char || c == current_end_mark_char || c == current_start_annotation_char) {
            ungetc(c, stdin); /* ++++++ (1) ++++++ */
...snip
            return STRING_EXCEPT_SPECIAL;
        }
    }
...snip

メタ文字を含まない文字列を構成中にメタ文字を読み込んだ場合、(1)においてungetcで一文字書き戻している。
この時、メタ文字変更シーケンスを途中で検出していたのなら、(2)において一文字分の書き戻しを(1)の前に行っている。
つまり、この場合、二文字分の書き戻しが行われる。

ところが、標準関数としてのungetcが書き戻せることを保証されているのは一文字分だけである。
実際にどれだけ書き戻せるかはライブラリの実装次第であり、

$ echo -n "a(({{({{({)" | ./parser
string [a]
marked-string [({]

$ echo -n "a(({b{({{({)" | ./parser
string [ab]
marked-string [({]

のように使用している環境ではうまく動作しているようである。

しかしソースの可搬性を考えれば、最低二文字分書き戻せる保証のある仕組みをきちんと与える必要がある。

unget.h

#ifndef UNGET_H_INCLUDED_
#define UNGET_H_INCLUDED_
int getchar_(void);
int ungetchar_(int c);
#endif /* UNGET_H_INCLUDED_ */

unget.c

#include <stdio.h>
#include "unget.h"

static int b[2];
static int p = 0;

int getchar_(void)
{
    switch (p) {
    case 0:
        return getchar();
    case 1:
    case 2:
        return b[--p];
    default:
        return EOF;
    }
}

int ungetchar_(int c)
{
    switch (p) {
    case 0:
    case 1:
        return b[p++] = c;
    default:
        return EOF;
    }
}

とりあえずかなりいい加減だがこのような感じで最高二文字分書き戻す今回の目的には使えそうだ。
ただし、標準入力から読み取る時は必ずgetchar_を、書き戻す時は必ずungetchar_を使う必要がある。