unbound の prefetch 動作が複雑で難しい

DNS による名前解決のオーバヘッドを解決するために DNS キャッシュサーバを導入しても、その上流や権威サーバに対して問合せが頻繁にある場合、応答の遅延が発生する状況があります。 DNS による名前解決がボトルネックとなっている環境のチューニングについて考えるとき、キャッシュサーバのキャッシュにあるレコードの TTL が切れる前にキャッシュサーバが自動的にレコードを更新してくれる機能があれば解決しそうです。

多くのプロバイダでフルリゾルバとして導入されている unbound には prefetch の機能があります。 prefetch という名前からして、キャッシュにあるレコードの TTL が切れる前に unbound 自ら自動的にレコードを更新してくれそうです。

prefetch: <yes or no> (1.4.2-)
    yesのときには、キャッシュを最新に保つために期限切れにする前に、メッセージ キャッシュの要素がプリフェッチされます。デフォルトはnoです。有効にすると、マシンに約10パーセントのトラフィックと負荷を与えますが、人気のある項目はキャッシュから期限切れになりません。

unbound.conf(5) – 日本Unboundユーザー会には上記の通り、設定項目の説明が書かれています。しかし、ドキュメントから実際の挙動については読み取れません。 prefetch の機能を検討する際に、検証からソースコードリーディングを行ったので、そのまとめを記事にまとめました。検証した unbound のバージョンは 1.6.4 です。結論だけ知りたい方は最後のまとめだけを読んでいただければと思います。

unbound のクエリ処理と prefetch が行われる条件

prefetch を有効にした unbound に対して、クエリを投げたとき、クエリに対する応答がキャッシュされ、レスポンスが返されます。キャッシュされた応答のレコードが TTL を迎えても unbound はそのレコードを prefetch してキャッシュを更新しようとしません。どのタイミングで prefetch されるかは unbound のソースコードを入り口から追っていくと実装が理解できます。

以下は unbound の worker の動作を示したコードであり、unbound はクエリを受け取ると以下の worker_handle_request() 関数にしたがってクエリを処理します。

daemon/worker.c:990

int 
worker_handle_request(struct comm_point* c, void* arg, int error,
    struct comm_reply* repinfo)
{
...
lookup_cache:
    /* Lookup the cache.  In case we chase an intermediate CNAME chain
    * this is a two-pass operation, and lookup_qinfo is different for
    * each pass.  We should still pass the original qinfo to
    * answer_from_cache(), however, since it's used to build the reply. */
    if(!edns_bypass_cache_stage(edns.opt_list, &worker->env)) {
        h = query_info_hash(lookup_qinfo, sldns_buffer_read_u16_at(c->buffer, 2));
        if((e=slabhash_lookup(worker->env.msg_cache, h, lookup_qinfo, 0))) {
            /* answer from cache - we have acquired a readlock on it */
            if(answer_from_cache(worker, &qinfo, 
                cinfo, &need_drop, &alias_rrset, &partial_rep,
                (struct reply_info*)e->data, 
                *(uint16_t*)(void *)sldns_buffer_begin(c->buffer), 
                sldns_buffer_read_u16_at(c->buffer, 2), repinfo, 
                &edns)) {
                /* prefetch it if the prefetch TTL expired.
                * Note that if there is more than one pass
                * its qname must be that used for cache
                * lookup. */
                if((worker->env.cfg->prefetch || worker->env.cfg->serve_expired)
                    && *worker->env.now >=
                    ((struct reply_info*)e->data)->prefetch_ttl) {
                    time_t leeway = ((struct reply_info*)e->
                        data)->ttl - *worker->env.now;
                    if(((struct reply_info*)e->data)->ttl
                        < *worker->env.now)
                        leeway = 0;
                    lock_rw_unlock(&e->lock);
                    reply_and_prefetch(worker, lookup_qinfo,
                        sldns_buffer_read_u16_at(c->buffer, 2),
                        repinfo, leeway);
                    if(!partial_rep) {
                        rc = 0;
                        regional_free_all(worker->scratchpad);
                        goto send_reply_rc;
                    }
                } else if(!partial_rep) {
                    lock_rw_unlock(&e->lock);
                    regional_free_all(worker->scratchpad);
                    goto send_reply;
                }
                /* We've found a partial reply ending with an
                * alias.  Replace the lookup qinfo for the
                * alias target and lookup the cache again to
                * (possibly) complete the reply.  As we're
                * passing the "base" reply, there will be no
                * more alias chasing. */
                lock_rw_unlock(&e->lock);
                memset(&qinfo_tmp, 0, sizeof(qinfo_tmp));
                get_cname_target(alias_rrset, &qinfo_tmp.qname,
                    &qinfo_tmp.qname_len);
                if(!qinfo_tmp.qname) {
                    log_err("unexpected: invalid answer alias");
                    regional_free_all(worker->scratchpad);
                    return 0; /* drop query */
                }
                qinfo_tmp.qtype = qinfo.qtype;
                qinfo_tmp.qclass = qinfo.qclass;
                lookup_qinfo = &qinfo_tmp;
                goto lookup_cache;
            }
            verbose(VERB_ALGO, "answer from the cache failed");
            lock_rw_unlock(&e->lock);
        }
        if(!LDNS_RD_WIRE(sldns_buffer_begin(c->buffer))) {
            if(answer_norec_from_cache(worker, &qinfo,
                *(uint16_t*)(void *)sldns_buffer_begin(c->buffer), 
                sldns_buffer_read_u16_at(c->buffer, 2), repinfo, 
                &edns)) {
                regional_free_all(worker->scratchpad);
                goto send_reply;
            }
            verbose(VERB_ALGO, "answer norec from cache -- "
                "need to validate or not primed");
        }
    }

この関数中の以下の箇所には、 prefetch が有効 (または serve_expired が有効) かつ現在の時間がキャッシュされたレコードの prefetch_ttl 以上の場合という条件の if 文があります。

                if((worker->env.cfg->prefetch || worker->env.cfg->serve_expired)
                    && *worker->env.now >=
                    ((struct reply_info*)e->data)->prefetch_ttl) {

if 文の上のコメントにはコメントがあり、コメントには prefetch_ttl が切れたタイミングで prefetch が行われると書かれています。 if 文中には reply_and_prefetch() 関数があり、この関数で prefetch の処理が行われていそうです。

prefetch の処理は後で追うとして、 prefetch_ttl がどのように付与されるか見ていきます。

unbound は DNS キャッシュサーバであるため、受け取ったクエリに対して unbound 内にキャッシュがあればキャッシュをそのままクライアントに対して返すのですが、初回の問合せでは unbound 内にキャッシュは存在せず、設定に記述された stub-zone や forward-zone などのルールにしたがってクエリを処理し、応答をクライアントに返します。その際、次回から同一のクエリをクライアントから受け取ったときにキャッシュを参照して応答を返せるようレコードをキャッシュしますが、 prefetch が有効であると prefetch_ttl を付加してレコードをキャッシュします。

prefetch_ttl がどう計算されているかは worker_handle_request() の下部のキャッシュが存在しない場合の処理である mesh_new_client() 関数を追って読んでいくとよいでしょう。

 /* grab a work request structure for this new request */
    mesh_new_client(worker->env.mesh, &qinfo, cinfo,
        sldns_buffer_read_u16_at(c->buffer, 2),
        &edns, repinfo, *(uint16_t*)(void *)sldns_buffer_begin(c->buffer));

prefetch_ttl が実際に計算され処理はクライアントに対する応答を構築している以下の箇所です。

./util/data/msgreply.c:426:

 rep->prefetch_ttl = PREFETCH_TTL_CALC(rep->ttl);

PREFETCH_TTL_CALC マクロは以下で定義されています。 ./util/data/msgreply.h:63:

#define PREFETCH_TTL_CALC(ttl) ((ttl) - (ttl)/10)

この式から計算すると例えば TTL が 60 の mackerel.io のようなレコードの場合、 prefetch_ttl は 54 秒になります。

実際に prefetch の機能を有効にした unbound で検証した結果は以下です。

$ dig @127.0.0.1 mackerel.io A; sleep 54; dig @127.0.0.1 mackerel.io A; sleep 6; dig @127.0.0.1 mackerel.io A
# 1 回目のクエリ
...
;; ANSWER SECTION:
mackerel.io.        60 IN  A   52.69.239.167
...
# 2 回目のクエリ
...
;; ANSWER SECTION:
mackerel.io.        6  IN  A   52.69.239.167
...
# 3 回目のクエリ
...
;; ANSWER SECTION:
mackerel.io.        54 IN  A   13.113.31.156
...

2 回目のクエリが返された後のタイミングで prefetch が行われたように見えます。このとき、 tcpdump したり unbound の verbosity を 5 にすると、 unbound がキャッシュされた応答をクライアントに返した後に prefetch している様子がわかります。

unbound のログ

[1503937596] unbound[15412:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0                                                                                                                                             
;; flags: qr aa ; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0                                                                                                                               
;; QUESTION SECTION:                                                                                                                                                                                                                                           
mackerel.io.    IN      A                                                                                                                                                                        
                                                                                                                                                                                                                                                               
;; ANSWER SECTION:                                                                                                                                                                               
mackerel.io.    60      IN      A       52.69.239.167                           
...
[1503937650] unbound[15412:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0                                                                                                                                             
;; flags: qr aa ; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0                                                                                                                                                                                             
;; QUESTION SECTION:                                                                                                                                                                                                                                           
mackerel.io.    IN      A                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                               
;; ANSWER SECTION:                                                                                                                                                                                                                                             
mackerel.io.    60      IN      A       13.113.31.156                                                                                                                                
...

prefetch はキャッシュのTTL の残りが 10% のときから、キャッシュが切れるまでの間にクエリを受信したときに行われることがわかりました。

prefetch はどのようにして行われるのか

prefetch がどのようにトリガされるかは先の検証とコードリーディングでわかりました。残りでは、もう少し prefetch の挙動を深堀りします。prefetch の具体的な中身について、先に取り上げた reply_and_prefetch() 関数を見ていきます。

reply_and_prefetch() 関数はユーザにクエリに対する応答を返し、統計情報の prefetch のカウントをインクリメントした後に、 mesh_new_prefetch() 関数をコールします。mesh_new_prefetch() は mesh と呼ばれる unbound のジョブキューのような構造体に prefetch のジョブを作成、更新する関数です。最後の引数の leeway + PREFETCH_EXPIRY_ADD は残りの TTL + 60 (定数) の計算であり、キューにこの値を入れ込みます。

/daemon/worker.c:764

/** Reply to client and perform prefetch to keep cache up to date.
 * If the buffer for the reply is empty, it indicates that only prefetch is
 * necessary and the reply should be suppressed (because it's dropped or
 * being deferred). */
static void
reply_and_prefetch(struct worker* worker, struct query_info* qinfo,
        uint16_t flags, struct comm_reply* repinfo, time_t leeway)
{
        /* first send answer to client to keep its latency 
         * as small as a cachereply */
        if(sldns_buffer_limit(repinfo->c->buffer) != 0)
                comm_point_send_reply(repinfo);
        server_stats_prefetch(&worker->stats, worker);
        
        /* create the prefetch in the mesh as a normal lookup without
         * client addrs waiting, which has the cache blacklisted (to bypass
         * the cache and go to the network for the data). */
        /* this (potentially) runs the mesh for the new query */
        mesh_new_prefetch(worker->env.mesh, qinfo, flags, leeway +
                PREFETCH_EXPIRY_ADD);
}

mesh_new_prefetch() の最後の mesh_run() でモジュールによって mesh に格納されているキューのクエリを実行します。 iterator モジュールであれば、iterator/iterator.c:3174 にて定義されている process_request() によって通常のクエリと同様に処理されます。

services/mesh.c:536

void mesh_new_prefetch(struct mesh_area* mesh, struct query_info* qinfo,
        uint16_t qflags, time_t leeway)
{
        struct mesh_state* s = mesh_area_find(mesh, NULL, qinfo,
                qflags&(BIT_RD|BIT_CD), 0, 0);
#ifdef UNBOUND_DEBUG
        struct rbnode_type* n;
#endif
        /* already exists, and for a different purpose perhaps.
         * if mesh_no_list, keep it that way. */
        if(s) {
                /* make it ignore the cache from now on */
                if(!s->s.blacklist)
                        sock_list_insert(&s->s.blacklist, NULL, 0, s->s.region);
                if(s->s.prefetch_leeway < leeway)
                        s->s.prefetch_leeway = leeway;
                return;
        }
        if(!mesh_make_new_space(mesh, NULL)) {
                verbose(VERB_ALGO, "Too many queries. dropped prefetch.");
                mesh->stats_dropped ++;
                return;
        }

        s = mesh_state_create(mesh->env, qinfo, NULL,
                qflags&(BIT_RD|BIT_CD), 0, 0);
        if(!s) {
                log_err("prefetch mesh_state_create: out of memory");
                return;
        }
#ifdef UNBOUND_DEBUG
        n =
#else
       (void)
#endif
        rbtree_insert(&mesh->all, &s->node);
        log_assert(n != NULL);
        /* set detached (it is now) */
        mesh->num_detached_states++;
        /* make it ignore the cache */
        sock_list_insert(&s->s.blacklist, NULL, 0, s->s.region);
        s->s.prefetch_leeway = leeway;
        
    if(s->list_select == mesh_no_list) {
        /* move to either the forever or the jostle_list */
        if(mesh->num_forever_states < mesh->max_forever_states) {
            mesh->num_forever_states ++;
            mesh_list_insert(s, &mesh->forever_first, 
                &mesh->forever_last);
            s->list_select = mesh_forever_list;
        } else {
            mesh_list_insert(s, &mesh->jostle_first, 
                &mesh->jostle_last);
            s->list_select = mesh_jostle_list;
        }
    }
    mesh_run(mesh, s, module_event_new, NULL);

prefetch_leeway は最終的には iter_dns_store() によってキャッシュとして格納されます。

iterator/iterator.c:2236

static int
processQueryResponse(struct module_qstate* qstate, struct iter_qstate* iq,
    int id)
{
...
        if(!qstate->no_cache_store)
            iter_dns_store(qstate->env, &iq->response->qinfo,
                iq->response->rep, 0, qstate->prefetch_leeway,
                iq->dp&&iq->dp->has_parent_side_NS,
                qstate->region, qstate->query_flags);

iterator/iter_utils.c:503

void
iter_dns_store(struct module_env* env, struct query_info* msgqinf,
    struct reply_info* msgrep, int is_referral, time_t leeway, int pside,
    struct regional* region, uint16_t flags)
{
    if(!dns_cache_store(env, msgqinf, msgrep, is_referral, leeway,
        pside, region, flags))
        log_err("out of memory: cannot store data in cache");
}

関数の説明によると、 prefetch_leeway は prefetch 中の TTL の更新にかかる待ち時間を表すもののようです。これにより複数レコードのセットを持つレコードのタイムアウトを早くに行うことができるようです(具体的な動作を追うのは飽きてあきらめました)。

services/cache/dns.h:60

 * @param leeway: during prefetch how much leeway to update TTLs.
 *  This makes rrsets (other than type NS) timeout sooner so they get
 *  updated with a new full TTL.
 *  Type NS does not get this, because it must not be refreshed from the
 *  child domain, but keep counting down properly.

まとめ

prefetch を有効にしても unbound が TTL が切れるタイミングで自動的にキャッシュのレコードを更新するわけではない
TTL が残り 10% から TTL が切れるタイミングでクエリを受信したときに prefetch が行われる。 (10% はハードコート)
負荷が 10% 向上すると書かれているのは、TTL が切れる 10% 前のタイミングでクエリを受信し続けるワーストケース
prefetch_leeway は特に prefetch 自体できにしなくてよさそう

試行錯誤のおと

日々の試行錯誤した結果です。失敗することが多い記録、それだけでっす！

unbound の prefetch 動作が複雑で難しい

unbound のクエリ処理と prefetch が行われる条件

prefetch はどのようにして行われるのか

まとめ