Name%Server%Selec+on%of%DNS% … · Unbound%in%Scenario%1% dnscache%in%Scenario%1% Windows%DNS*in%Scenario%1% …

Name Server Selec+on of DNS Caching Resolvers

Yingdi Yu1, Duane Wessels2, Ma> Larson2, and Lixia Zhang1

1

1 UCLA 2 VeriSign Labs

1UCLA 2VeriSign Lab

Why Select Servers

•  Query the fastest server –  To minimize itera+ve query delays.

•  Also query other servers from +me to +me –  to tell which server is the fastest. –  to avoid overloading fast servers. –  to prevent from being a>acked, e.g., Kaminsky-‐style cache poisoning.

2

Ques+ons to Answer

•  Do caching resolvers select the fastest server every +me?

•  If some resolvers select slow servers: is that inten+onal, or by mistake?

•  Can resolvers promptly detect changes in query response delays?

3

Tested Caching Resolvers

•  BIND 9.7.3 •  BIND 9.8.0 •  PowerDNS 3.1.5 •  Unbound 1.4.10 •  DNSCache 1.05 •  Windows DNS 6.1 (Windows Server 2008)

4

Measurement Testbed

5

Root

13 servers

SLD

⋯⋯ .com Network Emulator

Caching Resolver

Trace Replayer

Test domain

Redirect queries to the Test domain

Use Wildcard to terminate queries

Emulate different packet loss and delays to different servers.

Replay a DNS lookup trace

The tested caching resolvers (default cache size)

Query Load

•  Trace data –  From a large U.S. ISP –  10 minutes of resolver logs –  About 3.5 million lookups –  408,808 unique DNS names –  154,165 Second Level Domains

•  Average itera+ve query rate: –  250 queries/sec –  Could be higher if

•  Cache is small •  TTL of DNS RRs are short

6

70ms%

Caching%%Resolver%

TLD%NS%1%

TLD%NS%2%

TLD%NS%3%

TLD%NS%13%

Test Scenarios •  Scenario 1:

–  Whether caching resolvers can tell the differences in RTT.

7

70ms%

Caching%%Resolver%

TLD%NS%1%

TLD%NS%2%

TLD%NS%3%

TLD%NS%13%



•  Scenario 2: –  How caching resolver handle a

unreachable server?

8

70ms%

Caching%%Resolver%

Become%reachable%again%a5er%5%mins%

TLD%NS%1%

TLD%NS%2%

TLD%NS%3%

TLD%NS%13%



•  Scenario 2: –  How caching resolver handle a

unreachable server?

•  Scenario 3: –  Whether caching resolvers can

detect server recovery in a short +me.

9

Server Selec+on •  Strategy 1:

–  Select the one with the least Smoothed RTT (SRTT). –  Responses update SRTT of corresponding servers. –  Other servers: SRTT decays exponen+ally. –  Implementa+ons: BIND 9.8, PowerDNS

•  Strategy 2: –  Select a server based on an SRTT-‐related probability. –  Responses update SRTT of corresponding servers. –  Slow servers can be selected, but probabili+es may be small. –  Implementa+ons: BIND 9.7, Unbound

10

Results

11

Some prefer the fastest •  PowerDNS

–  Always selects the least SRTT server.

•  BIND 9.7 –  Selects sta+s+cally among all servers

with SRTT < 128ms. –  Smaller SRTT, higher probability. –  Decays all servers’ SRTT

•  Even a server’s RTT > 128ms, it can s+ll be selected ajer certain +me.

12

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

5

10

15

20


Que

ries

rece

ived

(%)

0

20

40

60

80

100

Bind 9.7 in Scenario 1

PowerDNS in Scenario 1

Some do NOT prefer the fastest •  DNSCache:

–  Does not measure RTT. –  Randomly selects among all

servers.

•  Unbound: –  Measures RTT. –  Randomly selects among

servers with SRTT < 400ms.

13


Que

ries

rece

ived

(%)

0

2

4

6

8

10

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10


Que

ries

rece

ived

(%)

0

2

4

6

8

10

Unbound in Scenario 1 dnscache in Scenario 1 Windows DNS* in Scenario 1

Queries are distributed evenly among servers.

* Without source code, we do not know reasons for query distribu+on of Windows DNS.

Some sends more queries to slower

•  BIND 9.8 –  Always selects the least SRTT

server (same as PowerDNS).

•  Measurements show that more queries are sent to slower servers.

•  Why?

14


Que

ries

rece

ived

(%)

0

3

6

9

12

BIND 9.8 in Scenario 1

* ISC has been no+fied about this problem and we are awai+ng their response.

How does BIND 9.8 select a server? •  The por+on of queries to a

name server NS is:

•  Both tselect and tupdate are approximately equal to RTT.

•  Faster decaying, smaller differences in tdecay!

•  Decaying speed is coupled with query rate.

15

tdecay tselect tupdate

+me

SRTT

t1 t2

rmin

rrtt

t3 t4

Periodical SRTT varia+on of a name server NS

tselect

tdecay + tselect + tupdate Nquery =

If query rate is low


Que

ries

rece

ived

(%)

0

3

6

9

12


Que

ries

rece

ived

(%)

0

3

6

9

12

16

Brief Article

The Author

December 5, 2011

1 First section

Your text goes here.

SRTTn = � · SRTTn+1 (1)

1.1 A subsection

More text.

1

Brief Article

The Author

December 5, 2011

1 First section

Your text goes here.

SRTTn = � · SRTTn+1 (1)

1.1 A subsection

More text.

1

250 queries/second 25 queries/second

BIND 9.8 under different query rates in Scenario 1

Lower query rate Larger tdecay Fewer queries to slower servers

tselect


What leads to high itera+ve query rate

•  Factors at resolver side: –  Cache is small in resolver. –  A resolver is shared by a large number of users.

•  Factors at domain side: –  TTL of resource record is short. –  Domain has a lot of records that are ojen queried.

17

If a server is unresponsive •  BIND may send much more

queries to the unresponsive one. •  Treat unresponsive as

“responsive” with a large SRTT. –  Queries are s+ll sent to the

unresponsive server for tselect seconds. –  tselect ≈ the value of +meout +mer.

•  Longer tselect & shorter tdecay, larger Nquery.

18


Que

ries

rece

ived

(%)

0

5

10

15

20


Que

ries

rece

ived

(%)

0

5

10

15

20

25



tselect


How others handle unresponsive servers

19


Que

ries

rece

ived

(%)

0

2

4

6

8

10


Que

ries

rece

ived

(%)

0

2

4

6

8

10


Que

ries

rece

ived

(%)

0

2

4

6

8

10

Unbound (✔): Can detect the unresponsive server, then use a few queries to probe it.

DNSCache (✖): No historical RTT record. Windows DNS (✔): Unknown


Que

ries

rece

ived

(%)

0

20

40

60

80

100

PowerDNS (✔): Slow decaying reduces frequency of selec+ng slow servers.

Latency to detect server recovery

•  Unbound: up to 15 minutes –  Large interval between two consecu+ve probes.

•  PowerDNS: 3 minutes –  Caused by slow decaying

•  BIND: depends on query rate •  Windows DNS: about 1 second

20

Conclusions

21

Conclusions

•  Comprehensive test is needed. •  Some “seemingly sound” implementa+ons may not work as expected. –  e.g. SRTT decaying.

•  An unresponsive server should NOT be treated as a regular server with a large SRTT.

•  Unresponsive servers can impact zone performance. –  Should be repaired ASAP, even if others work well.

22

Thanks!

23

Documents

Name%Server%Selec+on%of%DNS% … · Unbound%in%Scenario%1% dnscache%in%Scenario%1% Windows%DNS*in%Scenario%1% …