23
Name Server Selec+on of DNS Caching Resolvers Yingdi Yu 1 , Duane Wessels 2 , Ma> Larson 2 , and Lixia Zhang 1 1 1 UCLA 2 VeriSign Labs 1 UCLA 2 VeriSign Lab

Name%Server%Selec+on%of%DNS% … · Unbound%in%Scenario%1% dnscache%in%Scenario%1% Windows%DNS*in%Scenario%1% …

Embed Size (px)

Citation preview

Name  Server  Selec+on  of  DNS  Caching  Resolvers  

Yingdi  Yu1,  Duane  Wessels2,  Ma>  Larson2,  and  Lixia  Zhang1    

1  

1  UCLA  2  VeriSign  Labs  

1UCLA  2VeriSign  Lab  

Why  Select  Servers  

•  Query  the  fastest  server  –  To  minimize  itera+ve  query  delays.  

•  Also  query  other  servers  from  +me  to  +me  –  to  tell  which  server  is  the  fastest.  –  to  avoid  overloading  fast  servers.  –  to  prevent  from  being  a>acked,  e.g.,  Kaminsky-­‐style  cache  poisoning.  

2  

Ques+ons  to  Answer  

•  Do  caching  resolvers  select  the  fastest  server  every  +me?  

•  If  some  resolvers  select  slow  servers:  is  that  inten+onal,  or  by  mistake?  

•  Can  resolvers  promptly  detect  changes  in  query  response  delays?  

3  

Tested  Caching  Resolvers  

•  BIND  9.7.3  •  BIND  9.8.0  •  PowerDNS  3.1.5  •  Unbound  1.4.10  •  DNSCache  1.05  •  Windows  DNS  6.1  (Windows  Server  2008)  

4  

Measurement  Testbed  

5  

Root  

13  servers  

SLD  

⋯⋯  .com  Network    Emulator  

Caching  Resolver  

Trace  Replayer  

Test  domain  

Redirect  queries  to  the  Test  domain  

Use  Wildcard  to  terminate  queries    

Emulate  different  packet  loss  and  delays  to  different  servers.  

Replay  a  DNS  lookup  trace  

The  tested  caching  resolvers    (default  cache  size)  

Query  Load  

•  Trace  data  –  From  a  large  U.S.  ISP  –  10  minutes  of  resolver  logs  –  About  3.5  million  lookups  –  408,808  unique  DNS  names  –  154,165  Second  Level  Domains  

•  Average  itera+ve  query  rate:  –  250  queries/sec  –  Could  be  higher  if    

•  Cache  is  small  •  TTL  of  DNS  RRs  are  short  

6  

70ms%

Caching%%Resolver%

TLD%NS%1%

TLD%NS%2%

TLD%NS%3%

TLD%NS%13%

Test  Scenarios    •  Scenario  1:  

–  Whether  caching  resolvers  can  tell  the  differences  in  RTT.  

7  

70ms%

Caching%%Resolver%

TLD%NS%1%

TLD%NS%2%

TLD%NS%3%

TLD%NS%13%

Test  Scenarios    •  Scenario  1:  

–  Whether  caching  resolvers  can  tell  the  differences  in  RTT.  

•  Scenario  2:  –  How  caching  resolver  handle  a  

unreachable  server?  

8  

70ms%

Caching%%Resolver%

Become%reachable%again%a5er%5%mins%

TLD%NS%1%

TLD%NS%2%

TLD%NS%3%

TLD%NS%13%

Test  Scenarios    •  Scenario  1:  

–  Whether  caching  resolvers  can  tell  the  differences  in  RTT.  

•  Scenario  2:  –  How  caching  resolver  handle  a  

unreachable  server?  

•  Scenario  3:  –  Whether  caching  resolvers  can  

detect  server  recovery  in  a  short  +me.  

9  

Server  Selec+on  •  Strategy  1:  

–  Select  the  one  with  the  least  Smoothed  RTT  (SRTT).  –  Responses  update  SRTT  of  corresponding  servers.  –  Other  servers:  SRTT  decays  exponen+ally.  –  Implementa+ons:  BIND  9.8,  PowerDNS  

•  Strategy  2:  –  Select  a  server  based  on  an  SRTT-­‐related  probability.  –  Responses  update  SRTT  of  corresponding  servers.  –  Slow  servers  can  be  selected,  but  probabili+es  may  be  small.  –  Implementa+ons:  BIND  9.7,  Unbound  

10  

Results  

11  

Some  prefer  the  fastest  •  PowerDNS    

–  Always  selects  the  least  SRTT  server.  

•  BIND  9.7  –  Selects  sta+s+cally  among  all  servers  

with  SRTT  <  128ms.    –  Smaller  SRTT,  higher  probability.  –  Decays  all  servers’  SRTT  

•  Even  a  server’s  RTT  >  128ms,  it  can  s+ll  be  selected  ajer  certain  +me.  

12  

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

5

10

15

20

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

20

40

60

80

100

Bind  9.7  in  Scenario  1  

PowerDNS  in  Scenario  1  

Some  do  NOT  prefer  the  fastest    •  DNSCache:  

–  Does  not  measure  RTT.  –  Randomly  selects  among  all  

servers.  

•  Unbound:  –  Measures  RTT.  –  Randomly  selects  among  

servers  with  SRTT  <  400ms.  

13  

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10

Unbound  in  Scenario  1   dnscache  in  Scenario  1   Windows  DNS*  in  Scenario  1  

Queries  are  distributed  evenly  among  servers.    

*  Without  source  code,  we  do  not  know  reasons  for  query  distribu+on  of  Windows  DNS.  

Some  sends  more  queries  to  slower  

•  BIND  9.8    –  Always  selects  the  least  SRTT  

server  (same  as  PowerDNS).  

•  Measurements  show  that  more  queries  are  sent  to  slower  servers.  

•  Why?  

14  

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

3

6

9

12

BIND  9.8  in  Scenario  1  

*  ISC  has  been  no+fied  about  this  problem  and  we  are  awai+ng  their  response.  

How  does  BIND  9.8  select  a  server?  •  The  por+on  of  queries  to  a  

name  server  NS  is:  

•  Both  tselect and  tupdate  are  approximately  equal  to  RTT.  

•  Faster  decaying,  smaller  differences  in  tdecay!  

•  Decaying  speed  is  coupled  with  query  rate.  

15  

tdecay tselect tupdate

+me  

SRTT  

t1 t2

rmin

rrtt

t3 t4

Periodical  SRTT  varia+on  of  a  name  server  NS    

tselect

tdecay + tselect + tupdate Nquery =  

If  query  rate  is  low  

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

3

6

9

12

Name server indexed by RTT(ms)50 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

3

6

9

12

16  

Brief Article

The Author

December 5, 2011

1 First section

Your text goes here.

SRTTn = � · SRTTn+1 (1)

1.1 A subsection

More text.

1

Brief Article

The Author

December 5, 2011

1 First section

Your text goes here.

SRTTn = � · SRTTn+1 (1)

1.1 A subsection

More text.

1

250  queries/second   25  queries/second  

BIND  9.8  under  different  query  rates  in  Scenario  1    

Lower  query  rate   Larger  tdecay   Fewer  queries  to  slower  servers  

tselect

tdecay + tselect + tupdate Nquery =  

What  leads  to  high  itera+ve  query  rate  

•  Factors  at  resolver  side:  –  Cache  is  small  in  resolver.  –  A  resolver  is  shared  by  a  large  number  of  users.  

•  Factors  at  domain  side:  –  TTL  of  resource  record  is  short.  –  Domain  has  a  lot  of  records  that  are  ojen  queried.  

17  

If  a  server  is  unresponsive  •  BIND  may  send  much  more  

queries  to  the  unresponsive  one.  •  Treat  unresponsive  as  

“responsive”  with  a  large  SRTT.  –  Queries  are  s+ll  sent  to  the  

unresponsive  server  for  tselect  seconds.  –  tselect  ≈  the  value  of  +meout  +mer.  

•  Longer  tselect  &  shorter  tdecay,  larger  Nquery.  

18  

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

5

10

15

20

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

5

10

15

20

25

Bind  9.7  in  Scenario  2  

Bind  9.8  in  Scenario  2  

tselect

tdecay + tselect + tupdate Nquery =  

How  others  handle  unresponsive  servers  

19  

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

2

4

6

8

10

Unbound  (✔):  Can  detect  the  unresponsive  server,  then  use  a  few  queries  to  probe  it.  

DNSCache  (✖):  No  historical  RTT  record.   Windows  DNS  (✔):  Unknown  

Name server indexed by RTT(ms)Inf. 60 70 80 90 100 110 120 130 140 150 160 170

Que

ries

rece

ived

(%)

0

20

40

60

80

100

PowerDNS  (✔):  Slow  decaying  reduces  frequency  of  selec+ng  slow  servers.  

Latency  to  detect  server  recovery  

•  Unbound:  up  to  15  minutes  –  Large  interval  between  two  consecu+ve  probes.  

•  PowerDNS:  3  minutes  –  Caused  by  slow  decaying  

•  BIND:  depends  on  query  rate  •  Windows  DNS:  about  1  second  

20  

Conclusions  

21  

Conclusions  

•  Comprehensive  test  is  needed.  •  Some  “seemingly  sound”  implementa+ons  may  not  work  as  expected.  –  e.g.  SRTT  decaying.  

•  An  unresponsive  server  should  NOT  be  treated  as  a  regular  server  with  a  large  SRTT.  

•  Unresponsive  servers  can  impact  zone  performance.  –  Should  be  repaired  ASAP,  even  if  others  work  well.  

22  

Thanks!  

23