5
Disaster Recovery Solution A Case Study (A design using VMware/Netapp/Doubletake/esxpress) ELJAY ENGINEERING US: +1 877 718 7947 UK: +44 208994 4454 INDIA: +9144 42187886 www.eljayindia.com http://www.planetsolutions.co.uk http://www.planetconvergence.co.uk [email protected] [email protected]

Case Study DR PE - Planet Solutionsplanetsolutions.co.uk/.../Disaster_Recovery_Solution_Case_Study.pdf · Case!Study(Our!design!using Netapp/Doubletake/esxpress)!!!! Issuesandsolutions!!!

  • Upload
    vanbao

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

   

Disaster  Recovery  Solution      

         

A  Case  Study  (A  design  using  VMware/Netapp/Doubletake/esxpress)  

 

ELJAY  ENGINEERING  

         US:  +1  877  718  7947  

             UK:  +44  20-­‐8994  4454  

     INDIA:  +91-­‐44  -­‐42187886  

www.eljayindia.com http://www.planetsolutions.co.uk  

http://www.planetconvergence.co.uk  

 

 

[email protected]  

[email protected]  

 

     MANAGING  COMMUNICATIONS  OVER  IP

`

 

     MANAGING  COMMUNICATIONS  OVER  IP

     A  hosting  firm  XYZ  with  leased  datacenter  space  wanted  to  offer  one  of  its  hosting  clients  ABC,  a  disaster  recovery  solution.    XYZ   had   created   a   basic   design   for   ABC   with   their   in-­‐house  engineers,  but  the  design  had  some  obvious  glitches.  XYZ   approached   ELJAY   for   help   on   the   design   and  implementation  of  the  DR  solution.  Consultants   from   ELJAY   were   accurately   able   to   find   out   the  design   issues   and   rectify   the   problem  areas,   leading   to   project  success.          Situation    ABC   operated   an   environment   of   both   physical   and   virtual  servers  (VMware  based)  in  their  production  environment.  The  DR  solution  designed  in-­‐house  provided  a  DR  solution  with  2  NetApp,  one  FAS2040  at  the  ABC  production  site  and  the  second  (FAS  2050)  at  the  DR  site  (XYZ  datacenter)  The   idea   was   to   use   Netapp   snapmirror   to   transfer   data   from  production  to  DR.  A  15  Mbps  MPLS  dedicated  site  to  site  link  connected  had  been  obtained  from  AT&T  for  replication  traffic.  For  the  physical  servers  at  the  production,  double-­‐take  software  was  being  used  to  dump  data  to  volumes  on  the  FAS2040.  The  virtual  servers  (VMware  VI3)  at  production  site  were  backed  up   by   esxpress   and   backups   targeted   to   volumes   on   the  FAS2040.  The  FAS2040  replicated  these  volumes  to  FAS2050  at  the  DR  site  via  snapmirror.  The   DR   would   involve   restoring   the   esxpress   and   Doubletake  backups   (from   FAS2050)   to   a   VI3   environment   in   the   DR  datacenter.                

 Case  Study  (Our  design  using  Netapp/Doubletake/esxpress)  

     Issues  and  solutions      These   are   just   a   few   of   the  many   issues  that  were  rectified.      1.   The   replication  needed  more   than   the  estimated   maximum   bandwidth   on   the  link,   this   lead   to   snapmirror   lagging  back  by   a   few   days   for   each   volume.   This  means  the  data  on  the  DR  site  was  older  than  production   site  by   a   few  days   (high  RPO)  

   Solution:    We   enabled   dedupe   on   all   source  volumes  and  changed  the  original  NetApp  snapmirror   replication   design   to   use  Volume   snapmirror   (VSM)   instead   of  Qtree   snapmirror   (QSM).     This   led   to  dedupe   savings   being   passed   to  replication   traffic   savings.   Replication  traffic  came  down  by  30%.  We  also  change  esxpress  settings  to  daily  DELTA  backups  and  to  take  FULL  backups  only   when   DELTA   backup   differs   by   60%  or  every  2  months.  This  reduced  the  load  of   monthly   FULL   backups   of   virtual  machines   clogging   the   link   around   the  same  dates.            

   2.  In  an  event  of  an  actual  disaster,  restoration  of  backups  had  to  been  done  manually.  This  increased  the  RTO  of  the  DR  solution.  

 Solution:  We   used   a   feature   of   esxpress   called   auto-­‐restore,   which  automatically   restores   backups   from   a   target   when   newer  backups   are   placed.   We   configured   the   cron   jobs   needed   to  achieve   this   and   were   able   to   successfully   demonstrate   this   by  performing  a  DR  test.  The  cron  job  ran  every  hour  to  check  if  new  backups   had   arrived   to   the   FAS2050   volume,   and   restore   these  backups  to  the  DR  VI3  environment.      During  the  DR  test,  instead  of  manually  restoring  backups  on  the  DR  side,  XYZ  engineers  simply  needed  to  power  on  the  DR  VMs.            3.   Exchange   and   SQL   databases   came   up   "dirty"   during   DR  testing.   Database   repairs   had   to   be   performed   to   get   the  databases  to  mount  successfully.    Solution:  The  root  cause  of  the  problem  was  Netapp  snapmirror  usage  with  Doubletake.   Doubletake   backs   up   data   continuously   to   the  FAS2040  volumes.    Snapmirror   snapshots   this   volume   and   sends   it   over   to   the  NetApp  FAS2050  on  the  DR  site.  Although  the  backups   taken  by  Doubletake   are   "quiesced"   backups,   the   snapmirror   snapshots  can  happen  at  anytime  rendering  the  backups  "unquiesced".  This  problem   does   not   affect   regular   files   too  much,   it   considerable  affects  databases.  We   found   that  20%  of   all   databases   came  up  corrupt  on  the  DR  site.    To   have   a   proper   working   solution,   we   implemented   Exchange  SCR   (standby   continuous   replication)   instead   of   using   NetApp  snapmirror.   SCR   is   a   well   tested   and   documented   solution  recommended  by  Microsoft,  and  can  be  used  over  WAN  links.    For  SQL,  we  found  an  easier  solution  by  redirecting  regular  daily  backups  of  SQL  databases  to  a  NetApp  2040  volume.  This  NetApp  2040  volume  was  configured  to  be  replicated  over  to  the  DR  site  by  using  Netapp  VSM.      

 

     MANAGING  COMMUNICATIONS  OVER  IP

FAS2050

A B

App

Net

FAS2050

A B

App

Net

SNAP

Mirro

r

Asyn

chron

ous

Appli

catio

n

Oper

ating

Sys

tem ESX

Serve

rOper

ating

Sys

tem

Appli

catio

n

Stora

ge1

Stora

ge2

CIFS

/SMB

VMwa

re ES

X

CIFS

/SMB

Auto

Resto

re mo

st rec

ent d

elta +

full b

acku

p

A X

I O M

SLM

500

DA

TA S

YS

TEM

S™

FC

Tier1

Tier2VM

FS w

/ VMD

K res

tores

VMwa

re ES

X Cl

uster

Stora

ge1

Stora

ge2

esXp

ress R

eplic

ation

Stora

ge2

FAS2050

A B

App

Net

FAS2050

A B

App

Net

SNAP

Mirr

or

Asyn

chro

nous

vmdk

vmdk

vmdk

NFS

Volum

e

VMwa

re E

SX

NFS

2003

32bit

Phys

ical S

erve

rs 20

03 x6

4

LAN

vmdk

vmdk

vmdk

NFS

Volum

e

vmdk

vmdk

vmdk

NetA

pp

Flexc

lone

writa

ble

volum

e

VMwa

re E

SX C

luster

2003

64bit

Phys

ical S

erve

rs 20

03 x8

6

Doub

le Ta

ke R

eplic

ation