17
FAX UPDATE 26 TH AUGUST 2013

FAX UPDATE 26 TH AUGUST 2013. Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation

Embed Size (px)

Citation preview

FAX UPDATE

26TH AUGUST 2013

Ilija Vukotic [email protected] 2

Running issues

FAX failover

Moving to new AMQ server

Informing on endpoint status

Monitoring developments

Monitoring validation

dCache monitor 5.0.0

Collector

Dashboard

50 shades of green

CONTENT

Ilija Vukotic [email protected] 3

RUNNING ISSUESDead endpoints:

Frascati, Manchester, LAL

cmsd services are dead at:

Taiwan-lcg2, LPSC, Protvino, SWT2_CPB

/atlas/dq2/user/gangarbt lookups

• Made half of federation endpoints not accessible from upstream redirectors. • will be more explained by Johannes.

Remaining issues with x509

• communicating our wish to get it turned on• BU, DESY-HH, DESY-ZN, FZK, LRZ-LMU, MPPMU, Freiburg, Wuppertal,

Geogrid

Ilija Vukotic [email protected] 4

RUNNINGISSUES

Rather green considering it’s August !

Quite a bit of trafficconsidering it’s August !

New functional HC tests should not contribute much AFAIK

Ilija Vukotic [email protected] 5

FAX FAILOVERFAX failover works http://pandamon.cern.ch/fax/failover.

Developments:

• Cloud is shown and corrected queue names

• Side menu

In works:

• Filtering on user

• Graphing

To ponder:

• Site admins are not aware of this possibility. How do we communicate to them that it is in their best interest to turn it on?

Ilija Vukotic [email protected] 6

FAX FAILOVER

FAX dedicated submenu Will add here panda brokered job statistics

Production jobs failing over to

FAX

Ilija Vukotic [email protected] 7

MOVING TO NEW AMQ SERVER

All FAX related info was sent to pilot.msg.cern.ch

There was no authentication

Moved to Dashboard test broker

Consumer now uses STOMP+SSL

Required change to new stomp version

This week will move to production server

Ilija Vukotic [email protected] 8

INFORMING ON ENDPOINT STATUS

Mailing from SSB works and gives results.

Do we want SAM updates too?

What would it take?

Who would do it?

Ilija Vukotic [email protected] 9

MONITORING DEVELOPMENTSThere is a need to remotely check if cmsd works.

• We had (and still have) sites showing as green for direct access and red for downstream redirection.

• Investigation shows that actually cmsd’s are dead/not responding.

• Need a way to directly probe cmsd’s

• Andy will look at the ways to do it.

To develop new columns for SSB:

• xRootD version

• Rucio support

• Monitoring status

Ilija Vukotic [email protected] 10

MONITORING VALIDATION

First step is validation that results shown by Matevz’s collector are correct.

I was sending xrootd summary messages to collector and checking what I see in plots. While messages arrive and get shown, there is something wrong in calculating/plotting summaries.

Ilija Vukotic [email protected] 11

Ilija Vukotic [email protected] 12

DCACHE MONITOR 5.0.0

dCache monitor mostly rewritten:

• dCache compatible logging

• UDP messaging from same ports

• Sends “=” stream

• Sends more data (substitutes DN \CN with username etc.)

• Made compatible with collector

Tested at MWT2. Very good results.

End of the week, RPM will be produced and placed in WLCG repository. CMS will be informed about new version.

Ilija Vukotic [email protected] 13

COLLECTOR

New version being prepared by Matevz

• New AMQ version

BIG ISSUE:

Some CMS sites are sending info to our collector. Will be raised with Brian B.

Ilija Vukotic [email protected] 14

DCACHE MONITOR 5.0.0

Now gives really important and actionable information. Just during debugging I noticed:• Files opened, read a small percentage and kept open for hours.• Same file open twice in the same session (?!)• Rather small usage of vector reads.

Ilija Vukotic [email protected] 15

IN DASHBOARD

Why difference between table and plots?What’s idea of “Site history” tab?Need to investigate why CMS sites appear here (CERN-CMSTEST)

Ilija Vukotic [email protected] 16

PANDA RE-BROKERINGDiscussed at last CERN S&C week

We agreed on providing an estimate of cost to move data in WAN to PANDA, so it could re-broker jobs from very long queues to sites with free slots that have good connection to data.

Cost matrix exist in SSB.

Code reading it from SSB doing exponential decay smoothing runs and sends info to AGIS.

Have to check scalability of AGIS bulk update.

Waiting for Artem to code moving data from AGIS to schedconfig.

Next step is Tadashi making use of that table from schedconfig and actually re-broker.

Finally we’ll have to monitor it the same way we do with Failover.No develo

pments

Ilija Vukotic [email protected] 17

50 SHADES OF GREEN

Green color in any of the FAX SSB monitor metrics is based on one and the same file.

This involves a lot of cached information.

Need to find out a percentage of successfully obtained files from much large file pool while avoiding caching effects.

Simple code developed to test all endpoints having FDR datasets. Doing _file0->ls() on each of the ~800 files. Sequential.

Currently run by hand.

You may find it in FAXtools/FAXtestsFDR of our CERN FAX git repo.