Upload
wilfred-lynch
View
219
Download
2
Tags:
Embed Size (px)
Citation preview
C-RORC PRR 3
Introduction
• C-RORC: hardware design of ALICE• Types of firmware
– Test firmware used during production, mainly developed by ALICE, test procedures discussed between ALICE and ATLAS (loopback connector used for tests developed by ATLAS)
– ALICE specific– ATLAS specific
• RobinNP: C-RORC to become the Gen-III ROS ROBIN• “Dozolar”: data source for 12 S-links, for testing S-link inputs of RobinNP• RoIBuilder: if C-RORC replaces VME based RoI Builder specific firmware may be needed, not excluded that RobinNP firmware can be used
14 April 2014
C-RORC PRR 4
C-RORC
14 April 2014
• C-RORC picture with some explanation
Could be removed to improve air circulation (will be discussed later)
C-RORC PRR 5
This review
• Production Readiness of C-RORC hardware• First prototypes produced by Cerntech (Hungary)– PCBs from Exception PCB, UK
• After tendering production contract was awarded to Hapro (Norway)– PCBs from Suntak, China
• 20 pre-production cards under test since mid February
14 April 2014
C-RORC PRR 6
Hapro and Cerntech C-RORC
• PCB build different, copper balancing on Cerntech board (better spread of heat during manufacturing of board)
• Cooler + fan different, Hapro board within PCIe height limit • Hapro FPGA: commercial grade (0 – 85 0C), Cerntech FPGA: industrial grade (-40 – 100 0C)
14 April 2014
Hapro
Cerntech
C-RORC PRR 7
Pre-Series test at contractor’s siteand tests by ALICE
Described in presentation by H. Engel
14 April 2014
C-RORC PRR 8
Tests performed by ATLAS
• Visual inspection of the pre-series cards• Already mentioned by H. Engel: on one card 3
LEDs only soldered on one side• Fixed by CERN SMD Workshop
• Card at Nikhef: some VIAs filled with solder
14 April 2014
Hapro Cerntech CerntechHapro
C-RORC PRR 9
Tests performed by ATLAS IWith RobinNP firmware:• robinnpbist program:
• Checks register contents• Measures FPGA temperature• Sets clock frequencies for S-Links• On-board memory tests• DMA speed tests• Interrupt tests, including performance benchmarking• Tests of speed and data integrity for page handling and transfers into buffer
memory• Temperature measurements, readout via PCIe or via JTAG using Chipscope
14 April 2014
C-RORC PRR 10
Tests performed by ATLAS II• Standard data taking environment using ReadoutApplication:
– “Indexing” incoming data and managing buffer memory pages– Receiving requests via network from the ROSTester program – Forwarding requests for data to RobinNP – Sending data via network to the ROSTester program
• Data generated by internal test generator or by DOLARs or MDT RODs. • For short fragments (50 words) stable running has been seen over periods of 11
hours (limited by ROSTester) • Fragments larger than ~180 words cause a lockup of the firmware for a request
fraction of 100% after a short time (10 – 50 s). A logic error in the internal arbitration in the FPGA for access to shared resources is causing this. There is no obvious dependence on features of the C-RORC hardware. A fix for the lockups has been found, consisting of minor (but clearly significant) changes to a couple of state transitions in the Memory Controller's Finite State Machines. The memory is being operated at 303 MHz DDR, it is likely that with more work this can be scaled up
14 April 2014
C-RORC PRR 11
Test setups
14 April 2014
Nikhef: Intel dual CPU server,2 C-RORCs, 2 dual-port 10 GbE NICs,1 40 GbE NIC
CERN: • 2 C-RORCs used as Dozolar• 2 GEN-III candidate PCs with 2
RobinNPs each• 1 PC with 3 DOLAR cards
RHUL, single CPU server, 2-C-RORCs,2 dual port 10 GbE NICs
C-RORC PRR 12
Observations
• Current for a few cards ~10% higher than for the other cards, but cards do function normally
• Boards from Cerntech seem to be less sensitive to air flow– With good air flow and functioning fan
temperature of FPGA not a problem (< ~65 0C)
14 April 2014
C-RORC PRR 13
C-RORC FPGA Core Temperatures
14 April 2014
Site Motherboard C-RORC Manufacturer
FPGA Temperature
Lab Air Temperature
RHUL Supermicro X9SRL-F Cerntech 50 25
RHUL Supermicro X9SRL-F Hapro 60 25
CERN Dell R720 Hapro 53 15-20
CERN Dell R720 Hapro 53 15-20
CERN Supermicro X9SRW-F Hapro 61 15-20
CERN Supermicro X9SRW-F Hapro 63 15-20
Nikhef Intel S2600CP Cerntech 56 53 ~20
Nikhef Intel S2600CP Hapro 48 ~20
RHUL: 2xDDR3@606 Single Rank, 100 MHz oscillatorMeasurements at RHUL for system without lidNikhef: 100 MHz oscillator
1 subROB configuration2 subROBs: ~ +5 0C
Accuracy FPGA temperature sensor: ± 4 0C
C-RORC PRR 14
Infrared photos
14 April 2014
ALICE test firmware RobinNP firmware
Hapro C-RORC in machinewith Supermicro MB at Nikhef,4 U high machine with lid open
FPGA sensor: ~ 70 0C FPGA sensor: ~ 64 0C
C-RORC PRR 15
Temperature
• High data rates: no significant change of temperature of FPGA
• No relation with presence or absence of QFSPs• Reporting and monitoring of fan failure & over
temperature: via Ichinga (Nagios), automatic flushing of FPGA configuration to reduce power dissipation. – To be implemented– Discuss common solution with ALICE
14 April 2014
C-RORC PRR 16
Identification of cards
• DNA id of FPGA: unique number• Hapro serial number printed on PCB of card• ATLAS number• No registration of all QSFPs / memory
modules, but ROS team will keep a record of malfunctioning devices
14 April 2014
C-RORC PRR 17
Number of cards to be produced
• Total, including pre-production: 210 for ATLAS, 170 for ALICE
• ATLAS: – Sub-detectors have been asked if they would like to
purchase C-RORCs for test setups, deadline for requests: 15 April. Two requests received so far– ATLAS with 210 C-RORCs: about 10% spares + ~10 cards for
validation system at CERN and test systems at developer labs• Need for a (small) additional batch of C-RORCs, to be discussed
– Plan to have complete Gen-III ROS PCs available as spares (at least 4, depends on plans with pre-series)
14 April 2014
C-RORC PRR 18
Testing upon arrival
• Repeat Hapro test on small sample• Subset of Hapro test for all cards (no loopback, no FMC)• robinnpbist testing with RobinNP firmware• Run a test partition with Dolars or Dozolar sending test data
to C-RORC under test and ReadoutApplication and ROSTester programs
• After installation of Gen-III ROS PCs run again with test partition and verify that loading new firmware is OK
14 April 2014
C-RORC PRR 19
Deployment environment
• USA-15• 2 U high server PCs• 2 C-RORCs + 2 dual port 10 GbE NICs per PC• Purchase contract for PCs not yet awarded,
tendering closed, two candidate PCs under test in bdg. 4
14 April 2014
C-RORC PRR 20
S-Link tolerance test
QSFP related:• Set up a ROL between a DOLAR and a RobinNP
and measure with a variable attenuator at what attenuation the link starts to fail• LC-MPO fan out will be tested at the same
time
14 April 2014
C-RORC PRR 21
Schedule slippages
14 April 2014
• There have been some significant slippages in the schedule. In particular:– Delivery of the Pre-Series C-RORC cards was delayed, initially by a change
of FPGA fan (to meet the PCIe thickness spec) and then more significantly by changes in the PCB build requested by the company (NB: without the efforts of Tivadar Kiss these probably could not have been solved).
– The RobinNP firmware has taken longer to produce than expected and although it now all exists, there are still issues remaining and a fix has been found for the issue in the buffer handling for full-size fragments from multiple channels, optimization and further checking of the firmware is needed.
– Procurement of the GEN-III ROS PCs has been delayed – mainly in getting the tender launched - so that tests of the candidate PCs are only just starting
C-RORC PRR 22
Effect on testing of schedule slippages
• Thus not yet able to start a long-duration stability test using pre-series cards in the final configuration
• But there is a growing body of evidence from tests by ALICE and ourselves in CERN, at RHUL and NIKHEF that the C-RORC H/W works reliably
• Thus we no longer plan to run a long-duration (6-week) stability test prior to the main C-RORC production - the risk by not running the test is small and outweighed by the consequence of the extra delay it would cause
14 April 2014
C-RORC PRR 23
Support
• 5 years warranty by Hapro• Test setup at CERN for first diagnosis, remote
access by experts possible• Test setups at RHUL and Nikhef for further
investigations
14 April 2014
C-RORC PRR 24
Installation scheduleBoundary conditions:• The ROS system has to be stable and tested by 1 February 2015• In case of a major problem with the GEN-III re-installing and re-testing the
GEN-II H/W takes ~6 weeks
Step Completion date
Ordering C-RORCs, memory and QFSPs 1 May
Ordering PCs 15 May
Delivery of C-RORCs, memory, QFSPs and PCs 15 August
Upgrade of the first ROS rack (rack yet to be nominated) 15 September (with some contingency)
Stability test with this rack 1 October
Upgrade of the remaining racks 15 November
Decision: Keep GEN-III or revert to GEN-II 1 December
14 April 2014
C-RORC PRR 25
Concluding remarks
• RobinNP firmware not yet finalized, but to the best of our knowledge there are no hardware related issues
• ALICE is happy with starting the production• If we do not start production now the
deployment of the Gen III ROS for 2015 is not likely to be possible
14 April 2014
C-RORC PRR 3114 April 2014
Test setup at Nikhef
Intel server with 2 C-RORCs VME crate with 12 MRODs and SBC
Rack withGen-I andGen-II ROSPCs withDolars and10 GbE NICsand withE5-1620based machinewith 40 GbEdual-portNIC and 10GbE NICs
C-RORC PRR 33
Machine with SuperMicro MB at CERN
14 April 2014
Picture fromSupermicro website, machineat CERN has 1 CPU
Fan may not beoptimally positionedfor max. air flowover PCIe cards
C-RORC PRR 34
Test setup configuration (Nikhef)
14 April 2014
Gen II ROS PC runningROSTester
CerntechC-RORC Intel 2-port
10 Gb/s NIC
Intel 2-port10 Gb/s NIC
Gen II ROS PC runningROSTester
Intel 2-port10 Gb/s NIC
ROS PC runningReadoutApplication
2 x E5-2690 CPU(only 1 CPU used)
SLC6 64-bit PC with E5-1620 CPU
Intel 2-port10 Gb/s NICHapro
C-RORC Intel 2-port10 Gb/s NIC
34
Gen-I ROS PC
DOLAR
DOLAR
DOLAR
12 S-links
1 word = 4 Bytes
1 subROB
C-RORC PRR 35
Test with fix for lock up
14 April 2014
10% readout fraction, 12 x 150 word fragments
4 10**9 eventsgenerated: ROSTester stops
C-RORC PRR 39
Test with fix for lock up
14 April 2014
45% readout fraction, 12 x 200 word fragments4 10**9 events
C-RORC PRR 41
Test with fix for lock up
14 April 2014
70%*) readout fraction, 12 x 200 word fragments
*) 1 ROSTesterrequesting 100%of fragments, otherROSTester requesting40% of fragments
Slide correctedon 15 April