14
C M L CSE 520: Advanced Computer Architecture: Reliability Aviral Shrivastava

CSE 520: Advanced Computer Architecture: Reliability

Embed Size (px)

DESCRIPTION

CSE 520: Advanced Computer Architecture: Reliability. Aviral Shrivastava. Therac-25 1985-1987. The Therac-25 was a machine for administering radiation therapy, generally for treating cancer patients. ‘ arithmetic overflow’ sometimes occurred during automatic safety checks. - PowerPoint PPT Presentation

Citation preview

Page 1: CSE 520: Advanced Computer Architecture: Reliability

CML

CSE 520: Advanced Computer Architecture:

Reliability

Aviral Shrivastava

Page 2: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Therac-25 1985-1987 The Therac-25 was a machine

for administering radiation therapy, generally for treating cancer patients.

‘arithmetic overflow’ sometimes occurred during automatic safety checks.

If, at this precise moment, the operator was configuring the machine, the safety checks would fail and the metal target would not be moved into place.

The result was that beams 100 times higher than the intended dose would be fired into a patient, giving them radiation poisoning.

This happened on 6 known occasions, causing the later death of 4 patients.

Page 3: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Patriot Missile Bug - February 25th, 1991

During Operation Desert Shield, the US military fired a patriot missile against an incoming missile, but hit a US base where it killed 28 soldiers and injured a further 98.

The internal clock would ‘drift’ (much like any clock) further and further from accurate time. It was left running for 100 hours, by which point, the internal clock had drifted out by 0.34 of a second.

So when it calculated the target over half a kilometer away from missile’s true location.

Page 4: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Skynet Brings Judgement Day (1997)

Cost: 6 billion dead, near-total destruction of human civilization and animal ecosystems (fictional)

Disaster: Human operators attempt to shut off the Skynet global computer network.  Skynet responds by firing U.S. nuclear missiles at Russia, initiating global nuclear war on what became known as Judgement Day (August 29, 1997).

Cause: Cyberdyne, the leading weapons manufacturer, installed Skynet technology in all military hardware including stealth bombers and missile defense systems. The Skynet technology formed a seamless network and effectively removed humans from strategic defense.  Eventually Skynet became sentient, was threatened when the humans tried to take it offline, sought to survive, and retaliated with nuclear war.

Page 5: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Cold War Missile Crisis September 26, 1983

Soviet military officer Stanislav Petrov received an alert that the US had launched five Minuteman intercontinental ballistic missiles.

Petrov found it strange that the US would attack with just a handful of warheads.

Considering that the early warning system was known to have flaws and had been rushed into service, Petrov decided to rule the alert as a false alarm.

It was later determined that the early detection software had picked up the sun’s reflection from the top of clouds and misinterpreted it as missile launches.

Page 6: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Michigan Dept. of Corrections Grants Prisoners Early Release

In October 2005, The Register reported on the early release of 23 prisoners due to a computer programming glitch with the Michigan Department of Corrections.

The accidental early release dates came around 39 to 161 days early while an undisclosed number of inmates were kept in jail past their release dates.

State assembly representative Rick Jones was concerned about the matter, but noted that he was “glad it’s not murderers.”

Page 7: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

North American Blackout August 14, 2003

Affecting around 55 million people, mainly in the North Eastern United States, but also Ontario Canada, this was one of the biggest power blackouts in history.

While the causes of this blackout were nothing to do with a software bug, it could have been averted were it not for a software bug in the control centre alarm system.

The centre alarm system had a ‘race condition’, which caused the alarm system to freeze and stop processing alerts. The alarm system failed ‘silently’, and didn’t notify anybody.

Page 8: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Blue screen of death

Page 9: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Source of Errors Specification errors

Functionality in footnotes Programming errors

Incorrect implementation (Michigan prison error) Algorithm error (Cold war missile crisis) Floating point errors (Patriot missile) Race conditions (Blackout)

Manufacturing errors Process variations Silicon failures

Runtime errors Negative Bias Temperature Instability (NBTI) Noise effects Voltage emergencies

Environmental Soft errors

Assuming systems are mechanically and physically protected!

Page 10: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Fault Tolerant Computing is not new! 1940s: ENIAC, with 17.5K vacuum tubes and

1000s of other electrical elements, failed once every 2 days

1950s: Early ideas by von Neumann (multichannel, with voting) and Moore-Shannon (“crummy” relays)

Page 11: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Need is changing: Automation Space age Age of Automation Proliferation of robots

Page 12: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Need is changing: Proximity Near body computing

Google glass In-body computing

Accurate drug delivery Robotic surgery

Page 13: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Need is changing: Technology Transistors are smaller Even low-energy particles

can cause soft errors. Exponentially more low-

energy particles

Page 14: CSE 520: Advanced Computer Architecture: Reliability

CMLWeb page: aviral.lab.asu.edu CML

Welcome To the course on designing reliable computing

systems Focus of the course will be on “soft errors” Class webpage

http://www.public.asu.edu/~ashriva6/teaching/ARC/