







Hardware and Environment
Failures

• Moving parts, high speed, low tolerance, high complexity: disks, tape drives/libraries

• Lowest MTBF found in fans and power supplies

• Often fans fail gradually → subtle, sporadic failures in CPU, memory, backplane

• Environment: power, cooling, dehumidifying, cables, fire, collapsing racks, ventilation, earthquakes, ...



Hardware - Background

Chip designers, device engineers and the high-reliability community recognize that reliability concerns ultimately limit the scalability of any generation of microelectronics technology

Statistical methods and reliability physics provide the foundation for better understanding the next generation of scaled microelectronics

Microelectronics device physics
Reliability analysis and modeling
Experimentation
Accelerated testing
Failure analysis

The design, fabrication and implementation of highly aggressive advanced microelectronics requires expert controls, modern reliability approaches and novel qualification strategies



ITRS Roadmap
 ITRS predicts the main trends in the semiconductor industry spanning across 15 years into the future.
 The International Technology Roadmap for Semiconductors is sponsored by the five leading chip manufacturing regions in the world: Europe, Japan, Korea, Taiwan, and the United States.
 The objective of the ITRS is to ensure cost-effective advancements in the performance of the integrated circuit and the products that employ such devices, thereby continuing the health and success of this industry.



ITRS Roadmap

• www.itrs.net

• Editions:

- 1994, 1997, 1999, 2001, 2003, 2005, 2007, 2009, 2012

- Previously: SIA Roadmap





The problem to be solved:

How to design reliable system out of non-reliable hardware?

Human Factors

Human Factors

The role of humans in safety-critical systems
Human Reliability Analysis
task analysis
human error identification
human error model: Reason
human reliability quantification
mitigating human error
Safe user interface design



Have we learnt since Therac-25

Software for Certain Medtronic Implanted Infusion Pumps Recalled

FDA Patient Safety News: Show #32, October 2004

• Medtronic is recalling certain software application cards. They're used in the company's Model 8840 N'Vision Clinician Programmers. These hand-held devices are used to program a number of implantable devices, including the SynchroMed and SychroMed EL implantable infusion pumps.

#### Have we learnt since Therac-25

 The recall is prompted by reports of data entry errors that have led to serious drug overdoses, including two patient deaths. The overdoses occurred when clinicians who were programming the pump entered the wrong time duration or the wrong interval --- for example, mistakenly putting the time interval between periodic drug boluses in the "minutes" field, instead of the "hours" field.



#### Automation

- A driving force of automation is to compensate for human disadvantages
  - humans are unreliable components of systems requiring replacement by reliable computers
  - humans have limited capabilities in response time and capacity
- However, humans play an essential role in safety-critical decision making
  - computers are not flexible or adaptable, e.g., response in emergency situations
  - computers cannot make creative judgements or strategic decisions

21



## Human Error and Risk

- Automation yields
  - Increased capacity and productivity
  - Reduction in manual workload and fatigue
  - Increased safety
- But
  - Need specialised training
  - Cost of maintenance
- Impact on human operators
  - Unclear if overall workload reduced
  - Increased complacency due to overconfidence?

22

### Role of Humans

- · Monitor: detecting errors
  - it may not be possible to determine if an error has occurred
  - the system may provide inadequate feedback
  - operators may become complacent
- Backup: in an emergency
  - operators may become de-skilled
  - information provided may be inadequate for intervention
  - automated systems are usually too complicated



### Role of Humans

- Partner: responsible for part of a task
  - humans may be assigned "hard to automate" part
  - humans may be responsible for monitoring and maintaining
  - division of responsibility may make building a mental model harder

24





What are humans good at?

• Detecting correlations and exceptions

- Patterns/clusters in graphical data

- Breaks in lines

- Visual/sound disturbances

• Detecting isolated movement

- Waving

- Flashing lights

• Detecting differences

- Sounds, alarms, etc

- Lights on/off

- etc.









#### Human Machine Interaction (HMI)

- · Hybrid discipline: psychology, engineering, ergonomics, medicine, sociology, mathematics
- Concerned with the impact of human operators and maintainers on system performance, safety and productivity
- Concerned with enhancing the efficiency, flexibility, comprehensibility and robustness of user interaction
- In the safety-critical context, the primary concern is to enhance robustness, possibly at the expense of efficiency and flexibility



### Human Reliability Analysis (HRA)

- Identify potential operator errors that may lead to hazards and reduce error where risk is sufficiently high
- Four steps:
  - task analysis: characterise the actions performed to achieve particular goals
  - human error identification: identify possible erroneous actions in performing
  - human reliability quantification: estimate likelihood of error
  - mitigation of human error: identify control options

32



### Task Analysis

- Tasks are activities to transform some given initial state into a goal state, i.e., goal-directed
- Structured from sub-tasks and elementary actions
- Each elementary action is concerned with a manipulation to be performed upon an object in the task domain
- Procedures for
  - normal operation of the system
  - maintenance of the system
  - emergency situations
- Logical sequence of actions that the operator engages in and the detailed physical executions that the operator

#### Human-Task Mismatch

- Human error is not a useful term
  - Implies possible to improve humans
  - Human-Task Mismatch better term
  - Erroneous behaviour inextricably connected to the behaviour needed to complete a task
- Tasks
  - Involve problem solving, decision making
  - $\label{eq:Need} \textbf{Need adaptation, experimentation, optimisation}$
- Levels of cognitive control [Rasmussen's]
  - Skills-based behaviour (smooth sensory based) Rule-based behaviour (conscious problem solving)

Knowledge-based behaviour (goal known, planning by selection, trial and error, etc)



# Experimentaton versus Error

- Designer relies mostly on knowledge-based behaviour
- Operator employs all three
  - Crator employs all three In training, from knowledge- or rule-based to skills based
  - In unfamiliar situation, use knowledge-based to develop rules-based
  - Needs to maintain knowledge-based throughout
- Experimentation
  - Test a set of hypothesis through mental reasoning May be unreliable
- Human error
- unsuccessful experiments, in unkind environment
- Design for error tolerance



33

### Human as Monitor

- Monitoring, rather than active control
  - Responsible for detecting/repairing problems
- Humans perform badly..
  - Task may be impossible Cannot check in real-time if computer performs
  - correctly
- Operator dependent on information provided
- Too much or too little is bad Information is indirect
- System handles most functionality
- Failures may be silent or masked
- E.g. autopilot disengages
- Tasks are such that lower alertness results
  - Mechanical, lack of stimulation, can act without noticing





Accident Models

Reduce description of accident to a set of events and conditions

Used in investigations, for prediction, etc

Domino models

Social environment

Fault of a person

Unsafe act or mechanical/physical hazard

Accident

Injury

Chain-of-events

Event trees, fault trees

System theory

Accidents result from complex interactions

Human Tasks

Simple tasks
Uncomplicated sequences
Vigilance tasks
Detection of signals
Emergency response tasks
May involve complex reactions
Performed under stress
Complex tasks
Defined tasks, involve decision-making

**Human Error Models** Cognitive, e.g. Reason's model eight primary error groups False sensation (lack of correspondence between subjective experience and reality) Attentional failures (distraction, dividing attention) Memory lapses (forgetting items) Unintended words/actions Recognition failures (wrongly observed signals) Inaccurate and blocked recall (misremembering sequences) Errors in judgement (misconceptions) Reasoning errors (false deduction) Also Norman model of slips, mistakes in planning

Human-Task Mismatch again...

• Errors are an integral part of learning!

• Mechanisms of human malfunction

- Skills-based level

• Disorientation, motor skills failure

• Stereotype take-over

- Rule-based level

• Incorrect recall of rules

• Stereotype function

- Knowledge-based level

• Mental overload

• Premature hypothesis (way of least resistance, point of no return)

• Also performance affecting factors (separately)

- Work conditions, stress, social aspects





Verification vs. Validation

• Verification:

"Are we building the system right"

- The system should conform to its specification

• Validation:

"Are we building the right system"

- The system should do what the user really requires



Introduction

• Formal methods – use of mathematical techniques in the specification, design and analysis of hardware and software

• Many of the problems associated with the development of safety-critical systems are related to deficiencies in specification







Formal Methods

• Based on formal languages

• Very precise rules

• System (formal) specification languages

• Can only assist!

• Main advantage: automated tests

• Requirements → spec → design

• Possibility to prove



Formal Specification Languages

• These languages involve the explicit specification of a state model - system's desired behaviour with abstract mathematical objects as sets, relations and functions.

- VDM (Vienna Development Method ISO standardised).

- Z-language

- B-Method











Verification Methods

Deductive verification
Model checking
Equivalence checking
Simulation - performed on the model
Emulation, prototyping - product + environment
Testing - performed on the actual product (manufacturing test)

Formal Verification Deductive reasoning (theorem proving) uses axioms, rules to prove system correctness no guarantee that it will terminate difficult, time consuming: for critical applications only Model checking automatic technique to prove correctness of concurrent systems: digital circuits, communication protocols, etc. Equivalence checking - check if two circuits are equivalent OK for combinational circuits, unsolved 60 for sequential





Model Checking

• Algorithmic method of verifying correctness of (finite state) concurrent systems against temporal logic specifications

- A practical approach to formal verification

• Basic idea

- System is described in a formal model

• derived from high level design (HDL, C), circuit structure, etc.

- The desired behavior is expressed as a set of properties

• expressed as temporal logic specification

- The specification is checked against the model

Model Checking

• How does it work

- System is modeled as a state transition structure (Kripke structure)

- Specification is expressed in propositional temporal logic (CTL formula)

• asserts how system behavior evolves over time

- Efficient search procedure checks the transition system to see if it satisfies the specification

Model Checking

Characteristics
- searches the entire solution space
- always terminates with YES or NO
- relatively easy, can be done by experienced designers
- widely used in industry
- can be automated

Challenges
- state space explosion - use symbolic methods, BDDs
History
- Clark, Emerson [1981] USA
- Quielle, Sifakis [1980's] France

Model Checking - Tasks

• Modeling

- converts a design into a formalism: state transition system

• Specification

- state the properties that the design must satisfy

- use logical formalism: temporal logic

- asserts how system behavior evolves over time

• Verification

- automated procedure (algorithm)

















The Validation Challenge

• Microprocessor validation continues to be driven by the economics of Moore's Law

• Each new process generation doubles the number of transistors available to microprocessor architects and designers

• Some of this increase is consumed by larger structures (caches, TLB, etc.), which have no significant impact to validation

• The rest goes to increased complexity:

• Out-of-order, speculative execution machines

• Deeper pipelines

• New technologies (Hyper-Threading, 64-bit extensions, virtualization, security, ...

• Multi-core designs

• Increased complexity => increased validation effort and risk

High volumes magnify the cost of a validation escape

Microprocessor Design Scope

• Typical lead CPU design requires:

- 500+ person design team:

• logic and circuit design

• physical design

• validation and verification

• design automation

- 2-2½ years from start of RTL development to A0 tapeout

- 9-12 months from A0 tapeout to production qual (may take longer for workstation/server products)

One design cycle = 2 process generations

Pentium® 4 Processor

RTL coding started: 2H'96

First cluster models released: late '96

First full-chip model released: Q1'97

RTL coding complete: Q2'98

"All bugs coded for the first time!"

RTL under full ECO control: Q2'99

RTL frozen: Q3'99

A-0 tapeout: December '99

First packaged parts available: January 2000

First samples shipped to customers: Q1'00

Production ship qualification granted: October 2000







How do you verify a design which has bugs like this??

• The FMUL instruction, when the rounding mode is set to "round up", incorrectly sets the sticky bit when the source operands are:

src1[67:0] = X\*2i+15 + 1\*2i

src2[67:0] = Y\*2j+15 + 1\*2j

where i+j = 54 and {X,Y} are integers

And the answer is...

Hire 70+ validation engineers
Buy several thousand compute servers
Write 12,000 validation tests
Run up to 1 billion simulation cycles per day for 200 days
Check 2,750,000 manually-defined properties
Find, diagnose, track, and resolve 7,855 bugs
Apply formal verification with 10,000 proofs to the instruction decoder and FP units
This found that obscure FMUL bug!

Pentium 4 Validation - Staffing

10 people in initial "nucleus" from previous project
40 new hires in 1997
20 new hires in 1998







Power Reduction Validation

Power consumption was a big concern for Pentium 4

Need to stay within the cost-effective thermal envelope for desktop systems at 1.5+ GHz

Extensive clock gating in every part of the design

Mounted a focused effort to validate that:

Committed features were implemented as per plan
Functional correctness was maintained in the face of clock gating

Changes to the design did not impact power savings

12 person years of effort, 5 heads at peak
Fully functional on A-step silicon, measured savings of ~20W achieved for typical workloads

Formal Verification in P4
Validation

Based on model checking
Given a finite-state concurrent system
Express specifications as temporal logic formulas
Use symbolic algorithms to check whether model holds
Constructed database 10,000 "proofs"
Over 100 bugs found
20 were "high quality" bugs not likely to be found by simulation
Example errors: FADD, FMUL

Validation Results

• 5809 bugs identified by simulation

- 3411 bugs found by cluster-level testing

- 2398 found using full-chip model

• 1554 bugs found by code inspection

• 492 bugs found by formal verification

• Largest sources of bugs: memory cluster (25%)







Verification

• There is a separate course: IAF0620 Verification of Digital Systems (autumn semester)

