Study Unit IT3 - System Defects

Copyright Notice: This material was written and published in Wales by Derek J. Smith (Chartered Engineer). It forms part of a multifile e-learning resource, and subject only to acknowledging Derek J. Smith's rights under international copyright law to be identified as author may be freely downloaded and printed off in single complete copies solely for the purposes of private study and/or review. Commercial exploitation rights are reserved. The remote hyperlinks have been selected for the academic appropriacy of their contents; they were free of offensive and litigious content when selected, and will be periodically checked to have remained so. Copyright © 2001-2004, Derek J. Smith (Chartered Engineer). 

 First published [v1.0] 10:35 GMT 6th February 2001; this version [v1.2 - new link] dated 08:00 BST 19th April 2004

This is the third of nine post-foundation study units making up the INFORMATICS e-learning resource published and supported by Derek J. Smith (Chartered Engineer). For further information, please e-mail me.

Unit Aims and Outcomes: There is no point in changing systems unless and until they are known to be faulty. This study unit therefore looks at the how and why of capturing and analysing system defect data. When you have completed it, you will be able to deploy with enhanced confidence and accuracy the specific skills and vocabulary listed below:

Specific Skills

Vocabulary

1. Treat problem solving as a cyclical behaviour requiring active management throughout

iteration; iterative process; post-implementation report; problem solving wheel

2. Design and maintain a helpdesk information system to record incidents and known faults; apply the "five whys" method of root cause analysis; devise optimal workmix

bug; defect analysis; incident handling; known faults log; Lord Bridges' Law; prioritisation and priority codes; proximate cause; root cause analysis; Service Level agreement (SLA); system owner; workmix

3. Assess the impact of a currently defective system on the efficient discharge of organisational functions, express this in terms of unmet system requirement, cost this unmet requirement, and summarise your conclusions in a formal problem statement

Requirements Specification

Unit Structure: This unit contains three short lessons, each contributing to the overall unit outcomes, each with its own hyperlinked support material, and each with its own additional reading and tutorial task(s). Here is the learning sequence:

Lesson IT3.1: The Cyclical Nature of Problem Solving

Lesson IT3.2: IT Incidents and Known Faults

Lesson IT3.3: The Formal Problem Statement

So Where To Next?

References


Lesson IT3.1: The Cyclical Nature of Problem Solving

IT problem solving is merely a subset of problem solving in general, and the key to doing it successfully is to approach the process systematically. Problems are best solved in stages, with each stage not only supporting and informing the next, but providing a controlled exit point should the process need to be aborted. Theorists differ slightly over exactly how many stages and substages should be recognised and how to name them, but here is a typical eight-stage description of what is involved:

Problem Definition Stages: The first three stages of problem solving serve to crystalise your thoughts on exactly what it is you are trying to resolve. Basically, they force you to calm down and check your facts, thus avoiding misdirected, "knee-jerk", or other inappropriate responses. Problem definition will be assisted by the use of effective incident handling and defect analysis systems, as described in Lesson IT3.2. Here is a brief description of what each stage sets out to achieve:

Step 1 - Identify Your Problem: This is where you first formally record the existence of a problem.

Step 2 - Gather Facts: This is where anything of relevance to that problem is researched and collated.

Step 3 - Analyse Those Facts: This is where the information collated in Step 2 is analysed, with a view to producing a precise diagnosis of what is going wrong and how much it is costing.

Project Feasibility Stages: The next two stages of problem solving derive a solution to the problem which is both technically and commercially viable, a topic dealt with in further detail in Unit IT4 and Unit IT5. Here is a brief description of what each stage sets out to achieve:

Step 4 - Suggest Possible Solutions: If the problem is within your power to solve, then this is where you start to consider possible solutions (which may or may not include IT solutions).

Step 5 - Identify the Best Solution: This is where the real decision making comes in. It is where the full range of potential solutions is critically evaluated, and it is where the costs and the benefits of each option are weighed against each other, so that a final judgement can be made as to which option is best on balance.

Project Execution Stages: The final three stages of problem solving are where the recommended solution is put into effect as a properly planned and managed system development project. This is dealt with in further detail in Unit IT6. Here is a brief description of what each stage sets out to achieve:

Step 6 - Plan Implementation: This is the resourcing stage. It is where plans are drawn up to deploy the resources likely to be available to deliver your recommended solution.

Step 7 - Build, Implement, and Test: This is where the solution is actually put together. This might involve anything from a small improvement to an existing system to a major system development exercise.

Step 8 - Reassess Situation: This is the last step. It is where you check how well what you did actually worked. You carry out what is known as a Post-Implementation Review, and you write up your findings in a Post-Implementation Report.

So much for the theory. In practice, of course, as soon as you solve one problem, another - often worse, and often the result of a botched attempt at solving the first one - will appear to take its place. So you end up going through the stages again, and again, and again, in a continuous cycle of improvement. Indeed, problem solving theorists have frequently portrayed problem solving as a series of activities arranged around a wheel. They call this the problem solving wheel, and this is how it looks and works:

 The Problem Solving Wheel

[The Problem Solving Wheel]

Problems are best solved in stages, as shown diagrammatically in the Figure opposite. You begin at the top and move round clockwise until you get back to the top .....

..... by which time you will have another problem!

This makes problem solving a cyclical behaviour, and means that you will never stop doing it.

Steps 1 to 3 of the IT problem solving process are described in detail in the remainder of this study unit, and Steps 4 to 8 in later units.

LESSON RATIONALE: And why does all this matter? Because the final problem solving stage will always read "start again from the top". When dealing with business systems, therefore, you should expect them to be in a state of more or less active improvement - with all the attendant chaos - for ever.

EXERCISES (AND STANDARD STUDY TIMES): Depending on how thoroughly you have been exploring the hyperlinks provided, it has probably taken you less than 30 minutes to read the foregoing text, and now you have to do some real work. Complete the following exercises, taking careful note of the expected study times:

IT3.1.1

Consider five problems currently facing you in the workplace. How many are currently being worked on? What stages on the problem solving wheel are they at? What is the limiting factor in deciding how many problems can be solved simultaneously (ie. if you are currently managing four active problems, why can't you manage five)? [30 minutes.]

IT3.1.2

Obtain dictionary definitions of the words "iterate", "iteration", and "iterative", and get your IT Department contact to explain the term "development lifecycle". [30 minutes.]

IT3.1.3

Research the modern management concept of "continuous improvement". [30 minutes.]

Submitting Exercises for Assessment and Feedback (Fee-Paying Clients Only): Simply e-mail your answer(s) for full tutorial feedback. State each conclusion clearly, and briefly explain how you arrived at it. You may do this one exercise at a time, or all at once. Additional questions may then be asked, and additional tasks given as required. [Submit an Exercise] Please cooperate with this student-tutor exchange, because it will eventually form the basis of your individual student progress record. Do not proceed to Lesson IT3.2 until all the tutorial tasks are completed and signed off.


Lesson IT3.2: IT Incidents and Known Faults

We now move to problem solving specifically in an IT context, and the central point of this lesson is that IT problems can only be reliably identified if system performance has been formally monitored. This means that incidents and known faults logging systems (often referred to as "helpdesk systems") need to have been in place for long enough to accumulate reliable data. Of course, you only need to be concerned with logging service provision levels if you are the "owner" of the system in question, and you will probably not be the owner of any of your organisation's large corporate systems. However, you may well be the owner of one or more small departmental systems, and the comments in the remainder of this unit presume that you are responsible for at least one such small local system.

The Rule: You must formally monitor and record system performance. There are a number of software packages on the market to help you do this vital IT administration task, but because they themselves can be quite expensive they are best left to larger departments where there are dedicated helpdesk staff. For smaller workplaces, you can get away quite comfortably with a simple pen-and-paper system, or perhaps a PC spreadsheet.

The main thing you have to record are incidents, that is to say, individual occurrences of departures from specified functionality or level of performance. Each incident report should record such things as date, time, who reported the incident, who took the call, who was affected, what they did to recover from it, and how long it all took. The accumulation of incident reports over time gives you an incident log. All incidents should be recorded, especially if they are regular failures, and even if they are cured immediately as part of a routine corrective procedure (because only then can a true picture of defect costs be built up).

Key Illustration: A system might only have one fault, but it might strike once a day. Another system might have 365 faults, but they might each only go wrong once a year. Both systems will have 365 incident reports in their incidents log at the end of one year.

Key Document - Service Level Agreements (SLA): These are contractually binding agreements between the providers and purchasers of a service. They specify what is and what is not acceptable in such areas as system availability and response time. Their use is recommended whenever the support people are organisationally distant from those who have to rely on them, and essential when support has been contracted out to external agencies.

The incidents log then needs to be periodically analysed, looking for the underlying system defects - or "bugs" - and each defect should then be recorded in a known faults log. This process is known as defect analysis, and the known faults log serves two important purposes:

Incident Handling: The known faults log is a prime source of instruction for new system users, because it shows how to get around incidents when they occur. Again this does not have to be a big system - perhaps a simple advisory notice kept close to an existing PC, or next to the telephone.

Planned Upgrade: The known faults log is also - for obvious reasons - the primary document for helping you decide what needs to be repaired. Simply follow Lord Bridges' Law, which reads: "Find and cure the key defect. All else will follow automatically". For this reason, it is important to add a priority code of some sort to each defect. This will allow high priority defects to be rapidly identified, and thus cured by a single IT investment. This process is sometimes referred to as setting the "workmix" for a proposed project.

Defect analysis can sometimes require considerable detective work. You begin by accumulating totals and averages, and you then go progressively further afield, looking for "knock on" costs in other departments. It is also worth finding out how others have attempted to solve the problem in the past, why they failed, whether others are currently "on the case" at other sites, and so on, and so on. Remember that the most compelling statistic is how much your problem is costing in staff time per month, and be particularly suspicious as to why the problem has not been solved before, because it may well be that it is actually too difficult or too expensive to solve.

It is also informative to get to the bottom of your problem, because it did not just happen: it was at best allowed to happen, and at worst caused. The problem here is learning how to distinguish between proximate (or "proximal" or "immediate") cause and root cause. This is a distinction first developed by lawyers, and means looking not so much at what happened in the seconds leading up to an accident, but at what had gone before. The proximate cause of injury in a road accident, for example, is physical impact, but that explains little. A slightly less proximate cause might be drunkenness on the part of the driver, which might, in turn, have been due to depression, which might, in turn, have been due to unemployment, which might, in turn, have been due to poor education, and so on. So the problem with root cause analysis is knowing how far to go, and for the purposes of this course, we recommend the method of "the five whys" (see, for example, Pojasek, 2000/2003 online).  Further advice on root cause analysis in healthcare is available from Medical Risk Management Associates, and further worked examples are available here.

LESSON RATIONALE: And why does all this matter? Because an important part of producing a sound business case is to select the right problem in the first place.

EXERCISES (AND STANDARD STUDY TIMES): Depending on how thoroughly you have been exploring the hyperlinks provided, it has probably taken you less than 30 minutes to read the foregoing text, and now you have to do some real work. Complete the following exercises, taking careful note of the expected study times:

IT3.2.1

Local departments frequently run small departmental systems to compensate for deficiencies in, or lack of access to, major corporate systems. Who "owns" these small systems? Who should run the incidents and known faults logging systems for them? What is the key strategic weakness in such an approach? [30 minutes.]

IT3.2.2

Select one of your personal small systems, and, if you do not currently run an incidents log, produce one retrospectively to cover the last ten failures of that system. Produce (or, if you already have one, update) a comprehensive known faults log. [2 hours.]

IT3.2.3

Get your IT Department contact to show you a Service Level Agreement and last month's systems metrics reports. [30 minutes.]

IT3.2.4

Use the "five whys" method to explain (a) why the RMS Titanic sank [click for details], (b) why Custer got his men massacred at the Little Big Horn [click for story], and (c) why the functionality provided by your selected small system is not already provided by one of the major corporate systems. [30 minutes.]

Submitting Exercises for Assessment and Feedback (Fee-Paying Clients Only): Simply e-mail your answer(s) for full tutorial feedback. State each conclusion clearly, and briefly explain how you arrived at it. You may do this one exercise at a time, or all at once. Additional questions may then be asked, and additional tasks given as required. [Submit an Exercise] Please cooperate with this student-tutor exchange, because it will eventually form the basis of your individual student progress record. Do not proceed to Lesson IT3.3 until all the tutorial tasks are completed and signed off.


Lesson IT3.3: The Formal Problem Statement

Having identified and costed your problem, you then need to inform your superiors, because they may be blissfully unaware of it. This is often a problem in itself, because senior managers have a different set of priorities. Many will not know about your department at all, for example, and some will not have time to think about your problems because they have enough of their own. The best way to identify a problem to yourself is to mark it off in red highlighting pen on your document flowchart, and the best way to identify it to others is to describe what is going wrong in a few brief sentences of text. This calls for some very precise wording, but can readily be taught by worked example, as follows:

Skeleton Problem Statements: Here is a suggested problem statement structure. When specific details are inserted into the skeleton sentences at the points shown, it gives a succinct statement of what is currently going wrong with your system. Note the repeated reference to functionality, as taught in Unit IT2. There are three basic versions of the statement, one to cope with computer systems which are making mistakes, another to cope with computer systems which are doing what they were originally designed to do, but now need updating because the outside world has changed in some key respect, and a third to cope with manual systems.

Version A - IT System Repair Required: "The X system [you must name it, and locate it precisely within your document flowchart] is a [state system type] system, owned by [state department nominally responsible for the system]. Its prime function is to [state prime function] and it does this by taking information from [cross-reference to the system input(s) shown on the document flowchart] and feeding it to [cross-reference to the system output(s) shown on the document flowchart]. The detailed functionality is set out in Requirements Specification dated [state date], and the required level of service is governed by a [state whether formal or informal] Service Level Agreement dated [state date] between [state departments involved]. Currently [sometimes only "imminently"] a [summarise fault or faults] mean(s) that the system is unable to deliver the [state defective function or functions] function(s), resulting in [state problem frequency] costly [state nature of incident]."

Version B - IT System Enhancement Required: THE FIRST THREE SENTENCES ARE THE SAME AS VERSION A, THEN ..... "However, these documents fail to take account of [state change in policy], and the system [is already/will become] unable to deliver the [state defective function or functions] function(s), resulting in [state predicted problem frequency] costly [state nature of incident]."

Version C - Manual System: "The X system [you must name it, and locate it precisely within your document flowchart] is a manually operated business system, owned by [state department nominally responsible for the system]. Its prime function is to [state prime function] and it does this by taking information from [cross-reference to the system input(s) shown on the document flowchart] and feeding it to [cross-reference to the system output(s) shown on the document flowchart]. There is no Requirements Specification and no Service Level Agreement. Currently [sometimes only "imminently"] a [summarise fault or faults] mean(s) that the system is unable to deliver the [state defective function or functions] function(s), resulting in [state problem frequency] costly [state nature of incident]."

The Final Forms: This is the sort of paragraph you will end up with (the translation cannot always be totally word for word, so a little resourcefulness is required):

Version A - System Repair Required: "The SEPSIS system [IT system names are frequently acronyms, and frequently end -IS because the full name ends with the words "information system"] is a networked database system [the basic strategies and platforms are outlined in Unit IT4] owned by the Finance Department. Its prime function is to reduce the amount of money tied up in unofficial small stores, and it achieves this by taking information from the bought ledger and stores systems, and comparing it to spot-check audit data. The detailed functionality is set out in a Requirements Specification dated 23rd July 1996, and the required level of service is governed by an informal Service Level Agreement dated 17th August 1996 between Mr So-and-So, the then Finance Director, and Mrs Whats-her-Name, Head of IT. Currently, a serious programming error means that system is unable to deliver any of its outputs on time, resulting in impaired strategic purchasing."

Version B - System Enhancement Required: THE FIRST THREE SENTENCES ARE THE SAME AS VERSION A, THEN ..... "However, these documents fail to take account of the management reorganisation on 1st January 1999, and the system is already unable to deliver any of its outputs on time, resulting in impaired strategic purchasing."

Version C - Manual System: "The patient appointments system is a manually operated business system, owned by the XXX Department. Its prime function is to inform patients when their next clinic is, and it does this by taking information from clinicians' diaries and record cards, and passing it to the departmental secretary for typing and posting. There is no Requirements Specification and no Service Level Agreement. Currently a number of factors prevent the system responding to the resulting patient queries quickly enough, resulting in an average of 10 unnecessary DNAs per week, at an estimated annual cost to the department of £xxxxx."

LESSON RATIONALE: And why does all this matter? Because another precondition of a sound business case is the ability to describe the defect convincingly to those with the cash to cure it.

EXERCISES (AND STANDARD STUDY TIMES): Depending on how thoroughly you have been exploring the hyperlinks provided, it has probably taken you less than 30 minutes to read the foregoing text, and now you have to do some real work. Complete the following exercises, taking careful note of the expected study times:

IT3.3.1

Get your IT Department contact to explain the derivation of some of their system acronyms. [30 minutes.]

IT3.3.2

Get your IT Department contact to show you a Requirements Specification and a System Enhancement Request. Get them to comment also on the split between system maintenance expenditure (money spent running to stand still) versus system enhancement expenditure (money spent delivering new functionality). [30 minutes.]

Submitting Exercises for Assessment and Feedback (Fee-Paying Clients Only): Simply e-mail your answer(s) for full tutorial feedback. State each conclusion clearly, and briefly explain how you arrived at it. You may do this one exercise at a time, or all at once. Additional questions may then be asked, and additional tasks given as required. [Submit an Exercise] Please cooperate with this student-tutor exchange, because it will eventually form the basis of your individual student progress record. Do not proceed until all the tutorial tasks are completed and signed off.


So Where To Next?

If you have got to this point by mistake, click to return as appropriate: 

Back to Top

Restart Lesson IT3.1

Restart Lesson IT3.2

Restart Lesson IT3.3

Otherwise, congratulations!! You have reached the end of Unit IT3 of the INFORMATICS programme. Click to proceed to Unit IT4, and good luck!


References

See the Master References List

[Home]