Recovery & Trouble Shooting

COURTESY :- vrindawan.in

Wikipedia

In computing, data recovery is a process of salvaging deleted, inaccessible, lost, corrupted, damaged, or formatted data from secondary storage, removable media or files, when the data stored in them cannot be accessed in a usual way. The data is most often salvaged from storage media such as internal or external hard disk drives (HDDs), solid-state drives (SSDs), USB flash drives, magnetic tapes, CDs, DVDs, RAID subsystems, and other electronic devices. Recovery may be required due to physical damage to the storage devices or logical damage to the file system that prevents it from being mounted by the host operating system (OS).

Data recovery - Wikipedia

Data recovery can be a very simple or technical challenge. This is why there are specific software companies specialized in this field.

The most common data recovery scenarios involve an operating system failure, malfunction of a storage device, logical failure of storage devices, accidental damage or deletion, etc. (typically, on a single-drive, single-partition, single-OS system), in which case the ultimate goal is simply to copy all important files from the damaged media to another new drive. This can be accomplished using a Live CD, or DVD by booting directly from a ROM or an USB drive instead of the corrupted drive in question. Many Live CDs or DVDs provide a means to mount the system drive and backup drives or removable media, and to move the files from the system drive to the backup media with a file manager or optical disc authoring software. Such cases can often be mitigated by disk partitioning and consistently storing valuable data files (or copies of them) on a different partition from the replaceable OS system files.

Another scenario involves a drive-level failure, such as a compromised file system or drive partition, or a hard disk drive failure. In any of these cases, the data is not easily read from the media devices. Depending on the situation, solutions involve repairing the logical file system, partition table, or master boot record, or updating the firmware or drive recovery techniques ranging from software-based recovery of corrupted data, to hardware- and software-based recovery of damaged service areas (also known as the hard disk drive’s “firmware”), to hardware replacement on a physically damaged drive which allows for the extraction of data to a new drive. If a drive recovery is necessary, the drive itself has typically failed permanently, and the focus is rather on a one-time recovery, salvaging whatever data can be read.

In a third scenario, files have been accidentally “deleted” from a storage medium by the users. Typically, the contents of deleted files are not removed immediately from the physical drive; instead, references to them in the directory structure are removed, and thereafter space the deleted data occupy is made available for later data overwriting. In the mind of end users, deleted files cannot be discover able through a standard file manager, but the deleted data still technically exists on the physical drive. In the meantime, the original file contents remain, often several disconnected fragments, and may be recoverable if not overwritten by other data files.

The term “data recovery” is also used in the context of forensic applications or espionage, where data which have been encrypted, hidden, or deleted, rather than damaged, are recovered. Sometimes data present in the computer gets encrypted or hidden due to reasons like virus attacks which can only be recovered by some computer forensic experts.

A wide variety of failures can cause physical damage to storage media, which may result from human errors and natural disasters. CD-ROMs can have their metallic substrate or dye layer scratched off; hard disks can suffer from a multitude of mechanical failures, such as head crashes, PCB failure, and failed motors; tapes can simply break.

Physical damage to a hard drive, even in cases where a head crash has occurred, does not necessarily mean there will be a permanent loss of data. The techniques employed by many professional data recovery companies can typically salvage most, if not all, of the data that had been lost when the failure occurred.

Of course, there are exceptions to this, such as cases where severe damage to the hard drive platters may have occurred. However, if the hard drive can be repaired and a full image or clone created, then the logical file structure can be rebuilt in most instances.

Most physical damage cannot be repaired by end users. For example, opening a hard disk drive in a normal environment can allow airborne dust to settle on the platter and become caught between the platter and the read/write head. During normal operation, read/write heads float 3 to 6 nanometers above the platter surface, and the average dust particles found in a normal environment are typically around 30,000 nanometers in diameter. When these dust particles get caught between the read/write heads and the platter, they can cause new head crashes that further damage the platter and thus compromise the recovery process. Furthermore, end users generally do not have the hardware or technical expertise required to make these repairs. Consequently, data recovery companies are often employed to salvage important data with the more reputable ones using class 100 dust- and static-free cleanrooms.

Recovering data from physically damaged hardware can involve multiple techniques. Some damage can be repaired by replacing parts in the hard disk. This alone may make the disk usable, but there may still be logical damage. A specialized disk-imaging procedure is used to recover every readable bit from the surface. Once this image is acquired and saved on a reliable medium, the image can be safely analyzed for logical damage and will possibly allow much of the original file system to be reconstructed.

A common misconception is that a damaged printed circuit board (PCB) may be simply replaced during recovery procedures by an identical PCB from a healthy drive. While this may work in rare circumstances on hard disk drives manufactured before 2003, it will not work on newer drives. Electronics boards of modern drives usually contain drive-specific adaptation data (generally a map of bad sectors and tuning parameters) and other information required to properly access data on the drive. Replacement boards often need this information to effectively recover all of the data. The replacement board may need to be reprogrammed. Some manufacturers (Seagate, for example) store this information on a serial EEPROM chip, which can be removed and transferred to the replacement board.

Each hard disk drive has what is called a system area or service area; this portion of the drive, which is not directly accessible to the end user, usually contains drive’s firmware and adaptive data that helps the drive operate within normal parameters. One function of the system area is to log defective sectors within the drive; essentially telling the drive where it can and cannot write data.

The sector lists are also stored on various chips attached to the PCB, and they are unique to each hard disk drive. If the data on the PCB do not match what is stored on the platter, then the drive will not calibrate properly. In most cases the drive heads will click because they are unable to find the data matching what is stored on the PCB.

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

In general, troubleshooting is the identification or diagnosis of “trouble” in the management flow of a system caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining and remedying the causes of these symptoms.

A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example, selecting the “print” option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example). Corrective action can then be taken to prevent further failures of a similar kind.

The methods of forensic engineering are useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific failures. Corrective action can then be taken to prevent further failure of a similar kind. Preventive action is possible using failure mode and effects (FMEA) and fault tree analysis (FTA) before full-scale production, and these methods can also be used for failure analysis.

Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example, a printer that “was working when it was plugged in over there”). However, there is a well known principle that correlation does not imply causality. (For example, the failure of a device shortly after it has been plugged into a different outlet doesn’t necessarily mean that the events were related. The failure could have been a matter of coincidence.) Therefore, troubleshooting demands critical thinking rather than magical thinking.

It is useful to consider the common experiences we have with light bulbs. Light bulbs “burn out” more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.

The first basic principle in troubleshooting is to be able to reproduce the problem, at wish. Second basic principle in troubleshooting is to reduce the “system” to its simplest form that still show the problem. Third basic principle in troubleshooting is to “know what you are looking for. In other words, to fully understand the way the system is supposed to work, so you can “spot” the error when it happens.

A troubleshooter could check each component in a system one by one, substituting known good components for each potentially suspect one. However, this process of “serial substitution” can be considered degenerate when components are substituted without regard to a hypothesis concerning how their failure could result in the symptoms being diagnosed.

Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (feedback loops). Such systems are less amenable to “bisection” troubleshooting techniques.

It also helps to start from a known good state, the best example being a computer reboot. A cognitive walk through is also a good thing to try. Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system.

A common cause of problems is bad design, for example bad human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint), or a lack of error-tolerant design. This is especially bad if accompanied by habituation, where the user just doesn’t notice the incorrect usage, for instance if two parts have different functions but share a common case so that it is not apparent on a casual inspection which part is being used.

Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.

Some computerized troubleshooting services (such as Primefax, later renamed MaxServ), immediately show the top 10 solutions with the highest probability of fixing the underlying problem. The technician can either answer additional questions to advance through the troubleshooting procedure, each step narrowing the list of solutions, or immediately implement the solution he feels will fix the problem. These services give a rebate if the technician takes an additional step after the problem is solved: report back the solution that actually fixed the problem. The computer uses these reports to update its estimates of which solutions have the highest probability of fixing that particular set of symptoms.

Efficient methodical troubleshooting starts on with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes. This approach is often called “divide and conquer”.

Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer’s light is on and that its cable is firmly seated at both ends). This is often referred to as “milking the front panel.”

Then, “bisect” the system (for example in a network printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems “towards” the user’s end or “towards” the device).

This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It is simply the application of a binary search across the range of dependencies and is often referred to as “half-splitting”. It is similar to the game of “twenty questions”: Anyone can isolate one option out of a million by dividing the set of alternatives in half 20 times (because 2^10 = 1024 and 2^20 = 1,048,576).

One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility … on finding a procedure to reliably induce the symptom to occur.

Some of the most difficult troubleshooting issues relate to symptoms which occur intermittently. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.

In computer programming race conditions often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to “heating up” a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.

Intermittent issues can be thus defined:

An intermittent is a problem for which there is no known procedure to consistently reproduce its symptom.

— Steven Litt,

In particular he asserts that there is a distinction between the frequency of occurrence and a “known procedure to consistently reproduce” an issue. For example, knowing that an intermittent problem occurs ” within” an hour of a particular stimulus or event … but that sometimes it happens in five minutes and other times it takes almost an hour … does not constitute a “known procedure” even if the stimulus does increase the frequency of observable exhibitions of the symptom.

Nevertheless, sometimes troubleshooters must resort to statistical methods … and can only find procedures to increase the symptom’s occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause has been found and that the problem is truly solved.

Also, tests may be run to stress certain components to determine if those components have failed.

Isolating single component failures that cause reproducible symptoms is relatively straightforward.

However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features that add redundancy, fault detection and failover to a system may also be subject to failure, and enough different component failures in any system will “take it down.”

Even in simple systems, the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly, the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).

Note that, while we talk about “replacing components” the resolution of many problems involves adjustments or tuning rather than “replacement.” For example, intermittent breaks in conductors — or “dirty or loose contacts” might simply need to be cleaned and/or tightened. All discussion of “replacement” should be taken to mean “replacement or adjustment or other modification.”