The team is responsible for recovering and reestablishing operating systems (OSs)

System Recovery

Philip A. Bernstein, Eric Newcomer, in Principles of Transaction Processing (Second Edition), 2009

User Techniques

Although most optimizations of system recovery are only available to database system implementers, there are a few things that a user can do to speed up restart and thereby improve availability, such as the following:

If the checkpointing frequency can be adjusted by the system administrator, then increasing it will reduce the amount of work needed at restart. Running a benchmark with different checkpointing frequencies will help determine the expense of using frequent checkpoints to improve recovery time. Depending on the overhead of the checkpointing algorithm used, this might require buying extra hardware, to ensure satisfactory transaction performance while checkpointing is being done.

Partition the database across more disks. The restart algorithm is often I/O-bound. Although it reads the log sequentially (which is fast), it accesses the database randomly. Spreading the database over more disks increases the effective disk bandwidth and can reduce restart time.

Increase the system resources available to the restart program. After the operating system recovers from a failure, it runs recovery scripts that include calling the database restart algorithm. It may not allocate main memory resources optimally, if left to its own defaults. The restart algorithm benefits from a huge cache, to reduce its I/O. If memory allocation can be controlled, tuning it can help reduce restart time.

In general, one should benchmark the performance of restart to determine its sensitivity to a variety of conditions and thereby be able to tune it to balance restart running time against checkpointing overhead.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978155860623400007X

MCSE 70-293: Planning, Implementing, and Maintaining a High-Availability Strategy

Martin Grasdal, ... Dr.Thomas W. ShinderTechnical Editor, in MCSE (Exam 70-293) Study Guide, 2003

How ASR Works

ASR involves two main processes:

ASR backup The process of creating an ASR set, which consists of a 1.44 MB floppy diskette and a linked backup media containing ASR-created backup data. These two components are necessary for performing an ASR restore and must be kept together.

ASR restore The process of re-creating the operating system and system-related disk partitions/volumes from an ASR set. In addition to the ASR set, you will need to have the original media used to install Windows Server 2003 on your server.

An ASR backup creates a set of all of the information necessary to re-create the operating system at the time the ASR backup is performed. When an ASR restore is performed, the operating system is reinstalled using the original Windows Server 2003 media. However, instead of generating new disk signatures, security identifiers, and Registry content, these items are restored from the ASR set.

The team is responsible for recovering and reestablishing operating systems (OSs)
NOTE

When operating on a nonclustered server, members of the Backup Operators group can perform ASR backups. This is not the case on clustered servers. Either a member of the Administrators group must perform the ASR backup or the Backup Operators group must be added to the security descriptor for the cluster service.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781931836937500129

Operations Architecture

James V. Luisi, in Pragmatic Enterprise Architecture, 2014

6.1.5 System Recovery Architecture

System recovery architecture is responsible for the standards and frameworks for system recovery involving each type of technology across each type of operating system environment, typically including:

z/OS,

Windows,

Novell, and

UNIX.

As well as each type of platform, such as:

tablet,

smart phone,

desktop/portable,

Intel box, or

mainframe.

This architectural discipline serves to assess, plan for, and monitor the design of standard recovery capabilities, failover and disaster recovery for the various types of environments deployed globally across an organization. Its specialization and scope is such that it is comprised of failover architecture and disaster recovery architecture.

6.1.5.1 Failover Architecture

Failover architecture is responsible for the standards and frameworks associated with failover of applications and infrastructure technologies involving the various types of components that comprise automated capabilities.

At a high level, there are a number of options for the recovery of an application system or technology.

First is the option of “no failover,” which is appropriate for business capabilities that can incur an indefinite delay for recovery without significant cost or risk to the company.

Second is the option “cold failover,” which is appropriate when the SLA involving the window of recovery fits the time required to restore database and application server from the backups, including the process of applying the transaction journals and rollbacking incomplete units of work.

Often such an environment is not dedicated, but compatible with the environment being recovered, including networking. This option provides failover ranging from days to weeks, depending upon the number of application systems being recovered.

The third option is a “warm failover,” where a dedicated environment is ready, backups have already been applied, and the tasks that remain are simply the process of applying transaction journals, rollbacking incomplete units of work, and potentially manually switching the network over to the new environment.

This option provides failover ranging from hours to days, depending upon the number of application systems being recovered.

The fourth option is a “hot failover,” also known as high availability (HA) failover, which generally has three levels associated with it. Potentially, a hot failover may use HA clustering and a low latency messaging technology to keep its alternate configuration current with active-active and active-backup, or both.

The three levels associated with “hot failover” are:

HA hot-cold,

HA hot-warm, and

HA hot-hot.

6.1.5.1.1 HA hot-cold

The first “HA hot-cold” has a dedicated system already synchronized including its application and data, but requires relatively simple manual intervention, such as having a second gateway server running in standby mode that can be turned on when the master fails.

This option provides failover ranging from minutes to hours, depending upon the number of application systems being recovered.

6.1.5.1.2 HA hot-warm

The second “HA hot-warm” has a dedicated system already synchronized including its application and data, but requires the slightest manual intervention, such as having a second gateway server running in standby mode that can be enabled when the master fails.

This option provides failover ranging from seconds to minutes, depending upon the number of application systems being recovered.

6.1.5.1.3 HA hot-hot

The third is “HA hot-hot,” which has a dedicated system already synchronized including its application and data, but requiring no manual intervention. This is accomplished by having a second gateway server actively running simultaneously.

This option provides immediate failover regardless of the number of application systems being recovered.

6.1.5.1.4 Fail back

One additional failover strategy that exists is referred to as a “fail back,” which is the process of restoring a system to its configuration corresponding to its state prior to the failure.

6.1.5.2 Disaster Recovery Architecture

Disaster recovery (DR) architecture represents the technology side of business continuity architecture. While business continuity is responsible to identify business capabilities, and the applications and technologies upon which they are dependent from a business perspective, disaster recovery takes it to a greater level of detail from an IT perspective.

Behind the scenes of automation, applications and technologies are dependent upon a myriad of application, database, and hardware and software infrastructural components, which only IT would be qualified to identify in its entirety.

The standards and frameworks of disaster recovery extend into standing up a disaster recovery capability, its regular testing, and cooperation with IT compliance to make disaster recovery standards and frameworks available to regulators.

Although it is obvious that the more holistically a company approaches disaster recovery architecture, the better it is actually by having an appropriate set of subject matter experts looking out for their specialized area of interests that makes it holistic.

For example, the disaster recovery architecture must cooperate closely with nearly all of the other enterprise architectural disciplines, including:

technology portfolio architecture,

infrastructure architecture,

network architecture,

firewall architecture,

application and database server architectures,

data in motion architecture,

operational utilities architecture,

application architecture,

reporting architecture,

workflow architecture,

failover architecture,

configuration management architecture,

release management architecture,

compliance architecture, and

SLA architecture.

This will help ensure that the synergies among the architectural disciplines are being appropriately identified and integrated into their corresponding standards and frameworks.

Any operating environment that cannot be virtualized, such as certain technologies that cannot be stood up in a VMware environment, must have dedicated equipment where every last DLL and driver must be replicated in a disaster recovery site before the particular system can successfully be made operational.

The enterprises that look at disaster recovery most holistically realize the tremendous benefit that mainframe applications enjoy by greatly simplifying the majority of disaster recovery issues. In fact, it is easier to recover an entire mainframe in a disaster recovery scenario than it is to just identify the sequence in which hundreds of application systems across a distributed environment must be recovered to optimally support the business priorities for just a single line of business.

Modern enterprise architecture also recognizes that disaster recovery that depends upon the fewest number of individuals, and their knowledge, is far more advantageous in an emergency environment that cannot predict which staff resources will be available.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128002056000068

Privacy and Security in Healthcare

Timothy Virtue, Justin Rainey, in HCISPP Study Guide, 2015

Systems Recovery

Systems recovery and contingency planning is the series of processes and procedures that healthcare organizations implement to ensure availability (remember the availability leg of the CIA triangle) of electronic protected health information (EPHI). All healthcare organizations should develop a comprehensive business resiliency program to ensure the availability of healthcare information against natural or man-made disruptions. In addition to contingency planning being a security best practice, the HIPAA Contingency Plan Standard also addresses the importance of contingency planning through five implementation specifications. Per the HIPAA Standard [§164.308(a)(7)(ii)(A)–(E)] the five contingency plan implementation standards are a data backup plan (required), disaster recovery plan (required), emergency mode operation plan (required), testing and revision procedures (addressable), and application and data criticality analysis (addressable) and are defined as follows:

Data backup plan: “Establish and implement procedures to create and maintain retrievable exact copies of electronic protected health information.”

Disaster recovery plan: “Establish (and implement as needed) procedures to restore any loss of data.”

Emergency mode operation plan: “Establish (and implement as needed) procedures to enable continuation of critical business processes for protection of the security of electronic protected health information while operating in emergency mode.”

Testing and revision procedures: “Implement procedures for periodic testing and revision of contingency plans.”

Application and data criticality analysis: “Assess the relative criticality of specific applications and data in support of other contingency plan components.”

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128020432000045

Embedded Systems Analysis

Ronald van der Knijff, in Handbook of Digital Forensics and Investigation, 2010

File System Recovery

File system recovery uses the acquired low-level data to rebuild the high level hierarchy of directories, subdirectories, and files. For data originating from flash file systems (as explained earlier in this chapter) this means finding out how the flash translation layer maps physical data to logical data and how the difference between active and delete data can be determined.13 The result of this “flash translation layer analysis” is a method that splits the physical data into two parts: a part with all logical sectors in the right order belonging to the actual file system and another part with all other data not belonging to the (current) file system. The first part can be further analyzed by existing forensic tools for file system recovery (Carrier, 2005). Analysis of the second part is more complex and depends a lot on the setup and user behavior of the originating system.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123742674000082

Contingency Planning

Stephen D. Gantz, Daniel R. Philpott, in FISMA and the Risk Management Framework, 2013

Alternate Storage Site

Incidents or outages affecting information systems often preclude the use of the primary operating location for system recovery. In such circumstances contingency operations teams perform system recovery activities at an alternate site, until the primary facility is again operational or a permanent replacement is chosen. To facilitate alternate site recovery and reconstitution, control CP-6 requires agencies to establish an alternate storage site for its moderate- and high-impact systems, and to put in place the agreements necessary to permit storage of and access to backup information to support recovery phase processes. Alternate storage sites also must be located far enough away from the primary storage site that the alternate site is not likely to be affected by the same hazards that might disrupt the primary site. Interpretations of what constitutes a sufficient geographic distance vary among organizations, but when selecting alternate sites, agencies must also identify potential issues (and describe actions to mitigate those issues) for accessing the alternate storage site in the event of a widespread outage or disaster [13]. Because all moderate- and high-impact systems share this requirement, agencies often designate one or more standard alternate storage sites to be used to support multiple systems; in such cases system owners or common control providers are responsible for implementing control mechanisms to transport system backups and other recovery information to the alternate storage site.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597496414000151

Data Science: Theory and Applications

Diana Rypkema, Shripad Tuljapurkar, in Handbook of Statistics, 2021

6 Summary

Here, we outline how to fit the GEV and incorporate ECE frequency, intensity, and damage, and system recovery into ecological population models. We provide a framework that downscales GEV models to an ecologically relevant spatial scale and discuss challenges and considerations when applying GEVs to different event types and systems. The GEV is well-supported, flexible, and proves to be a useful tool, with a long history as the modeling standard for extreme events. Although the GEV is not as detailed as some climate simulations (e.g., the Coupled Model Intercomparison Project, Eyring et al., 2016), it is very straightforward to use and interpret—providing an opportunity for ecologists to explicitly model ECEs in their study system with a low barrier to entry. By explicitly including ECE frequency and intensity, we can estimate the strongest sources of ecosystem change, modeling damage and recovery during and after an ECE. Our case study underscores the importance of explicitly modeling ECE characteristics; by incorporating the effects of climate change on our system through the lenses of both hurricane frequency and intensity, we found (surprisingly) that the stronger hurricanes of the future may not significantly impact our focal species. Without explicitly modeling hurricane frequency and intensity, we would not have been able to parse out the effects of probable future climate change scenarios on A. escallonioides and may have overestimated the potential influence of future storms on our study species. Better understanding the impacts of ECEs on ecosystems can improve our plans for the future, supporting conservation prioritization and effective natural resource management. As climate change continues, it is imperative that we accurately and explicitly model extreme events so we can understand current ecosystem dynamics, project future conditions, and plan for the future.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0169716120300511

An Introduction to Virtualization

In Virtualization for Security, 2009

Frequently Asked Questions

Q:

What is virtual machine technology used for?

A:

Virtual machine technology serves a variety of purposes. It enables hardware consolidation, simplified system recovery, and the re-hosting of earlier applications because multiple operating systems can run on one computer. One key application for virtual machine technology is cross-platform integration. Other key applications include server consolidation, the automation and consolidation of development and testing environments, the re-hosting of earlier versions of applications, simplifying system recovery environments, and software demonstrations.

Q:

How does virtualization address a CIO's pain points?

A:

IT organizations need to control costs, improve quality, reduce risks and increase business agility, all of which are critical to a business' success. With virtualization, lower costs and improved business agility are no longer trade-offs. By enabling IT resources to be pooled and shared, IT organizations are provided with the ability to reduce costs and improve overall IT performance.

Q:

What is the status of virtualization standards?

A:

True open standards for getting all the layers talking and working together aren't ready yet, let alone giving users interoperable choices between competitive vendors. Users are forced to rely on de facto standards at this time. For instance, users can deploy two different virtualization products within one environment, especially if each provides the ability to import virtual machines from the other. But that is about as far as interoperability currently extends.

Q:

When is a product not really virtualization but something else?

A:

Application vendors have been known to overuse the term and label their product “virtualization ready.” But by definition, the application should not be to tell whether it is on a virtualized platform or not. Some vendors also label their isolation tools as virtualization. To isolate an application means files are installed but are redirected or shielded from the operating system. That is not the same as true virtualization, which lets you change any underlying component, even network and operating system settings, without having to tweak the application.

Q:

What is the ideal way to deploy virtualization?

A:

Although enterprises gain incremental benefits from applying virtualization in one area, they gain much more by using it across every tier of the IT infrastructure. For example, when server virtualization is deployed with network and storage virtualization, the entire infrastructure becomes more flexible, making it capable of dynamically adapting to various business needs and demands.

Q:

What are some of the issues to watch out for?

A:

Companies beginning to deploy virtualization technologies should be cautious of the following: software costs/licensing from proliferating virtual machines, capacity planning, training, high and unrealistic consolidation expectations, and upfront hardware investment, to name a few. Also, sufficient planning upfront is important to avoid issues that can cause unplanned outages affecting a larger number of critical business applications and processes.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597493055000013

Macintosh Forensic Analysis

Anthony Kokocinski, in Handbook of Digital Forensics and Investigation, 2010

File Deletion

Because of the nature of HFS, HFS+, and HFSX volumes, recovery of deleted file system details is particularly difficult. Unlike many Windows and Unix systems, recovery of deleted file system information on Macintosh systems is nearly impossible because nodes in the Catalog file are so frequently overwritten. This is largely due to the B-tree balancing process in the Catalog file, which obliterates previous file system details. Standard data carving techniques can be used to recover some file contents from Macintosh volumes, but only the data fork. This fork will retain what is a “typical” structure of media, document, application files with well-defined file formats, and header signatures that tools like Foremost rely on for carving as discussed in Chapter 2, “Forensic Analysis.” The real difficulty comes in further correlation; pairing the carved data for each fork would be very difficult, especially since a lot of the data in resource forks is undocumented. Additionally, tying specific files to a user can be very difficult when file system details are no longer available in a Catalog file record. Therefore, attributing a file to a particular user is often possible only through forensic examination of data contents.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123742674000070

Domain 7

Eric Conrad, ... Joshua Feldman, in Eleventh Hour CISSP® (Third Edition), 2017

Failure and recovery metrics

A number of metrics are used to quantify how frequently systems fail, how long a system may exist in a failed state, and the maximum time to recover from failure. These metrics include the Recovery Point Objective (RPO), RTO, WRT, Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), and Minimum Operating Requirements (MOR).

Recovery point objective

The RPO is the amount of data loss or system inaccessibility (measured in time) that an organization can withstand. “If you perform weekly backups, someone made a decision that your company could tolerate the loss of a week's worth of data. If backups are performed on Saturday evenings and a system fails on Saturday afternoon, you have lost the entire week's worth of data. This is the RPO. In this case, the RPO is 1 week.”3

The RPO represents the maximum acceptable amount of data/work loss for a given process because of a disaster or disruptive event.

Recovery time objective and work recovery time

The RTO describes the maximum time allowed to recover business or IT systems. RTO is also called the systems recovery time. This is one part of MTD; once the system is physically running, it must be configured.

Crunch Time

WRT describes the time required to configure a recovered system. “Downtime consists of two elements, the systems recovery time and the WRT. Therefore, MTD = RTO + WRT.”3

Mean time between failures

MTBF quantifies how long a new or repaired system will run before failing. It is typically generated by a component vendor and is largely applicable to hardware as opposed to applications and software.

Mean time to repair

The MTTR describes how long it will take to recover a specific failed system. It is the best estimate for reconstituting the IT system so that business continuity may occur.

Minimum operating requirements

MOR describe the minimum environmental and connectivity requirements in order to operate computer equipment. It is important to determine and document what the MOR is for each IT-critical asset because in the event of a disruptive event or disaster, proper analysis can be conducted quickly to determine if the IT assets will be able to function in the emergency environment.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128112489000073

Is the period of time within which systems applications or functions must be recovered after an outage?

The recovery time objective (RTO) is the maximum tolerable length of time that a computer, system, network or application can be down after a failure or disaster occurs.

Are those that occur suddenly with little warning taking the lives of people and destroying the means of production?

A rapid onset disaster refers to an event or hazard that occurs suddenly, with little warning, taking the lives of people, and destroying economic structures and material resources. Rapid onset disasters may be caused by earthquakes, floods, storm winds, tornadoes, or mud flows.

Is the point in time to which lost systems and data can be recovered after an outage as determined by the business unit?

The recovery time objective (RTO) downtime metric is the defined as the point in time to which lost systems and data can be recovered after an outage as determined by the business unit.

Why must the alert roster and the notification procedures that use it be tested more frequently than other components of the DR plan?

The alert roster must be tested more frequently than other components of a disaster recovery plan because it is subject to continual change due to employee turnover. Training focuses on the particular roles each individual is expected to execute during an actual disaster.