Hence what is proposed here is
Hence, what is proposed here is a technique highly used in reliability engineering: Software Rejuvenation, , a technique of proactive fault tolerance in which the system is periodically reboot to clean the memory. In fact, it is well known that most critical SW failures are transient. These may be caused by error conditions due to software aging phenomenon, that is the issue that SW can exhibit data corruption or unlimited resource consumption during its execution time. Although faults left by developers still remain, the periodic rejuvenation can help to remove or at least minimize transient failures thus reducing possible outages.
In order to evaluate the rejuvenation impact, an extension to the PRISM CTMC model seen in (Section 5.3) is provided. Fig. 6 shows the new states added to the cluster model. It is assumed that each Iberiotoxin after a certain time interval (α) moves to the failure-prone state 9. From that moment, unexpected shutdown of the Active node (with rate ) may be experienced and thus requiring the switch-over to the Standby node () that will be completed after Time to Switch (θ). Only a rejuvenation action – occurring with rate β – can bring the cluster back from 9 to the state of higher robustness.
The process of rejuvenation is leveraged as an opportunity to periodically switch the Active node running the TMS application, thus avoiding a situation where one node always runs, accumulates SW errors, and stresses the same HW units. Hence, when the rejuvenation begins the cluster enforces the switch to the other node within time θ. Therefore, the cluster will have only one server available in the time interval required to implement the hardware reboot (rate γ) of the node under rejuvenation. This is acceptable since the rejuvenation action may be triggered in time windows of low workload. Rates α and β have been tuned through a PRISM formal checking. Results, in line with , demonstrate that optimal values are h and h.
The rejuvenation yields a better failure rate result: . Such estimate – compliant with SIL2 requirements – needs now an on-field verification.
Experimental validation This section illustrates the experimental validation conducted on a TMS cluster test bed having the same architecture seen in 3. In such a test bed, the TMS receives messages on railway status from a simulated IXL actually used by ASTS for test purposes. Experiments aim at proving the validity of model results and the proposed rejuvenation strategy. The analysis of experiments moves in two directions. The idea is to test the TMS under stressful conditions by looking at: In both cases, two widely accepted techniques ,  have been used to obtain valid estimates and, at the same time, minimize the test duration. A Quantitative Accelerated Life Test (QALT) is the technique adopted to evaluate (1). QALT is designed to provide reliability information on a component or system through failure data obtained from an accelerated test. It is usually adopted for hardware component testing but LTR is also demonstrated by Matias et al.  as an accurate solution to observe – in short-term and stressed executions – the behavior of systems suffering from software aging. QALT allows a quantification of the MTTF by applying controlled stresses useful to reduce the test period. Then, QALT uses the lifetime data obtained under stress conditions to estimate the lifetime distribution of the system in its normal use condition. An Accelerated Degradation Test (ADT) is the method used to analyze (2). ADT extends the QALT technique. Rather than looking at failure times, ADT uses degradation measures to produce a time to failure. ADT fits well for aging-related failures as these are particularly difficult to empirically observe. However, ADT is designed for physical systems. Hence, this paper makes use of Matias et al.  approach who made ADT applicable for software aging studies. Both QALT/ADT make use of a life-stress relationship to model the connection between failure/degradation times observed at some stress levels to the system lifetime distribution in its normal use conditions.