TY - GEN
T1 - Towards a Safe and Latency-Aware Fault-tolerant Scheduling Technique for Multi-rate Task Chains
AU - Gohari, Pourya
AU - Voeten, Jeroen
AU - Nasri, Mitra
PY - 2025/1/3
Y1 - 2025/1/3
N2 - In safety-critical real-time systems such as autonomous cars, fault-tolerance is essential for system reliability but can increase end-to-end latency and hinder schedulability. This paper presents a novel, safe, and latency-aware fault-tolerant scheduling technique for multi-rate task chains. A naive use of traditional fault-tolerance mechanisms, such as checkpointing and re-execution with recovery blocks, can violate end-to-end latency requirements. Our technique uses recovery blocks but leverages inherent task redundancies in multi-rate task chains (where data producers and consumers have different periods) to reduce the need for recovery. Moreover, it determines the priority of the recover blocks such that the end-to-end latency of the task chain is reduced in the presence of transient faults. Our experiments show that our technique significantly improves schedulability and reduces data age compared to the state-of-the-art checkpointing method. For instance, for systems with 4 to 16 cores and 10 to 40 tasks, we achieve up to 6 times higher schedulability and reduce data age by 21% under various fault levels.
AB - In safety-critical real-time systems such as autonomous cars, fault-tolerance is essential for system reliability but can increase end-to-end latency and hinder schedulability. This paper presents a novel, safe, and latency-aware fault-tolerant scheduling technique for multi-rate task chains. A naive use of traditional fault-tolerance mechanisms, such as checkpointing and re-execution with recovery blocks, can violate end-to-end latency requirements. Our technique uses recovery blocks but leverages inherent task redundancies in multi-rate task chains (where data producers and consumers have different periods) to reduce the need for recovery. Moreover, it determines the priority of the recover blocks such that the end-to-end latency of the task chain is reduced in the presence of transient faults. Our experiments show that our technique significantly improves schedulability and reduces data age compared to the state-of-the-art checkpointing method. For instance, for systems with 4 to 16 cores and 10 to 40 tasks, we achieve up to 6 times higher schedulability and reduce data age by 21% under various fault levels.
U2 - 10.1145/3696355.3699708
DO - 10.1145/3696355.3699708
M3 - Conference contribution
SP - 25
EP - 36
BT - RTNS '24
PB - Association for Computing Machinery, Inc
CY - New York
T2 - 32nd International Conference on Real-Time Networks and Systems, RTNS 2024
Y2 - 6 November 2024 through 8 November 2024
ER -