A Cross-Layer Tutorial on Mitigating SDCs, from silicon physics to fleet-scale distributed systems.
As process nodes shrink and datacenters scale toward exascale, Silent Data Corruptions (SDCs) have emerged as a primary challenge to computational integrity at scale. Once dismissed as rare anomalies, SDCs are now implicated in corrupted AI training weights, silent database corruptions, and elusive bugs that surface only after billions of compute-hours.
This full-day tutorial offers a cross-layer journey through the SDC landscape, led by academic and industry experts from AMD, Meta, and Google. We begin at the silicon level — examining the defects, variability, and aging phenomena at the root of these corruptions — then ascend through architectural mitigations (ECC, residue checking, lockstepping), software-level testing and resilience, and finally workload behavior in distributed systems and large-scale AI. The tutorial bridges the dialogue between those who design silicon and those who manage the software running upon it, leaving attendees with a clear view of the open research questions that will define the next decade of reliable systems design.
Target Audience: Graduate students seeking a research area combining hardware, reliability, and systems; dependability researchers wishing to understand industrial hyperscaler constraints; and practitioners responsible for device and datacenter reliability. Basic knowledge of computer architecture and systems programming is expected.
Full-Day Tutorial (6 Hours)
Defects, variability, and aging; trends and remediation at the process and circuit levels.
Speaker TBD
ECC, residue checking, lockstepping, and modular redundancy, with their power, area, and performance trade-offs.
Vilas Sridharan (AMD)
Simulation-driven and compiler-driven test generation for detecting SDC-causing defects in a fleet.
Caroline Trippel (Stanford); Nikos Karystinos, Odysseas Chatzopoulos & Dimitris Gizopoulos (University of Athens)
Building software and compilers robust to hardware defects, with insights from large-scale production deployments.
David Bacon (Google)
Hyperscale failure trends drawn from recent large-scale industry reports on fleet-wide hardware reliability.
Harish Dixit (Meta)
Shrinking nodes, AI/ML for failure prediction, and the road ahead.
Speaker TBD
Vilas Sridharan — AMD
Abstract. This talk will cover the challenges, current state, and future directions of addressing the needs of data center reliability with a focus on architectural approaches to quantify and remediate the threat of silicon defects.
Bio. Vilas Sridharan is an AMD Senior Fellow where he leads the RAS (Reliability, Availability and Serviceability) Architecture team. His research focuses on the modeling of hardware faults and architectural and micro-architectural approaches to reliability and fault tolerance in high-performance microprocessors. Vilas received his Ph.D. and M.S.E. from the Department of Electrical and Computer Engineering at Northeastern University, and his B.S.E. in Computer Engineering from Princeton University in 2000. From 2000–2004, he worked in the SPARC server division at Sun Microsystems. Since 2010, he has been on AMD's RAS Architecture team.
Two-part session
Abstract. Hyperscalar reports of silent data corruptions (SDCs)—presumed to be caused by silicon manufacturing defects—have motivated the development of functional tests for detecting defective CPUs and their use in hyperscalar fleet studies. Interestingly, all such tests seem to assume that defects induce consistent errors: two instances of the same instruction within the same thread, given the same architectural inputs, always produce the same wrong architectural output. We find that this assumption unnecessarily restricts which programs can serve as tests—biasing which defect-induced errors are triggered and detected—and limits identification of affected instructions to those impacted by errors that short, targeted tests can reproduce—biasing how errors are characterized.
This talk will present ITHICA, which automatically generates functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight, challenging the assumption above, is that the most pernicious defects—those most likely to escape manufacturing testing—cause inconsistent errors: two executions of the same instruction, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and localizes affected instructions upon error detections, overcoming both aforementioned limitations of prior functional tests. We use ITHICA to transform industrial hyperscalar test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscalar fleet studies.
Bio. Caroline Trippel is an Assistant Professor in the Computer Science and Electrical Engineering Departments at Stanford University, where she leads the High Assurance Computer Architectures Lab. A central theme of her work is leveraging formal methods, especially automated reasoning, techniques to design and verify hardware systems. Trippel's research has been recognized with IEEE Top Picks and Best Paper Award distinctions, a Sloan Research Fellowship, an NSF CAREER Award, the Intel Rising Star Faculty Award, the 2020 ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Award, and the 2020 CGS/ProQuest® Distinguished Dissertation Award in Mathematics, Physical Sciences, & Engineering.
Abstract. In this talk we present the state-of-the-art in microarchitectural modeling and simulation-based methods to demystify the problem of silicon defects in modern CPU and AIA (AI accelerators) chips. Microarchitectural simulation is harnessed for the development of effective functional programs for defect detection (and thus catching them before they lead to user programs' silent corruptions) as well as for measuring the probabilities and rates of silent and loud data corruptions in modern architectures (and thus assisting efficient fault tolerance design techniques across the abstraction layers).
Bios. Nikos Karystinos is a Ph.D. student and a member of the Computer Architecture Lab at the Department of Informatics and Telecommunications of the University of Athens. His research interests include microarchitectural simulation, hardware reliability, and program generation. Karystinos received his BSc and MSc degrees in computer science (with a specialization in computer systems: software and hardware) from the University of Athens.
Odysseas Chatzopoulos is a Ph.D. student and a member of the Computer Architecture Lab at the Department of Informatics and Telecommunications of the University of Athens. His research interests include reliability analysis of heterogeneous system architectures from edge devices to hyperscale systems. Chatzopoulos received his BSc degree in computer science from the University of Athens.
Dimitris Gizopoulos is a professor at the Department of Informatics and Telecommunications of the University of Athens, Greece, and director of the Computer Architecture Lab. His team's research interests include the complex interactions among performance, power, and reliability of computing systems built on CPUs, GPUs, and AI accelerators. He serves as associate editor and guest editor for several IEEE and ACM publications (including IEEE CAL, ACM CSUR, and IEEE TC) and is a member of the steering, organizing, and program committees of international computer architecture, hardware, and systems conferences. He is a Fellow of IEEE, an ACM Distinguished Member, and a Golden Core member of the IEEE Computer Society. He is the General Chair of the IEEE/ACM MICRO 2026 symposium.
David Bacon — Google
Abstract forthcoming.
Harish Dixit — Meta
Abstract forthcoming.