RESIST Tutorial @ DSN

About the Tutorial

As process nodes shrink and datacenters scale toward exascale, Silent Data Corruptions (SDCs) have emerged as a primary challenge to computational integrity at scale. Once dismissed as rare anomalies, SDCs are now implicated in corrupted AI training weights, silent database corruptions, and elusive bugs that surface only after billions of compute-hours.

This full-day tutorial offers a cross-layer journey through the SDC landscape, led by academic and industry experts from AMD, Meta, and Google. We begin at the silicon level — examining the defects, variability, and aging phenomena at the root of these corruptions — then ascend through architectural mitigations (ECC, residue checking, lockstepping), software-level testing and resilience, and finally workload behavior in distributed systems and large-scale AI. The tutorial bridges the dialogue between those who design silicon and those who manage the software running upon it, leaving attendees with a clear view of the open research questions that will define the next decade of reliable systems design.

Target Audience: Graduate students seeking a research area combining hardware, reliability, and systems; dependability researchers wishing to understand industrial hyperscaler constraints; and practitioners responsible for device and datacenter reliability. Basic knowledge of computer architecture and systems programming is expected.

Tutorial Schedule

Full-Day Tutorial (6 Hours)

Morning (3 hours)

45 min

Silicon Marginality and Fault Origins

Defects, variability, and aging; trends and remediation at the process and circuit levels.

Speaker TBD

45 min

Architectural Defenses and Redundancy

ECC, residue checking, lockstepping, and modular redundancy, with their power, area, and performance trade-offs.

Vilas Sridharan (AMD)

1.5 hrs

Testing for SDCs via Software Approaches

Simulation-driven and compiler-driven test generation for detecting SDC-causing defects in a fleet.

Caroline Trippel (Stanford); Nikos Karystinos, Odysseas Chatzopoulos & Dimitris Gizopoulos (University of Athens)

Afternoon (3 hours)

45 min

Software Robustness to SDCs

Building software and compilers robust to hardware defects, with insights from large-scale production deployments.

David Bacon (Google)

1.5 hrs

At-Scale Failure Trends and Debugging

Hyperscale failure trends drawn from recent large-scale industry reports on fleet-wide hardware reliability.

Harish Dixit (Meta)

30 min

The Future of Reliable Systems Panel

Shrinking nodes, AI/ML for failure prediction, and the road ahead.

15 min

Open Discussion

Talks & Speakers

Silicon Marginality and Fault Origins

Speaker TBD

Architectural Defenses and Redundancy

Vilas Sridharan — AMD

Abstract. This talk will cover the challenges, current state, and future directions of addressing the needs of data center reliability with a focus on architectural approaches to quantify and remediate the threat of silicon defects.

Bio. Vilas Sridharan is an AMD Senior Fellow where he leads the RAS (Reliability, Availability and Serviceability) Architecture team. His research focuses on the modeling of hardware faults and architectural and micro-architectural approaches to reliability and fault tolerance in high-performance microprocessors. Vilas received his Ph.D. and M.S.E. from the Department of Electrical and Computer Engineering at Northeastern University, and his B.S.E. in Computer Engineering from Princeton University in 2000. From 2000–2004, he worked in the SPARC server division at Sun Microsystems. Since 2010, he has been on AMD's RAS Architecture team.

Testing for SDCs via Software Approaches

Two-part session

Part 1 — Caroline Trippel (Stanford University)

Abstract. Hyperscalar reports of silent data corruptions (SDCs)—presumed to be caused by silicon manufacturing defects—have motivated the development of functional tests for detecting defective CPUs and their use in hyperscalar fleet studies. Interestingly, all such tests seem to assume that defects induce consistent errors: two instances of the same instruction within the same thread, given the same architectural inputs, always produce the same wrong architectural output. We find that this assumption unnecessarily restricts which programs can serve as tests—biasing which defect-induced errors are triggered and detected—and limits identification of affected instructions to those impacted by errors that short, targeted tests can reproduce—biasing how errors are characterized.

This talk will present ITHICA, which automatically generates functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight, challenging the assumption above, is that the most pernicious defects—those most likely to escape manufacturing testing—cause inconsistent errors: two executions of the same instruction, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and localizes affected instructions upon error detections, overcoming both aforementioned limitations of prior functional tests. We use ITHICA to transform industrial hyperscalar test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscalar fleet studies.

Bio. Caroline Trippel is an Assistant Professor in the Computer Science and Electrical Engineering Departments at Stanford University, where she leads the High Assurance Computer Architectures Lab. A central theme of her work is leveraging formal methods, especially automated reasoning, techniques to design and verify hardware systems. Trippel's research has been recognized with IEEE Top Picks and Best Paper Award distinctions, a Sloan Research Fellowship, an NSF CAREER Award, the Intel Rising Star Faculty Award, the 2020 ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Award, and the 2020 CGS/ProQuest® Distinguished Dissertation Award in Mathematics, Physical Sciences, & Engineering.

Part 2 — Nikos Karystinos, Odysseas Chatzopoulos & Dimitris Gizopoulos (University of Athens)

Abstract. In this talk we present the state-of-the-art in microarchitectural modeling and simulation-based methods to demystify the problem of silicon defects in modern CPU and AIA (AI accelerators) chips. Microarchitectural simulation is harnessed for the development of effective functional programs for defect detection (and thus catching them before they lead to user programs' silent corruptions) as well as for measuring the probabilities and rates of silent and loud data corruptions in modern architectures (and thus assisting efficient fault tolerance design techniques across the abstraction layers).

Bios. Nikos Karystinos is a Ph.D. student and a member of the Computer Architecture Lab at the Department of Informatics and Telecommunications of the University of Athens. His research interests include microarchitectural simulation, hardware reliability, and program generation. Karystinos received his BSc and MSc degrees in computer science (with a specialization in computer systems: software and hardware) from the University of Athens.

Odysseas Chatzopoulos is a Ph.D. student and a member of the Computer Architecture Lab at the Department of Informatics and Telecommunications of the University of Athens. His research interests include reliability analysis of heterogeneous system architectures from edge devices to hyperscale systems. Chatzopoulos received his BSc degree in computer science from the University of Athens.

Dimitris Gizopoulos is a professor at the Department of Informatics and Telecommunications of the University of Athens, Greece, and director of the Computer Architecture Lab. His team's research interests include the complex interactions among performance, power, and reliability of computing systems built on CPUs, GPUs, and AI accelerators. He serves as associate editor and guest editor for several IEEE and ACM publications (including IEEE CAL, ACM CSUR, and IEEE TC) and is a member of the steering, organizing, and program committees of international computer architecture, hardware, and systems conferences. He is a Fellow of IEEE, an ACM Distinguished Member, and a Golden Core member of the IEEE Computer Society. He is the General Chair of the IEEE/ACM MICRO 2026 symposium.

Software Robustness to SDCs

David Bacon — Google

Abstract forthcoming.

At-Scale Failure Trends and Debugging

Harish Dixit — Meta

Abstract forthcoming.