Debugging of Flash Issues Observed in Hyperscale Environment at Scale

Abstract

A deep dive of the methodology and tooling that we use at Meta, to improve debuggability of failures in the datacenters, especially for failures on components like SSDs where privacy requirements might prohibit us from sending the components back for FA or add custom instrumentations in our datacenter. In particular, we will talk about how the tool tracewatch coupled with Latency Monitoring log page helps us trigger trace collection on failures using BPF based triggers. We will present the retrace tool which can then be used to analyze the captures in a variety of format, convert between the different formats and filter down to the stack of a single I/O from application layer down to the drive. We will present dialog, our collection mechanism for file system based logging, the sanitization process, etc. Finally we will talk about ways in which we’re collaborating with the industry to design efficient logging built into flash drives.

Related Sessions