Data Erasure In Distributed Systems
Encoding Respect
My talk at this year’s Privacy Engineering Practice and Respect (PEPR) conference came on the heels of the Colorado House voting to pass the state’s comprehensive privacy legislation. This regulatory news sums up one of my talk’s main points for engineers: users are looking for respectful systems, where respect is built into the processes that handle their data. Building respectful systems is both the right thing to do and—increasingly—the approach demanded by regulations worldwide.
For folks who could not attend PEPR, or those who want to revisit my talk, I’ve written up a couple of highlights on modern data erasure in distributed systems.
Key Take-Away #1: Personal data is never just a matter of deleting a single row. Method and order of deletion matter.
A single company will hold personal data across dozens of data systems, and these data systems often refer to one another. Thus, simply deleting a row or individual field could wreak havoc downstream on operations that, for example, rely on a string in a particular field that no longer exists. Maintaining these relationships is referred to as preserving “referential integrity”. In most case, erasure should not interrupt those existing data relationships held across tables. At the same time, erasure methods must be comprehensive, visiting every table in every data system at least once and effectively erasing all identifiable information. Failing to do so could yield a costly fine for non-compliance with the growing number of privacy regulations worldwide.
Performing erasure at scale requires a systematic traversal of all data systems that could hold users’ data. That said, the essential algorithm is fairly straightforward:
- Receive a user’s erasure request and lookup their identity in your system
- Collect initial row IDs that correspond to the user
- For each row, collect all related row IDs, and repeat until all rows are identified
- Erase personal data in all of the collected rows
Without getting too into the weeds on data structures, this approach effectively creates a tree. Starting at the base (or root), a set of row IDs are collected. Then for each of those row IDs, their related row IDs branch off from that root repeatedly, until no further relationships exist. These are the “leaves” of the tree, and the safest order for deletion for this kind of structure is to start at these at the leaves and work backwards down to the root. With this reverse-order traversal, the erasure process can safely be interrupted without leaving the system in a bad state. In fact, your database constraints may force this!
Key Take-Away #2: Avoid storing personal data in data warehouses whenever possible.
Structural differences between application databases and data warehouses make the latter less conducive to this iterative approach. Data warehouses often contain many copies of personal data, as they are often denormalized tables. Furthermore, warehouse tables are usually indexed completely differently than application databases, which means that typically fast operations (like querying by user ID) may become painfully slow. As a result, there are typically more tables to visit, less efficiently, so erasure in a data warehouse is generally slower than it is in an application database.
Nevertheless, erasure is possible in a data warehouse. When implementing the same general algorithm as above, visiting thousands of these rows iteratively can be inefficient. To improve this, I’d recommend looking for where you can run bulk updates of records, taking advantage of the partitions or indexes that do exist in your data warehouse to carry out the operation at scale.
Key Take-Away #3: Take a proactive privacy approach to reduce the overall complexity of data erasure.
As a company, you’re only responsible for erasing data that you actually store. It sounds almost vacuously true, but it’s increasingly important to emphasize data minimization—storing only the necessary amount of data, for the minimum amount of time—as a design mindset in building data systems.
As the demand for data-driven applications has exploded, developers have largely defaulted to collecting as much personal data as possible, since it’s convenient to do so. This mindset has simply been the industry norm. However, proactive privacy steps like maintaining detailed metadata on categories of personal data aren’t typically a part of the development lifecycle. Instead, these tasks happen infrequently after deployment, taking up a lot of time. I believe we can and should build privacy into the design of data systems. Respectful systems are systems that users deserve, and they simplify regulatory compliance so you can focus on engineering instead of ticking checkboxes. Take data minimization as an example. Developers can practice respectful design by minimizing the data they collect from the outset, and furthermore proactively erasing data after it has fulfilled its business purposes. For example: expire all collected session data every few hours. In this case, erasure takes care of itself!
Virtually every company will need to implement privacy measures. It’s only a question of whether you’ll implement them proactively and efficiently, or retroactively and weigh down your team’s ability to innovate in the process.
Conclusion
On a conceptual level, data deletion is one of the most straightforward ideas in modern data privacy. Putting into action across distributed systems, however, is more of a challenge. Modern data rights are here to stay (see: Colorado on the brink of passing the latest comprehensive privacy law in the US), and frankly we as developers can make respectful design choices to make it less difficult!
In my opinion, waiting for regulation to catch up to reality is pretty unnecessary; building more respectful systems makes better products, because they help you earn users’ trust. As I was preparing for PEPR, I came across a tweet from Dr. Ann Cavoukian, the scholar who put the Privacy By Design framework on the map. In a conversation about privacy vs. innovation, Dr. Cavoukian replied:
Get rid of the “vs.” You can have privacy AND innovation! You just need to proactively embed privacy-protective measures into the design of your operations — bake it into the code, by design!
Couldn’t have said it better myself.