GDPR in Machine Learning projects

By Guest Author , on April 22, 2018

5 min read

5 min

Originally published at medium.com by @l.mokrzycki on April 22, 2018.

GDPR in Machine Learning projects — “A Paris square beside the Louvre filled with people, a garden fountain, and the petit Arc de Triomphe” by Jace Grandinetti on Unsplash

General Data Protection Regulation comes to life in the European Union from 25th May and will strongly influence how Machine Learning products are developed. Without a doubt, it will increase the amount of work required for shipping ML project to productions, but on the other hand, it will be solid protection for the rights of users, which also includes us, creators of these projects. The rights that help us keep control over data that is collected about us and protect us from unfair results of fully autonomous systems.
Now, a user will have a legal basis for access, rectify, transfer or deletion of his private data. With this post, I would like to propose best practices for ML models developers allowing them to reduce the amount of work and issues implied by the new law.

It is important to note that this is not official legal advice for companies or individuals, only my subjective ideas to help developers with the process of compliance with the new standards. Remember that details of GDPR implementation could be a little bit different in specific countries.

Before we jump into good patterns which will help to comply with GDPR, we need to get familiar with a few concise definitions of terms widely used in documents and materials about this new directive.

Data Subject: a single person in your system that is either already identified or is directly or indirectly identifiable based on physical, genetic, economic or other factors.

Personal Data: any information related to Data Subject. It may be misleading that “personal” means an only narrow subset of users data, like health or financial records, but it truly means any data that is gathered about users, including but not limited to online behavior, work performance or physical position.

Data Controller: person, company or institution thats collect users’ data and determines it’s purpose. Data Controller is responsible for protecting users data according to GDPR standards and may be obligated to assign one of his employees to Data Protection Officer role.

Data Processor: person, company or institution that processes Personal Data on behalf of Data Controllers. Usually, those are third-party services which perform specific tasks with users data. According to GDPR, they need to provide proper technical and security measures.

Profiling: means any form of analysis or prediction model based on individuals Personal Data.

Automated decision making: the process which allows to make a decision or perform an action without human involvement. Level of automation of such system may be very different, from just advice giving solutions to fully autonomous ones.

Of course, those are plain langue explanations, to see full version of those descriptions you could visit sites like gdpr-info.eu or gdpreu.org.

Documentation is a fundamental tool for building more complex supervision of a ML project. First of all, it helps new developers quickly jump into the system, which may be developed through years, but what’s more important it provides a permanent source of knowledge about under the mask behavior of the product. According to the new law, Data Subject has a right to obtain a detailed explanation of why the automated decision, with a significant impact on their life or career, produced specific outcome as what as getting the full description of how his Personal Data is used. The sooner we introduce the idea of up-to-date documentation in ML workflow, the sooner we will be confident and calm about the possibility of mentioned situation. What should such documentation contain? It depends on the type of the project, but some parts should be universal: description of data formats, list of all data sources, list of Data Processors with details about what kind of data they could access, description of data consents obtaining process. In case of Machine Learning models, even if we use black box models, we should have at least documentation what’s the input and any possible output of the model.

Control over data could be described as an ability to quickly perform one of the actions described as Data Subject right in GDPR, which are: access, rectify, portability and erase. Every experienced software developer would see big refactoring overhaul for this compliance if the system wasn’t designed to support those actions easily. That’s a strong argument to plan those possibilities in order to avoid the cost of future changes and prepare ML pipeline for recalibration after data rectifications.

Anonymization middleware is connected with the part of GDPR that permits more relaxed work with data when it’s fully anonymized, and there’s no possibility to re-identify individual based on such subset of information. If final product doesn’t require identification of specific user, it’s smart to just get rid of such information, by introducing anonymization layer at the beginning of data collection pipeline and save yourself a lot of additional work. Tricky part is possibility of being affected by results of automated decision making even after, in theory, secure removal of all direct and indirect identifiers, and so far there’s no easy answer for that.

Simple Algorithms First means starting small, asking yourself if you need start with techniques like Deep Learning to solve your problem with successful results. Descriptive statistics or decision trees may be just good enough for the task and provide a huge benefit in the form of interpretability of the results and underlying rules. Additionally performance gain of black box models in a production environment may not compensate the cost of its implementation. Slightly less accurate models could be a good compromise, and their interpretability makes compliance with GDPR faster to achieve. What if using for example Deep Learning is crucial for the project? In such scenario, developers should remember about documenting and testing input and output data, think about the possibility of manual corrections and supervision of models output and prepare validation datasets which illustrate the behavior of the black box in specific scenarios. The last one could be called “testing for ethics” and should check if the algorithm doesn’t discriminate or affects the Data Subject unintentionally based on any direct or more fuzzy correlation.

Transparency translates into being honest with users about what and how personal data is processed in your product. It’s the foundation of GDPR idea, giving the user ability to give or revoke consent for managing personal data or performing profiling and automated decision making by Data Controller. Those consents require conscious user action and should be logged and safely stored in case of a potential audit. But transparency is not only about that, but it also connects strongly with an ability to explain how ML model works and what may be their legal or any other significant impact on lives of the product users. In other words be honest what are you’re going to do with the data.

Those are generic ideas, which I’ll cover with the more details and examples in future posts on this blog. I hope that by just keeping above concepts in mind, while starting new or refactoring existing Machine Learning projects, cost of compliance with General Data Protection Regulation standards should be minimized. In my opinion, it’s not the main benefit. Biggest take off is creating transparent and interpretable systems, and prioritizing and protecting ethical standards with proper testing and supervision of the MLpipeline, to gain the trust of users in AI/ML powered solutions.