AI Compliance, LLMs and the GDPR

The General Data Protection Regulation (GDPR) was deliberately drafted in a technology-neutral manner, so it is not surprising that the long arm of the GDPR also extends deep into processes involving AI. This is somewhat inevitable, as the development of Large Language Models (LLMs) requires the processing of ever-larger data sets. The development of an AI system can undoubtedly involve a number of activities relevant to data protection law on the part of the controller, from the development phase to the deployment phase. At the heart of every AI system is the underlying AI model, the neural network developed using machine learning. This requires training data to be collected and processed, and the AI model must then be trained. The collection and preparation of data may constitute processing within the meaning of the GDPR if the training data is personal data. Anonymizing personal data prior to training also constitutes processing, which is why the GDPR must be observed. In the deployment phase, i.e., when the AI system is used, the processing of personal data is often also envisaged, which must also be reviewed from a data protection perspective. However, in addition to these more obvious forms of personal data processing, the question arises as to whether an AI model that has been trained with personal data itself contains personal data. In other words, whether the AI model itself can be subject to data subject rights under Art. 12 et seq. GDPR. In addition, supervisory authorities could order remedial measures to remedy the unlawfulness of the processing of personal data in the development phase of an AI model. These include fines, temporary restrictions, the deletion of unlawfully processed data sets (in whole or in part), or even the deletion of the AI model itself.

Is an AI Model Anonymous or Do they Contain Personal Data?

Whether an AI model itself is anonymous depends on whether the AI model contains personal data. According to Art. 4 No. 1 GDPR, personal data is any information relating to an identified or identifiable natural person. In contrast, the GDPR does not apply to anonymous data, i.e., data that does not relate to an identified or identifiable natural person, or personal data that has been anonymized in such a way that the data subject cannot be identified or is no longer identifiable. If an AI model has been trained (also) with personal data, the question arises as to what extent the AI model contains personal data as a result of this training. In this context, the Hamburg Commissioner for Data Protection and Freedom of Information stated in its discussion paper “Large Language Models and Personal Data” on the applicability of the General Data Protection Regulation (GDPR) to large language models that the mere storage of an LLM does not constitute processing within the meaning of Art. 4 No. 2 GDPR, as no personal data is stored in LLMs themselves. This is justified by the fact that LLMs work on the basis of tokens (linguistic fragments) and embeddings (mathematical representations of the relationships between tokens) and represent “highly abstracted and aggregated data points from the training data and their relationships to each other without concrete characteristics or references to natural persons.” In a recent statement “on certain data protection aspects of the processing of personal data in the context of AI models,” the EDPB has now refuted the Hamburg Data Protection Commissioner’s thesis. The EDPB clarifies that an AI model trained with personal data cannot be considered anonymous in all cases. The claimed anonymity must therefore be examined by the competent supervisory authorities on a case-by-case basis.

How Will the Distinction Be Made?

An AI model can only be considered anonymous if two cumulative conditions are met: The probability of direct (including probabilistic) extraction of personal data about the individuals whose data was used for training, and the probability that such personal data will be obtained through intentional or unintentional queries, must be negligible for each individual concerned. This is to be agreed with, as information may also refer to a natural person if it is encoded in such a way that the relationship is not immediately apparent. Although AI models do not usually contain direct records of personal data, but only parameters that represent probabilistic relationships between the data contained in the AI model, it is possible to derive information from the AI model. Under certain circumstances, statistically derived personal data can be extracted from the AI model. The probability assessment to be carried out should take into account all means likely to be used by the controller or another person acting in the exercise of their normal activities, including the unintended (re)use or disclosure of the AI model. According to the EDPD, the criteria for assessing the residual probability of identification should include the characteristics of the training data set (e.g., uniqueness of the data sets, accuracy), the methods used for training, and the implementation of technical and organizational measures to reduce identifiability (e.g., regularization methods, differential privacy). The results of structural tests that check resistance to attacks such as attribute and membership inference, exfiltration, or regurgitation of training data, the context in which the AI model is released and/or processed (e.g., public availability versus internal use), and additional information that could be available to another person for identification must also be taken into account. Controllers must document the measures taken to reduce the probability of identification and the possible remaining risks, not least because this documentation in particular is to be taken into account by the competent authorities in order to assess the anonymity of an AI model. If, after reviewing the documentation and the measures implemented, the competent authority cannot confirm anonymity, it can be assumed that the controller has not fulfilled its accountability obligations under Article 5(2) GDPR. Careful documentation is therefore strongly recommended.

Attorney Anton Schröder

I. https://fin-law.de

E. info@fin-law.de

This Blog Article as Podcast?

FIN LAW Hosts the Annual Meeting of the Fintech Legal Network
FIN LAW had the pleasure of hosting the Annual Meeting of the Fintech Legal Network and was pleased to welcome members to its offices in Senckenberganlage.
Successful Book Launch at FIN LAW for the New MiCAR Handbook by Dr. Johannes Meier
Presentation of the new MiCAR handbook by Dr. Johannes Meier at the book launch event of FIN LAW.
Lutz Auffenberg and FIN LAW have Again been Recognized by WiWo
WirtschaftsWoche (WiWO) has once again named Attorney Lutz Auffenberg, LL.M., as a top lawyer and FIN LAW as a top law firm.

Is an AI Model Anonymous or Do they Contain Personal Data?

How Will the Distinction Be Made?

Contact