Legal Update for UK AI Companies: Data Scraping and AI Development

 

Introduction

Data scraping refers to the automated extraction of large volumes of data from publicly accessible websites or platforms, which is then used to train machine learning models. These datasets are critical for teaching the model how to recognise patterns, generate content, or make predictions. Data scraping is one of the foundational methods used to collect training data for AI models.

The recent wave of litigation targeting major AI players like OpenAI, Microsoft, and Stability AI has brought one issue to the forefront: the legality of using scraped data to train generative models. The New York Times lawsuit against OpenAI and Microsoft alleges wholesale reproduction of its copyrighted content without consent, while Getty Images v. Stability AI raises the alarm over unauthorised image scraping. Nvidia, too, is facing questions around whether its data collection practices may violate copyright or data protection standards.

Many UK-based AI companies are left asking: What is permissible under current law? The answers are nuanced.

Scraping: Not Unlawful in Principle

Whilst English law does not prohibit data scraping, there is limited legal authority addressing this area of law, signifying little precedent for Courts to follow should a dispute arise. Its legal treatment depends on a matrix of factors including intellectual property rights, contractual restrictions, and data protection laws.

Between January 2024 – September 2024, the Information Commissioner’s Office (ICO) opened a consultation on how the UK GDPR applies to generative AI, particularly in the context of large-scale data harvesting. Meanwhile, the Government’s 2024 white paper, A Pro-Innovation Approach to AI Regulation, suggests a light-touch, developer-friendly framework is in the works. It explicitly prioritises innovation over early regulatory rigidity and highlights ambitions to make the UK a “science and technology superpower by the end of the decade”.

As it stands developers may extract and use publicly available data so long as: it is not in breach of any laws; and there is a valid lawful basis under UK GDPR (the second limb relating to the scraping of personal data).

Key Legal Risks for AI Developers

COPYRIGHT INFRINGEMENT

Copyright protection arises automatically in original works such as written content and images.

The threshold is relatively low—any element of creativity or intellectual input can trigger protection. So if a developer is scraping images, editorial copies, or stylised representations, they may be infringing the original author's rights.

In December 2024, the UK government launched a public (now closed) consultation - Copyright and Artificial Intelligence - proposing a new copyright and database exception to permit text and data mining (including for commercial AI training) unless rights holders opt out via a rights-reservation system. This aims to give greater legal certainty to AI developers while offering protections and compensation mechanisms for creators. The government is presently analysing feedback and has yet to revert with an outcome on the consultation.

Accordingly, the legal position remains unsettled. There is no reported UK case confirming that scraping for AI training amounts to copyright infringement. The Getty v. Stability AI litigation is expected to clarify this issue.

BREACH OF CONTRACT (TERMS OF USE)

Many websites include Terms of Service/Use which expressly prohibit scraping. However, enforceability will depend on how those terms are presented:

  • if the user had to click “Accept” (clickwrap agreement), the contract is likely enforceable;

  • if the terms were passively linked (browsewrap agreements), enforceability depends on how visible and prominent they were.

UK GDPR AND PERSONAL DATA

Even if data is publicly accessible, scraping personal data triggers obligations under the UK GDPR. Developers must identify a lawful basis to process the data.

In the context of AI training, the most relevant basis is likely to be legitimate interests. This requires satisfying a three-part test.

  1. Purpose test – Is there a legitimate reason for scraping the data?

  2. Necessity test – Is scraping the only way to achieve that purpose?

  3. Balancing test – Do the rights of individuals override your interest?

Where a model is scraping identifiable personal data (e.g. agent names, contact details, user reviews), these tests become critical. Active data governance policies can mitigate the risk of breaching UK GDPR.

Developers must also inform users how their data is used and provide mechanisms to exercise subject rights (e.g. deletion, access).

 

Dealing with Scraped Data: Internal and Third-Party Use

If an AI model will be used internally, the developer retains control. However, if the model is deployed via an API or licensing arrangement, the developer may be liable for how others use the model.

The ICO has warned that AI companies must take steps to ensure their models are used only for their intended lawful purpose. Examples of such steps may include:

  • implementing token-based access controls;

  • building an API gateway to limit use cases and volume; and/or

  • requiring third-party users to sign binding terms of use.

This is particularly important for start-ups building out commercial use cases or collaborating with other developers.

Looking Ahead

It is evident that the legal landscape surrounding data scraping and AI is still maturing. While UK regulators have signalled pro-innovation leanings, developers must act responsibly and proactively manage risk.

Until the Courts or Parliament provide clearer guidance, companies operating in this space will need to strike a careful balance between innovation and compliance.


This article is intended for information purposes only and provides a general overview of the relevant legal topic. It does not constitute legal advice and should not be relied upon as such. While we strive for accuracy, the law is subject to change, and we cannot guarantee that the information is current or applicable to specific circumstances. Costigan King accepts no liability for any reliance placed on this material. For further details concerning the subject of the article or for specific advice, please contact a member of our team.


 
 

Chira Santea

Trainee


Related Articles


Next
Next

Operating a Remote Casino in the UK: Legal and Regulatory Guide