Launch1mo ago

Microsoft's AI Models Trained on Unlicensed Web Data, Contradicting Clean Data Claims

The DecoderJune 5, 20261 min brief

In brief

Microsoft has revealed that its new MAI models were trained using unlicensed web data like Common Crawl, despite earlier claims of exclusively using "clean and commercially licensed" datasets.
- This practice aligns with many other AI companies, which rely on fair use and depend on website owners to block their crawlers if they object.
- This admission raises questions about the transparency and accuracy of Microsoft's marketing around its AI products.
While the company emphasizes the quality of its data, critics argue that using unlicensed material undermines claims of enterprise-grade cleanliness and licensing.
Looking ahead, this could spark broader discussions about data sourcing practices in the AI industry and how companies communicate their methods to users and developers.

Terms in this brief

MAI: Microsoft's AI (MAI) refers to Microsoft's suite of artificial intelligence models and services. The term is used within Microsoft to denote their AI initiatives and products.
Common Crawl: A large-scale dataset created by crawling the web, providing a vast resource for training AI models. It is often used despite concerns over licensing and data ownership.

Read full story at The Decoder →

More briefs