Bluesky — Data Breach
Executive Summary
A Hugging Face employee published a dataset of 1 million Bluesky posts scraped via the public Firehose API, including text, metadata, and users' decentralized identifiers (DIDs). After immediate backlash, the dataset was removed within a day. However, larger datasets quickly appeared — including one of nearly 300 million non-anonymized posts (roughly 42.5% of all Bluesky posts). Bluesky acknowledged it could not enforce consent preferences outside its own systems.
What Happened
On November 26, 2024, Daniel van Strien, a machine learning librarian at Hugging Face, published a dataset containing 1 million public Bluesky posts collected via the platform's Firehose API, including text content, metadata, and users' decentralized identifiers. Van Strien removed the dataset after backlash, but larger datasets subsequently appeared, including one containing nearly 300 million non-anonymized posts representing roughly 42.5% of all Bluesky posts. Bluesky acknowledged it cannot enforce user consent preferences outside its own systems and stated it is exploring ways for users to communicate consent preferences externally, though enforcement would depend on outside developers voluntarily respecting those settings.
Who Is Affected
Any Bluesky user who has posted publicly on the platform is affected, as their posts, metadata, and user identifiers were included in these datasets intended for machine learning research. The larger dataset of 300 million posts represents a substantial portion of the platform's user base, though the source material does not specify geographic or demographic details of affected users.
Why It Matters
This incident demonstrates that Bluesky's open API architecture allows unlimited third-party data collection for AI training purposes despite the platform's own policy not to use user content for such training. The event highlights a fundamental limitation of decentralized social platforms where user data posted publicly can be harvested at scale without enforceable consent mechanisms. As Bluesky experiences rapid user growth, the platform now faces the same data scraping scrutiny as established social networks but with fewer technical controls to protect user preferences.
What You Should Do
Bluesky users should assume that any content they post publicly can be collected and used for AI training or other purposes beyond the platform's control. Users concerned about their data being scraped should carefully consider what information they post publicly and review their posting habits accordingly. Since Bluesky has stated it cannot enforce consent preferences outside its systems, users should monitor for any platform updates regarding external consent communication tools that may be developed.
AI-Assisted
Event summaries are generated by Claude AI from verified sources and reviewed by humans before publication.
Sources