Content Moderation

Content moderation refers to the removal, demotion, or labelling of potentially harmful user posts on social media.

How should AI be used in the process of content moderation? And how should AI-generated content itself be moderated?


Key Points:

  • AI can be used in content moderation, to enable large-scale enforcement of platform rules and reduce harmful impacts for human reviewers.
  • The use of AI in content moderation can make enforcement errors more likely.
  • We can expect to see increasing quantities of harmful AI-generated content being posted on social media.
  • Short of banning AI-generated content altogether, platforms should moderate it according to the same principles applied to human-generated content.

Social media platforms (such as Facebook, Instagram, TikTok, YouTube, and Bluesky) have elaborate policies governing what people are allowed to share. For example, platforms routinely prohibit threats of violence, incitements to violence, hate speech, bullying and harassment, sexual content, and much else. Platforms also have policies concerning what is and is not amplified to users–i.e., what gets recommended at the top of users’ feeds, or appears prominently in search results. For example, a platform may allow users to post certain forms of misinformation, but deamplify it to reduce its visibility (also known as demotion). 

Content moderation policies are broadly designed to mitigate the risks of potentially harmful speech (including physical or psychological damage to particular individuals or groups, or damage to our physical or social environment). They are also influenced, of course, by business interests. These can overlap: removing child sexual abuse material or overt racism both prevents harm and comforts advertisers, from whom platforms make their money. 

Platforms also seek to amplify engaging content – showing users more of the content that keeps their attention (though this particular practice is sometimes called content curation instead of moderation since it isn’t about enforcing rules). 

Additionally, platforms face a set of legal compliance issues, including demands to comply with local speech laws and ad hoc takedown requests from governments.

Given the immense size of some social media platforms, actively enforcing content rules can be an enormous task. Meta, which owns Facebook, Instagram, and Threads, reported removing millions of posts per day in December 2024, yet this still constituted less than 1% of the total content posted. Platforms would face serious challenges were they to rely on humans alone to accomplish this task. Clearly, it is impossible for human reviewers to look at each and every post that appears on such platforms, in order to check if it needs to be removed or demoted. And, while platforms also invite users to flag and report potentially violating content, there is too much reported content for human teams to review in a timely manner. If a death threat stays up for days before it is flagged, assessed, and finally taken down, the damage is likely already done. Further, it is increasingly recognised that looking at huge volumes of violent, abusive, or otherwise disturbing material is liable to cause serious stress and trauma to human reviewers.

Platforms can and do use AI to cope with these challenges, since it can be deployed at scale, speedily, and without harming worker health. AI classifiers are algorithms that learn from training data to evaluate novel inputs – deciding, for example, whether a user’s new post counts as hate speech or incitement. These algorithms are deployed both to assess posts reported by users and in proactive policing efforts by platforms, to look for violating content. Posts can then be automatically removed, demoted, or escalated to human teams for further review – usually, when the classifier has assigned lower confidence to the content violating platform rules. Finally, AI can be used to double-check human judgements, as with Meta’s use of large language models (LLMs) to provide ‘second opinions’ on enforcement decisions.

While there are certainly opportunities for improving content moderation through AI, the technology also carries risks. As with any rule enforcement, there will inevitably be errors–both ‘false negatives’ (i.e. violating posts that are missed) and false positives (i.e. non-violating posts are accidentally removed or demoted). One reason to suppose that AIs are more prone than humans to make classification errors is that interpreting speech requires sensitivity to ever-changing contexts and conventions. For example, the ability to distinguish between a legitimate political slogan and an illegitimate threat might depend on sophisticated, up-to-date linguistic, social, and cultural awareness. The extent to which AIs can replicate such awareness remains to be seen. Further, the opacity of advanced AI systems – the fact that their inner workings are often a “black box” inscrutable even to the engineers who design them – raises further questions about whether AI-based moderation is sufficiently transparent to be used in moderating users’ speech.

One of the most pressing challenges in content moderation concerns how to moderate content that has itself been generated by AI tools (whether or not the moderation is done by humans or other AIs). Platforms are now flooded with text and audiovisual content produced by generative AI technologies – including diffusion models like Midjourney and LLMs like GPT-5. These constitute a new threat to platform integrity, insofar as they enable fast and cheap production of potentially harmful content. 

Consider the problem of deepfakes: false but highly realistic portrayals of individuals, depicted doing things they may not have done. We have already seen these deployed to spread falsehoods (as when fake images of the pope wearing a puffer jacket went viral); abuse and harass particular individuals (perhaps most egregiously via the circulation of non-consensual intimate deepfakes); and even subvert electoral processes (as when fake audio of a politician confessing to a crime was released just ahead of the 2023 Slovakian election). While similar effects might be achieved using older technologies, AI massively facilitates the process of spreading potentially harmful misinformation – and in this way, exacerbates the problem of enforcing platform rules effectively.

One possible response would be to ban AI-generated content altogether. This is currently infeasible, since it is very difficult to tell whether a piece of content is AI-generated. The difficulty is partly practical – AI-generated content is often perceptually indistinguishable from other content, and there may be no technical means to trace back its provenance (despite attempts to introduce watermarking and labelling regimes). There is also a theoretical issue pertaining to the conceptual definition of AI-generated content; do re-touched or filtered images count? What about text that has been edited with the aid of an LLM? 

In addition to concerns about feasibility, there is concern that banning all AI-generated content goes too far – throwing the baby out with the bathwater. It would seem that some AI-generated content has genuine value. By enabling the production and sharing of compelling or beautiful content, it can help us express ourselves and understand other people and possibilities. Prohibiting it from social media platforms could needlessly inhibit our public discourse (albeit some worry that platforms saturated in AI-generated content will devolve into artificially sculpted info-tainment spheres).

If AI-generated content is not banned, how exactly should it be moderated? Some platforms have  sui generis policies to deal with the threat of generative AI. For example, Meta, for several years, explicitly disallowed videos or images depicting  false events that included the likenesses of actual individuals.. 

It is not clear, though, why AI-generated content should require its own set of content rules. Insofar as the danger of realistic portrayals of false events lies in their propensity to mislead – or, where real individuals’ likenesses are used, to harass – such activities can be covered under existing rules. For example, any egregious forms of misleading (about physically harmful activities, say, or electoral processes) are (or should be) addressed in platform policies. The same goes for sharing intimate, or otherwise demeaning, deepfakes. A reasonable approach, then, may be simply to apply existing content moderation principles to AI-generated content in cases where it is harmful. This approach has the added advantage of covering the full range of potentially harmful AI-generated content, which platforms may not have explicitly identified in their sui generis policies (for example, non-photo-realistic propagandic images; or false or pejorative outputs of LLMs).  

This entry has explored AI’s important implications for content moderation, first as a technology for enforcing rules against harmful content, and second as a technology for generating harmful content. There are, to be sure, other uses of AI in social media. For example, platforms use AI to make data-driven predictions about what sort of content users will be find engaging – keeping them on the platform for longer, and so boosting advertising revenue. This use of AI raises a wide range of concerns about manipulation, privacy, and other problems intrinsic to (what is sometimes called) surveillance capitalism. As new regulations like the EU Digital Services Act and the UK Online Safety Act come into force, what to do about these many challenges is swiftly becoming one of the most contentious topics in philosophy and public policy.