SEARCH BY:
Blog  |  September 29, 2025

Taming Modern Data Challenges: Generative AI Created Content

In our last post, we discussed the emergence of emojis as discoverable evidence in terms of how they have evolved, how they have become important in discovery, the challenges associated with collecting and producing emojis, and the best practices for addressing them in discovery.

Perhaps the newest modern data challenge is the explosion of generative AI content in organizations. While most of the discussion of GenAI in relation to eDiscovery relates to how GenAI can streamline review and many other workflows within discovery, there isn’t much discussion about the growing volumes of generative AI-created content within organizations through the use of tools like ChatGPT, Gemini, Claude, and Microsoft Copilot. In this post, we will discuss examples of AI-generated content, when that content could be relevant in discovery, as well as challenges, considerations and best practices associated with the discovery of AI-generated content.

The Many Faces and Forms of AI-Generated Content

AI can generate a wide spectrum of electronically stored information (ESI). Examples include conversation logs with chatbots, first drafts of contracts or emails, code snippets created by developer tools, and marketing copy produced at scale. Some of this content may be produced within your organization through approved platforms; for example, Microsoft Copilot may have been integrated into your organization’s M365 environment. However, AI-generated content can also come from shadow IT usage, where employees turn to free or consumer versions of ChatGPT or other chatbots without corporate oversight. According to one study, 85% of IT decision-makers say employees are adopting AI tools before IT can even assess them.

From an eDiscovery perspective, the use of unapproved AI is hugely important. Enterprise platforms may provide logging, user IDs, and metadata that support defensible collection. Shadow IT tools, by contrast, often leave little to no record of what was created, when, or by whom. This can complicate discovery from these tools considerably.

When AI-Generated Content Becomes Relevant

There are several scenarios where generative AI output could be squarely at issue in litigation or investigations. Here are a few examples:

  • Employment disputes: could be impacted by whether performance reviews or internal memos were AI-drafted.
  • Intellectual property cases: can involve allegations including claims of copyright infringement of publications or that AI-generated code or text infringes third-party rights, or that confidential information was improperly disclosed to public systems.
  • Contract disputes: could turn on whether key language was authored by a person or a machine.
  • Fraud and misrepresentation claims: disputes could arise if AI-generated reports, statements, or even images were represented as human-authored.

In these contexts, both the prompt (input) and the AI response (output) can be relevant evidence. And the volume is potentially huge. In the copyright infringement litigation between OpenAI and The New York Times (NYT), OpenAI’s expert suggested producing a sample of 20 million logs to determine how frequently ChatGPT was being used to regurgitate articles and circumvent news sites’ paywalls; in response, NYT requested 120 million chat logs!

Discovery Challenges for AI-Generated Content

Challenges associated with discovery of AI-generated content goes beyond just the “nuts and bolts” of preserving, collecting, reviewing and producing the ESI – it also raises complex questions around authorship, provenance, and authenticity.

  • Authorship and attribution: Who is the true “author” of AI-generated material: the human who prompted it, the organization, or the AI vendor? Courts may need both the input and the output to evaluate responsibility.
  • Metadata and provenance: Many AI platforms do not preserve metadata by default. While Microsoft Copilot keeps user-linked interaction logs, consumer tools may not. Without proper provenance, it is difficult to establish whether content was AI-created, user-edited, or copied from elsewhere.
  • Validation and authenticity: Federal Rule of Evidence 901 requires authentication of evidence, but proving the authenticity of AI-generated drafts or chatbot conversations is a developing challenge.
  • Retention and legal hold: Not all AI interactions are automatically stored. Unless organizations proactively enable audit logs or capture outputs, potentially relevant content may be lost. This raises spoliation risks if AI content should have been preserved under a litigation hold.
  • Privilege and confidentiality: AI prompts may contain sensitive or privileged material; for example, legal strategy questions posed to a chatbot. If those questions are entered into a public platform like ChatGPT or Claude, privilege could be waived, and those prompts may be subject to discovery requests.

These are considerations that organizations should address proactively – before a case potentially involves AI-generated content instead of being caught “flat-footed” when a case is filed.

Considerations for Collecting Generative AI Content

Collecting generative AI content presents its own set of technical and strategic challenges. Unlike email or traditional document repositories, AI systems are dynamic environments where drafts, prompts, and outputs may not be stored in a stable location. For eDiscovery purposes, collection strategies should consider:

  • Enterprise logging and compliance tools: In Microsoft 365 environments, Copilot interactions can often be captured through Purview audit logs or related compliance solutions. These may include records of user prompts, timestamps, and AI-generated outputs. Exporting these logs in structured formats ensures they can be ingested into eDiscovery platforms alongside traditional documents and communications.
  • Conversation histories and exports: In environments where AI chats are retained (e.g., ChatGPT Enterprise, Slack or Teams integrations), organizations should preserve both the user input and the AI response, ideally with session metadata (user ID, date/time, context). Conversely, consumer versions of some AI tools do not provide export APIs, which means collection may require endpoint forensics, screenshots, or custodian self-production – methods that are less scalable and less defensible.
  • Format and normalization: Once collected, AI-created material must be converted into reviewable formats. Prompts and responses may be best represented in chat-style transcripts or structured JavaScript Object Notation (JSON) or Comma Separated Values (CSV) exports before loading into an eDiscovery platform. Proper normalization ensures reviewers can see the full context of the interaction and not just a snippet of output.
  • Search and review considerations: Because prompts often contain sensitive or privileged content, legal teams must be careful in applying keyword searches or TAR/AI review workflows. Clear segregation of prompts, outputs, and human-authored revisions may be necessary to manage review efficiently.
  • Shadow IT scenarios: Where employees used unapproved AI tools, collection is often manual. Interviews, endpoint collections, and review of browser history may be required to identify responsive AI-generated content. Counsel must weigh proportionality arguments carefully in these cases.

Ultimately, the defensibility of collection depends on demonstrating a systematic process: identifying the repositories, preserving AI interaction data, documenting chain of custody, and ensuring reviewability in established platforms.

Five Best Practices for Discovery of AI-Generated Content

As discussed above, organizations can’t afford to treat generative AI content as an afterthought. With that in mind, here are five best practices for discovery of AI-generated content:

  • Company policies: Develop clear policies distinguishing between approved enterprise use of GenAI tools and unapproved shadow IT use.
  • Custodian interviews: Incorporate questions about AI content into custodian interviews to determine their use of chatbots and AI drafting tools – whether authorized or unauthorized.
  • Leverage enterprise solutions: Leverage compliance platforms such as Microsoft Purview or Google Vault to collect enterprise AI logs and outputs.
  • Defensible documentation: Document workflows to establish how AI content was generated, edited, and used in decision-making.
  • Education: Educate staff to recognize AI-created material and understand its discovery implications.

Conclusion

Generative AI is transforming the way organizations create information, which means it’s transforming discovery as well. Whether it’s a draft contract, a code snippet, or a chatbot conversation, AI-created content is potentially discoverable ESI. By updating policies, collection practices, and legal hold protocols to account for AI-generated content, organizations can better position themselves to navigate the discovery risks and opportunities of AI-generated content.

In our next post in the series, we will discuss how important an information governance program has become in preparing your legal team for modern data discovery challenges!

For more regarding Cimplifi data reduction & analytics capabilities, click here.

>