Blog Post

Digital Innovations from FTI Technology: Testing the Intersection of E-Discovery and ChatGPT (Part 2)

Jon Chan
Senior Managing Director,
FTI Technology

Stuart Craft
Managing Director,
FTI Technology

Selvakumar Sivanand
Director,
FTI Technology

FTI Technology’s data innovation lab leads cutting-edge research, development and testing across the most disruptive technologies. Our experts have a strong track record of creating novel solutions, workflows and practice areas in e-discovery, digital forensics, analytics, machine learning, emerging data sources, blockchain and more. This article series provides a peek behind the curtain of our ongoing work relating to the implications of ChatGPT and other large language models (LLM) in e-discovery. In the first installment, the team examined role-based prompt engineering for ChatGPT and how different approaches to prompts affected results from the tool. This piece begins to unpack the tool’s potential for document summarisation use cases, including guidelines for optimising results and potential pitfalls that legal teams need to watch for as they consider the use of generative AI in case work.

As discussed in part one of this series, when testing ChatGPT for document summarisation, we found that the way we structured and asked questions in ChatGPT had a significant influence on the responses we received (thus, the deep dive into role-based prompt engineering, the process of designing and fine-tuning prompts for a language model to help it provide a more appropriate response, and therefore improve its performance).

Taking this one step further in document summarisation testing, we found it was helpful to use ChatGPT APIs to adjust prompts according to three types of messages. These included:

System message, which allows the user to provide instructions like, “act as an investigator and summarise the document in 150 words.” This instruction helps the tool to understand the boundaries and style of the conversation.
User message and assistant message, which can be paired. The user message contains a prompt from the user and then the assistant message contains an expected response from the assistant (ChatGPT). Several of these pairs can be run to temporarily guide the model in the conversation style and domain-specific terminology.

Even with a variety of prompt options, one roadblock to using ChatGPT for document summarisation is the character limit for prompts. These limits can make it challenging to insert a highly detailed prompt and corresponding documents to be summarised. Character limits are calculated using tokens, and on average, one token is equivalent to four characters in length. Open AI’s latest GPT 3.5 model (gpt-3.5-turbo) has a max token limit of 4,096 tokens. This is roughly 16,380 characters, which for perspective would be about the length of two articles like this one. Open AI’s GPT 4 models have a max token limit of 8,192 or 32,768, depending on the model that you use (so, approximately the length of roughly four articles or 16 articles, respectively).

In our testing, we asked ChatGPT to answer questions based on a set of documents we provided, but the amount of text in these documents far exceeded the token limit of the GPT-3.5 model. Further examination uncovered the following potential (albeit flawed) workarounds:

Ask ChatGPT to summarise each document to a specified number of words and use that as context. The more documents there are to be analysed would essentially result in the creation of summaries and subsequent summaries of summaries. This method is lossy (“a class of data compression methods that uses inexact approximations and partial data discarding to represent the content…used to reduce data size for storing, handling and transmitting content”), and will therefore miss potentially important details within these documents — and the longer the conversation goes or larger the document set is, the more will potentially be lost.
Ask ChatGPT to help with creating embeddings of these documents and then storing them in vector databases. With this approach, when a question is asked, it is converted to embeddings and used to search the vector database using semantic searching technique. The search results and questions can be fed into ChatGPT as a prompt. This is the more technical work around to solve the character limitation without losing substantial details from within the documents.

Additional Considerations

Beyond the issues of prompt engineering and character limitations, our testing provided further understanding of additional potential challenges or limitations in the use of ChatGPT for document summarisation and other e-discovery activities. These included:

Custom Models. At the time of testing, ChatGPT did not have existing APIs to allow users to add their own data to the model. There are however APIs for GPT-3 (not GPT-3.5 or GPT-4) models, which allow for fine-tuning, which is the process of training an existing model to generate higher quality results. Using fine-tuning APIs, teams can potentially teach the model to understand and generate a domain-specific language and to improve accuracy. However, to become reliable in an e-discovery setting, custom models would be needed.
Other deployment options. Microsoft Azure recently announced that its hosted OpenAI models will allow users to create chatbots, which will enable the addition of unique, proprietary data via the cognitive search function. This is an emerging feature, and an area we are investigating further as its implemented and evolves.
Hallucinations. AI hallucination has been widely reported in the news. This is when an AI interface provides a confident response that seems plausible but is factually incorrect. Due to the confident tone and the use of natural language in responses, humans perceive these responses as correct and reliable, or even as a signal that the tool has a highly developed intelligence.

Causes for hallucinations vary, but insufficient training data and biased training data are two of the major causes. Prompting ChatGPT to only consider given data sets could minimize hallucinations. Again, without rigorous quality control and understanding of the underlying technology, hallucinations may cause (and already have) significant problems in legal and many other settings.
Rate Limits. Like other cloud service APIs, there is a rate limit on the number of requests a user can make to the ChatGPT API within a specified period of time. These rate limits are set to prevent flooding attacks like DDOS, ensure fair amount of availability to all users and manage infrastructure load. ChatGPT enforces rate limits in two ways: RPM (request per minute) and TPM (tokens per minute), with different allowed levels set per model. Users looking to summarise large groups of documents could easily hit the rate limit, either by exceeding the RPM or TPM limit, whichever occurs first. This can be managed with common techniques known to software developers.

Generative AI is raising countless questions across industries. Our Digital Insights & Risk Management experts and our data innovation lab continue to watch emerging issues and run tests on disruptive technologies. Stay tuned for future installments of this series, where we’ll share research across generative AI data protection issues, in-depth testing on the Azure OpenAI hosted models, security capabilities and more.

The views expressed herein are those of the author(s) and not necessarily the views of FTI Consulting, its management, its subsidiaries, its affiliates, or its other professionals.