Comparing GPT-3.5 & GPT-4: A Thought Framework on When To Use Each Model
Intro
Since the spring of 2022, there has been an explosion of Large Language Models (LLMs) onto the market. Companies like OpenAI, Microsoft, Anthropic, Meta, and AI 21 Labs have released several iterations of their proprietary LLMs, sparking a technological paradigm shift. As with all significant advances in technology, there is ample room for analysis paralysis to creep in when deciding how and when to leverage this innovative technology. This effect is compounded by the array of different LLMs available today. While I won’t speak to all LLMs on the market, I will shed some light on the popular suite of GPT models available on Azure OpenAI. The aim of this blog is to demystify the use and capabilities of the different GPT models, and by the end of this blog, you'll have a decision framework to help answer the question "when should I use which model?".
To quickly level set, generative pre-trained transformers (GPTs) are machine learning models designed for natural language processing. They are trained on exabyte amounts of data, such as books and web pages, to produce language that is contextually relevant and semantically coherent. In other words, GPTs can generate text that resembles human writing without being explicitly programmed. This makes them highly versatile and adaptable for natural language processing tasks, including answering questions, translating languages, and summarizing text. The most capable of the GPT family of models are GPT 3.5 and GPT4. Even within these model releases, there are several versions with subtle yet important differences. While these models can be used for similar natural language tasks, they have their distinct strengths and weaknesses. To aid in your decision making I’ll use the following factors when comparing these models:
- Context windows
- Training dataset cutoff
- Cost
- Model Capabilities
- Fine-tunability
- Latency
Let's define and examine these one by one.
Context windows
Context windows refer to the number of tokens a model will accept as an input. This input includes both your system prompt and the user prompt. This plays a significant role in the operability of your AI application and can be a determining factor of your overall application design. For example, imagine you are tasked with building an application that leverages an LLM for summarization. You want your end-users to be able to summarize text that is quite lengthy, so the model you choose needs to be able to "read and remember" multiple pages of text. The larger the context window, the more text you can fit into your prompts. At the time of this blog, GPT 3.5 (1106) has an input context window of 16k and output context window of 4k. GPT4 (1106 and 0125) has a context window of up to 128k for input and 4k for output. This is the difference between reading and remembering up to 16 pages versus 200+ pages of text. Various techniques such as applying a chunking strategy allow for processing larger amounts of text through a smaller context window. However, while this will achieve your goals of processing more text, it will also add more complexity to the application with more components to manage and additional engineering challenges to solve.
Training dataset cutoff
Training dataset cutoff refers to the date on which the model stopped "learning". Given that these LLMs are just big machine learning models, most principles of machine learning still apply to them. One concrete principle is the need to aggregate and engineer the training dataset used for the model. Though some of the data used to train these models is historic data, much of it is related to current events (such as the current world leaders, or the death of a celebrity). Since these models aren't being trained incrementally and in real time, this means there is typically a set cutoff point which the model has no further knowledge (aka data) about. Typically, newer model versions will have a more up to date training data set. For example, GPT 3.5 is trained with data up until Sep 2021 while different versions of GPT4 are trained with data as recent as Dec 2023. While this may seem like a significant limitation, there are techniques such as Retrieval Augmented Generation (RAG), that can circumvent this cutoff date by providing the model with up-to-date information. This begs the question: “If I could just use RAG why do I care about the training data cutoff?” RAG typically requires adding more tokens to your prompt, which could result in higher latency and higher costs. Having the model pretrained on more recent and relevant data could reduce the amount of info you need to include in your RAG implementation.
Cost
Cost is typically one of the largest factors when designing your AI application. Cost consideration for working with LLMs usually comes down to token measurement. Tokens are either whole words or pieces of words used by the model to interpret natural language. For GPT-3.5 and GPT-4, 1 token represents approximately 4 characters. That said, not all tokens are created equally. There are prompt tokens and completion tokens. Prompt tokens are tokens you pass to the LLM. This could be anything from your system message, your RAG context, or your user prompt. Prompt tokens are typically what contribute to reaching context window limits when working with LLMs. Conversely, your completion tokens are tokens that the LLM generates. These contribute to your token limits as well but can easily be controlled by updating the "max_tokens" parameter. Token pricing tends to vary so I'd recommend viewing the Azure OpenAI Service pricing page for the most to date prices. As of this writing, GPT-3.5 prices are significantly lower than GPT-4 prices.
Capabilities
When comparing GPT models, understanding their capabilities is key. A good way to think about this is that there are both incremental and exponential improvements in the GPT family of models. GPT-3.5 and GPT-4 each have 4 versions: 0301, 0613, 1106, and 0125. These versions represent incremental improvements such as lower latency, supporting function calling, and minor bug fixes. For example, GPT-3.5 (0613) is 60% faster and supports function calling compared to the 0301. GPT-3.5 (1106) improves upon 0613 by supporting parallel function calling and compatibility with the Assistants API, while 0125 offers bug fixes for response formatting and text-encoding issues. When comparing GPT-3.5 to GPT-4, this represents an exponential improvement. GPT-3.5 was trained on 175B parameters while GPT-4 was likely trained on close to 1 trillion. This exposure to more data has allowed the GPT-4 versions to successfully exhibit more advanced reasoning and formatting capabilities when compared to their GPT-3.5 counterparts. For this reason, it is generally recommended to use GPT-4 in situations where more “complex” reasoning capabilities are required. Defining a use case as “complex” can be subjective, but typically includes uses like multi-agent systems, image and textual analysis, and classification workloads. Prompt engineering has been shown to improve GPT-3.5's capabilities to be on par with GPT-4 for certain tasks. While prompt-engineering could be a viable approach, it typically requires techniques, such as chain-of-though (CoT) and few-shot learning, which use more tokens.
Fine-tunability
While most use-cases and features in a GenAI application can be solved through a combination of RAG and prompt engineering, there are instances where fine-tuning an LLM is the best solution. Fine-tuning is the process of tailoring the model based on your specific data. It works similarly to the prompt-engineering concept of "few-shot learning", by letting you provide many examples of what you'd want your model to know or how you'd like it to respond. These examples are then used to alter the model weights to better tailor the model to an industry, company, persona, etc. As the name suggests, with few-shot learning you can get away with providing a few examples (2-5) but fine-tuning typically requires thousands of examples to make a meaningful impact on the model. When fine-tuning an LLM you also run the risk of making the model worse rather than better, due to an observed phenomenon known as catastrophic forgetting. When considering fine-tuning, I would recommend first experimenting with a small language model such as phi-2. This is because with a Small Language Model (SLM), it is significantly easier to alter model weights and therefore it’s performance with a smaller dataset (typically 100s of records). That said, Azure OpenAI supports fine-tuning for GPT3.5. This can be done through the Azure AI Studio or Python SDK.
Latency
Latency, in the context of LLMs, is the time delay between a prompt submission and a returned response. When measuring latency, there are two metrics to pay attention to, time to first token (TTFT) and end to end latency (E2E). As the names imply these refer to the time it takes the model to generate the first token and the time for the full response to be interpreted and returned, respectively. When comparing GPT-3.5 and GPT-4, a general observation is that all the versions of GPT-3.5 are typically faster than their GPT-4 counterparts. This is likely due to the sheer size of GPT-4 compared to GPT-3.5. Rumored at roughly 1 trillion parameters, GPT4 is significantly more capable than GPT-3.5 but comes at the cost of being more computationally intensive. A smaller model like GPT-3.5 will be cheaper and faster. If your workload requires both complex reasoning and low latency, there are techniques you can implement such as prompt compression and semantic caching to leverage GPT-4.
Decision Tree
Below is a decision framework that can help focus your intentions. While not a comprehensive guide, it is a great starting point to lead into conversations about technical and business requirements. As mentioned above, “complex” can be subjective, but in this case can includes use cases related to multi-agent systems, image and textual analysis, and classification.
Conclusion
The choice between GPT-3.5 and GPT-4 should be made with a clear understanding of your specific needs and constraints. GPT-3.5 offers cost-effectiveness and faster response times, making it suitable for applications where these factors are critical. Meanwhile, GPT-4's enhanced reasoning capabilities and broader context window make it the superior choice for complex tasks that require deeper understanding and elaborate content generation. It is worth mentioning that you can also take a “best-of-both-worlds” approach using an LLM orchestration framework to call GPT-3.5 as default and flex to GPT-4 when necessary. Ultimately, the decision should factor in not just the technical aspects, such as context window size, training dataset recency, and latency, but also the practical considerations of cost, ease of fine-tuning, and the specific demands of the task at hand. By carefully weighing these aspects, you can harness the full potential of the GPT family of LLMs to drive innovation and efficiency in your applications. Remember that the LLM landscape is rapidly evolving, and staying informed about the latest developments will enable you to make the most out of these powerful tools.