Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Announcements

Community

Engineering

Product

Maximising Efficiency With LLMs: Leveraging ChatGPT-3.5 vs GPT-4

Engineering

April 2, 2024

Alex Morton

Software Engineer

Maximising Efficiency With LLMs: Leveraging ChatGPT-3.5 vs GPT-4

Welcome to The Observatory, the community newsletter from Orbit.

Each week we go down rabbit holes so you don't have to. We share tactics, trends and valuable resources we've observed in the world of community building.
‍

💫 Subscribe to The Observatory

Introduction

In our work at Orbit, we’ve spent the past year building Community Search: a system that processes communities’ online conversations and enables users to search through those conversations to find answers to their questions.

Community conversations are made up of any number of individual user messages.

Take a Discord thread, for example. There is one original message that was posted, and then there are any number of responses inside of the thread. The Discord thread is the conversation.

In addition to Discord, we also pull conversations from GitHub (Issues and Pull Requests), Twitter (Tweets and Replies), and Discourse (Forum posts and replies).

Now, consider a given online community. Community members can potentially send up to hundreds of thousands of messages, so that means there can be thousands of conversations to process.

Using LLMs

With the rise of Large Language Models (LLMs), software engineering teams are increasingly turning to tools like OpenAI’s ChatGPT (or Meta’s LLaMA or Google’s BERT, etc) to manipulate data and enhance text-based processes.

With each new model iteration that pops up (i.e. ChatGPT-4 from ChatGPT-3.5, advancements in capabilities, performance, and accuracy push the boundaries of what LLMs can achieve. To that extent, it may be easy to blindly choose the highest-powered model by default instead of exploring what’s possible with prior iterations.

When ChatGPT-4 was released, for example, our team was tempted to directly use it directly, as the latest and most powerful model of the ChatGPT armada. However, we couldn’t ignore the fact that the higher accuracy of GPT-4 came with a significant increase in cost and decrease in speed.

So at a certain point, we asked ourselves: Would it be possible to use the GPT-3.5 model instead of GPT-4 — allowing us to hit our data accuracy targets, maximise efficiency, and see a boost in performance all while minimising cost?

In this exploration, we’ll delve into how far we were able to push the capabilities of GPT-3.5, weighing its performance against the power of GPT-4.

While GPT-4 boasts superior capabilities compared to its predecessor, GPT-3.5, the distinction between the two goes beyond their on-paper specifications.

By understanding how to effectively prompt and use each model, we learned that we could optimise results while minimising costs. In other words, it all came down to prompt engineering.

While GPT-4 excels in handling complex scenarios and providing deeper insights, GPT-3.5 can still be leveraged effectively with the strategic usage and structuring of prompts.

Processing Community Conversations Using ChatGPT

For our Community Search tool, we’ve created a system that needs to process thousands of conversations from a particular product community. Each of those online conversations should be given specific properties that we define to be able to group those conversations down the line.

For example, we’d like to know if a particular conversation is initially a question (is_question) and is solved (is_solved), and if it’s solved, who solved it (solved_by)? Each of these questions can be answered by attaching a predefined property to the conversation, and we leverage the use of LLMs to help us do this.

So, when you provide the LLM with a list of 10 questions it should answer for those accompanying 10 properties, we use GPT’s answers as the corresponding properties to add to each conversation.

The LLM receives an input in the form of a conversation. This processes the conversation by asking it a series of questions, such as “Is there a question posed in the conversation?”, and generates a standardised JSON object of data based on the questions and answers based on the conversation.

Comparing ChatGPT-3.5 to ChatGPT-4

With the use case above - prompting the LLM with a list of 10 questions to answer regarding each conversation - GPT-4 clearly and obviously outperforms GPT-3.5.

When we ran the experiment on each model, we saw that GPT-4 provided more accurate responses and reasoning out of the box, while we saw a lot of clear misses and inaccuracies from GPT-3.5.

Now, it’s at this point that we had a decision to make: 1) to go forward with GPT-4 or 2) to explore our options for reducing cost and increasing efficiency with GPT-3.5. We chose the second option.

When comparing GPT-3.5 to GPT-4, we know that the 1-to-1 question/answer format (i.e. prompt: 10 questions about a conversation, LLM response: 10 corresponding answers to those questions) doesn’t work as well as it does with GPT-4. With GPT-3.5, it would often return false positives and obviously wrong answers to the questions.

Since it was clear that the results of the two models were so different when using the same prompting strategy, we decided to experiment with different ways of prompting GPT-3.5 to give results that approached the accuracy of GPT-4.

This is where our team discovered that by rephrasing the prompt and breaking it into two parts for GPT-3.5, along with instructing it to explain its reasoning for its answers, we noticed significant improvements in accuracy.

In the diagram below, the green LLM Response box indicates the final response we received after prompting it with our list of questions:

Using GPT-4, we could prompt it to answer all questions and return a JSON object to us with the information we needed in one request. With GPT-3.5, we achieved the same JSON object response by breaking up the initial prompt into two separate requests.

For GPT-4, we were able to simply prompt it to answer all questions and return a JSON object to us with the information we needed.

For GPT-3.5, we used Prompt #1 to send the questions with a prompt for it to answer the questions and explain its reasoning for each. Then, for Prompt #2, we took the responses generated from its response to Prompt #1 and instructed the LLM to return a JSON object with the information we needed:

Prompting GPT-3.5 by instructing it to explain its reasoning and breaking the original prompt into two parts.

This method allows GPT-3.5 to approach the accuracy of GPT-4 while remaining faster, lighter, and more cost-effective.

The only difference is that it required two API calls instead of one.

Explanation

The key to this approach lies in understanding how the two models process information and responds to prompts.

While GPT-4 is powerful enough to answer the questions as they are asked (in the first request/response cycle), we need a different approach with GPT-3.5. By breaking down the prompt and requesting explanations, we make sure to prompt GPT-3.5 to justify its answers, leading to more accurate responses.

Regarding LLMs and text-completion, it’s important to understand that when you prompt an LLM, the response can be largely based on how the prompt is formatted.

In other words, once the request is sent (the user prompt), LLMs respond in real time, one word after the other. The LLM does not have the ability to go back during its response and change something it said one sentence or paragraph earlier (in the same response).

So, in our example, when our prompt involves 10 questions, the LLM goes down the list and responds to each question. But if question #2 depends on the answer of question #4, then we would run into a problem. As the LLM answers the question, it cannot go back and update the answer it gave for question #2 (at least, not in the same response).

The difference between GPT-3.5 and GPT-4 also becomes clear when asking the model to format its responses in a certain way (for example, as a JSON object).

Again, LLMs don’t plan their answers. They “think” while outputting the tokens. So when we directly ask for a formatted JSON object, the tokens that are spent in structuring the JSON (with keys, curly braces, quotes, etc…) are not helping it get closer to the answer.

If anything these added formatting instructions are moving the LLM away from the correct answer(s). And the weaker the model (i.e. GPT-3.5 vs 4), the more noticeable the inaccuracies get.

Overall, it’s important to keep in mind that the overall goal when prompting GPT-3.5 is to make sure that you are keeping it as focused as possible on your prompt(s) and to reduce the risk of it trailing from the desired answer(s).

Tradeoffs Between Models

In this post, we’ve touched on the tradeoffs between using GPT-3.5 vs GPT-4, but let’s dive a little bit deeper into the implications of using one over the other.

Speed

Because GPT-4 is a larger model and requires more computational resources when both processing inputs of different levels of complexity and generating responses for those inputs, it is a slower model than the smaller GPT-3.5.

The speed at the response-time of an LLM is an important factor in a project, especially if a more real-time response experience is necessary (i.e. a chatbot). If your model is powerful but takes a longer time to produce responses, that will definitely affect the user experience of your project.

When thinking about how your project would scale, if you need to process many items as complex inputs for your model and receive outputs for each input, then the difference in model speeds can be the difference between completing a task in minutes vs hours (or longer).

Cost

In our specific experiment, we compared the costs of running similar datasets through our LLM calls as inputs and the corresponding outputs we received.

At the time of this experiment, when processing around 3000 conversations, the cost with GPT-4 was $1.80 for 3000 conversations, while the cost of GPT-3.5 was $0.02 (two cents!) for those same 3000 conversations.

And while $1.80 is not necessarily a high cost on its own, when we compare it to $0.02, it’s clear that the cost savings that would be generated by GPT-3.5 would be most obvious as the project scales and more conversations are processed:

Another point from this example: processing one million conversations with GPT-3.5 would cost the same as processing just 11,167 conversations using GPT-4 ($6.70).

Conclusion

Our experimentation with GPT-3.5 reveals a remarkable improvement in accuracy and efficiency compared to GPT-4.

By optimising our prompts and leveraging the capabilities of GPT-3.5, we've achieved comparable results at a fraction of the cost.

The significant reduction in token costs, from $1.80 to $0.02 for around 3000 conversations (a whopping 90x cheaper!), demonstrates the scalability and cost-effectiveness of this approach.

Moving forward, this nuanced understanding of LLMs will empower us to maximise efficiency while minimising expenditure in our text-based processes.

‍