Decoding LLM Implementation: Balancing Accuracy and Complexity

Oct 16

Exploring different technical implementations of Large Language Models, trade-offs, and use cases.

Large Language Models (LLMs) have emerged as tremendously powerful tools with wide business application. LLMs are sophisticated artificial intelligence systems built on extensive neural networks, capable of processing, comprehending, and generating human-like text. These models are trained on massive datasets and equip organizations with advanced natural language processing capabilities, facilitating tasks ranging from content creation to data analysis.

Product managers may be unsure which technical implementation and approach is the best for their product or company. Should an organization just leverage an off-the-shelf LLM? Or fine-tune a model for a specific task? Or even enhance LLM prompts with company-specific context and data? This blog lays out the main options for optimizing LLM results, discusses the trade-offs for each, and suggests the best fit for different scenarios. Note that, although these methods will be assessed separately, they are often complementary: product managers can often use a combination of approaches to create an optimal solution.

The graphic below is a handy visualization of the main options available today when considering an LLM implementation. The level of customization (and associated complexity as measured by engineering effort and future maintenance) is displayed in the top bar. Further, the options are segmented by colors to denote which ones require no optimization to the prompt and no customization to the model itself.

Before jumping in, let’s clearly delineate different types of prompts that feed into a model. For the purposes of this blog, we have segmented prompts into simple prompts and engineered prompts. Simple prompts are basic inquiries that are clear and direct, but do not involve any nuanced or tailored language to guide the model more specifically. In contrast, engineered prompts are optimally created through a combination of precise words or phrases to elicit desired responses from an LLM. Engineered prompts can be manually crafted, usually by someone with domain knowledge and insight into how a model might respond to certain inputs. Engineered prompts can also be programmatically coded to structure the input of information, usually through a user checklist or questionnaire.

Commercial LLM API with Simple Prompts

The simplest approach uses an out-of-the-box LLM with pre-trained models and simple prompts to perform a variety of tasks, such as text generation and summarization, sentiment analysis, and language translation. However, pre-trained language models with no prompt or model customization can return inaccurate or incomplete results – especially for complex questions that require a deep understanding of the topic. The easiest and most common way for organizations to access a commercial LLM capability is to connect through an API or via the text box in a question-and-answer chatbot.

Pros:

Cheapest and fastest – eliminates the need for extensive model development, data engineering, and prompt infrastructure setup
Easiest option for less technical organizations
A good fit for tasks where output accuracy is not critical
The diverse set of knowledge across a wide variety of topics in the LLM is an interactive way to enhance search

Cons:

Lack of specialization, relevance, and accuracy of results
Potential data security implications may arise if employees share sensitive data or IP with third parties through an external API

Example Use Cases:

Generalized Code Development: Commercial LLMs can be powerful developer support tools – developers can obtain general code snippets, explanations, and guidance on programming languages, libraries, and frameworks to get projects started
Language Translation: A pre-trained translation API is often sufficient for applications requiring basic language translation between common languages

Prompt Engineering

Prompt engineering is the process of actively or programmatically composing structured prompts or queries to guide the output of any LLM to generate better search results or to format the output for further processing. This approach involves understanding a model's behavior and iteratively refining prompts to achieve the best results.

Prompt engineering can be enhanced by using a questionnaire, checklist, or other method to intake structured user information to improve the search input by providing more context and thereby reducing the search space. Prompt engineering does not modify the underlying model, but does require a consistent, even programmatic (like prompt chaining), approach to improve output accuracy.

Pros:

Will produce more accurate and consistent results than simple prompting with the expectation of fewer hallucinations
Prompts can be manually created or programmatically designed

Cons:

Although prompt design guides the output, results will not be as specialized as from a fine-tuned model
Does not pull in outside data beyond what the user directly inputs to supplement or support results
Verification may be needed when switching across LLMs to preserve effectiveness of LLM-specific prompt engineering
Prompt engineering/design can be challenging because it often requires deep understanding of a model’s sensitivity and limitations

Example Use Cases:

Health Diagnosis: Crafting prompts that present patient data and symptoms to LLMs for diagnostic assistance, narrowing total search space and improving the accuracy of medical diagnoses

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a newer approach to LLM queries that combines the strengths of pre-trained language models with outside context. RAG addresses the limitations of pre-trained models by retrieving and using external information (can be proprietary or industry-specific), such as documents, articles, and code snippets, at the prompting stage (prior to model ingestion). This information is automatically retrieved from a vector database based on contents of the text input and pulled in to supplement the prompt. Because the RAG approach does not need to continually re-train the LLM itself, model inferencing stays more relevant by continually prompting with up-to-date documentation and data.

While the model is untouched, RAG does require significant effort and expertise in data engineering to build the underlying data pipelines and vector database architecture and then to integrate with the LLM itself.

Pros:

Improved contextual understanding due to pulling in external information sources for each query
Can generate responses with enhanced relevance, particularly for more complex questions

Cons:

Building and maintaining the architecture can require significant time and resources depending on the size, scale, and scope of the information in the vector database
Maintaining large knowledge bases requires constant effort and expense

Example Use Cases:

Journalism Support: A journalist could utilize RAG to write more informative and comprehensive news articles based on past articles written, interviews conducted, and secondary sources referenced
Informed Customer Service: A customer service representative could use RAG to personalize a customer interaction by automatically pulling from company CRM data
Aided Academic Research: A researcher could use RAG to generate new hypotheses and insights based on troves of scientific data and documentation
Proposal Creation: A company can accelerate the time to create proposals that utilize proprietary past work qualification or third-party state of the art papers

Fine-Tuning

Fine tuning involves changing model weights based on specific, selected training examples to specialize LLM results for a certain task. Adjusting the model weights in effect retrains that model and creates a unique version on the original LLM to improve performance around a task. Fine-tuning takes advantage of a pre-existing model's broad understanding of language and then hones it to specialize in a narrower, domain-specific scenario. This approach is typically better suited for inputs that will be constrained to a topic area and do not have large variability.

Pros:

Improves performance and consistency around a specific task; minimizes unnecessary results
Faster development than training a model from scratch
Creates company-specific version of a generalized LLM

Cons:

Does not pull in outside data to supplement a prompt (like RAG)
Requires the collection of some amount of domain-specific training data to re-train the model
Can lead to overfitting, where the model becomes too specialized and performs poorly on inputs that deviate even slightly from the training data
Since fine tuning creates a unique version of the general model, resources are required to catalog and monitor this version for accuracy and drift
If the business task is even slightly modified, the model will need to be re-trained
Re-training a model can be costly in terms of computational power, time, data, and specialized expertise

Example Use Cases:

Fraud Detection for Financial Services: Using extensive company data on specific, unique behaviors of identity theft in loan applications, a large bank can re-train a language model to better detect this activity
Legal Document Review: A model can be re-trained to help legal professionals review legal documents, contracts, or patents, and to quickly identify relevant information and flag issues

Build and Train New Model

Building a complete, new LLM from the ground up is an expensive endeavor. The costs to train GPT4 were estimated at $100M which may be beyond most non-governmental, corporate budgets. However, this provides the ultimate flexibility in creating a specialized model with known, ground truth data that may minimize hallucinations. The model could be optimized for other factors such as size and speed of execution.

Pros:

Ultimate flexibility in the use of ground truth, verified data
Total control over speed, size, and security
Training costs likely to drop over time

Cons:

Expensive to design, build, train and maintain
Does not leverage data/training of other LLMs
Requires specialized knowledge of and access to fast cloud infrastructure
May not stay current as new features are created and delivered by commercial offerings

Example Use Cases:

Government: The sovereign governments, intelligence agencies, and military organizations have a vested interest in controlling their models and training on non-public datasets
Very Large Corporations: Some specific companies may seek a competitive advantage by using their own models trained on large proprietary datasets

We have described varying levels of customization and complexity for leveraging LLMs. For most companies, a combination of methods – prompt engineering, RAG, and fine-tuning – will be needed to deliver a broad range of results. However, any level of customization usually means higher upfront development costs and ongoing maintenance costs. For companies with fewer available resources, a commercial API-driven LLM may be more practical. In general, the level of customization for a given amount of accuracy and performance correlates to complexity and long-term cost. Ultimately, the right approach will depend on need and available resources.

1 https://en.wikipedia.org/wiki/GPT-4#

Armando Pauker

Decoding LLM Implementation: Balancing Accuracy and Complexity

Commercial LLM API with Simple Prompts

Prompt Engineering

Retrieval Augmented Generation (RAG)

Fine-Tuning

Build and Train New Model

The Double-Edged Sword of Synthetic Data in AI Training

Building Trust in AI Text Generation: Addressing Hallucinations

Explore