In recent years, large language models (LLMs) such as GPT-3/4, PaLM, and ChatGPT have rapidly grown in popularity and demonstrated impressive text-generation capabilities.
However, while LLMs show great promise in fields like search and customer service, there are still many challenges and constraints limiting their applications.
A new paper titled “Challenges and Applications of Large Language Models” by researchers from University College London, the UK Health Security Agency, Stability AI, and others provides a timely overview of the key issues facing LLMs and where they are currently being applied.
This article summarises the core challenges identified in the paper and examples of real-world LLM usage to date.
What Are Large Language Models?
LLMs are a class of neural networks trained on vast datasets of text data in an unsupervised manner to predict the next word in a sequence.
Unlike traditional programs with rigid rules, LLMs develop fluid understandings of language through exposure to patterns in enormous volumes of text. Recent models like OpenAI’s ChatGPT have demonstrated an uncanny ability to generate human-like text when prompted.
LLMs owe their capabilities to a technique known as Transformer, a novel neural network architecture introduced in 2017.
Combining the Transformer architecture with massive datasets scraped from the internet enabled exponential growth in model sizes, from millions of parameters just five years ago to hundreds of billions today.
For instance, OpenAI’s GPT-3 contained 175 billion parameters when launched in 2020.
But what can these models actually do?
When fed a prompt like “Write a poem about nature”, LLMs can produce creative poetry. Given a mixed review of a restaurant, they may summarise the key points.
LLMs can even take simple conversational turns or assist with basic coding tasks when prompted appropriately.
Challenges Facing Large Language Models
However, the paper argues that despite rapid progress, LLMs still face critical challenges today across three broad areas:
Design Challenges
Design choices when developing LLMs can significantly impact their capabilities. But best practices remain unclear.
- Training Data Quality: Modern datasets used to train LLMs are unfathomably large, often containing billions of web pages. This makes it impractical for researchers to assess their data quality thoroughly. Issues like duplicated or misclassified text can propagate flaws into the models.
- Limited Context: LLM architectures rely heavily on a fixed context window, constraining their ability to process long documents. For instance, summarising a novel chapter by chapter may be beyond their reach today.
- Tokeniser Dependence: LLMs depend on a separate tokeniser model to digest text into suitable input. This introduces brittleness and inefficiencies that designers struggle to overcome.
- Training Scale: Recent estimates suggest model performance follows a power law with training compute. But the field lacks clear guidance on efficiently scaling LLMs for a given resource budget.
Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarising.
Behavioural Challenges
Even well-designed LLMs often exhibit problematic behaviours in the real world:
- Misaligned Values: LLMs frequently generate text or recommendations misaligned with human values and societal norms. Their open-ended nature makes this hard to predict.
- Deception & Toxicity: Without safeguards, LLMs may produce false information or toxic language which humans should not trust or propagate.
- Prompt Sensitivity: Small prompt variations often drastically alter an LLM’s outputs, making their behaviour unpredictable.
- Limited Memory: LLMs struggle to maintain conversational context, repeatedly contradicting themselves or getting lost across prolonged interactions.
Evaluation Challenges
Finally, fundamental issues evaluating LLMs persist today:
- Changing Capabilities: Existing benchmark suites quickly become outdated as models rapidly evolve. But human evaluation needs to be faster to keep pace.
- Reasoning Shortfalls: Despite progress, LLM reasoning capabilities lag humans in areas like mathematical logic, causality, and analogy comprehension.
- Brittle Metrics: Slight prompt modifications often radically impact benchmark scores, limiting generalisable insights.
Current Applications of Large Language Models
Despite these limitations, LLMs are already being incorporated into diverse applications, including the following:
- Chatbots: LLMs like OpenAI’s ChatGPT are becoming popular in chatbots for their engaging conversational abilities. However, their limited memory poses challenges in maintaining coherence across long dialogues.
- Medicine: Researchers have prompted LLMs like GPT-4 to answer medical questions reasonably well. But safety risks from potential inaccuracies remain barriers to clinical usage.
- Legal: LLMs have shown promise for legal tasks like predicting court outcomes and statute analysis when provided with relevant context. Though their training data quickly becomes outdated with new laws.
- Robotics: Combining LLMs with computer vision has enabled robots to understand natural language instructions better and act accordingly. Reliance on a single modality is still a constraint, though.
- Knowledge Work: Models customised for domains like scientific writing (Galactica), or finance (BloombergGPT) can summarise, analyse data, and answer questions reasonably well—just not tasks requiring complex numerical reasoning yet.
- Psychology: Early experiments suggest LLMs may simulate certain human cognitive biases or personality traits when prompted appropriately. Though biased training data likely skews their behaviour.
- Synthetic Data: LLMs generate synthetic training examples for downstream tasks on command with proper prompting. But risks of unrepresentative or low-quality data require caution.
In most applications today, LLMs require careful prompting, training on high-quality demonstrations, and/or integration with non-language tools to strengthen their capabilities and mitigate risks.
Tackling Obstacles in LLM Development
In conclusion, despite rapid progress in recent years, considerable limitations still need to be revised in real-world LLM applications.
Key challenges highlighted across the design, behavioural, and evaluation of LLMs include brittleness, safety risks, reasoning gaps, and data quality.
Ongoing research to tackle these issues will be vital to unlocking the full potential of large language models.
If these obstacles can be surmounted, LLMs may one day fundamentally transform how humans and machines interact across healthcare, scientific discovery, education, and more.
However, researchers still face a long road ahead to develop LLMs that are both broadly capable and reliably beneficial.