LLMs: Limited Logic Models

Societal Impact and Approaches to the Issue.

Jul 05, 2023

[rough draft, updating over time adding papers. It started out high level, but then became a technical over time as I threw things in so I’ll restructure sometime for different audiences. Further along it gets into ways to fix the problems with hallucination & reasoning]

For decades people have ingrained the idea that computers are logical entities: but current LLMs aren’t fully logical. Wharton professor Ethan Mollick has written a great deal about how to make practical use of the current LLM technology including an essay on “Blinded by Analogies: What is this AI thing? The wrong model can lead us astray.” Another author explained “ChatGPT is Lanley, not Spock (Artificial Empath, Part 1)”. As Yann LeCun, Chief AI Scientist at Meta put it, these “Auto-Regressive LLMs can't plan (and can't really reason).". Some computer scientists see this topic as still subject to debate: “Can Large Language Models Reason?”.

Since LLMs often mimic logical argument and write authoritatively, many are concerned that people will believe things they should not without skeptically examining the information to be sure the AI is not hallucinating. There is a phenomenon referred to as “garbage in, gospel out” where people often trust the output of a computer system as valid regardless of whether they should.

Professor Mollick suggests treating current AI systems like interns whose work needs to be validated. Unfortunately, in the real world that may not always happen in the rush to get work done if the results seem accurate, especially if the flaws in their logic are ones the human reviewing their work may also be prone to make. The current state of the art AI may lead to workers being more productive overall, with the tradeoff that certain types of errors may become more frequent. Future systems may be able to combat some of these flaws.

Current AIs are trained on human knowledge and so they can be prone to falling for the same types of cognitive fallacies that humans fall prey to. The current generation of AIs may increase the frequency of certain types of logic errors by reinforcing the views of humans who have made them.

One easy to see example of their difficulty with some simple reasoning tasks is illustrated by a data scientist at Meta who tweeted a graph of the results of this:

I asked GPT 3 and 4 to add random 7 digit numbers 12,000 times each (150 distinct pairs of addends × 8 possible values for Number of Carries × 10 responses each). The prompt was just "{a}+{b}=", and in most cases it just responded a number. Kind of wild results!

The accuracy numbers for GPT-4 are only on a graph, but they approximately range from 75% for additions requiring 0 carries down to below 20% for 7 carries. As he tweeted:

There is something so beautifully absurd about performing 10 trillion floating point operations to add a pair of numbers with 75% accuracy

A paper exploring the issue of “Teaching Arithmetic to Small Transformers” indicates LLMs have difficulty learning arithmetic even with large numbers of examples, with one small model not being fully accurate even after being fine tuned on 200k examples. Fine tuning for 4 digit arithmetic made it forget arithmetic with fewer digits. They could achieve accurate results with only thousands of training examples if the digits were written in reverse order, for example writing "128+367=495" as "$128+367=594$". That suggests which concepts LLMs learn may depend partly on the accidental structure of the unstructured text rather than the inherent logical difficulty of the task.

One study notes that their apparent successes in logical reasoning seem to be based on having seen that logic in their training data, which leads them to make errors in logic when they haven’t:

But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO
We find that GPT3, ChatGPT, and GPT4 cannot fully solve compositional tasks even with in-context learning, fine-tuning, or using scratchpads
…We show that Transformers' successes are heavily linked to having seen significant portions of the required computation graph during training! This revelation, where models reduce multi-step reasoning into subgraph matching, raises questions about sparks of AGI claims

In fact a study summarized in this twitter thread explains how dependent they are on the structure of this logic graph:

Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!
To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”). We find they get ~0% accuracy! This is the Reversal Curse.
…In Experiment 2, we looked for evidence of the Reversal Curse impacting models in practice.
We discovered 519 facts about celebrities that pretrained LLMs can reproduce in one direction but not in the other.

A long Twitter thread by a Deep Mind researcher explores what that suggests about how we should intuitively think about the the way LLMs function, a highlight:

This is the 1st rigorous treatment (and 3rd verification) I've seen
….P.S. better explanation. LLMs can do deductive logic *in the context window* because they index into data that's doing deductive logic. Training data: "I am a dog. Dogs have fur. Thus I have" Prediction: "I am a cat. Cats have eyes. Thus I have" This kind of thing. :)

The paper those threads (worth reading) are talking about:

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
…We also evaluate ChatGPT (GPT3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother? [A: Mary Lee Pfeiffer]” and the reverse “Who is Mary Lee Pfeiffer’s son?”. GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse.

The Deep Mind researcher referenced above also has a long twitter thread on the issue that LLMs also have trouble with hierarchical structures:

For anyone interested in future LLM development One of the bigger unsolved deep learning problems: learning of hierarchical structure
[…]But there's still a strong case to be made for symbolic AI here — or at least for solving the hierarchical structure problem. It means you can have more intelligence with less data — with less training — etc.

One paper suggests tuning LLMs by providing abstract symbolic example tasks to provide that logic graph:

FROM ZERO TO HERO: EXAMINING THE POWER OF SYMBOLIC TASKS IN INSTRUCTION TUNING
…Empirical results on variTo explore the potential of symbolic tasks, we carry out an extensive case study on the representative symbolic task of SQL execuous benchmarks validate that the integration of SQL execution leads to significant improvements in zero-shot scenarios, particularly in table reasoning

However the paper referenced earlier on the difficulty of teaching arithmetic to transformers using standard notation suggests that certain types of logic graphs may be difficult to impart to LLMs. Researchers at Google explored methods to teach arithmetic to LLMS and although they report improved results:

Teaching Algorithmic Reasoning via In-context Learning
…We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines.

Their accuracy on different trivial tasks still ranged from only 65.6% to 95%: when 100% accuracy is trivial to get in traditional software.

Another paper showing logical reasoning difficulties:

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
… Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving.

A recent academic research paper has explored addressing the flaws in the ability of LLMs to plan tasks:

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
[...]so far, LLMs cannot reliably solve long-horizon planning problems. By contrast, classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal, plans.[...]Via a comprehensive set of experiments on these benchmark problems, we find that LLM+P is able to provide optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems

Researchers from MIT and Boston University published:

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining?
[…]Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving.
[…] Do humans also perform worse with unfamiliar counterfactual conditions? In fact, there is increasing evidence from cognitive science that human reasoning is scaffolded by rich causal models of the world

A Science magazine article by a Santa Fe Institute professor explores the question “How do we know how smart AI systems are?” and notes:

The author took these results as support for the claim that GPT-4 had developed a sophisticated theory of mind. However, a follow-up study took the same tests and performed the kinds of systematic, carefully controlled experiments that Michael Frank advocates. They found that rather than having robust theory-of-mind abilities, GPT-4 and other language models seem instead to rely on “shallow heuristics” to perform the tasks from the original paper.

These “shallow heuristics” match the idea above that they usually rely on simple computation or logic graphs implicit in their training for many results, even if sometimes they seem to be able to be led to more complex results with the right prompting. Another study notes that current systems struggle with a basic form of problem-solving reasoning used in programming, noting that a different type of AI works better:

Can Transformers Learn to Solve Problems Recursively?
Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. […] Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.

Many in the public are not used to that type of reasoning either. While experienced programmers are used to the types of reasoning that neural model AIs currently struggle with, even professional engineers are being warned to be careful to: “be critical of the outputs of large language models, as they tend to hallucinate and produce inaccurate or incorrect code.” While AI can generate code quickly, it can contain subtle bugs difficult to spot. Engineers are used to the need to fully test their systems to be sure code is not released with such bugs, but the public is not.

Those subtle remaining flaws may not lead to a program crashing since they would be aware of such bugs and address them, but instead for instance a business model may produce an answer that is wrong without them realizing it. Many are touting the potential for AI to assist non-programmers by creating software. The problem is most people are not schooled in the rigorous logical processes used to create and validate software, and neither are current AI systems.

Software engineers use specialized languages to program computers, or to logically specify in detail what a computer program should do. Natural language like English is usually too ambiguous to fully specify exactly how a computer should solve a problem. They leave off assumptions about the world that they take for granted other humans will understand, or they merely do not think through things rigorously enough to realize there is ambiguity.

Non-programmers explain to developers what a program should do, and developers then think through all the details of the program to fully specify its functionality, asking questions of the non-programmers when needed to resolve ambiguity. Engineers need to think rigorously through the logic of the program. Since LLMs are not rigorously logical now they cannot currently fully replace the role of engineers in creating reliable programs. Eventually AIs may be trained, or constructed differently, to engage in the sort of rigorous analysis software engineers perform.

Until then, in many cases the programs AI produces may seem “good enough” for many uses: but it seems likely society risks suffering from a rise of “almost, but not quite right” programs with bugs their users do not realize are there. Society may accumulate what is referred to for individual companies as technical debt.

An article listing open research problems relevant to the sort of AI that will have a transformative impact on society refers to the problems LLMs experience due to lack of a causal model:

The list of open research problems relevant to transformative AI continues. Learning a causal model is one. Ortega et al. show a naive case where a sequence model that takes actions can experience delusions without access to a causal model.

Many people have been working to find ways to force LLM’s to reason more carefully, for instance even just telling a chatbot to “reason step by step.” That may be useful for some purposes, but it may be better to consider using tools that already focus on rigorous logic. Future new neural inspired AI architectures or different training of the existing ones may lead to the rise of fully logical AIs, but its unclear when that will happen. Prominent VC firm Andreessen Horowitz posted that there may be a drastically rising cost to train these types of models to be more accurate:

For example, it may take an investment of $20 million to build a robot that can pick cherries with 80% accuracy, but the required investment could balloon to $200 million if you need 90% accuracy. Getting to 95% accuracy might take $1 billion. Not only is that a ton of upfront investment to get adequate levels of accuracy without relying too much on humans (otherwise, what is the point?), but it also results in diminishing marginal returns on capital invested. In addition to the sheer amount of dollars that may be required to hit and maintain the desired level of accuracy, the escalating cost of progress can serve as an anti-moat for leaders—they burn cash on R&D while fast-followers build on their learnings and close the gap for a fraction of the cost.

Humans took a long time to evolve to where they could logically think things through, and then to further collectively evolve a process of logically thinking through problems using external tools like writing to aid in the process. Current AI systems may benefit from a similar approach. Many people hope for AI to reach the level of human intelligence, AGI, yet even humans benefit from external tools to deal with things like logical reasoning.

The saying “when the only tool you have is a hammer, everything looks like a nail” seems to be the approach many are taking. They are using LLMs for tasks they aren’t the best tool for, because they are part of a larger task that LLMs are useful for. There are other tools available better suited to other tasks that may be combined with LLMs to address these issues.

Its useful for LLMs to be able to explain their reasoning to humans, especially given the need to verify logic that may be flawed. Unfortunately LLMs explanation may not match their reasoning, an issue referred to as “faithfulness”:

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
[…]We found that LLM's explanations have low precision and that precision does not correlate with plausibility.

A roundup of recent papers describes that paper's results informally:

When you ask a text model to generate an explanation for its answer, you’d hope that its future responses are consistent with that explanation. E.g., suppose it says a bacon sandwich is hard to get in a certain area because bacon is hard to get there; if you then ask it whether bacon is hard to get there and is says “no”, there’s an inconsistency.
Turns out GPT-3 and GPT-4 often generate inconsistent explanations.
…Also, how often a model generates a seemingly satisfying explanation doesn’t correlate with how often its explanations are consistent.

Anthropic researchers also found:

Measuring Faithfulness in Chain-of-Thought Reasoning
[…]As models become larger and more capable, they produce less faithful reasoning on most tasks we study
[…]. We also see that the degree of post-hoc reasoning often shows inverse scaling, getting worse with increasingly capable models, suggesting that smaller models may be better to use if faithful reasoning is important.

Yann LeCun, Chief AI Scientist at Meta explains:

I totally agree with
@rao2z
that LLMs can't plan. In fact, one of the main features of the cognitive architecture I propose in my position paper is its ability to plan (and reason) by searching for values of actions (or latent variables) that minimize an objective.

He is referring to professor Subbarao Kambhampati, past president of the Association for the Advancement of AI, who has explained that:

LLMs can’t Plan(But they can help you inPlanning)
In the case of "reasoning" tasks, we may consider that an LLM was able to reach a conclusion by something akin to theorem proving from base facts• But then we are missing the simple fact that the linguistic knowledge on the web not only contains "facts" and "rules" but chunks of the deductive closure of these facts/rules.

One of his papers reports:

On the Planning Abilities of Large Language Models -- A Critical Investigation
[…] Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the heuristic mode show more promise. In the heuristic mode, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.

A general audience article explains:

Large language models can’t plan, even if they write fancy essays
Large language models perform very poorly at tasks that require methodical planning
However, “reasoning” is often used broadly in these benchmarks and studies, Kambhampati believes. What LLMs are doing, in fact, is creating a semblance of planning and reasoning through pattern recognition.

In an invited talk at ICML 2023 he explained:

Avenging Polanyi's Revenge: Exploiting the Approximate Omniscience of LLMs in Planning without Deluding Yourself In the Process
Trained as they are on everything ever written on the web, LLMs exhibit "approximate omniscience"--they can provide answers to all sorts of queries, with nary a guarantee. This could herald a new era for knowledge-based AI systems--with LLMs taking the role of (blowhard?) experts. But first, we have to stop confusing the impressive form of the generated knowledge for correct content, and resist the temptation to ascribe reasoning powers to approximate retrieval by these n-gram models on steroids. We have to focus instead on LLM-Modulo techniques that complement the unfettered idea generation of LLMs with careful vetting by model-based AI systems.

with slides noting:

Our Poor Intuitions about Approximate Omniscience make it hard to tell whether LLMs are reasoning or retrieving[…]
It is worth understanding that our intuitions about what exactly is in the 600gb of text on the web are very poor.
One of the big surprises when Google came out with Palm LLM was that it could “explain” jokes. But did you know that there are sites on the web that explain jokes …
In the case of "reasoning" tasks, we may consider that an LLM was able to reach a conclusion by something akin to theorem proving from base facts•But then we are missing the simple fact that the linguistic knowledge on the web not only contains "facts" and "rules" but chunks of the deductive closure of these facts/rules
In general, memory reduces the need to reason from first principles..

He notes that LLMs are poor at reasoning, which most people don’t understand, and in fact one study showed human planners did worse with an LLM assisting them:

With LLM assistance 48 human planners, 33 (~69%) of them came up with a valid plan.
Without LLM assistance: 49 human planners 39 (~80%) of them came up with a valid plan.

This paper suggests that despite the hype (though there has been some critique of this paper I haven’t explored):

GPT-4 Can't Reason
[…]Despite the genuinely impressive improvement, however, there are good reasons to be highly skeptical of GPT-4's ability to reason. This position paper discusses the nature of reasoning; criticizes the current formulation of reasoning problems in the NLP community and the way in which the reasoning performance of LLMs is currently evaluated; introduces a collection of 21 diverse reasoning problems; and performs a detailed qualitative analysis of GPT-4's performance on these problems. Based on the results of this analysis, the paper argues that, despite the occasional flashes of analytical brilliance, GPT-4 at present is utterly incapable of reasoning.

AI assistants that perform tasks for users or aid users to do tasks require a model of the steps needed to perform a task, and a model of the current state of the world. They need to reason about how to perform those tasks given the current state of the state of the world. They benefit from a model of the users preferences when there are multiple options to perform a task. Unfortunately LLMs by themselves can’t rigorously reason about that sort of planning in a trustworthy fashion, nor do they keep a world model or a user model.

Hybrid Approaches to Address LLM Limitations

Prior to the hardware advancements that made current neural net approaches to AI useful, AI researchers often focused on creating AI systems based on symbolic logic approaches that were capable of more rigorously reasoning about the world. It may be that eventually neural net approaches will be capable of embodying more rigorous logic using different training methods or architecture, but until that happens it may be useful to create hybrid systems that combine symbolic AI approaches with neural nets.

A New York Times article on applying AI to mathematics illustrates the stark contrast between the current capabilities of the neural nets that still struggle to beat humans, vs. the capabilities of symbolic AI:

Another set of tools uses machine learning, which synthesizes oodles of data and detects patterns but is not good at logical, step-by-step reasoning. […]
The model obtained scores that were better than an average 16-year-old student on high school math exams. […]
Lean uses automated reasoning, which is powered by what is known as good old-fashioned artificial intelligence, or GOFAI — symbolic A.I., inspired by logic. So far, the Lean community has verified an intriguing theorem about turning a sphere inside out as well as a pivotal theorem in a scheme for unifying mathematical realms, among other gambits.

Unfortunately, the drawback of these symbolic AI tools is that they usually are not created to understand problems and reasoning expressed in natural language. OpenAI has been testing one approach to addressing the problem by allowing its LLM based AI systems to use tools like Wolfram’s symbolic math software. One concern with this approach is that the flaws in an LLM’s reasoning may lead it to make flawed use of these tools.

An alternative approach is for more logical symbolic AI systems to use LLMs to interact with humans and translate their natural language into forms the symbolic AI can deal with. A symbolic reasoning system can attempt to determine when natural language is ambiguous and interact with people to resolve things it doesn’t understand.

Professor Subbarao Kambhampati has written about a hybrid approach to planning that uses an LLM to aid humans in building a world model used for a symbolic planner:

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning
…In this work, we introduce a novel alternative paradigm that constructs an explicit world (domain) model in planning domain definition language (PDDL) and then uses it to plan with sound domain-independent planners. To address the fact that LLMs may not generate a fully functional PDDL model initially, we employ LLMs as an interface between PDDL and sources of corrective feedback, such as PDDL validators and humans. For users who lack a background in PDDL, we show that LLMs can translate PDDL into natural language and effectively encode corrective feedback back to the underlying domain model. Our framework not only enjoys the correctness guarantee offered by the external planners but also reduces human involvement by allowing users to correct domain models at the beginning, rather than inspecting and correcting (through interactive prompting) every generated plan as in previous work.

One early attempt at integrating LLMs with symbolic reasoning is a system to create knowledge graphs using ChatGPT:

By bridging the gap between the unstructured textual world of ChatGPT and the structured clarity of knowledge graphs, we strive to enhance the effectiveness and reliability of AI language models.

Other recent papers discuss related ideas like “Unifying Large Language Models and Knowledge Graphs: A Roadmap” and “Knowledge Graphs: Opportunities and Challenges”.

Symbolic artificial intelligence researchers have built reason or truth maintenance systems that embody an explicitly logical model of the world. The Cyc project has over 1.5 million concepts, 25 million axioms and 40 thousand predicates in its commercial version. An open source version of it has 239 thousand concepts and over 2 million facts. AI pioneer and founder of Cyc Doug Lenat published:

Getting from Generative AI to Trustworthy AI: What LLMs might learn from Cyc
…We lay out 16 desiderata for future AI, and discuss an alternative approach to AI which could theoretically address many of the limitations associated with current approaches …We suggest that any trustworthy general AI will need to hybridize the approaches, the LLM approach and more formal approach, and lay out a path to realizing that dream.

LLMs may be used to handle input and output from humans and senses, with more rigorous logical systems either driving the LLMs or available to LLMs to use to address issues of hallucination and logical fallacies. Future services might offer APIs for “Logic as a service” or “World model as a service”.

Different LLMs and other AIs that are tailored to specific tasks may use each other as tools and have a need to communicate. Rather than communicating in natural language and risking ambiguities or misunderstandings, they could interact with a shared formal world model. A research paper explains:

The Roles of Symbols in Neural-based AI: They are Not What You Think!
ABSTRACT: We propose that symbols are first and foremost external communication tools used between intelligent agents that allow knowledge to be transferred in a more efficient and effective manner than having to experience the world directly. But, they are also used internally within an agent through a form of self-communication to help formulate, describe and justify subsymbolic patterns of neural activity that truly implement thinking.
[…]We suggest here that symbols and symbolic reasoning are communication tools used externally and internally within an agent to explain decisions, to help guide reasoning, and to bias learning. Symbols are based on categories that help humans make sense of a complex world, individually and as groups. They provide the means by which agents can learn concepts with each other more efficiently and effectively using less energy and with lower risk. But symbolic representations are of even greater importance to agents because: (1) they can also be used internally in a form of self-communication to describe, justify, and guide subsymbolic patterns of neural activity, and (2) they provide an inductive bias to guide future learning of new symbols, concepts and their relations

This approach parallels the idea of integrating traditional databases as a retrieval mechanism for information an LLM can use to ground itself. One article describing this approach notes: “In a recent Sequoia survey, 88% of respondents believe that retrieval will be a key component of their stack.” A study from Facebook AI Research and the University College of London looked into models trained using facts for a particular topic (which chatbots demonstrate leads to hallucinations in general models) vs. integrating information retrieval:

When pre-trained on large unsupervised textual corpora, language models are able to store and retrieve factual knowledge to some extent, making it possible to use them directly for zero-shot cloze-style question answering. However, storing factual knowledge in a fixed number of weights of a language model clearly has limitations. Previous approaches have successfully provided access to information outside the model weights using supervised architectures that combine an information retrieval system with a machine reading component. In this paper, we go a step further and integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way. We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.

Another paper from them reported:

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Operating LLMs is expensive compared to the computations involved in symbolic AI. One article on “Symbolic AI vs LLM: Cost Comparison” suggests that for a high volume of transactions, for some tasks the fixed upfront cost of the knowledge engineering effort for a human to explicitly encode task specific knowledge may lead to lower total transaction costs than relying on an general purpose LLM. That author has also done initial work on combining LLMs with more formal models:

To empirically test the richness and usefulness of ChatGPT’s abstractions I have created a computer program, called Finchbot, that prompts ChatGPT to create domain models using the Accord Project Concerto data modelling language. These models (if syntactically valid) can then be rendered as UML diagrams. If the initial model doesn’t meet requirements it can be iteratively refined.
[…]Firstly, Finchbot creates syntactically valid models approximately 80% of the time
[…]In summary, I think Finchbot is operating at the level of a novice user of Concerto, and a novice business analyst / data modeller, but FOR ALL THE WORLD’S DATA. This is incredibly impressive, and exciting.

Some other early work on combining symbolic and neural AI suggests they are more efficient:

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization
Abstract—Neuro-symbolic artificial intelligence is a novel area of AI research which seeks to combine traditional rules-based AI approaches with modern deep learning techniques. Neurosymbolic models have already demonstrated the capability to outperform state-of-the-art deep learning models in domains such as image and video reasoning. They have also been shown to obtain high accuracy with significantly less training data than traditional models. Due to the recency of the field’s emergence and relative sparsity of published results, the performance characteristics of these models are not well understood

In addition to being more accurate:

The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or “chain-of-thought” (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog’s backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than COT on length generalization benchmarks even with fewer parameters.1

An overview of some relevant research:

Neuro-Symbolic RDF and Description Logic Reasoners: The State-Of-The-Art and Challenges
Additionally, advancements in automated knowledge base construction have led to the creation of large and expressive ontologies that are often noisy and inconsistent, posing further challenges for conventional reasoners. To address these challenges, researchers have explored neuro-symbolic approaches that combine neural networks’ learning capabilities with symbolic systems’ reasoning abilities. In this chapter, we provide an overview of the existing literature in the field of neuro-symbolic deductive reasoning supported by RDF(S), EL, ALC, and OWL 2 RL, discussing the techniques employed, the tasks they address, and other relevant efforts in this area.

While work has begun on addressing the issues, they aren’t easy to resolve. Society in the near term will need to deal with AI systems that sound like experts but need to be treated like interns.

Lessons from Expert Systems

A prior AI boom attempted to formally encode the knowledge of experts into AI systems, often using explicit rules. The difficulty of the task was underestimated and was one factor leading to an AI winter where many became skeptical of the utility of the field.

Early artificial intelligence research encountered what they refer to as the frame problem. Early AI researchers didn’t have the luxury of vast digital stores of human knowledge. They tried to create systems to solve small problems that would require only a limited set of knowledge about the world. They wished to frame certain problems in a way they could have a computer solve them without needing to provide it with all the knowledge a human has.

People attempted to create expert systems that would encode expert knowledge in a particular field like medicine. They often used explicit behavior rules or examples similar to the fine tuning instruction examples used for LLMs to guide their behavior. They discovered that often such experts weren’t merely relying on a limited set of domain specific knowledge, but also general common sense and knowledge about the rest of the world that they could apply when required. They were able to achieve a certain level of utility, but the gaps in ability were problematic.

It turns out that many real world small problems people solve either require, or at least benefit from, lots of other knowledge about the real world rather than merely a small subset of knowledge. The attempt to create expert systems using formalized rules within a specific area parallels the new attempts to collect examples of how an LLM should behave to fine tune it to aid with a specific task.

The benefit of LLMs is that they superficially appear to bypass the frame problem because they have wide general knowledge about the world and so the focus of fine tuning can be on the details of a particular niche. The concern is that even if someone provides useful rules for how to deal with particular examples within a field accurately, that the general logical flaws LLMs have when dealing with the rest of the world will undermine their utility. There is a risk that LLMs for some purposes may be flawed just as expert systems were, even if they are good for many tasks. The open question is whether their flaws can be spotted and worked around well enough.

Humans need to turn to external sources of knowledge as tools, just as the new AI wave may. Grounding LLMs with explicit knowledge bases for particular tasks may help, but also a crowd-created shared world model growing over time may be of use to cope with the frame problem.

IBM AI developers who have been dealing with real world applications and also suggest its useful to integrate rule based AI with LLMs:

Approaches in Using Generative AI for Business Automation: The Path to Comprehensive Decision Automation
[…]In short, rules and LLMs have their sweet spots. The challenge is: how can we combine these technologies to capitalize on their strengths, just as we have invented composite materials to go beyond the properties of iron and carbon?
[…] While LLMs offer impressive capabilities, they lack reliable and repeatable reasoning skills to meet strict decision-making requirements. To bridge this gap, the blog introduces five innovative approaches that combine LLMs with rule-based reasoning engines.

Lessons From Journalism

Society has dealt before with difficulties with systems that aren’t completely accurate. People tolerated errors in journalism for a long time, often choosing to not question them.

Twenty years ago, author Michael Chrichton suggested most people suffered from what he termed the Gell-Mann Amnesia effect. People would read a news article on a topic they were well informed about and decide “the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect.”

Then they would go on to read an article on a topic where they had no expertise and forget their concerns about the publication’s quality and trust it as if it were probably completely accurate. Perhaps they implicitly rationalized they had no way to know for sure about the content and may as well hope for the best.

Unfortunately we risk similar problems today with LLMs that people grasp are flawed in areas they are knowledgeable about, but too easily trust information in other domains they aren’t experts in.

Society and AI Substack

LLMs: Limited Logic Models

Societal Impact and Approaches to the Issue.

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Large language models can’t plan, even if they write fancy essays

Hybrid Approaches to Address LLM Limitations

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lessons from Expert Systems

Lessons From Journalism