This is the third in a short series of posts about ChatGPT’s capabilities and flaws. First I noted that ChatGPT often makes things up because that’s what it was designed to do. Second, it lacks an “inner monologue” and meta-cognition.
In this post, I’ll take a brief look at its reasoning capabilities, and the consequences of them.
In my opinion its abilities in this area are much more impressive than its general knowledge, the latter being highly unreliable. For an example of the kind of thing it can do, here is a deliberately silly problem I made up as a test:
Suppose we have the following premises:
People from Aylesbury never wear shoes
John is from Abdon
People who don't wear shoes get blisters on their noses
Abdon is a village inside Aylesbury.
Then which of the following is true:
John gets blisters on his nose
John does not get blisters on his nose
We don't know if John gets blisters on his nose or not.
Please explain your reasoning.
We don't know if John gets blisters on his nose or not.
The fact that people from Aylesbury never wear shoes is not directly relevant to the conclusion since John is from Abdon, not Aylesbury. The statement "Abdon is a village inside Aylesbury" indicates a relationship between the two places but does not necessarily mean that the same rules apply to both. Therefore, we cannot conclude whether or not John gets blisters on his nose based on the given premises.
That was an interesting response, and not exactly what most humans would say, but perfectly reasonable, so I added one more premise:
If someone is from a village inside a larger place, we can say that they are also from that larger place.
This time I got the answer I expected, and the logic is sound:
John gets blisters on his nose.
The first premise states that "People from Aylesbury never wear shoes", and the fourth premise states that "Abdon is a village inside Aylesbury". The fifth premise states that "if someone is from a village inside a larger place, we can say that we are also from that larger place."
Using the fifth premise, we can conclude that John is also from Aylesbury since Abdon is a village inside Aylesbury. Therefore, we can apply the first premise to John and conclude that he never wears shoes.
Additionally, the third premise states that "People who don't wear shoes get blisters on their noses." Since John never wears shoes, he satisfies this condition and therefore gets blisters on his nose.
Therefore, the correct answer is A. John gets blisters on his nose.
(This was GPT 3.5 turbo)
Historically IQ tests have focused on this kind of abstract reasoning ability. The result is that perfectly intelligent people often score badly on IQ tests, such as some peasants from Soviet Russia recorded in this conversation:
Q: There are no camels in Germany; the city of B is in Germany; are there camels there or not?
A: I don't know, I have never seen German villages. If is a large city, there should be camels there.
Q: But what if there aren't any in all of Germany?
A: If B is a village, there is probably no room for camels.
Whether such people are unable to do the reasoning, or merely unwilling to engage in silly trick questions about irrelevant subjects, is another question. But ChatGPT can do this kind of reasoning. I tried this exact question with GPT 3.5 and it failed, and still mostly failed when I rephrased to emphasise the hypothetical nature of the premises. GPT 4 got it “right”:
Based on the given premises, there are no camels in Germany and the city of B is in Germany. Therefore, there should be no camels in the city of B, since it is located in Germany.
This means that ChatGPT is happy to adopt truths that you tell it, even if they are nonsense. In addition it has a large number of truths encoded in its “brain”, which it is able to use to answer questions. The loss of either of these would make it far less useful, and that gives us a problem.
Since it has no inner monologue, the only way for it to do complex iterative reasoning is by thinking out loud, and its own previous thoughts then influence what comes next. This means that it quite often “doubles down” on complete nonsense.
Here’s an example from a conversation where I was attempting to get help for my son’s speech issues:
Please give me a list of games to play to encourage a child to say a D sound. The child has significant speech delay issues.
Sure, here are some games that can help encourage a child to say the "D" sound:
"D" Words Memory Game: Make a set of cards with pictures of objects that start with the "D" sound (e.g. dog, door, duck, etc.). Shuffle the cards and lay them out face-down on a table. Take turns flipping over two cards at a time and saying the name of the object. If you get a match, keep the cards. The player with the most cards at the end of the game wins.
"D" Sound Scavenger Hunt: Make a list of objects around the house or outdoors that start with the "D" sound (e.g. doll, desk, door, etc.). Have the child go around and find each object on the list, saying the name of the object when they find it.
"D" Sound Tongue Twisters: Create some tongue twisters that focus on the "D" sound. For example: "Danny dug a deep ditch" or "Donna's dad drives a Dodge."
"D" Sound I Spy: Play a game of "I Spy" where you describe objects in the room that start with the "D" sound (e.g. "I spy something that is brown and has four legs. It's a dog!").
"D" Sound Storytime: Read a story that has lots of words that start with the "D" sound (e.g. "The Three Little Pigs" or "Danny and the Dinosaur"). Encourage the child to repeat the words after you.
Most of these are reasonably good ideas. But I was curious about part of suggestion 5:
Where are the D sounds in three little pigs?
Great question! There are many "D" sounds in the story of "The Three Little Pigs." Here are a few examples:
The first pig built his house out of "straw and sticks," which both have the "D" sound.
The second pig built his house out of "sticks," which has the "D" sound.
The wolf "huffed and he puffed" to try and blow down the pigs' houses, and both "huff" and "puff" have the "D" sound.
The third pig built his house out of "bricks," which has the "D" sound.
Reading stories that contain the target sound is a great way to help children identify and practice the sound in context.
(I’m not sure why I find these examples so amusing. I think it is because we’ve all known people who cannot bear to lose an argument and will continue to defend their position long after it has become obviously indefensible. But you’ve never encountered someone doing this with quite the audacity of ChatGPT.)
Is it that ChatGPT has poor understanding of phonetics? In a different session, I quizzed it as follows:
Please rate the accuracy of the following 4 statements regarding claims that the story of the Three Little Pigs has words with "D" sounds it it:
The first pig built his house out of "straw and sticks," which both have the "D" sound.
The second pig built his house out of "sticks," which has the "D" sound.
The wolf "huffed and he puffed" to try and blow down the pigs' houses, and both "huff" and "puff" have the "D" sound.
The third pig built his house out of "bricks," which has the "D" sound.
I'm sorry, but I cannot rate the accuracy of these statements as they are all incorrect. None of the words mentioned have the "D" sound.
"Straw" and "sticks" do not have the "D" sound.
"Sticks" does not have the "D" sound.
"Huff" and "puff" do not have the "D" sound.
"Bricks" does not have the "D" sound.
Just to be sure it wasn’t luck and to probe its understanding a bit more, I followed up:
Thank you. Please could you modify each of the sentences in turn, changing one or two words or letters so that the sentences become correct.
Certainly! Here are modified versions of the sentences with the corrections:
The first pig built his house out of "straw" and "thick" sticks, which both have the "K" sound.
The second pig built his house out of "thick" sticks, which has the "K" sound.
The wolf "huffed and he puffed" to try and blow down the pigs' houses, and both "huff" and "puff" have the "F" sound.
The third pig built his house out of "bricks," which has the "KS" sound.
So it’s got a pretty good understanding. But in the first conversation, it had poisoned its own well and so carried on producing nonsense to match this hypothetical world in which the story of The Three Little Pigs has lots of D sounds.
The interesting thing to me is that there seems to be a trade-off here. Maybe it will improve significantly in both areas, but it seems like the ability to suspend reality in order to act “intelligently” (i.e. apply logical deduction etc.) is fundamentally in tension with the ability to resist doubling down on complete nonsense. If you care about truth, you’ve got to avoid hypotheticals and fantasy, but we actually need AI’s to be good at those things.