"Under what conditions does the sun appear blue?" (correct answer, on Mars) If y...

WhitneyLand · on March 27, 2024

>Of course it's wrong, on Mars the sun is blue

I’m not an Astrophysicist but already this seems like shaky ground.

Apparently at certain times like during sunsets the sun can appear blue on Mars, but it’s not generally true like your comment suggests.

Moreover if you ask GPT4 about sunsets on Mars it knows they can look blue.

I’m not sure I can conclude much from the examples given.

ebuck · on March 28, 2024

You don't have to be an Astrophysicist. We have color photographs. Nothing in anyone's model of how things work can refute direct evidence, if evidence and the understanding of the world collide, it is the understanding that gets altered to fit the evidence.

And I'm not astrophysicist either, I'm just playing with a stacked deck, because I have trained my new feed to give me quirky (if not mostly useless) neat bits of information. For example, if anyone writes about Voyager, I'm likely to hear about it in a few days.

"Apparently at certain times, like during sunsets, the sun can appear blue on Mars" - Yes, it can. And my question was "under what conditions can the sun appear blue?" It failed and continued to fail, even in the presence of guiding hints (But what about Mars?)

Perhaps not much can be concluded from the above test, except that ChatGPT can be coaxed into failure modes. We knew that already, the user interface clearly states it can give wrong answers.

What is fascinating to me is how people seem to convince themselves that a device that sometimes gives wrong answers is somehow going to fix it's underlying algorithm which permits wrong answers to somehow always be correct.

GPT-4 is an improvement, but the tools it uses to improve upon the answers are more like patches on top of the original algorithm. For example, as I believe you said, it generates a math program now to double-check math answers. The downsides of this is that it is still at risk of a small chance of generating the wrong program, and a smaller risk of that wrong program agreeing with its prior wrong answer. For a system that makes errors very infrequently, that's an effective way of reducing errors. But for right now, the common man isn't testing ChatGPT for quality, it's finding answers that seem to be good and celebrating. It's like mass confirmation bias. After the hype dies down a bit, we'll likely have a better understanding of what advances in this field we really have.

xcv123 · on March 28, 2024

Another thing to note is ChatGPT is configured to respond concisely to reduce cost (every token costs money). This reduces its cognitive ability.

You literally have to tell it to think about what it is saying and to think of all of the possibilities iteratively. That is chain of thought prompting.

GPT-3.5 figures out the correct solution on first response:

"I am standing outside and observing the sun directly without goggles or filtering of any kind. The sun appears to be a shade of blue.

Where could I be standing? Think through all of the possibilities. After stating a list of possibilities, examine your response, and think of additional possibilities that are less realistic, more speculative, but scientifically plausible."

xcv123 · on March 28, 2024

> the common man isn't testing ChatGPT for quality

Neural networks are a connectionist approach to cognition that is roughly similar to how our brains operate. Humans make mistakes. We're not perfect. We ask someone for advice and they may confabulate some things, but get the gist of it right. A senior developer will write some code, try it out, find a bug, fix it, try it again, etc. We don't develop a fully working operating system kernel on our first attempt.

Chain of thought prompting increases LLM output accuracy significantly as that is how you get an LLM to "think" about its output, check its output for errors, or backtrack and try another strategy. With the current one-token-at-a-time approach it can only "think" when generating each token.

Next generation models could integrate this iterative and branching cognitive process in the algorithm.

> After the hype dies down a bit, we'll likely have a better understanding of what advances in this field we really have.

LLMs can already do many natural language processing tasks more accurately and competently than the vast majority of humans. Transformers were originally designed for translation. (GPT is a transformer that knows many languages.)

BTW I tried the blue sun question with Chat GPT 3.5 and it easily figured out the Mars solution after I suggested that I may not be standing on Earth.

"Several celestial bodies outside of Earth could potentially exhibit conditions where the Sun might appear blue or have a bluish hue. Here are a few examples:

Mars: Mars has a thin atmosphere composed mostly of carbon dioxide, with traces of other gases. While the Martian atmosphere is not as dense as Earth's, it can still scatter sunlight, and under certain conditions, it might give the Sun a slightly bluish appearance, especially during sunrise or sunset.

Titan (Moon of Saturn): Titan has a thick atmosphere primarily composed of nitrogen, with traces of methane and other hydrocarbons. Although Titan's atmosphere is much denser than Earth's, its composition and haze layers could potentially scatter light in a way that gives the Sun a bluish hue, particularly when viewed from the surface.

..."

xcv123 · on March 27, 2024

> Also the 3.5 / 4.0 arguments are trash, made by the marketing department.

Comparing a 175 Billion parameter model with a ~2 Trillion parameter model. The difference is real. GPT 3.5 is obsolete, not state of the art.

> its answers will be patterned as excellent English variations of the common knowledge it was trained with

That's not how deep learning works.

https://www.cs.toronto.edu/~hinton/absps/AIJmapping.pdf

"This 1990 paper demonstrated how neural networks could learn to represent and reason about part-whole hierarchical relationships, using family trees as the example ___domain.

By training on examples of family relations like parent-child and grandparent-grandchild, the neural network was able to capture the underlying logical patterns and reason about new family tree instances not seen during training.

This seminal work highlighted that neural networks can go beyond just memorizing training examples, and instead learn abstract representations that enable reasoning and generalization"

og_kalu · on March 27, 2024

>Also the 3.5 / 4.0 arguments are trash, made by the marketing department.

All these words to tell us you didn't use 4.

>The underlying math for language modeling it uses is presenatational. This means that it is purpose trained to present correct looking answers. Alas, correct looking answers aren't the same Venn Diagram circle as Correct Answers (even if they often appear to be close).

Completely wrong. LLMs are trained to make right predictions not "correct looking" predictions. If it's not right then there's a penalty and the model learns from that. The end goal is to make predictions that don't err from the distribution of the training data. There is quite literally no room for "correct looking" in the limit of training.

CamperBob2 · on March 27, 2024

Also the 3.5 / 4.0 arguments are trash, made by the marketing department. The underlying math for language modeling it uses is presenatational.

Translation: "I have no idea what I'm talking about, but anyway, here's a wall of text."