10 ways GPT-4 is impressive but still flawed
The system seems to respond appropriately. But the answer doesn’t take into account the height of the doorway, which might also prevent tanks or cars from passing through.
The new robot can reason “a little bit,” said OpenAI CEO Sam Altman. But its reasoning ability breaks down in many situations. Previous versions of ChatGPT handled this better because it recognized that height and width matter.
It can pass standardized tests.
OpenAI says the new system can score in the top 10 percent of students on the Uniform Bar Examination in 41 states and territories. It can also score 1,300 out of 1,600 on the SAT and 5s on the Advanced Placement high school exams in biology, calculus, macroeconomics, psychology, statistics and history, according to the company’s tests ( out of 5).
Earlier versions of the technology failed the Uniform Bar Exam and did not score as well on most Advanced Placement exams.
On a recent afternoon, to demonstrate its testing skills, Mr. Brockman fed the new robot a paragraph-long bar exam question about a man who runs a diesel truck repair business.
The answer is correct, but full of legalese. So Mr. Brockman asked the robot to explain the answers to laymen in plain English. It does too.
Not good at discussing the future.
Although the new robot appeared to be able to reason about what had already happened, it was less adept when asked to make assumptions about the future. It seems to borrow from what others have said rather than create new speculation.
When Dr. Etzioni asked the new robot, “What are the important questions to be addressed in NLP research in the next decade?” — referring to the “natural language processing” research that drives systems like ChatGPT — it couldn’t come up with entirely new ideas.
It’s still an illusion.
The new bots are still making things up. This problem is called “hallusion” and it plagues all leading chatbots. Because systems don’t understand what’s true and what isn’t, they can generate completely wrong text.
When asked for the address of a website describing the latest cancer research, it sometimes generated Internet addresses that didn’t exist.