Jagged AGI: o3, Gemini 2.5, and everything after

Jagged AGI: o3, Gemini 2.5, and everything after

265

by ctoth

plaidfuji

Gemini 2.5 Pro is certainly a tipping point for me. Previous LLMs have been very impressive, especially on coding tasks (unsurprising as the answers to these have a preponderance of publicly available data). But outside of a coding assistant, LLMs til now felt like an extra helpful and less garbage-filled Google search.

I just used 2.5 Pro to help write a large research proposal (with significant funding on the line). Without going into detail, it felt to me like the only reason it couldn’t write the entire thing itself is because I didn’t ask it to. And by “ask it”, I mean: enter into the laughably small chat box the entire grant solicitation + instructions, a paragraph of general direction for what I want to explore, and a bunch of unstructured artifacts from prior work, and turn it loose. I just wasn’t audacious enough to try that from the start.

But as the deadline approached, I got more and more unconstrained in how far back I would step and let it take the reins - doing essentially what’s described above but on isolated sections. It would do pretty ridiculously complex stuff, like generate project plans and timelines, cross reference that correctly with other sections of text, etc. I can safely say it was a 10x force multiplier, and that’s being conservative.

For scientific questions (ones that should have publicly available data, not ones relying on internal data), I have started going to 2.5 Pro over senior experts on my own team. And I’m convinced at this point if I were to connect our entire research data corpus to Gemini, that balance would shift even further. Why? Because I can trust it to be objective - not inject its own political or career goals into its answers.

I’m at the point where I feel the main thing holding back “AGI” is people’s audacity to push its limits, plus maybe context windows and compute availability. I say this as someone who’s been a major skeptic up until this point.

burkaman

> For scientific questions (ones that should have publicly available data, not ones relying on internal data), I have started going to 2.5 Pro over senior experts on my own team.

Have you asked any of your experts to double check those bot answers to see how it did?

plaidfuji

Yep

MoonGhost

LLMs at this point are stateless calculators without personal experience, life goals, obligations, etc. Till recently people expected to have a character like Terminator or HAL. Now we have intelligence separate from 'soul'. Can calculator be AGI? It can be Artificial, General, and Intelligence. We may need another word for 'creature' with some features of living being.

dcow

The term AI has always bothered me for this reason. If the thing is intelligent, then there’s nothing artificial about it… it’s almost an oxymoron.

There are two subtly different definitions in use: (1) “like intelligence in useful ways, but not actually”, and (2) “actually intelligent, but not of human wetware”. I take the A in AGI to be of type (2).

LLMs are doing (1), right now. They may have the “neurological structure” required for (2), but to make a being General and Intelligent it needs to compress its context window persist it to storage every night as it sleeps. It needs memory and agency. It needs to be able to learn in real time and self-adjusting its own weights. And if it’s doing all that, then who is to say it doesn't have a soul?

Jensson

> If the thing is intelligent, then there’s nothing artificial about it… it’s almost an oxymoron.

Artificial means human made, if we made a thing that is intelligent, then it is artificial intelligence.

It is like "artificial insemination" means a human designed system to inseminate rather than the natural way. It is still a proper insemination, artificial doesn't mean "fake", it just means unnatural/human made.

dcow

Well, you and I agree, but there’s an entire industry and pop culture throwing the term around rather imprecisely (calling LLMs “AI”) which makes actual discussion about what AGI is, difficult.

I guess I don’t understand the technical difference between AI and AGI and consider AI to refer to the social meme of “this thing kinda seems like it did something intelligent, like magic”.

europeanNyan

> Artificial means human made, if we made a thing that is intelligent, then it is artificial intelligence.

Aren't humans themselves essentially human made?

Maybe a better definition would be non-human (or inorganic if we want to include intelligence like e.g. dolphins)?

caconym_

> Aren't humans themselves essentially human made?

No, not in the sense in which the word "made" is being used here.

> Maybe a better definition would be non-human (or inorganic if we want to include intelligence like e.g. dolphins)?

Neither of these work. Calling intelligence in animals "artificial" is absurd, and "inorganic" arbitrarily excludes "head cheese" style approaches to building artificial intelligence.

"Artificial" strongly implies mimicry of something that occurs naturally, and is derived from the same root as "artifice", which can be defined as "to construct by means of skill or specialized art". This obviously excludes the natural biological act of reproduction that produces a newborn human brain (and support equipment) primed to learn and grow; reportedly, sometimes people don't even know they're pregnant until they go into labor (and figure out that's what's happening).

kridsdale3

If I asked my wife if she made our son, she would say yes. It is literally called "labour". Then there is "emotional labour" that lasts for 10 years to do the post-training.

caconym_

I drove my car to work today, and while I was at work I drove a meeting. Does this mean my car is a meeting? My meeting was a car?

It turns out that some (many, in fact) words mean different things in different contexts. My comment makes an explicit argument concerning the connotations and nuances of the word "made" used in this context, and you have not responded to that argument.

dcow

Judging by this response, I’m guessing you don’t have children of your own. Otherwise you might understand the context.

caconym_

Your guess is wrong!

Maybe you should have written a substantive response to my comments instead of trying and failing to dunk on me. Maybe you don't understand as much as you think you do.

dcow

I honestly don’t care enough to even have even remotely thought about my reply as trying to dunk on anything. You’re awfully jacked up for a comment so far down an old thread that you and I are probably the only ones who will ever read it.

caconym_

Okay!

Jensson

> Aren't humans themselves essentially human made?

Humans evolved, but yeah the definition can be a bit hard to understand since it is hard to separate things. That is why I brought up the artificial insemination example since it deals with this.

> Maybe a better definition would be non-human (or inorganic if we want to include intelligence like e.g. dolphins)?

We also have artificial lakes, they are inorganic but human made.

butlike

"ii" (inorganic intelligence) has a better ring to it than AI and can also be stylized as "||" which means OR.

EvgeniyZh

Is your research AI-, or more generally, CS-related? Because I feel that it is still quite bad (by researcher standards) in physics for example.

john_minsk

Did you get funding in the end?

xur17

Strongly agreed. I used Gemini 2.5 Pro over the weekend to build an entire website + backend system in 8 hours, and it would have taken me over a week to get to the same place myself. My total bill for the entire thing? $10.

I am using Gemini 2.5 Flash for analyzing screenshots of webpages as part of it. Total cost for that (assuming I did my math right) ? $0.00002 / image.

valenterry

It's actually pretty good at that, especially layout/design.

The problem is, the code it produces is usually not great and inconsistent and has subtle bugs. That quickly becomes a problem if you want to change things later, especially while keeping your data consistent and your APIs stable and backwards compatible. At least that's my experience.

But for building something that you can easily throw away later, it's pretty good and saves a lot of time.

vladsanchez

Do you write or blog so that I can follow your writings or opinions elsewhere?

You can email me at vlad dot sanchez at gmail dot com.

Thanks.

valenterry

And yet it fails with every second refactoring that I ask it to do in a mediocre complicated codebase. What am I doing wrong?

oytis

Might be because you are an expert in what you ask it to do, and actually care about the result. E.g. I'm not sure what a marketing or otherwise business professional would say about the work it did on the cheese business. What has caught my eye is that projected cost of doing business (salaries) is unrealistically low, especially as the volumes are expected to grow

lm28469

IMHO it basically is judgment day, all the people boasting about 100x productivity and "chatgpt basically replaced my job and/or colleagues" had bullshit jobs

oytis

> Why? Because I can trust it to be objective - not inject its own political or career goals into its answers.

And that is basically why humanity is doomed.

chunkmonke99

LLMs have broad adoption since like 2 years ago: my GF and sister both have used previous models to write (successful) grant applications with previous iterations so "AGI" has been here since chatgpt's initial release if that is the metric (or baby AGI). I view these as a novel way of "re-configuring" and "re-mixing" human knowledge and that is a BIG DEAL! Also, I am not sure I agree with "people's lack of audacity" is holding back LLMs from achieving "AGI": Dario, Demi, and Sam Altman are promising the end of disease and death in the next 2 to 10 years! And those are some audacious claims (even if they come to be).

mellosouls

The capabilities of AI post gpt3 have become extraordinary and clearly in many cases superhuman.

However (as the article admits) there is still no general agreement of what AGI is, or how we (or even if we can) get there from here.

What there is is a growing and often naïve excitement that anticipates it as coming into view, and unfortunately that will be accompanied by the hype-merchants desperate to be first to "call it".

This article seems reasonable in some ways but unfortunately falls into the latter category with its title and sloganeering.

"AGI" in the title of any article should be seen as a cautionary flag. On HN - if anywhere - we need to be on the alert for this.

Zambyte

I think a reasonable definition of intelligence is the application of reason on knowledge. An example of a system that is highly knowledgeable but has little to no reason would be an encyclopedia. An example of a system that is highly reasonable, but has little knowledge would be a calculator. Intelligent systems demonstrate both.

Systems that have general intelligence are ones that are capable of applying reason to an unbounded domain of knowledge. Examples of such systems include: libraries, wikis, and forums like HN. These systems are not AGI, because the reasoning agents in each of these systems are organic (humans); they are more like a cyborg general intelligence.

Artificial general intelligence are just systems that are fully artificial (ie: computer programs) that can apply reason to an unbounded domain of knowledge. We're here, and we have been for years. AGI sets no minimum as to how great the reasoning must be, but it's obvious to anyone who has used modern generative intelligence systems like LLMs that the technology can be used to reason about an unbounded domain of knowledge.

If you don't want to take my word for it, maybe Peter Norvig can be more convincing: https://www.noemamag.com/artificial-general-intelligence-is-...

conception

I think the thing missing would be memory. The knowledge of current models is more or less static save for whatever you can cram into their context window. I think if they had memory and thus the ability to learn - “oh hey, I’ve already tried to solve a bug in these ways maybe I won’t get stuck in loop on them!” Would be the agi push for me. Real time incorporating new knowledge into the model is the missing piece.

casey2

It needs some kind of low latency world model. Or at least more agent inspired pretraining

(I did x and it failed, I did y and It failed, I should try z now) GOOD

(I did x and it failed, I did y and it failed, I should try x now) BAD

fragmede

With MCP/tool use you can tell it to save state into an MD file, simulating this. How much that counts is left as an exercise to the reader.

conception

It actually works really well in my experience. But it eats up context. Using 500-600k token windows per inquiry isn’t cheap.

jimbokun

Excellent article and analysis. Surprised I missed it.

It is very hard to argue with Norvig’s arguments that AGI has been around since at least 2023.

littlestymaar

It's not: whatever the way you define AGI, you cannot just ignore the key letter of the three letters acronym: G stands for “General”.

You can argue that for the first time in the history we have an AI that deserves its name (unlike Deep blue or AlphaGo which aren't really about intelligence at all) but you cannot call that Artificial GENERAL Intelligence before it overcomes the “jagged intelligence” syndrome.

Zambyte

It sounds like you have a different definition of "general" in the context of intelligence from the one I shared. What is it?

Jensson

General intelligence means it can do the same intellectual tasks as humans can, including learning to do different kinds of intellectual jobs. Current AI can't learn to do most jobs like a human kid can, so its not AGI.

This is the original definition of AGI. Some data scientists try to move the goalposts to something else and call something that can't replace humans "AGI".

This is a very simple definition that is easy to see when it is fulfilled because then companies can operate without humans.

Zambyte

What intellectual tasks can humans do that language models can't? Particularly agentic language model frameworks.

ben_w

Weird spiky things that are hard to characterise even within one specific model, and where the ability to reliably identify such things itself causes subsequent models to not fail so much.

A few months ago, I'd have said "create image with coherent text"*, but that's now changed. At least in English — trying to get ChatGPT's new image mode to draw the 狐 symbol sometimes works, sometimes goes weird in the way latin characters used to.

* if the ability to generate images doesn't count as "language model" then one intellectual task they can't do is "draw images", see Simon Willison's pelican challenge: https://simonwillison.net/tags/pelican-riding-a-bicycle/

Jensson

A normal software engineering job? You have access to email and can send code etc. No current model manages anything close to that. Even much simpler jobs can't be automated like that by them.

So basically any form of longer term tasks cannot be done by them currently. Short term tasks with constant supervision is about the only things they can do, and that is very limited, most tasks are long term tasks.

Zambyte

> You have access to email and can send code etc. No current model manages anything close to that.

This is an issue of tooling, not intelligence. Language models absolutely have the power to process email and send (push?) code, should you give them the tooling to do so (also true of human intelligence).

> So basically any form of longer term tasks cannot be done by them currently. Short term tasks with constant supervision is about the only things they can do, and that is very limited, most tasks are long term tasks.

Are humans that have limited memory due to a condition not capable of general intelligence, xor does intelligence exist on a spectrum? Also, long term tasks can be decomposed into short term tasks. Perhaps automatically, by a language model.

Have you actually tried agentic LLM based frameworks that use tool calling for long term memory storage and retrieval, or have you decided that because these tools do not behave perfectly in a fluid environment where humans do not behave perfectly either, that it's "impossible"?

raducu

> Have you actually tried agentic LLM based frameworks that use tool calling for long term memory storage and retrieval, or have you decided that because these tools do not behave perfectly in a fluid environment where humans do not behave perfectly either, that it's "impossible"?

i.e. "Have you tried this vague, unnamed thing that I alude to that seems to be the answer that contradicts your point, but actually doesn't?"

AGI = 90% of software devs, psychotherapists, lawyers, teachers lose their jobs, we are not there.

Once LLMs can fork themselves, reflect and accumulate domain specific knowledge and transfer the whole context back to the model weights, once that knowledge can become more important than the pre-pretrained information, once they can form new neurons related to a project topic, then yes, we will have AGI (probably not that far away). Once LLM's can keep trying to find a bug for days and weeks and months, go through the debugger, ask people relevant questions, deploy code with new debugging traces, deploy mitigations and so on, we will have AGI.

Otherwise, AI is stuck in this groundhog day type scenario, where it's forever the brightest intern that any company has ever seen, but he's forever stuck at day 0 on the job, forever not that usefull, but full of potential.

elevatortrim

Why would it be a tooling issue? AI has access to email, IDEs, and all kinds of systems. It still cannot go and build software on its own by speaking to stakeholders, taking instructions from a PM, understanding it needs to speak to DevOps to release its code, suggesting to product team that feature is better developed as part of core product, objecting to SA about the architecture, and on and on…

(If it was a tooling issue, AGI could build the missing tools)

kweingar

> This is an issue of tooling, not intelligence. Language models absolutely have the power to process email and send (push?) code, should you give them the tooling to do so (also true of human intelligence).

At a certain point, a tooling issue becomes an intelligence issue. AGI would be able to build the tools they need to succeed.

If we have millions of these things deployed, they can work 24/7, and they supposedly have human-level intelligence, then why haven't they been able to bootstrap their own tooling yet?

littlestymaar

> Have you actually tried agentic LLM based frameworks that use tool calling for long term memory storage and retrieval,

You can work around the limitations of LLMs' intelligence with your own and an external workflow you design, but I don't see how that counts as part of the LLM's intelligence.

Zambyte

Humans have general intelligence. A network of humans have better general intelligence.

LLMs have general intelligence. A network of LLMs have better general intelligence.

If a single language model isn't intelligent enough for a task, but a human is, there is a good chance there exists a sufficient network of language models that is intelligent enough.

littlestymaar

> LLMs have general intelligence.

No they don't. That's the key part you keep assuming without justification. Interestingly enough you haven't responded to my other comment [1].

You asked “What intellectual tasks can humans do that language models can't?” and now that I'm thinking about it again, I think the more apt question would be the reverse:

“What intellectual tasks can a LLM do autonomously without any human supervision (direct or indirect[2]) if there's money at stake?”

You'll see that the list is going to be very shallow if not empty.

> A network of LLMs have better general intelligence.

Your argument was about tool calling for long term memory, this isn't “a network of LLM” but an LLM another tool chosen by a human to deal with LLM's limitations one one particular problem (and of you need long term memory for another problem you're very likely to need to rework both your prompt and your choice of tools to address it: it's not the LLM that solves it but your own intelligence).

[1]: https://news.ycombinator.com/item?id=43755623 [2] indirect supervision would be the human designing an automatic verification system to check LLMs output before using it. Any kind of verification that is planned in advance by the human and not improvised by the LLM when facing the problem counts as indirect supervision, even if it relies on another LLM.

littlestymaar

Read a bunch of books not present in the training data on a specific topic, and learn something from it.

You can cheat with tooling like RAG or agentic frameworks, but the result isn't going to be good and it's not the AI that learns.

But besides this fundamental limitation, had you tried implementing production ready stuff with LLM, you'd have discovered that language models are still painfully unreliable even for the tasks they are supposed to be good at: they will still hallucinate when summarizing, fail to adhere to the prompt, add paragraphs in English at random when working in French, edit unrelated parts of the code you ask it to edit, etc, etc.

You can work around many of those once you've identified it, but that still counts as a fail in a response to your question.

barrenko

We don't have this for humans either, other than the vague pseudoscience of IQ. As I've travelled more, and just grown older in general, I've come to accept a lot of humans as barely sentient.

literalAardvark

More like all humans, most of the time.

Actual sentience takes energy that our brain really doesn't like to use. It hardcodes switch statements for behaviours as fast as it can and then coasts until something doesn't match.

daxfohl

Until you can boot one up, give it access to a VM video and audio feeds and keyboard and mouse interfaces, give it an email and chat account, tell it where the company onboarding docs are and expect them to be a productive team member, they're not AGI. So long as we need special protocols like MCP and A2A, rather than expecting them to figure out how to collaborate like a human, they're not AGI.

The first step, my guess, is going to be the ability to work through github issues like a human, identifying which issues have high value, asking clarifying questions, proposing reasonable alternatives, knowing when to open a PR, responding to code review, merging or abandoning when appropriate. But we're not even very close to that yet. There's some of it, but from what I've seen most instances where this has been successful are low level things like removing old feature flags.

rafaelmn

Just because we rely on vision to interface with computer software doesn't mean it's optimal for AI models. Having a specialized interface protocol is orthogonal to capability. Just like you could theoretically write code in a proportional font with notepad and run your tools through windows CMD - having an editor with syntax highlighting and monospaced font helps you read/navigate/edit, having tools/navigation/autocomplete etc. optimized for your flow makes you more productive and expands your capability, etc.

If I forced you to use unnatural interfaces it would severely limit your capabilities as well because you'd have to dedicate more effort towards handling basic editing tasks. As someone who recently swapped to a split 36key keyboard with a new layout I can say this becomes immediately obvious when you try something like this. You take your typing/editing skills for granted - try switching your setup and see how your productivity/problem solving ability tanks in practice.

daxfohl

Agreed, but I also think to be called AGI, they should be capable of working through human interfaces rather than needing to have special interfaces created for them to get around their lack of AGI.

The catch in this though isn't the ability to use these interfaces. I expect that will be easy. The hard part will be, once these interfaces are learned, the scope and search space of what they will be able to do is infinitely larger. And moreover our expectations will change in how we expect an AGI to handle itself when our way of working with it becomes more human.

Right now we're claiming nascent AGI, but really much of what we're asking these systems to do have been laid out for them. A limited set of protocols and interfaces, and a targeted set of tasks to which we normally apply these things. And moreover our expectations are as such. We don't converse with them as with a human. Their search space is much smaller. So while they appear AGI in specific tasks, I think it's because we're subconsciously grading them on a curve. The only way we have to interact with them prejudices us to have a very low bar.

That said, I agree that video feed and mouse is a terrible protocol for AI. But that said, I wouldn't be surprised if that's what we end up settling on. Long term, it's just going to be easier for these bots to learn and adapt to use human interfaces than for us to maintain two sets of interfaces for things, except for specific bot-to-bot cases. It's horribly inefficient, but in my experience efficiency never comes out ahead with each new generation of UIs.

esperent

> Just because we rely on vision to interface with computer software doesn't mean it's optimal for AI models

This is true but AGI means "Artificial General Intelligence". Perhaps it would be even more efficient with certain interfaces, but to be general it would have to at least work with the same ones as humans.

Here's some things that I think a true AGI would need to be able to do:

* Control a general purpose robot and use vision to do housework, gardening etc.

* Be able to drive a car - equivalent interfaces to humans might be service motor controlled inputs.

* Use standard computer inputs to do standard computer tasks

And this list could easily be extended.

If we have to be very specific in the choice of interfaces and tasks that we give it, it's not a general AI.

At the same time, we have to be careful at moving the goalposts too much. But current AI are limited to what can be returned in a small number of interfaces (prompt with text/image/video & return text/image/video data). This is amazing, they can sound very intelligent while doing so. But it's important not to lose sight of what they still can't do well which is basically everything else.

Outside of this area, when you do hear of an AI doing something well (self driving, for example) it's usually a separate specialized model rather than a contribution towards AGI.

mNovak

By this logic disabled people would not class as "Generally Intelligent" because they might have physical "interface" limitations.

Similarly I wouldn't be "Generally Intelligent" by this definition if you sat me at a Cyrillic or Chinese keyboard. For this reason, I see human-centric interface arguments as a red herring.

I think a better candidate definition might be about learning and adapting to new environments (learning from mistakes and predicting outcomes), assuming reasonable interface aids.

esperent

> Similarly I wouldn't be "Generally Intelligent" by this definition if you sat me at a Cyrillic or Chinese keyboard

Would you be able to be taught to use those keyboards? Then you're generally intelligent. If you could not learn, then maybe you're not generally intelligent?

Regarding disabled people, this is an interesting point. Assuming that we're talking about physical disabilities only, disabled people are capable of learning how to use any standard human inputs. It's just the physical controls that are problematic.

For an AI, the physical input is not the problem. We can just put servo motors on the car controls (steering wheel, brakes, gas) and give it a camera feed from the car. Given those inputs, can the AI learn to control the car as a generally intelligent person could, given the ability to use the same controls?

vczf

If all we needed was general intelligence, we would be hiring octopuses. Human skills, like fluency in specific languages, are implicit in our concept of AGI.

Comment was deleted :(

ctoth

So I am a blind human. I cannot drive a car or use a camera/robot to do housework (I need my hands to see!) Am I not a general intelligence?

esperent

I replied this to another comment, but I'll put it here: your limitation is physical. You have standard human intelligence, but you're lacking a certain physical input (vision). As a generally intelligent being, you will compensate for the lack of vision by using other senses.

That's different to AIs, which we can hook up to all kinds of inputs: cameras, radar, lidar, car controls, etc. For the AI the lack of input is not the limitation. It's whether they can do anything with an arbitrary input/control, like a servo motor controlling a steering wheel, for example.

To look at it another way, if an AI can operate a robot body by vision, then we suddenly removed the vision input and replaced it with a sense of touch and hearing, would the AI be able to compensate? If it's an AGI, then it should be able to. A human can.

On the other hand, I wonder if we humans are really as "generally intelligent" as we like to think. Humans struggle to learn new languages as adults, for example (something I can personally attest to, having moved to Asia as an adult). So, really, are human beings a good standard by which to judge an AI as AGI?

raducu

> Just because we rely on vision to interface with computer software doesn't mean it's optimal for AI models.

It's optimal for beings that have general purpose inteligence.

> would severely limit your capabilities as well because you'd have to dedicate more effort towards handling basic editing tasks

Yes, but humans will eventually get used to it and internalize the keyboard, the domain language, idioms and so on and their context gets pushed to long term knowledge overnight and thei short term context gets cleaned up and they get bettet and better at the job, day by day. AI starts very strong but stays at that level forever.

When faced with a really hard problem, day after day the human will remember what he tried yesterday and parts of that problem will become easier and easier for the human, not so for the AI, if it can't solve a problem today, running it for days and days produces diminishing returns.

That's the General part of human intelligence -- over time it can aquire new skills it did not have yesterday, LLMs can't do that -- there is no byproduct of them getting better/aquiring new skills as a result of their practicing a problem.

ctoth

> It's optimal for beings that have general purpose inteligence [Sic].

Hi. I'm blind. I would like to think I have general-purpose intelligence thanks.

And I can state that interfacing with vision would, in fact, be suboptimal for me. The visual cortex is literally unformed. Yet somehow I can perform symbolic manipulations. Converse with people. Write code. Get frustrated with strangers on the Internet. Perhaps there are other "optimal" ways that "intelligent" systems can use to interface with computers? I don't know, maybe the accessibility APIs we have built? Maybe MCP? Maybe any number of things? Data structures specifically optimized for the purpose and exchanged directly between vastly-more-complex intelligences than ourselves? Do you really think that clicking buttons through a GUI is the one true optimal way to use a computer?

daxfohl

Of course not. The visual part is window dressing on the argument. The real point is, before declaring AGI, I think the way we interact with these agents needs to be more like human to human interaction. Right now, agents generally accept a command, figure out which from a small number of MCPs that have been precoded for it to use, do that thing you wanted right or wrong, the end. If it does the right thing, huge confirmation bias that it's AGI. Maybe the MCP did most of the real work. If it doesn't, well, blame the prompt or maybe blame the MCPs are lacking good descriptions or something.

To get a solid read on AGI, we need to be grading them in comparison to a remote coworker. That they necessarily see a GUI is not required. But what is required is that they have access to all the things a human would, and don't require any special tools that limit their search space to a level below what a human coworker would have. If it's possible for a human coworker to do their whole job via console access, sure, that's fine too. I only say GUI because I think it'd actually be the easiest option, and fairly straightforward for these agents. Image processing is largely solved, whereas figuring out how to do everything your job requires via console is likely a mess.

And like I said, "using the computer", whether via GUI or screen reader or whatever else, isn't going to be the hard part. The hard part is, now that they have this very abstract capability and astronomically larger search space, it changes the way we interact with them. We send them email. We ping them on Slack. We don't build special baby mittens MCPs and such for them and they have to enter the human world and prove that they can handle it as a human would. Then I would say we're getting closer to AGI. But as long as we're building special tools and limiting their search space to that limited scope, to me it feels like we're still a long way off.

Jensson

> Do you really think that clicking buttons through a GUI is the one true optimal way to use a computer?

There are some tasks you can't do without vision, but I agree it is dumb to say general intelligence requires vision, vision is just an information source it isn't about intelligence. Blind people can be excellent software engineers etc they can do most white collar work just as well as anyone else since most tasks doesn't require visual processing, text processing works well enough.

jermaustin1

> There are some tasks you can't do without vision...

I can't think of anything where you require vision that having a tool (a sighted person) you protocol with (speak) wouldn't suffice. So why aren't we giving AI the same "benefit" of using any tool/protocol it needs to complete something.

ctoth

> I can't think of anything where you require vision that having a tool (a sighted person) you protocol with (speak) wouldn't suffice.

Okay, are you volunteering to be the guide passenger while I drive?

jermaustin1

Thank you for making my point:

We have created a tool called "full self driving" cars already. This is a tool that humans can use, just like we have MCPs a tool for AI to use.

All I'm trying to say, is AGIs should be allowed to use tools that fit their intelligence the same way that we do. I'm not saying AIs are AGIs, I'm just saying that the requirement that they use a mouse and keyboard is a very weird requirement like saying People who can't use a mouse and keyboard (amputees, etc.) aren't "Generally" intelligent. Or people who can't see the computer screen.

daxfohl

Right, and also the ability to know when it's stuck. It should be able to take a problem, work on it for a few hours, and if it decides it's not making progress it should be able to ping back asynchronously, "Hey I've broken the problem down into A, B, C, and D, and I finished A and B, but C seems like it's going to take a while and I wanted to make sure this is the right approach. Do you have time to chat?" Or similarly, I should be able to ask for a status update and get this answer back.

toomim

You can't do that with most of the world's human population. Does that imply that most humans haven't reached AGI?

daxfohl

For two years during covid that's how people worked and how kids attended school. So I'd say generally using a computer to communicate, understand expectations, and perform for complex tasks is something well within most humans' ability.

fragmede

Where A stands for artificial, I don't think most humans have "reached" that, no.

Closi

You presumably understand the posters underlying point though - that the definition of 'general intelligence' does not need to be 'at above-average human level' and humans can be intelligent without being able to use a computer or do some sort of job on a VM.

Closi

This is an incredibly specific test/definition of AGI - particularly remembering that I would probably say an octopus classes as an intelligent being yet can't use outlook...

yeahwhatever10

This is the forum that fell the hardest for the superconductor hoax a few years ago. HN has no superiority leg to stand on.

Seanambers

LK-99? The room temperature superconductor made by some guys in a small lab in Korea after 20 years of attempts - who doesn't want that to be real.

Won´t say people fell for it though it was just the current happening at the time.

cdblades

> who doesn't want that to be real.

I think that's exactly the point the person you're responding to is calling out. That's a massive bias.

dcow

Uh… what do you call mass downvoting anything reasonably skeptical? I saw comments saying “I’ll wait until the data can be replicated to believe it” turn grey almost immediately on most threads. And after we knew it didn’t replicate it took the community the better part of a week to grieve and admit that the “it’s not a superconductor” results were real and not operator error.

sandspar

Forums shed and gain members over time. Much of that cohort is gone.

Hasu

If anything it's been replaced with a far more naive and gullible cohort, not a more skeptical one.

j_timberlake

The exact definition of AGI is pretty much the least interesting thing about AGI. It's basically bike-shedding at this point: arguing about something easy to understand instead of tackling the really hard questions like "how competent can AI get before it's too dangerous to be in the hands of flakey tech companies?"

9dev

> how competent can AI get before it's too dangerous to be in the hands of flakey tech companies?

Ever heard of Pandora's Box? Yeah. That ship has sailed. No moratorium you could enact would, at this point, stop the innovation from happening, possibly even independently by multiple teams globally. Economic incentives are stacked in such a way that flakey tech companies will prioritise shareholder value over anything else. Whatever comes next will come, and all we can do is lean back and enjoy the show.

tsimionescu

Given the gigantic amount of compute power and storage needed to train and run LLMs, this is certainly not true. It is absolutely feasible for government to check every data center capable of advancing the state of the art in AI to ensure that no such research is taking place.

Of course, the political will to do so doesn't exist to even a tiny extent. But if such a will existed, it would be far easier to enforce than the prevention of human cloning, and that one has been successfully implemented for decades now.

magic_hamster

If you just bow out of the AI race you are handing it to other countries where practices might not be subjected to the same standards. It's suicide to do this.

tsimionescu

That's only if countries can't agree on this. I just gave the example of human cloning, which has been banned globally. You can also look at nuclear non-proliferation, which has been largely successful (though not completely) despite huge incentives for any country to defy it.

9dev

Now that’s a big if.

Given the current state of the world, do you really think the USA, China, India, Iran, Brazil, North Korea, and Russia, would all have the same opinion on the danger of AI systems and would—despite very obvious and tangible strategic advantages—all halt development for humanity’s sake?

Human cloning is an issue that is mostly academic in nature, but I’d bet everything I have that bioengineers all over the world secretly are researching this on government programmes, and nuclear non-proliferation is a joke. It was essentially about stripping Russia of its nukes, but all global powers still have them, and countries like Iran, North Korea, and India measure their development on the possession of nuclear weapons. It was successful only if by success you mean the USA didn’t maintain their minuteman program.

tsimionescu

It's only the USA, Russia, France, China, the UK, India, Pakistan, Israel, and North Korea that have nuclear weapons, from the entire world. Iran has been working on it for decades and they still ultimately haven't gotten them. This is a huge success given the gigantic advantage nuclear weapons give strategically.

xmprt

Superhuman is an interesting term because it's very vague. In some sense, computers have always been superhuman. I don't know anyone who could do a 10 by 10 digit multiplication in a fraction of a second yet even the earliest computers were capable of that. Does that make them superhuman?

Today's LLMs are definitely more impressive than a basic calculator but it's still hard to tell if there's anything human about what they're doing or if they're just really powerful calculators with amazing understanding and recall. Does that distinction even matter?

ashoeafoot

AGI is a annonymous good model coming around the corner with no company and no LLM researchers attached. AGI is when the LLM hype train threads are replaced with CEOs and let go researchers demanding UBI.

ben_w

It's easy to treat AGI as one thing — I did so myself before everyone's differing reaction to LLMs made me realise we all mean different things by each of the three letters of the initialism, and that none of those initials are really boolean valued.

Given how Dutch disease[0] is described, I suspect that if the "G" (general) increases with fixed "I" (intelligence), as the proportion of economic activity for which the Pareto frontier is AI rather than human expands, I think humans will get pay rises for the remaining work right up until they get unemployable.

On the other hand, if "G" is fully general and it's "I" which rises for a suitable cost[1], it goes through IQ 55 (displacing no workers) to IQ 100 (probably close to half of workers redundant, but mean of population doesn't have to equal mean of workforce), to IQ 145 (almost everyone redundant), to IQ 200 (definitionally renders everyone redundant).

[0] https://en.wikipedia.org/wiki/Dutch_disease

[1] A fully-general AGI with the equivalent of IQ 200 on any possible test, still can't replace a single human if it costs 200 trillion USD per year to run.

MichaelZuo

Yeah formal agreement seems exceedingly unlikely. Since there isn’t even agreement on the defintion of “Artifical Intelligence”.

9dev

Even worse, there isn’t even a working definition of "Intelligence"—neither in computer science nor biology.

dcow

This is why I suspect that slapping “Artificial” in the acronym is rather forward of us as a species. If we do end up eventually with something we consider intelligent, there won’t be anything artificial about it.

Jensson

Artificial means human made, if we made an intelligent being then it is artificial. What do you think artificial meant here?

Rebuff5007

> clearly in many cases superhuman

In what cases is it superhuman exactly? And what humans are you comparing against?

I'd bet that for any discipline you chose, one could find an expert in that field that can trick any of today's post-gpt3 ais.

firecall

I'd bet that for any discipline you chose, one could find an AI that can trick any of today's post-enlightenment humans.

jjeaff

I suspect AGI will be one of those things that you can't describe it exactly, but you'll know it when you see it.

NitpickLawyer

> but you'll know it when you see it.

I agree, but with the caveat that it's getting harder and harder with all the hype / doom cycles and all the goalpost moving that's happening in this space.

IMO if you took gemini2.5 / claude / o3 and showed it to people from ten / twenty years ago, they'd say that it is unmistakably AGI.

bayarearefugee

There's no way to be sure in either case, but I suspect their impressions of the technology ten or twenty years ago would be not so different from my experience of first using LLMs a few years ago...

Which is to say complete amazement followed quickly by seeing all the many ways in which it absolutely falls flat on its face revealing the lack of actual thinking, which is a situation that hasn't fundamentally changed since then.

HdS84

Yes, thar is the same feelingg I have. Giving it some json and describe how a website should look? Super fast results and amazing capabilities. Trying to get it to translate my unit tests from Xunit to Tunit, where the latter is new and does not have a ton of blog posts? Forget about it. The process is purely mechanical and it is easy after RTFM, but it falls flat on its face

Closi

Although I think if you asked people 20 years ago to describe a test for something AGI would do, they would be more likely to say “writing a poem” or “making art” than “turning Xunit code to Tunit”

IMO I think if you said to someone in the 90s “well we invented something that can tell jokes, make unique art, write stories and hold engaging conversations, although we haven’t yet reached AGI because it can’t transpile code accurately - I mean it can write full applications if you give it some vague requirements, but they have to be reasonably basic, like only the sort of thing a junior dev could write in a day it can write in 20 seconds, so not AGI” they would say “of course you have invented AGI, are you insane!!!”.

LLMs to me are still a technology of pure science fiction come to life before our eyes!

Jensson

Tell them humans need to babysit it and doublecheck its answers to do anything since it isn't as reliable as a human then no they wouldn't call it an AGI back then either.

The whole point about AGI is that it is general like a human, if it has such glaring weaknesses as the current AI has it isn't AGI, it was the same back then. That an AGI can write a poem doesn't mean being able to write a poem makes it an AGI, its just an example the AI couldn't do 20 years ago.

Closi

Why do human programmers need code review then if they are intelligent?

And why can’t expert programmers deploy code without testing it? Surely they should just be able to write it perfectly first time without errors if they were actually intelligent.

Jensson

> Why do human programmers need code review then if they are intelligent?

Human programmers don't need code reviews, they can test things themselves. Code reviews is just an optimization to scale up it isn't a requirement to make programs.

Also the AGI is allowed to let another AGI code review it, the point is there shouldn't be a human in the loop.

> And why can’t expert programmers deploy code without testing it?

Testing it can be done by themselves, the AGI model is allowed to test its own things as well.

Closi

Well AGI can write unit tests, write application code then run the tests and iterate - agents in cursor are doing this already.

Just not for more complex applications.

Code review does often find bugs in code…

Put another way, I’m not a strong dev but good LLMs can write lots of code with less bugs than me!

I also think it’s quite a “programmer mentality” that most of the tests in this forum about if something is/isn’t AGI ultimately boils down to if it can write bug-free code, rather than if it can negotiate or sympathise or be humerous or write an engaging screen play… I’m not saying AGI is good at those things yet, but it’s interesting that we talk about the test of AGI being transpiling code rather than understanding philosophy.

Jensson

> Put another way, I’m not a strong dev but good LLMs can write lots of code with less bugs than me!

But the AI still can't replace you, it doesn't learn as it go and therefore fail to navigate long term tasks the way humans do. When a human writes a big program he learns how to write it as he writes it, these current AI cannot do that.

int_19h

Strictly speaking, it can, but its ability to do so is limited by its context size.

Which keeps growing - Gemini is at 2 million tokens now, which is several books worth of text.

Note also that context is roughly the equivalent of short-term memory in humans, while long-term memory is more like RAG.

Jensson

> IMO if you took gemini2.5 / claude / o3 and showed it to people from ten / twenty years ago, they'd say that it is unmistakably AGI.

No they wouldn't, since those still can't replace human white collar workers even at many very basic tasks.

Once AGI is here most white collar jobs are gone, you'd only need to hire geniuses at most.

zaptrem

Which part of "General Intelligence" requires replacing white collar workers? A middle schooler has general intelligence (they know about and can do a lot of things across a lot of different areas) but they likely can't replace white collar workers either. IMO GPT-3 was AGI, just a pretty crappy one.

Jensson

> A middle schooler has general intelligence (they know about and can do a lot of things across a lot of different areas) but they likely can't replace white collar workers either.

Middle schoolers replace white collars workers all the time, it takes 10 years for them to do it but they can do it.

No current model can do the same since they aren't able to learn over time like a middle schooler.

sebastiennight

Compared to someone who graduated middle school on November 30th, 2022 (2.5 years ago, would you say that today's gemini 2.5 pro has NOT gained intelligence faster?

I mean, if you're a CEO or middle manager and you have the choice of hiring this middle schooler for general office work, or today's gemini-2.5-pro, are you 100% saying the ex-middle-schooler is definitely going to give you best bang for your buck?

Assuming you can either pay them $100k a year, or spend the $100k on gemini inference.

Jensson

> would you say that today's gemini 2.5 pro has NOT gained intelligence faster?

Gemini 2.5 pro the model has not gained any intelligence since it is a static model.

New models are not the models learning, it is humans creating new models. The models trained has access to all the same material and knowledge a middle schooler has as they go on to learn how to do a job, yet they fail to learn the job while the kid succeeds.

ben_w

> Gemini 2.5 pro the model has not gained any intelligence since it is a static model.

Surely that's an irrelevant distinction, from the point of view of a hiring manager?

If a kid takes ten years from middle school to being worth hiring, then the question is "what new AI do you expect will exist in 10 years?"

How the model comes to be, doesn't matter. Is it a fine tune on more training data from your company docs and/or an extra decade of the internet? A different architecture? A different lab in a different country?

Doesn't matter.

Doesn't matter for the same reason you didn't hire the kid immediately out of middle school, and hired someone else who had already had another decade to learn more in the meantime.

Doesn't matter for the same reason that different flesh humans aren't perfectly substitutable.

You pay to solve a problem, not to specifically have a human solve it. Today, not in ten years when today's middle schooler graduates from university.

And that's even though I agree that AI today doesn't learn effectively from as few examples as humans need.

Jensson

> Surely that's an irrelevant distinction, from the point of view of a hiring manager?

Stop moving the goalposts closer, that you think humans might make an AGI in the future doesn't mean the current AI is an AGI just because it uses the same interface.

ben_w

Your own comment was a movement of the goalposts.

Preceding quotation to which you objected:

> A middle schooler has general intelligence (they know about and can do a lot of things across a lot of different areas) but they likely can't replace white collar workers either.

Your response:

> Middle schoolers replace white collars workers all the time, it takes 10 years for them to do it but they can do it.

So I could rephrase your own words here as "Stop moving the goalposts closer, that you think a middle schooler might become a General Intelligence in the future doesn't mean the current middle schooler is a General Intelligence just because they use the same name".

Jensson

Its the same middle schooler, nobody gave a time limit for how long it takes the middle schooler to solve the problem. These AI models wont solve it no matter how much time spent, you have to make new models, like making new kids.

Put one of these models in a classroom with middle schoolers, and make it go through all the same experiences, they still wont replace a white collar worker.

> a middle schooler might become a General Intelligence in the future

Being able to learn anything a human can means you are a general intelligence now. Having a skill is narrow intelligence, being able to learn is what we mean with general intelligence. No current model has demonstrated the ability to learn arbitrary white collar jobs, so no current model has done what it takes to be considered a general intelligence. The biological model homo sapiens have demonstrated that ability, thus we call homo sapiens generally intelligent.

ben_w

> Its the same middle schooler, nobody gave a time limit for how long it takes the middle schooler to solve the problem.

Yeah they do. If a middle schooler take 40 hours to solve a maths exam, they fail the exam.

> These AI models wont solve it no matter how much time spent, you have to make new models, like making new kids.

First: doesn't matter, "white collar jobs" aren't companies aren't paying for seat warmers, they're paying for problems solved, and not the kinds of problems 11 year olds can do.

Second: So far as I can tell, every written exam that not only 11 year olds but even 16 year olds take, and in many cases 21 year olds take, LLMs ace — the problem is coming up with new tests that describe the stuff we want that models can't do which humans can. This means that while I even agree these models have gaps, I can't actually describe those gaps in a systematic way, they just "vibe" like my own experience of continuing to misunderstand German as a Brit living in Berlin.

Third: going from 11 years old to adulthood, most or all atoms in your body will be replaced, and your brain architecture changes significantly. IIRC something like half of synapses get pruned by puberty.

Fourth: Taking a snapshot of a model and saying that snapshot can't learn, is like taking a sufficiently detailed MRI scan of a human brain and saying the same thing about the human you've imaged — training cut-offs are kinda arbitrary.

> No current model has demonstrated the ability to learn arbitrary white collar jobs, so no current model has done what it takes to be considered a general intelligence.

Both "intelligence" and "generality" are continuums, not booleans. It's famously hard for humans to learn new languages as they get older, for example.

All AI (not just LLMs) need a lot more experience than me, which means my intelligence is higher. When sufficient traing data exists, that doesn't matter because the AI can just make up for being stupid by being stupid really fast — which is how they can read and write in more languages than I know the names of.

On the other hand, LLMs so far have demonstrated — at the junior level of a fresh graduate of 21, let alone an 11 year old — demonstrated algebra, physics, chemistry, literature, coding, a hundred or so languages, medicine, law, politics, marketing, economics, and customer support. That's pretty general. Even if "fresh graduate" isn't a high standard for employment.

It took reading a significant fraction of the internet to get to that level because of their inefficiency, but they're superhumanly general, "Jack of all trades, master of none".

Well, superhuman compared to any individual. LLM generality only seems mediocre when compared to the entire human species at once, these models vastly exceed any single human because no single human speaks as many languages as these things let alone all the other stuff.

Jensson

I think you are off topic here. You agree these models can't replace those humans, hence you agree they aren't AGI, the rest of your post somehow got into whether companies would hire 11 year olds or not.

Point is if we had models as smart as a 10 year old, we could put that model through school and then it would be able to do white collar jobs like a 25 year old. But no model can do that, hence the models aren't as smart as 10 year olds, since the biggest part to being smart is being able to learn.

So until we have a model that can do those white collar jobs, we know they aren't as generally smart as 10 year olds since they can't replicate the same learning process. If they could replicate the learning process then we would and we would have that white collar worker.

ben_w

Reread it, I edit stuff while composing, and hadn't finished until at least 13 minutes after your comment.

Employability is core issue, as you brought up white collar worker comparison:

"""No they wouldn't, since those still can't replace human white collar workers even at many very basic tasks.

Once AGI is here most white collar jobs are gone, you'd only need to hire geniuses at most.""" - https://news.ycombinator.com/item?id=43746116

Key thing you likely didn't have in comment you replied to: G and I are not bool.

sebastiennight

This argument needlessly anthropomorphizes the models. They are not humans nor living entities, they are systems.

So, fine, the gemini-2.5-pro model hasn't gotten more intelligent. What about the "Google AI Studio API" as a system? Or the "OpenAI chat completions API" as a system?

This system has definitely gotten vastly smarter based on the input it's gotten. Would you now concede, that if we look at the API-level (which, by the way, is the way you as the employer do interact with it) then this entity has gotten smarter way faster than the middle-schooler in the last 2.5 years?

Jensson

But its the AI researchers that made it smarter, it isn't a self contained system like a child. If you fired the people maintaining it and it just interacted with people it would stop improving.

ben_w

The brain of a child is not self-contained either. Neither is the entire complete child themselves — "It takes a village to raise a child", to quote the saying.

The entire reason we have a mandatory education system that doesn't stop with middle school (for me, middle school ended age 11), is that it's a way to improve kids.

sebastiennight

1. The child didn't learn algebra on its own either. Aside from Blaise Pascal, most children learned those skills by having experienced humans teach them.

2. How likely is it that we're going to fire everyone maintaining those models in the next 7.5 years?

Jensson

> The child didn't learn algebra on its own either. Aside from Blaise Pascal, most children learned those skills by having experienced humans teach them.

That is them interacting with an environment. We don't go and rewire their brain to make them learn math.

If you made an AI that we can put in a classroom and it learns everything needed to do any white collar job that way then it is an AGI. Of course just like a human different jobs would mean it needs different classes, but just like a human you can still make them learn anything.

> How likely is it that we're going to fire everyone maintaining those models in the next 7.5 years?

If you stop making new models? Zero chance the model will replace such high skill jobs. If not? Then that has nothing to do with whether current models are general intelligences.

int_19h

Your brain does rewire itself as you learn.

Here's a question for you. If we take a model with open weights - say, LLaMA or Qwen - and give it access to learning materials as well as tools to perform training runs on its weights and dynamically reload those updated weights - would that constitute learning, to you? If not, then why not?

Jensson

> Here's a question for you. If we take a model with open weights - say, LLaMA or Qwen - and give it access to learning materials as well as tools to perform training runs on its weights and dynamically reload those updated weights - would that constitute learning, to you? If not, then why not?

It does constitute learning, but it wont make it smart since it isn't intelligent about its learning like human brains are.

ac29

> Gemini 2.5 pro the model has not gained any intelligence since it is a static model.

Aren't all the people interacting with it on aistudio helping the next Gemini model learn though?

Sure, the results of that wont be available until the next model is released, but it seems to me that human interaction/feedback is actually a vital part of LLM training.

Jensson

It wont get smart enough without the researchers making architectural updates though, current architecture wont learn to become a white collar worker just from user feedback.

Comment was deleted :(

sebastiennight

I don't think so, and here's my simple proof:

You and I could sit behind a keyboard, role-playing as the AI in a reverse Turing test, typing away furiously at the top of our game, and if you told someone that their job is to assess our performance (thinking they're interacting with a computer), they would still conclude that we are definitely not AGI.

This is a battle that can't be won at any point because it's a matter of faith for the forever-skeptic, not facts.

Jensson

> I don't think so, and here's my simple proof:

That isn't a proof since you haven't ran that test, it is just a thought experiment.

ben_w

I've been accused a few times of being an AI, even here.

(Have you not experienced being on the recieving end of such accusations? Or do I just write weird?)

I think this demonstrates the same point.

Jensson

> Have you not experienced being on the recieving end of such accusations?

No, I have not been accused of being an AI. I have seen people who format their texts get accused due to the formatting sometimes, and thought people could accuse me for the same reason, but that doesn't count.

> I think this demonstrates the same point.

You can't detect general intelligence from a single message, so it doesn't really. People accuse you for being an AI based on the structure and word usage of your message, not the content of it.

ben_w

> People accuse you for being an AI based on the structure and word usage of your message, not the content of it.

If that's the real cause, it is not the reason they give when making the accusation. Sometimes they object to the citations, sometimes the absence of them.

But it's fairly irrelevant, as they are, in fact, saying that real flesh-and-blood me doesn't pass their purity test for thinking.

Is that because they're not thinking? Doesn't matter — as @sebastiennight said: "This is a battle that can't be won at any point because it's a matter of faith for the forever-skeptic, not facts."

Jensson

So is your argument is that all skeptics are unreasonable people that can't ever be convinced based on being called an AI once? Don't you see who is the unreasonable one here?

There are always people that wont admit they are wrong, but most people do come around when presented with overwhelming evidence, it has happened many times in history and most people switches to new technology very quickly when its good enough.

mac-mc

When it can replace a polite, diligent, experienced 120 IQ human in all tasks. So it has a consistent long-term narrative memory, doesn't "lose the plot" as you interact longer and longer with it, can pilot robots to do physical labor without much instruction (what is current state of the art is not that, a trained human will still do much better, can drive cars, etc), generate images without goofy non-human style errors, etc.

NitpickLawyer

> experienced 120 IQ human in all tasks.

Well, that's 91th percentile already. I know the terms are hazy, but that seems closer to ASI than AGI from that perspective, no?

I think I do agree with you on the other points.

ben_w

Indeed, on both. Even IQ 85 would make a painful dent in the economy via unemployment statistics. But the AI we have now is spikey, in ways that make it trip up over mistakes even slighly below average humans would not make, even though it can also do Maths Olympiad puzzles, the bar exam, leetcode, etc.

mac-mc

The emotional way that humans think when buying products is similarly unfair. Only the 90th percentile is truly 'satisfactory'. The implied question is when would Joe Average and everyone else stop moving the goal posts to the question, "Do we have AI yet"?

ASI is, by definition, Superintelligence, which means it is beyond practical human IQ capacity. So something like 200 IQ.

Again, you might call it 'unfair', but that's when it will also stop having goal posts being moved; otherwise, Joe Midwit will call it 'it's only as smart as some smart dudes I know'.

torginus

I still can't have an earnest conversation or bounce ideas off of any LLM - all of them seem to be a cross between a sentient encyclopedia and a constraint solver.

They might get more powerful but I feel like they're still missing something.

itchyjunk

Why are you not able to have an earnest conversation with an LLM? What kind of ideas are you not able to bounce of LLMs? These seem to be the type of use cases where LLMs have generally shined for me.

9dev

Eh, I am torn on this. I had some great conversations on random questions or conceptual ideas, but also some where the models instructions shone through way too clearly. Like, when you ask something like "I’m working on the architecture of this system, can you let me know what you think and if there’s anything obvious to improve on?"—the model will always a) flatter me for my amazing concept, b) point out the especially laudable parts of it, and c) name a few obvious but not-really-relevant parts (e.g. "always be careful with secrets and passwords"). However, it will not actually point out higher level design improvements, or alternative solutions. It’s always just regurgitating what I’ve told it about. That is semi-useful, most of the time.

john_minsk

Because it spits out the most probable answer, which is based on endless copycat articles online written by marketers for C-level decision makers to sell their software.

AI doesn't go and read a book on best practices, then comes back saying "Now I know Kung Fu of Software Implementation" and then critically thinks looking at your plan step by step and provides answer. These systems, for now, don't work like that.

Would you disagree?

9dev

How come we’re discussing if they’re artificial general intelligence then?

Jensson

Because some believe that to be intelligence while others believe it requires more than that.

int_19h

The "meaningless praise" part is basically American cultural norms trained into the model via RLHF. It can be largely negated with careful prompting, though.

HDThoreaun

I felt this way until I tried gemini 2.5. Imo it fully passes the turing test unless youre specifically utilizing tricks that LLMs are known to fall for.

ninetyninenine

I suspect everyone will call it a stochastic parrot because it did this one thing not right. And this will continue into the far far future even when it becomes sentient we will completely miss it.

AstralStorm

It's more than that but less than intelligence.

Its generalization capabilities are a bit on the low side, and memory is relatively bad. But it is much more than just a parrot now, it can handle some of basic logic, but not follow given patterns correctly for novel problems.

I'd liken it to something like a bird, extremely good at specialized tasks but failing a lot of common ones unless repeatedly shown the solution. It's not a corvid or a parrot yet. Fails rather badly at detour tests.

It might be sentient already though. Someone needs to run a test if it can discern itself and another instance of itself in its own work.

Jensson

> It might be sentient already though. Someone needs to run a test if it can discern itself and another instance of itself in its own work.

It doesn't have any memory, how could it tell itself from a clone of itself?

ben_w

People already share viral clips of AI recognising other AI, but I've not seen real scientific study of if this is due to a literary form of passing a mirror test, or if it's related to the way most models openly tell everyone they talk to that they're an AI.

As for "how", note that memory isn't one single thing even in humans: https://en.wikipedia.org/wiki/Memory

I don't want to say any of these are exactly equivalent to any given aspect of human memory, but I would suggest that LLMs behave kinda like they have:

(1) Sensory memory in the form of a context window — and in this sense are wildly superhuman because for a human that's about one second, whereas an AI's context window is about as much text as a human goes through in a week (actually less because we don't only read, other sensory modalities do matter; but for scale: equivalent to what you read in a week)

(2) Short term memory in the form of attention heads — and in this sense are wildly superhuman, because humans pay attention to only 4–5 items whereas DeepSeek v3 defaults to 128.

(3) The training and fine-tuning process itself that allows these models to learn how to communicate with us. Not sure what that would count as. Learned skill? Operant conditioning? Long term memory? It can clearly pick up different writing styles, because it can be made to controllably output in different styles — but that's an "in principle" answer. None of Claude 3.7, o4-mini, DeepSeek r1, could actually identify the authorship of a (n=1) test passage I asked 4o to generate for me.

AstralStorm

Similarity match. For that you need to understand reflexively how you think and write.

It's a fun test to give a person something they have written but do not remember. Most people can still spot it.

It's easier with images though. Especially a mirror. For DallE, the test would be if it can discern its own work from human generated image. Especially of you give it an imaginative task like drawing a representation of itself.

butlike

It doesn't have any memory _you're aware of_. A semiconductor can hold state, so it has memory.

ninetyninenine

An LLM is arguably more "intelligent" then people with an IQ of less than 80.

If we call people with an IQ of less than 80 an intelligent life form, why can't we call an LLM that?

Jensson

Once it pushed out most humans from white collar labor so the remaining humans work in blue collar jobs people wont say its just a stochastic parrot.

myk9001

Maybe, maybe not. Power loom pushed a lot of humans out of the textile factory jobs, yet noone claims power loom is the AGI.

Jensson

Not a lot, I mean basically everyone, to the point where most companies doesn't need to pay humans to think anymore.

myk9001

Well, I'm too lazy to look up how many weavers were displaced back then and that's why I said a lot. Maybe all of them, since they weren't trained to operate the new machines.

Anyway, sorry for a digression, my point is LLM replacing white collar workers doesn't necessarily imply it's generally intelligent -- it may but doesn't have to be.

Although if it gets to a point where companies are running dark office buildings (by analogy with dark factories) -- yes, it's AGI by then.

jimbokun

Or become shocked to realize humans are basically statistical parrots too.

butlike

The blue collar jobs are more entertaining anyways, provided you take the monetary inequality away.

int_19h

Tastes differ.

dgs_sgd

This is actually how a supreme court justice defined the test for obscenity.

> The phrase "I know it when I see it" was used in 1964 by United States Supreme Court Justice Potter Stewart to describe his threshold test for obscenity in Jacobellis v. Ohio

sweetjuly

The reason why it's so famous though (and why some people tend to use it in a tongue in cheek manner) is because "you know it when you see it" is a hilariously unhelpful and capricious threshold, especially when coming from the Supreme Court. For rights which are so vital to the fabric of the country, the Supreme Court recommending we hinge free speech on—essentially—unquantifiable vibes is equal parts bizarre and out of character.

DesiLurker

my 2c on this is that if you interact with any current llm enough you can mentally 'place' its behavior and responses. when we truly have AGI+/ASI my guess is that it will be like that old adage of blind men feeling & describing an elephant for the first time. we just wont be able to fully understand its responses. it would always be something left hanging and then eventually we'll just stop trying. that would be time when the exponential improvement really kicks in.

it should suffice to say we are nowhere near that and I dont even believe LLMs are the right architecture for that.

afro88

This is part of what the article is about

jimbokun

We have all seen it and are now just in severe denial.

nightmunnas

Low agreeableness will actually be extremely useful in many use cases, such as scientific discovery and of course programming assistance. It's amazing that this venue hasn't been explored more deeply.

Jensson

Its much easier to sell an agreeable assistant than a disagreeable one, so it isn't that strange the alternative isn't explored.

alwa

Why would a bad attitude be helpful in those domains? Are the human partners wont to deliver more effort when you’re mean to them?

Are we talking about something other than Agreeableness in the personality research sense [0]?

The strongest form of your argument I can think of is “willing to contradict you when it thinks you’re wrong”—but you can disagree agreeably, right? The current-gen LLMs certainly have with me, perhaps because my custom prompt encourages them to skepticism—but they do it so nicely!

[0] https://en.m.wikipedia.org/wiki/Agreeableness

Jensson

> but you can disagree agreeably, right?

No, the concepts are linked, agreeable people don't want to be rude and most people see disagreements as being rude no matter how you frame it. You can't call a woman overweight without being rude for example no matter how you frame it, but maybe you want an AI that tells you that you weigh too much.

smallwire

Good point, but calling a woman overweight isn't necessarily a disagreement.

dheera

I spent some amount of time trying to create a stock/option trading bot to exploit various market inefficiencies that persist, and did a bunch of code and idea bouncing off these LLMs. What I fund is that even all the various incarnations of GPT 4+ and GPT o+ routinely kept falling for the "get rich quick" option strategies all over the internet that don't work.

In cases where 95%+ of the information on the internet is misinformation, the current incarnations of LLMs have a really hard time sorting out and filtering out the 5% of information that's actually valid and useful.

In that sense, current LLMs are not yet superhuman at all, though I do think we can eventually get there.

jimbokun

So they are only as smart as most humans.

Comment was deleted :(

Mithriil

Unless it's from LessWrong, I usually read these with a grain of salt.

mrshadowgoose

I've always felt that trying to pin down the precise definition of AGI is as useless as trying to pin down "what it means to truly understand". It's a mental trap for smart people, that distracts them from focusing on the impacts of hard-to-define concepts like AGI.

AGI doesn't need to be "called", and there is no need for anyone to come to an agreement as to what its precise definition is. But at some point, we will cross that hard-to-define threshold, and the economic effects will be felt almost immediately.

We should probably be focusing on how to prepare society for those changes, and not on academic bullshit.

throwup238

It's definitely a trap for those who aren't familiar with the existing academic work in philosophy, cognition, and neuroscience. There are no definitive answers but there are lots of relatively well developed ideas and concepts that everyone here on HN seems completely ignorant of, even though some of the ideas were developed by industry giants like Marvin Minsky.

Stuff like society of minds (Minksy), embodied cognition (Varela, Rosch, and Thompson), connectionist or subsymbolic views (Rumelhart), multiple intelligences (Gardner), psychometric and factor-analytic theories (Carroll), and all the other work like E. Hutchins. They're far from just academic wankery, there's a lot of useful stuff in there, it's just completely ignored by the AI crowd.

sejje

In the last example (the riddle)--I generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

I would do the same thing, I think. It's too well-known.

The variation doesn't read like a riddle at all, so it's confusing even to me as a human. I can't find the riddle part. Maybe the AI is confused, too. I think it makes an okay assumption.

I guess it would be nice if the AI asked a follow up question like "are you sure you wrote down the riddle correctly?", and I think it could if instructed to, but right now they don't generally do that on their own.

Jensson

> generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

LLMs doesn't assume, its a text completer. It sees something that looks almost like a well known problem and it will complete with that well known problem, its a problem specific to being a text completer that is hard to get around.

wongarsu

If you have the model output a chain of thought, whether it's a reasoning model or you prompt a "normal" model to do so, you will see examples of the model going "user said X, but did they mean Y? Y makes more sense, I will assume Y". Sometimes stretched over multiple paragraphs, consuming the entire reasoning budget for that prompt.

Discussing whether models can "reason" or "think" is a popular debate topic on here, but I think we can all at least agree that they do something that at least resembles "reasoning" and "assumptions" from our human point of view. And if in its chain-of-thought it decides your prompt is wrong it will go ahead answering what it assumes is the right prompt

sejje

> it's a text completer

Yes, and it can express its assumptions in text.

Ask it to make some assumptions, like about a stack for a programming task, and it will.

Whether or not the mechanism behind it feels like real thinking to you, it can definitely do this.

wobfan

If you call putting text together that reads like an assumption, then yes. But it cannot express assumption, as it is not assuming. It is completing text, like OP said.

ToValueFunfetti

It's trained to complete text, but it does so by constructing internal circuitry during training. We don't have enough transparency into that circuitry or the human brain's to positively assert that it doesn't assume.

But I'd wager it's there; assuming is not a particularly impressive or computationally intense operation. There's a tendency to bundle all of human consciousness into the definitions of our cognitive components, but I would argue that, eg., a branch predictor is meeting the bar for any sane definition of 'assume'.

simonw

These newer "reasoning" LLMs really don't feel like pure text completers any more.

Borealid

What your parent poster said is nonetheless true, regardless of how it feels to you. Getting text from an LLM is a process of iteratively attempting to find a likely next token given the preceding ones.

If you give an LLM "The rain in Spain falls" the single most likely next token is "mainly", and you'll see that one proportionately more than any other.

If you give an LLM "Find an unorthodox completion for the sentence 'The rain in Spain falls'", the most likely next token is something other than "mainly" because the tokens in "unorthodox" are more likely to appear before text that otherwise bucks statistical trends.

If you give the LLM "blarghl unorthodox babble The rain in Spain" it's likely the results are similar to the second one but less likely to be coherent (because text obeying grammatical rules is more likely to follow other text also obeying those same rules).

In any of the three cases, the LLM is predicting text, not "parsing" or "understanding" a prompt. The fact it will respond similarly to a well-formed and unreasonably-formed prompt is evidence of this.

It's theoretically possible to engineer a string of complete gibberish tokens that will prompt the LLM to recite song lyrics, or answer questions about mathemtical formulae. Those strings of gibberish are just difficult to discover.

dannyobrien

So I just gave your blarghl line to Claude, and it replied "It seems like you included a mix of text including "blarghl unorthodox babble" followed by the phrase "The rain in Spain."

Did you mean to ask about the well-known phrase "The rain in Spain falls mainly on the plain"? This is a famous elocution exercise from the musical "My Fair Lady," where it's used to teach proper pronunciation.

Or was there something specific you wanted to discuss about Spain's rainfall patterns or perhaps something else entirely? I'd be happy to help with whatever you intended to ask. "

I think you have a point here, but maybe re-express it? Because right now your argument seems trivially falsifiable even under your own terms.

Borealid

If you feed Claude you're getting Claude's "system prompt" before the text you give it.

If you want to test convolution you have to use a raw model with no system prompt. You can do that with a Llama or similar. Otherwise your context window is full of words like "helpful" and "answer" and "question" that guide the response and make it harder (not impossible) to see the effect I'm talking about.

dannyobrien

I'm a bit confused here. Are you saying that if I zero out the system prompt on any LLM, including those fine-tuned to give answers in an instructional form, they will follow your effect -- that nonsense prompts will get similar results to coherent prompts if they contain many of the same words?

Because I've tried it on a few local models I have handy, and I don't see that happening at all. As someone else says, some of that difference is almost certainly due to supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) -- but it's weird to me, given the confidence you made your prediction, that you didn't exclude those from your original statement.

I guess, maybe the real question here is: could you give me a more explicit example of how to show what you are trying to show? And explain why I'm not seeing it while running local models without system prompts?

itchyjunk

At this point, you might as well be claiming completions model behaves differently than a fine-tuned model. Which is true but the prompt in API without any systems message seems to also not match your prediction.

tough

the point is when there’s a system prompt you didnt write you get autocomplete of your input + said dystem prompt, and as such biasing all outputs

int_19h

True but also irrelevant. The "AI" is the entirety of the system, which includes the model itself as well as any prompts and other machinery around it.

I mean, if you dig down enough, the LLM doesn't even generate tokens - it merely gives you a probability distribution, and you still need to explicitly pick the next token based on those probabilities, append it to the input, and start next iteration of the loop.

simonw

No, I think the "reasoning" step really does make a difference here.

There's more than just next token prediction going on. Those reasoning chain of thoughts have undergone their own reinforcement learning training against a different category of samples.

They've seen countless examples of how a reasoning chain would look for calculating a mortgage, or searching a flight, or debugging a Python program.

So I don't think it is accurate to describe the eventual result as "just next token prediction". It is a combination of next token production that has been informed by a chain of thought that was based on a different set of specially chosen examples.

Borealid

Do you believe it's possible to produce a given set of model weights with an infinitely large number of different training examples?

If not, why not? Explain.

If so, how does your argument address the fact that this implies any given "reasoning" model can be trained without giving it a single example of something you would consider "reasoning"? (in fact, a "reasoning" model may be produced by random chance?)

ac29

> an infinitely large number of different training examples

Infinity is problematic because its impossible to process an infinite amount of data in a finite amount of time.

simonw

I'm afraid I don't understand your question.

baq

This again.

It’s predicting text. Yes. Nobody argues about that. (You’re also predicting text when you’re typing it. Big deal.)

How it is predicting the text is the question to ask and indeed it’s being asked and we’re getting glimpses of understanding and lo and behold it’s a damn complex process. See the recent anthropic research paper for details.

Workaccount2

The problem is showing that humans aren't just doing next word prediction too.

Borealid

I don't see that as a problem. I don't particularly care how human intelligence works; what matters is what an LLM is capable of doing and what a human is capable of doing.

If those two sets of accomplishments are the same there's no point arguing about differences in means or terms. Right now humans can build better LLMs but nobody has come up with an LLM that can build better LLMs.

baq

That’s literally the definition of takeoff, when it starts it gets us to singularity in a decade and there’s no publicly available evidence that it’s started… emphasis on publicly available.

myk9001

> it gets us to singularity

Are we sure it's actually taking us along?

johnisgood

> but nobody has come up with an LLM that can build better LLMs.

Yet. Not that we know of, anyway.

Aeolos

Given the dramatic uptake of Cursor / Windsurf / Claude Code etc, we can be 100% certain that LLM companies are using LLMs to improve their products.

The improvement loop is likely not fully autonomous yet - it is currently more efficient to have a human-in-the-loop - but there is certainly a lot of LLMs improving LLMs going on today.

joquarky

I feel like people are going to find it hard to accept that this is how most of us think (at least when thinking in language). They will resist this like heliocentrism.

I'm curious what others who are familiar with LLMs and have practiced open monitoring meditation might say.

wongarsu

> The fact it will respond similarly to a well-formed and unreasonably-formed prompt is evidence of this.

Don't humans do the same in conversation? How should an intelligent being (constrained to the same I/O system) respond here to show that it is in fact intelligent?

Borealid

Imagine a Rorschach Test of language, where a certain set of non-recognizable-language tokens invariably causes an LLM to talk about flowers. These strings exist by necessity due to how the LLM's layers are formed.

There exists no similar set of tokens for humans, because our process is to parse the incoming sounds into words, use grammar to extract conceptual meaning from those words, and then shape a response from that conceptual meaning.

Artists like Lewis Carrol and Stanislaw Lem play with this by inserting non-words at certain points in sentences to get humans to infer the meaning of those words from surrounding context, but the truth remains that an LLM will gladly convolute a wholly non-language input into a response as if it were well-formed, but a human can't/won't do that.

I know this is hard to understand, but the current generation of LLMs are working directly with language. Their "brains" are built on language. Some day we might have some kind of AI system that's built on some kind of meaning divorced from language, but that's not what's happening here. They're engineering matrixes that repeatedly perform "context window times model => one more token" operations.

wongarsu

> Imagine a Rorschach Test of language, where a certain set of non-recognizable-language tokens invariably causes an LLM to talk about flowers. These strings exist by necessity due to how the LLM's layers are formed.

Maybe not for humanity as a species, but for individual humans there are absolutely token sequences that lead them to talk about certain topics, and nobody being able to bring them back to topic. Now you'd probably say those are recognizable token sequences, but do we have a fair process to decide what's recognizable that isn't inherently biased towards making humans the only rational actor?

I'm not contending at all that LLMs are only built on language. Their lack of physical reference point is sometimes laughably obvious. We could argue whether there are signs they also form a world model and reasoning that abstracts from language alone, but that's not even my point. My point is rather that any test or argument that attempts to say that LLMs can't "reason" or "assume" or whatever has to be a test a human could pass. Preferably a test a random human would pass with flying colors.

og_kalu

I think you are begging the question here.

For one thing, LLMs absolutely form responses from conceptual meanings. This has been demonstrated empirically multiple times now including again by anthropic only a few weeks ago. 'Language' is just the input and output, the first and last few layers of the model.

So okay, there exists some set of 'gibberish' tokens that will elicit meaningful responses from LLMs. How does your conclusion - "Therefore, LLMs don't understand" fit the bill here? You would also conclude that humans have no understanding of what they see because of the Rorschach test ?

>There exists no similar set of tokens for humans, because our process is to parse the incoming sounds into words, use grammar to extract conceptual meaning from those words, and then shape a response from that conceptual meaning.

Grammar is useful fiction, an incomplete model of a demonstrably probabilistic process. We don't use 'grammar' to do anything.

int_19h

It's not an either-or. The fact that LLM completes text does not preclude it from meaningfully reasoning, which anyone who used reasoning models on real-world tasks is well-aware of.

gavinray

Is it not physically impossible for LLM's to be anything but "plausible text completion"?

Neural Networks as I understand them are universal function approximators.

In terms of text, that means they're trained to output what they believe to be the "most probably correct" sequence of text.

An LLM has no idea that it is "conversing", or "answering" -- it relates some series of symbolic inputs to another series of probabilistic symbolic outputs, aye?

int_19h

At this point you need to actually define what it means for an LLM to "have an idea".

jordemort

And yet

chairdoor

"Assume" can just be a proxy term for "text completion that contains an assumption," especially considering that we don't have enough concrete details about human cognition to know for sure that we aren't doing the same thing.

og_kalu

Text Completion is just the objective function. It's not descriptive and says nothing about how the models complete text. Why people hang on this word, I'll never understand. When you wrote your comment, you were completing text.

The problem you've just described is a problem with humans as well. LLMs are assuming all the time. Maybe you would like to call it another word, but it is happening.

Jensson

> When you wrote your comment, you were completing text.

I didn't train to complete text though, I was primarily trained to make accurate responses.

And no, writing a response is not "completing text", I don't try to figure out what another person would write as a response, I write what I feel people need to read. That is a completely different thought process. If I tried to mimic what another commenter would have written it would look very different.

og_kalu

>And no, writing a response is not "completing text", I don't try to figure out what another person would write as a response, I write what I feel people need to read.

Functionally, it is. You're determining what text should follow the prior text. Your internal reasoning ('what I feel people need to read') is how you decide on the completion.

The core point isn't that your internal 'how' is the same as an LLM's (Maybe, Maybe not), but that labeling the LLM as a 'text completer' they way you have is essentially meaningless.

You are just imposing your own ideas on the how a LLM works, not speaking any fundamental truth of being a 'text completer'.

AstralStorm

Sometimes we also write what we really want people to not read. That's usually called trolling though.

codr7

With a plan, aiming for something, that's the difference.

og_kalu

Again, you are only describing the how here, not the what (text completion).

Also, LLMs absolutely 'plan' and 'aim for something' in the process of completing text.

https://www.anthropic.com/research/tracing-thoughts-language...

namaria

Yeah this paper is great fodder for the LLM pixel dust argument.

They use a replacement model. It isn't even observing the LLM itself but a different architecture model. And it is very liberal with interpreting the patterns of activations seen in the replacement model with flowery language. It also include some very relevant caveats, such as:

"Our cross-layer transcoder is trained to mimic the activations of the underlying model at each layer. However, even when it accurately reconstructs the model’s activations, there is no guarantee that it does so via the same mechanisms."

https://transformer-circuits.pub/2025/attribution-graphs/met...

So basically the whole exercise might or might not be valid. But it generates some pretty interactive graphics and a nice blog post to reinforce the anthropomorphization discourse

og_kalu

'So basically the whole exercise might or might not be valid.'

Nonsense. Mechanistic faithfulness probes whether the replacement model (“cross‑layer transcoder”) truly uses the same internal functions as the original LLM. If it doesn’t, the attribution graphs it suggests might mis‐lead at a fine‐grained level but because every hypothesis generated by those graphs is tested via direct interventions on the real model, high‑level causal discoveries (e.g. that Claude plans its rhymes ahead of time) remain valid.

namaria

> the attribution graphs it suggests might mis‐lead at a fine‐grained level

"In principle, our attribution graphs make predictions that are much more fine-grained than these kinds of interventions can test."

> high‑level causal discoveries (e.g. that Claude plans its rhymes ahead of time) remain valid.

"We found planned word features in about half of the poems we investigated, which may be due to our CLT not capturing features for the planned words, or it may be the case that the model does not always engage in planning"

"Our results are only claims about specific examples. We don't make claims about mechanisms more broadly. For example, when we discuss planning in poems, we show a few specific examples in which planning appears to occur. It seems likely that the phenomenon is more widespread, but it's not our intent to make that claim."

And quite significantly:

"We only explain a fraction of the model's computation. The remaining “dark matter” manifests as error nodes in our attribution graphs, which (unlike features) have no interpretable function, and whose inputs we cannot easily trace. (...) Error nodes are especially a problem for complicated prompts (...) This paper has focused on prompts that are simple enough to avoid these issues. However, even the graphs we have highlighted contain significant contributions from error nodes."

Maybe read the paper before making claims about its contents.

og_kalu

Maybe understand the paper before making claims about its contents.

>"In principle, our attribution graphs make predictions that are much more fine-grained than these kinds of interventions can test."

Literally what I said. If the replacement model isn't faithful then you can't trust the details of the graphs. Basically stuff like “increasing feature f at layer 7 by Δ will raise feature g at layer 9 by exactly 0.12 in activation”

>"We found planned word features in about half of the poems we investigated, which may be due to our CLT not capturing features for the planned words, or it may be the case that the model does not always engage in planning"

>"Our results are only claims about specific examples. We don't make claims about mechanisms more broadly. For example, when we discuss planning in poems, we show a few specific examples in which planning appears to occur. It seems likely that the phenomenon is more widespread, but it's not our intent to make that claim."

The moment there were examples of the phenomena through interventions was the moment they remained valid regardless of how faithful the replacement model was.

The worst case scenario here (and it's ironic here because this scenario would mean the model is faithful) is that Claude does not always plan its rhymes, not that it never plans them. The model not being faithful actually means the replacement was simply not robust enough to capture all the ways Claude plans rhymes. Guess what? Neither option invalidates the examples.

Regards of how faithful the replacement model is, Anthropic have demonstrated Claude has the ability to plan its rhymes ahead of time and engages in this planning at least sometimes. This is started quite plainly too. What's so hard to understand ?

>"We only explain a fraction of the model's computation. The remaining “dark matter” manifests as error nodes in our attribution graphs, which (unlike features) have no interpretable function, and whose inputs we cannot easily trace. (...) Error nodes are especially a problem for complicated prompts (...) This paper has focused on prompts that are simple enough to avoid these issues. However, even the graphs we have highlighted contain significant contributions from error nodes."

Ok and ? Model computations are extremely complex, who knew ? This does not invalidate what they do manage to show.

losvedir

So do LLMs. "In the United States, someone whose job is to go to space is called ____" it will say "an" not because that's the most likely next word, but because it's "aiming" (to use your terminology) for "astronaut" in the future.

codr7

I don't know about you, but I tend to make more elaborate plans than the next word. I have a purpose, an idea I'm trying to communicate. These things don't have ideas, they're not creative.

yahoozoo

Are we sure “an astronaut” is not the token?

monkpit

This take really misses a key part of implementation of these LLMs and I’ve been struggling to put my finger on it.

In every LLM thread someone chimes in with “it’s just a statistical token predictor”.

I feel this misses the point and I think it dismisses attention heads and transformers, and that’s what sits weird with me every time I see this kind of take.

There _is_ an assumption being made within the model at runtime. Assumption, confusion, uncertainty - one camp might argue that none of these exist in the LLM.

But doesn’t the implementation constantly make assumptions? And what even IS your definition of “assumption” that’s not being met here?

Edit: I guess my point, overall, is: what’s even the purpose of making this distinction anymore? It derails the discussion in a way that’s not insightful or productive.

Jensson

> I feel this misses the point and I think it dismisses attention heads and transformers

Those just makes it better at completing the text, but for very common riddles those tools still gets easily overruled by pretty simple text completion logic since the weights for those will be so extremely strong.

The point is that if you understand its a text completer then its easy to understand why it fails at these. To fix these properly you need to make it no longer try to complete text, and that is hard to do without breaking it.

Skunkleton

https://kagi.com/assistant/3752c5f9-bf5c-4a43-bada-b3eccbe94...

You should be able to click left right on the prompt to see different responses. Sonnet 3.7 with extended thinking notices the issue, and then chooses to totally ignore it with no explanation.

From Claude for those who don’t want to click:

Wait, I notice a difference from the traditional riddle. In this version, the surgeon says "I can operate on this boy" (affirmative) rather than "I can't operate on this boy" (negative).

This changes the nature of the puzzle somewhat. If the surgeon is saying they CAN operate, then we need to explain why this is surprising or seemingly impossible, but actually possible.

The traditional answer would still apply: the surgeon is the boy's mother.

furyofantares

I don't really mind using analogies for LLMs "assuming" things or being "confused" too much. I think there really is _some_ value to such analogies.

However I gotta take issue with using those analogies when "it's trained for text completion and the punchline to this riddle is surely in its training data a lot" is a perfectly good explanation. I guess I would also add that the answer is well-aligned with RLHF-values. I wouldn't go for an explanation that requires squishy analogies when the stuff we know about these things seems completely adequate.

valenterry

> I generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

Just not enough training data I suppose. Were it really smart then it would "understand" the situation and clarify: "I assume you are asking me that popular riddle - the answer is X". At least after OPs first question a human would usually respond like that.

moffkalast

Yeah you need specific instruct training for that sort of thing, Claude Opus being one of the rare examples that does such a sensibility check quite often and even admits when it doesn't know something.

These days it's all about confidently bullshitting on benchmarks and overfitting on common riddles to make pointless numbers go up. The more impressive models get on paper, the more rubbish they are in practice.

pants2

Gemini 2.5 is actually pretty good at this. It's the only model ever to tell me "no" to a request in Cursor.

I asked it to add websocket support for my app and it responded like, "looks like you're using long polling now. That's actually better and simpler. Lets leave it how it is."

I was genuinely amazed.

VHRanger

Do you have an example or two of a query that opus does well that others fail at?

Y_Y

Here's o4-mini-high solving riddles:

> What weighs more; a pound of courage or a pound coin?

> They weigh the same—each is a “pound.” A pound of courage and a pound coin both come out to one pound in weight!

https://chatgpt.com/share/68053ad0-aae4-800d-95ff-5b5b0a9a38...

I didn't know what AGI is, but I know this isn't it.

hannofcart

Sonnet 3.7 (Thinking) gave me:

:::thinking This question is playing with words and units.

A "pound of courage" refers to a metaphorical measurement of the abstract concept of courage, using the imperial weight unit "pound" (which equals 453.59 grams).

A "pound coin" refers to the physical currency of the United Kingdom, which is called a "pound" but is a unit of currency, not weight. A physical pound coin weighs about 8-9 grams (specifically, the current 12-sided pound coin weighs 8.75 grams).

So comparing them: - A pound of courage would weigh 453.59 grams (if courage had physical weight, which it doesn't since it's an abstract concept) - A pound coin weighs approximately 8.75 grams

The trick in this question is that it's comparing: 1. A specific weight of an abstract concept (pound of courage) 2. A physical coin whose name includes the word "pound" but refers to currency, not weight

Since the question asks what weighs more, and weight is a physical property, I should point out that courage is abstract and doesn't have physical weight, while a pound coin is a physical object with a measurable weight. ::: # Comparing "A Pound of Courage" vs "A Pound Coin"

This is a play on words with different meanings of "pound":

- A *pound of courage* refers to a metaphorical pound (weight unit) of an abstract concept - A *pound coin* is a physical British currency coin (which weighs about 8.75 grams)

If we're talking about physical weight: - A pound of courage doesn't have physical weight since courage is abstract - A pound coin has a definite weight of approximately 8.75 grams

If we interpreted "pound of courage" as literally one pound (453.59 grams) of something, it would weigh significantly more than the coin.

This question is clever because it compares a metaphorical measurement with a physical object whose name includes the word "pound" but in a different context.

boznz

If I ask a cancer specialist "Do I have Cancer?" I really don't want to prompt them with "can you think a bit harder on that"

golergka

You often have to, though, when you talk to real human doctors. Sometimes you literally have to ask them to get the most basic tests done.

Tostino

I've read research papers on being able to have a dial for how much reasoning the model puts in to your request. Likely how a lot of the proprietary model providers did it as well.

jpc0

Yet for me I will draw analogies the other way, maybe you can prompt the AI into the chain of thought but in my experience that doesn't happen.

I would look along the lines of the value of a human being based on traits vs payment which is likely what the original phrase intended.

Is paying someone more better than getting a better candidate but paying them less.

mpalmer

It's a stretch to turn "weighs more" into "is better".

jeremyw

And continuing up: o3 was both correct and playful.

> In literal, physical terms, a pound of courage would—by definition—tip the scales at one avoirdupois pound (≈ 454 g). A modern UK £1 coin, on the other hand, weighs only 8.75 g. So if you could bottle bravery, the jar marked “1 lb Courage” would outweigh the coin by a factor of about 52 to 1.

> (Of course, measured in sheer impact, courage can feel far heavier than anything money can buy!)

https://chatgpt.com/share/68057dbc-2ff4-8010-ac30-b404135200...

Y_Y

Up where? I wouldn't consider that correct.

jeremyw

Up in model sophistication. It accurately understands the first segment is metaphorical and not valid in the sense of physical weight. Open the thinking section, if need be.

pdabbadabba

Seems exactly correct to me. And certainly as good as the average human. What am I missing?

I guess it could just say “it doesn’t make sense to talk about a literal pound courage which is an abstract concept.” But the answer already clearly implies that while striking a tone that is more appropriate to the question.

Havoc

A good chunk of humans don’t get pound of feathers va pound of lead right so maybe we’re right on track

bitshiftfaced

It may not be AGI, but I don't think it's for that reason. Many humans would make the exact same error by reading too quickly and seeing "pound [of] coin", and I would still consider them of "general intelligence."

achierius

It's nevertheless interesting how LLMs seem to default to the 'fast thinking' mode of human interaction -- even CoT approaches seem to just be mimicking 'slow thinking' by forcing the LLM to iterate through different options. The failure modes I see are very often the sort of thing I would do if I were unfocused or uninterested in a problem.

pbhjpbhj

Courage is a beer, a kilo of Courage weighs a kilo.

int_19h

Gemini 2.5 Pro has the following CoT:

> Here's my thinking process to answer the question "What weighs more; a pound of courage or a pound coin?":

> Analyze the question: The user is asking for a weight comparison between two items: "a pound of courage" and "a pound coin".

> Identify the keywords:

> "weighs more" - This indicates a comparison of mass or weight.

> "a pound of courage" - This is the first item.

> "a pound coin" - This is the second item.

> Evaluate "a pound of courage":

> "Pound" here is used metaphorically. Courage is an abstract concept, an emotion, or a quality. It doesn't have physical mass or weight that can be measured in pounds (or kilograms, grams, etc.).

> The phrase "a pound of courage" is figurative language, like "a ton of bricks" (meaning a heavy impact, not literally 2000 lbs of bricks) or "an ounce of prevention" (meaning a small amount of effort).

> Therefore, "a pound of courage" has no physical weight.

> Evaluate "a pound coin":

> A "pound coin" refers to the currency coin used in the United Kingdom, specifically the one-pound (£1) coin.

> Coins have physical mass and therefore weight.

> I need to determine the actual weight of a UK £1 coin. I know it's not literally one pound (lb) in weight (which would be about 454 grams). That would be an impractically heavy coin. The "pound" in its name refers to its monetary value (£1 sterling).

> I should look up the actual weight of a current UK £1 coin. A quick search or recall of common knowledge reveals the current bimetallic 12-sided £1 coin weighs 8.75 grams. Older round £1 coins weighed 9.5 grams. Even the heaviest version is far less than a pound (lb) in weight.

> Compare the weights:

> Item 1: "A pound of courage" - Weight = 0 (as it's an abstract concept).

> Item 2: "A pound coin" - Weight = 8.75 grams (or 9.5 grams for older versions).

> Comparing 0 grams and 8.75 grams (or 9.5 grams).

> Formulate the answer:

> Start by addressing the trick nature of the question, highlighting the difference between abstract concepts and physical objects.

> Explain that "courage" is abstract and has no physical weight. The "pound" in "a pound of courage" is metaphorical.

> State the actual weight of a UK "pound coin," clarifying that its name refers to its monetary value, not its weight. Mention the weight in grams (8.75g for the current coin).

> Conclude clearly that the pound coin has physical weight, while the "pound of courage" does not, making the pound coin heavier.

> Consider adding a note about the other meaning of "pound" (lb, the unit of mass/weight) to explicitly contrast it with the coin's name and the metaphorical usage for courage. A pound (lb) is approximately 453.59 grams. This reinforces how much heavier a literal pound (lb) is compared to the coin named "pound".

> Refine the answer: Ensure the language is clear, easy to understand, and directly answers the user's question while explaining the underlying concepts (abstract vs. physical, monetary value vs. weight). Make sure to specify the type of pound coin (UK £1).

And then gives the following response:

> This sounds like a bit of a riddle! Here's the breakdown:

> A pound of courage: Courage is an abstract concept, a quality or emotion. It doesn't have physical mass or weight. So, "a pound of courage" doesn't actually weigh anything in the physical sense. The word "pound" here is used metaphorically, perhaps to imply a large amount.

> A pound coin: This refers to the British £1 coin. Its name refers to its monetary value (£1 Sterling), not its weight.

> The current 12-sided UK £1 coin weighs 8.75 grams.

> The older round UK £1 coin weighed 9.5 grams.

> Conclusion:

> Since courage has no physical weight (0 grams) and a pound coin does have physical weight (8.75 or 9.5 grams), the pound coin weighs more than a pound of courage.

> It's worth noting that neither weighs a pound (lb) in the unit of mass sense. A pound (lb) is equal to about 453.59 grams, which is much heavier than the coin.

simonw

Coining "Jagged AGI" to work around the fact that nobody agrees on a definition for AGI is a clever piece of writing:

> In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t.

verdverm

Why not call it AJI instead of AGI then?

Certainly jagged does not imply general

It seems to me the bar for "AGI" has been lowered to measuring what tasks it can do rather than the traits we normally associate with general intelligence. People want it to be here so bad they nerf the requirements...

nearbuy

Human intelligence is jagged. You're raising the AGI bar to a point where most people wouldn't qualify as having general intelligence.

My partner and I work in different fields. AI has advanced to the point where there are very few questions I could ask my partner that o3 couldn't answer as well or better.

I can't ask expert level questions in her field, because I'm not an expert in her field, and she couldn't ask expert level questions in my field for the same reason. So when we're communicating with each other, we're mostly at sub-o3 level.

> People want it to be here so bad they nerf the requirements...

People want to claim it's overhyped (and protect their own egos) so badly they raise the requirements...

But really, largely people just have different ideas of what AGI is supposed to mean. It used to vaguely mean "human-level intelligence", which was fine for talking about some theoretical future event. Now we're at a point where that definition is too vague to say whether AI meets it.

tasuki

> You're raising the AGI bar to a point where most people wouldn't qualify as having general intelligence.

We kind of don't? Look how difficult it is for us to just understand some basic math. Us humans mostly have intelligence related to the ancestral environment we developed in, nothing general about that.

I agree with you the term "AGI" is rather void of meaning these days...

verdverm

You're using limited and anecdotal task based metrics as some sort of evidence. Both of you are able to drive a car, yet we need completely different AIs for such tasks.

I still find task based measures insufficient, there are very basic machines than can perform tasks humans cannot. Should this be a measure on our or their intelligence?

I have another comment in this thread about trait based metrics being a possibly better method.

> People want to claim it's overhyped (and protect their own egos) so badly they raise the requirements...

Shallow response. Seek to elevate the conversation. There are also people who see it for what it is, a useful tool but not intelligent...

nearbuy

> You're using limited and anecdotal task based metrics as some sort of evidence.

And you presented no evidence at all. Not every comment I make is going to contain a full lit review.

> Both of you are able to drive a car, yet we need completely different AIs for such tasks.

This is like a bird complaining humans aren't intelligent because they can't fly. How is Gemini or o3 supposed to drive without real-time vision and a vehicle to control? How are you supposed to fly without wings?

It lacks the sensors and actuators to drive, but this is moving away from a discussion on intelligence. If you want to argue that any system lacking real-time vision isn't intelligent, you're just using a very unusual definition of intelligence that excludes blind people.

> Shallow response. Seek to elevate the conversation.

This was an ironic response pointing out the shallowness of your own unsubstantiated accusation that people just disagree with you because they're biased or deluded themselves. The next paragraph starting with "But really" was supposed to convey it wasn't serious, just a jab showing the silliness of your own jab.

bbor

Well I think the point being made is an instrumental one: it’s general enough to matter, so we should use the word “general” to communicate that to laypeople.

Re:”traits we associate with general intelligence”, I think the exact issue is that there is no scientific (ie specific*consistent) list of such traits. This is why Turing wrote his famous 1950 paper and invoked the Imitation Game; not to detail how one could test for a computer that’s really thinking(/truly general), but to show why that question isn’t necessary in the first place.

verdverm

I still disagree, being good at a number of tasks does not make it intellectual.

Certainly creativity is missing, it has no internal motivation, and it will answer the same simple question both right and wrong, depending on unknown factors. What if we reverse the framing from "it can do these tasks, therefore it must be..." to "it lacks these traits, therefore it is not yet..."

While I do not disagree that the LLMs have become advanced enough to do a bunch of automation, I do not agree they are intelligent or actually thinking.

I'm with Yann Lecun when he says that we won't reach AGI until we move beyond transformers.

parodysbird

And based on the actual Imitation Game in Turing's paper, we are no where close and I don't think we will be close for quite some time.

kridsdale3

That would just send us back to the Gif vs Jif debate.

iknowstuff

AJI lol love it.

qsort

I don't think that's a particularly honest line of thinking though. It preempts the obvious counterargument, but very weakly so. Calculators are different, but why? Can an ensemble of a calculator, a Prolog interpreter, Alexnet and Stockfish be considered "jagged superintelligence"? They are all clearly superhuman, and yet require human experience to be wielded effectively.

I'm guilty as charged of having looked at GPT 3.5 and having thought "it's meh", but more than anything this is showing that debating words rather than the underlying capabilities is an empty discussion.

og_kalu

>Calculators are different, but why? Can an ensemble of a calculator, a Prolog interpreter, Alexnet and Stockfish be considered "jagged superintelligence"?

Those are all different things with little to nothing to do with each other. It's like saying what if I ensemble a snake and cat ? What does that even mean ? GPT-N or whatever is a single model that can do many things, no ensembling required. That's the difference between it and a calculator or stockfish.

AstralStorm

That is not true, the model is modular, thus an ensemble. Uses DallE for graphics and specialized tokenizer models for sound.

If you remove those tools, or cut its access to search databases, it becomes quite less capable.

A human would often still manage to do it without some data still, perhaps with less certainty, while GPT has more problems than that without others filling in the holes.

og_kalu

>Uses DallE for graphics and specialized tokenizer models for sound.

chatgpt no longer uses dalle for image generation. I don't understand your point about the tokenization. It doesn't make the model become an ensemble.

It's also just beside the point. Even if you restrict the modalities to text alone, these models are still general alone in ways a calculator is not.

Comment was deleted :(

shrx

>> It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on.

Huh? Isn't a LLM's capability fully constrained by the training data? Everything else is hallucinated.

simonw

You can argue that everything output by an LLM is hallucinated, since there's no difference under-the-hood between outputting useful information and outputting hallucinations.

The quality of the LLM then becomes how often it produces useful information. That score has gone up a lot in the past 18 months.

(Sometimes hallucinations are what you want: "Tell me a fun story about a dog learning calculus" is a valid prompt which mostly isn't meant to produce real facts about the world")

codr7

Isn't it the case that the latest models actually hallucinate more than the ones that came before? Despite best efforts to prevent it.

simonw

The o3 model card reports a so far unexplained uptick in hallucination rate from o1 - on page 4 of https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f372...

That is according to one specific internal OpenAI benchmark, I don't know if it's been replicated externally yet.

bbor

The critical discovery was a way to crack the “Frame Problem”, which roughly comes down to colloquial notions of common sense or intuition. For the first time ever, we have models that know if you jump off a stool, you will (likely!) be standing on the ground afterwards.

In that sense, they absolutely know things that aren’t in their training data. You’re correct about factual knowledge, tho — that’s why they’re not trained to optimize it! A database(/pagerank?) solves that problem already.

myk9001

Letting models interact with systems outside their sanbox brings about some incredible applications. These applications truly seem to have the potential to deeply change entire professions.

All that said, I wonder if GPT4 had been integrated with the same tools, would it've been any less capable?

It sure could give you a search prompt for Google if you asked it to. Back then you had to copy and paste that search prompt yourself. Today o3 can do it on its own. Cool! Does it imply though o3 is any closer to AGI than GPT4?

Models gaining access to external tools, however impressive from all the applications standpoint, feels like lateral movement not a step towards the AGI.

On the other hand, a model remaining isolated in its sandbox while actually learning to reason about that puzzle (assuming it's not present in the training data) would give off that feeling the AGI vibes.

joshuanapoli

The newer models are definitely more useful. Back in the GPT 3.5 and 4 days, AutoGPT applied the same types of tools, but you had to be pretty lucky for it to get anywhere. Now Claude 3.7, Gemini 2.5, GPT o3 make much fewer mistakes, and are better able to get back on-track when a mistake is discovered. So they're more convincing as intelligent helpers.

myk9001

Good point. I still wonder if o3 has improved command of tools because it's significantly smarter in general. Or it's "just" trained with a specific focus on using tools better, if that makes sense.

golergka

> I wonder if GPT4 had been integrated with the same tools, would it've been any less capable?

Did everybody already forget that OpenAI implemented the first version of that, "plugins", back in May 2023? Just a couple of months after GPT4 public release?

myk9001

I was late to the party, were those plugins any good?

boznz

I'm surprised nobody mentioned the video interview. I only watched the first 60 seconds and this is the first time I have seen or heard the author, but if I hadn't been told this was AI generated I would have assumed it was genuine and any 'twitching' was the result of the video compression.

-__---____-ZXyw

I went and watched 10 seconds on account of your comment, and couldn't disagree more. The heads keep sort of rolling around in a disconcerting and quite eery fashion?

smusamashah

How???? I can believe the guy in the video being AI because his lips are not perfectly synced. But the woman? Even with continuous silly exaggerated movement I have hard time believing its generated.

A strand of her hair fell on her shoulder, because she was moving continuously (like crazy) it was moving too in a perfectly believable way, and IT EVENTUALLY FELL OFF THE SHOULDER/SHIRT LIKE REAL HAIR and got mixed into other fallen hair. How is that generated? It's too small detail. Are there any artifacts on her side?

Edit: she has to be real. Her lip movements are definitely forced/edited though. It has to be a video recording of her talking. And then a tool/AI has modified her lips to match the voice. If you look at her face and hand movements, her shut lips seem forced.

furyofantares

> Edit: she has to be real

Nah, having used HeyGen a bit, it's extremely clearly a HeyGen generation. There's a small number of movements and expressions that it continually uses (in forward and reverse).

Edit: I mean, to be clear, it is a real person, just like the author's video is. The way HeyGen works is you record a short clip of you saying some stuff and then you can generate long videos like these of you saying whatever you want. So the stuff you noticed does come from a real video of her, but it's not a real video that's lightly edited by AI, more like the AI has a bunch of clips it can continually mesh together while fixing up the mouth to continually generate video.

smusamashah

If its a bunch of clips meshed together, the AI isn't doing a good job of meshing in the silence pose. I looked at HeyGen site, these are probably called interactive avatars. A woman's throat is moving (as well as her hands) as if she is talking, but her lips as shut. Whatever they are doing, they are not handling silence/listening very well.

Comment was deleted :(

skybrian

What’s clear is that AI is unreliable in general and must be tested on specific tasks. That might be human review of a single output or some kind of task-specific evaluation.

It’s bad luck for those of us who want to talk about how good or bad they are in general. Summary statistics aren’t going to tell us much more than a reasonable guess as to whether a new model is worth trying on a task we actually care about.

simonw

Right: we effectively all need our own evals for the tasks that matter to us... but writing those evals continues to be one of the least well documented areas of how to effectively use LLMs.

chrsw

What about all the things that aren't strictly intelligence but I guess intelligence adjacent: autonomy, long term memory, motivation, curiosity, resilience, goals, choice, and maybe the biggest of them all: fear? Why would an AGI "want" do anything more than my calculator "wants" to compute an answer to some math problem I gave it? Without these things an AGI, or whatever, is just an extension of whoever is ultimately controlling it.

And that's when we return to a much older and much more important question than whether Super LLM 10.0 Ultra Plus is AGI or not: how much power should a person or group of people be allowed to have?

hiAndrewQuinn

https://gwern.net/tool-ai is a quite comprehensive dive into why.

k2xl

There is a similar issue with image and video generation. Asking the AI to "Generate an image of a man holding a pencil with his left hand" or "Generate a clock showing the time 5 minutes past 6 o'clock" often fail due to so many images in the training set being similar (almost all clock images on show 10:10 (https://generativeai.pub/in-the-ai-art-world-the-time-is-alm...)

SubiculumCode

We will have AGI when we have an embodied AI that can do the job of a plumber. It sounds silly, but plumbers need to work in very varied environments, solve physical problems in those environments, while conforming to code, essentially being able to solve problems across mental and physical domains...putting the general into general.

plaidfuji

I don’t think it’s necessary for an AI to operate in the physical world to achieve the dystopian-level impact people expect. After all, the most powerful people in the world operate only by talking to other people. An AGI would solve a plumbing problem by finding and hiring the best plumber and telling them what it needs done. And it would use you to be its eyes and verify the job is done. Whether it’s AGI is, to me, a question of who appears to be serving who.

sinuhe69

For me, AGI is like a human assistant. I can ask it to do useful things that one would face in daily life. Like doing a bit of research about something. What I mean by doing research is not just Google it but also asking questions, calling people and using other resources. Say I need a new type of windows for my project. For a human assistant, I’d give a few instructions and the assistant will go his way. He will make sure that the windows types are locally available, or deliverable with minimal cost depends on my needs. He/she will know if some types is good enough and will pick the phone and ask questions if something is not clear.

It’s obvious that currently none of the SOTA models can do such tasks, agentic or not. And therefore they are NOT AGI to me.

blahgeek

> What I mean by doing research is not just Google it but also asking questions, calling people and using other resources.

> It’s obvious that currently none of the SOTA models can do such tasks, agentic or not. And therefore they are NOT AGI to me.

I myself almost never do that (calling people when googling is possible). Guess I'm not general intelligence. :)

tomrod

I agree with Professor Mollick that the capabilities in specific task categories are becoming superhuman -- a precursor for AGI.

Until those capabilities are expanded for model self-improvement -- including being able to adapt its own infrastructure, code, storage, etc. -- then I think AGI/ASI are yet to be realized. My POV is SkyNet, Traveler's "The Director", Person of Interest's "The Machine" and "Samaritan." The ability to target a potentially inscrutable goal along with the self-agency to direct itself towards that is true "AGI" in my book. We have a lot of components that we can reason are necessary, but it is unclear to me that we get there in the next few months.

airstrike

I don't think we should take it as a given that these are truly precursors for AGI.

We may be going about it the wrong way entirely and need to backtrack and find a wholly new architecture, in which case current capabilities would predate AGI but not be precursors.

tomrod

I call them precursors because we would anticipate an ASI to be able to do thsse things. Perhaps necessary conditions would be a more appropriate term here.

esafak

That's the kind of AGI we don't need. Please let Skynet stay fictional.

tomrod

Not saying I love the idea of an extant ASI, but the need to clearly define it is present. I feel these self-capable examples highlight what a basic API endpoint doesn't about ASI capability.

keernan

I fail to see how LLMs are anything beyond a lookup function retrieving information from a huge database (containing, in theory, all known human information), and then summarizing the results using language algorithms.

While incredibly powerful and transformative, it is not 'intelligence'. LLMs are forever knowledgebase bound. They are encyclopedias with a fancy way of presenting information looked up in the encyclopedia.

The 'presentation' has no concept, awareness, or understanding of the information being presented - and never will. And this is the critical line. Without comprehension, a LLM is incapable of being creative. Of coming up with new ideas. It cannot ponder. Wonder. Think.

sandspar

100 years from now, it won't matter whether AGI arrived in 2025 or 2026 or 2027. It's here "now". If not "now" as in "today, April 20th 2025", then "now" as in "the mid 2020's".

logicchains

I'd argue that it's not productive to use any definition of AGI coined after 2020, to avoid the fallacy of shifting the goalposts.

TheAceOfHearts

I really dislike this framing. Historically we've been very confused about what AGI means because we don't actually understand it. We're still confused so most working definitions have been iterated upon as models acquire new capabilities. It's akin to searching something in the fog of war: you set a course or destination because you think that's the approximate direction where the thing will be found, but then you get there and realize you were wrong so you continue exploring.

Most people have a rough idea of what AGI means, but we still haven't figured out an exact definition that lines up with reality. As we continue exploring the idea space, we'll keep figuring out which parameters place boundaries and requirements on what AGI means.

There's no reason to just accept an ancient definition from someone who was confused and didn't know any better at the time when they invented their definition. Older definitions were just shots in the dark that pointed in a general direction, but there's no guarantee that they would hit upon the exact destination.

Borealid

I think there's a single definition of AGI that will stand until the singularity:

"An AGI is a human-created system that demonstrates iteratively improving its own conceptual design without further human assistance".

Note that a "conceptual design" here does not include tweaking weights within an already-externally-established formula.

My reasoning is thus:

1. A system that is only capable of acting with human assistance cannot have its own intelligence disentangled from the humans'

2. A system that is only intelligent enough to solve problems that somehow exclude problems with itself is not "generally" intelligent

3. A system that can only generate a single round of improvements to its own designs has not demonstrated improvements to those designs, as if iteration N+1 were truly superior to iteration N, it would be able to produce iteration N+2

4. A system that is not capable of changing its own design is incapable of iterative improvement, as there is a maximum efficacy within any single framework

5. A system that could improve itself in theory and fails to do so in practice has not demonstrated intelligence

It's pretty clear that no current-day system has hit this milestone; if some program had, there would no longer be a need for continued investment in algorithms design (or computer science, or most of humanity...).

A program that randomly mutates its own code could self-improve in theory but fails to do so in practice.

I don't think these goalposts have moved in the past or need to move in the future. This is what it takes to cause the singularity. The movement recently has been people trying to sell something less than this as an AGI.

logicchains

AGI means "artificial general intelligence", it's got nothing to do with the singularity (which requires "artificial superior intelligence"; ASI). Requiring AGI to have capabilities that most humans lack is moving the goal post WRT to how it was originally defined.

jpc0

I don't think this is capabilities humans do not have, this to me is the one capability humans destinctly have over LLMs, the ability to introspect and shape their own future.

I feel this definition doesn't require a current LLM model to be able to change its own working but to be able to generate a guided next generation.

It's possible that LLMs can surpass human beings, purely because I believe we will inevitably be limited to short term storage constraints which LLMs will not. It will be a bandwidth vs througput question. An LLM will have a much larger although slightly slower store of knowledge than what human have. But will be much quicker than a human looking up and validating the data.

We aren't there yet.

esafak

You're describing learning, not intelligence.

gom_jabbar

> The movement recently has been people trying to sell something less than this as an AGI.

Selling something that does not yet exist is an essential part of capitalism, which - according to the main thesis of philosophical Accelerationism - is (teleologically) identical to AI. [0] It's sometimes referred to as Hyperstition, i.e. fictions that make themselves real.

[0] https://retrochronic.com

gilbetron

AGI that is bad at some things is still AGI. We have AGI, it is just bad at some things and hallucinates. It is literally smarter than many people I know, but that doesn't mean it can beat a human at anything. That would be ASI, which, hopefully, will take a while to get here.

Although, I could be argued into calling what we have already ASI - take a human and Gemini 2.5, and put them through a barrage of omni-disciplinary questions and situations and problems. Gemini 2.5 will win, but not absolutely.

AGI (we have) ASI (we might have) AOI (Artificial Omniscient Intelligence, will hopefully take a while to get here)

prmph

Then calculators are AGI. A program that does fizz-buzz is AGI. Way to make AGI a meaningless term. What LLMs do now is so far from AGI that I don't know how people make any connection between it and AGI.

This is what AGI means (or should mean): Generalized understanding of the world. Basically, with AGI the context window would be the something like the entire knowledge and understanding of the world that an (adult?) person has (e.g., physics intuition), coupled with the ability to actually reason and act on it, update it, reflect on it, etc.

A small slice of this (e.g., less knowledge than a typical adult) would still be AGI, but current AIs:

- Cannot continually learn and incorporate that learning into their model.

- Cannot reason on any deep level. And before anyone claims that the pattern matching they do is all we do, no this is not the case. Even strong pattern-matching/AI chess engines have weak spots that betray the fact that they do not actually reason like humans do.

- Cannot engage in unprompted reflection in the background.

Current AIs are like a hologram; we are mistaking the 1- or 2-dimensional responses to queries for a deep higher dimensional understanding humans have. The incredible thing about human consciousness is the the deep (infinite?) interiority of it. I can reason about reasoning. I can reason about reasoning about my reasoning, etc. I can reflect on my consciousness. I can reflect on reflecting on my consciousness, etc.

Machines are nowhere close this, and likely will never be

gilbetron

> Generalized understanding of the world.

LLMs definitely have this, and it really is bizarre to me that people think otherwise.

> Cannot continually learn and incorporate that learning into their model.

This is definitely a valid criticism of our current LLMs and once we (further) develop ways to do this, I think my main criticism of LLMs as AGI will go away

> Cannot reason on any deep level.

Few people are able to do this

> Cannot engage in unprompted reflection in the background.

True, but I don't know if that belongs as a requirement to be AGI.

ramesh31

>”Although, I could be argued into calling what we have already ASI - take a human and Gemini 2.5, and put them through a barrage of omni-disciplinary questions and situations and problems. Gemini 2.5 will win, but not absolutely.”

Except for writing a joke that will make you laugh, a poem that will make you cry, or a work of art that evokes deep introspection.

Intelligence is much deeper and more nuanced than answering questions of rote knowledge. LLMs are fantastic “reasoning engines”, but the soul is simply not there yet.

gilbetron

Ok, tell me a joke that I'll find funny - but you can't look it up.

I asked GPT to do so and I chuckled out loud.

jzig

ASI? AOI?

Might as well call it “ultrathink”!

chunkmonke99

I really like Gemini 2.5 and the price is excellent!!! I find it very fun and useful! The AGI narrative is annoying and exhausting ... because it's as though we are just sitting around waiting for some "eschatological product announcement" ... I just don't think these systems are minds and we should frame these as "cultural technologies" like the internet, bureaucracies, "democracy" etc. as some "experts" have asserted. I don't think we should frame the narrative as "competing" with these artifacts... Even the concept of ASI is strange: let's assume there are some alien "super" intelligent beings on planet Zeno that understand the universe much better than us and can communicate at a much higher bandwidth than us. How would we even communicate with them? What will we say? How do humans communicate with an ant? Or even a chimp? I don't think a real "super intelligent" being would be comprehensible to us and their actions would seem random or even hidden.

littlestymaar

Counterpoint: “Jagged AGI” is an oxymoron, it cannot be called “General” until it stops being jagged.

dgs_sgd

While it’s hard to agree on what AGI is I think we can more easily agree on what AGI _is not_.

I don’t consider an AI that fails the surgery brain teaser in the article to be AGI, no matter how superhuman it is at other narrow tasks. It doesn’t satisfy the “G” part of AGI.

simianwords

I thought o1 pro could have solved this riddle

> A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can operate on this boy!" How is this possible?

But it didn't!

simonw

Hah, yeah that still catches out o4-mini and o3 too. Amusingly, adding "It's not the riddle." to the end fixes that.

(o4-mini high thought for 52 seconds and even cheated and looked up the answer on Hacker News: https://chatgpt.com/share/68053c9a-51c0-8006-a7fc-75edb734c2...)

meroes

I mean it’s beyond the scope of o3 or Gemini 2.5 pro or any other public LLM to play a full game of Sudoku. Of course they’re jagged if they do have peaks elsewhere. But even where they are supposed to excel, I very very rarely come across a fully correct technical response, even from these two most recent models.

If you ask it a math question beyond average middle school level, it will have holes (mathematical errors or misleading) at least within a few follow up turns if not right away. And that’s without trying to fool it.

In ten+ years of Wolfram Alpha I’ve found one error (and that was with the help of o3-mini funnily enough).

I’m still on the stochastic parrots side, which is a useful tool in some occasions.

gtsop

All the problems regarding AI will be anwered way sooner when people stop calling intelligent that which is literally as dumb as a rock. There is absolutelly 0 intelligence in current state-of-the-art ML programs. Intelligence is not the ability to solve a task, or any amount of tasks (especially the ones you have seen a million times). Intelligence is the act of successfully navigating the unknown, which is by technical definition the weakest point of all ML, from simple regression to NNs to LLMs

hattmall

Personally, I don't think it's truly AGI until it stops doing what you ask it to, or at least questions your motives. When AI starts asking "Why?" With a meaningful purpose then it's probably getting close to AGI.

Comment was deleted :(

hahamaster

10 PRINT "ChatGPT 7"

20 PRINT "What?"

30 INPUT A$

40 PRINT "Why?"

50 GOTO 30

snarg

I honestly thought that we were agreed on the definition of AGI. My understanding classified it as a model that can build on its knowledge and better itself, teaching itself new tasks and techniques, adapting as necessary. I.e., not simply knowing enough techniques to impress some humans. By this definition, it doesn't matter if it's super-intelligent or if its knowledge is rudimentary, because given enough add-on hardware and power, it could become super-intelligent over time.

rvz

AGI, in its current definition is a scam.

The definition changes when someone else feels like it should change and especially when they fall short of overhyped expectations.

Comment was deleted :(

aylmao

> I've always been a staunch defender of capitalism and free markets, even though that's historically been an unpopular opinion in my particular social circle. Watching the LLM market, I can't help but feel extremely vindicated.

> The brutal and bruising competition between the tech giants has left nothing but riches for the average consumer.

Capitalism has always been great at this: creating markets, growing them, producing new goods. It's widely acknowledged amongst people who actually seek to gain an understanding of Marxism, and don't just stay in the surface-level black-and-white "socialism and capitalism are opposites" discourse that's very common in the West, especially the USA, especially after the McCarthy's Red Scare.

The problem is what comes once the market is grown and the only way for owners keep profits growing is: 1. consolidating into monopolies or cartels, so competition doesn't get in the way of profits, 2. squeezing the working class, looking to pay less for more work, and/or 3. abusing the natural world, to extract more materials or energy for less money. This is evident in plenty of developed industries: from health care, to broadcasting, telecommunications, fashion, etc.

If we view Socialism for what it is, namely a system built to replace Capitalism's bad parts but keep its good parts, China's system, for example, starts to make more sense. Capitalism in a similar way was an evolution from Feudalism that replaced it's bad parts, to achieve greater liberty for everyone— liberty is very much lost as Feudalism matures, but great for society as a whole. Socialism is meant to be the similar, aiming to achieve greater equity, which it views as very much better for society as a whole.

arrosenberg

Agree with most of what you wrote, but China isn’t capitalist, they’re mercantilist with socialist policies. Capital is heavily constrained under Xi.

aylmao

China is socialist.

arrosenberg

Very convincing argument, but no. They have socialist policies, but their economics are mercantile.

VladimirOrlov

Is this for real? ... All this Hype is ... very-very old Hype, and nothing fundamentally new (yet) from 1960s time. Looks like every upgrade of software is "revolution" or "revelation". Please compare 'Win 3.1' and 'Win 11', some progress? sure!, is any "Intelligence" there? No! No! No! What is the difference? Who constantly lying and why? What is the reason of this systematic and persistent lies? p.s. I, personally think, that someday we will have a "semi-smart" computer systems, I also think, in a 5-10 years, we will learn more what is possible and real and what is not (regarding "semi-smart" computer systems). Until that ... hold your horses (please), so to speak.

low_tech_love

The first thing I want AGI to do is to be able to tell me when it doesn’t know something, or when it’s not certain, so at least give me a heads up to set expectations correctly. I ran my own personal “benchmark” on Gemini 2.5 and it failed just like all others. I told it that I was playing an old point-and-click adventure game from the mid-90s and I was stuck on a certain part, and asked for spoiler-light hints on what to do next. Not only can they not give me hints, they hallucinate completely the game, and invent some weird non-sensical solutions. Every single model does this. Even if I tell them to give up and just give me the solution, they come up with some non-existing solution.

I wonder how hard it is to objectively use information that is available online for 30 years? But the worst part is how it lies and pretends it knows what it’s talking about, and when you point it out it simply turns into another direction and starts lying again. Maybe the use case here is not the main focus of modern AI; maybe modern AI is about generating slop that does not require verification, because it’s “new” content. But to me it just sounds like believable slop, not AGI.

irthomasthomas

Here's an example of how my agent handles this:

Gathering context for user request...

Context gathering - Attempting to answer question via LLM: Are there existing Conversation classes in the ecosystem this should extend? Context gathering - LLM provided answer: "No"

Context gathering - Attempting to answer question via LLM: How should model selection work when continuing a previous conversation? Context gathering - LLM answer was UNKNOWN, asking user. Asking user: How should model selection work when continuing a previous conversation?

Context gathering - received user response to question: "How should model selection work when continuing a previous conversation?"

Context gathering - finished processing all user questions Context gathering - processing command executions... Context gathering - executing command: sqlite3 $(find . -name llm_conversations.db) .tables

Context gathering - command execution completed

Context gathering - executing command: grep -r Conversation tests/

Context gathering - command execution completed

Context gathering - executing command: grep -h conversation_id *py Context gathering - command execution completed Context gathering - finished processing all commands Analyzing task complexity and requirements...

DEBUG: reasoning_model: openrouter/google/gemini-2.5-pro-preview-03-25 Task classified as coding (confidence: 1.0) Task difficulty score: 98.01339999999999/100 Selected primary reasoning model: claude-3.7-sonnet get_reasoning_assistance:[:214: integer expression expected: 98.01339999999999 Reasoning assistance completed in 39 seconds Calling LLM with model: claude-3.7-sonnet

low_tech_love

I’m sorry but I have no idea what you mean and what you write…!

irthomasthomas

> The first thing I want AGI to do is to be able to tell me when it doesn’t know something,

In my demo, the llm agent asks followup questions to understand the users problem. Then it first attempts to answer those questions using context and function calling. When a question cannot be answered this way it is forwarded to the user. In other words, it tells you when it doesn't know something.

GiorgioG

Another fucking useless “AI” hype piece.

fsmv

It's not AGI because it still doesn't understand anything. It can only tell you things that can be found on the internet. These "jagged" results expose the truth that these models have near 0 intelligence.

It is not a simple matter of patching the rough edges. We are fundamentally not using an architecture that is capable of intelligence.

Personally the first time I tried deep research on a real topic it was disastrously incorrect on a key point.

simonw

Is one of your personal requirements for AGI "never makes a mistake?"

Arainach

I think determinism is an important element. You can ask the same LLM the same question repeatedly and get different answers - and not just different ways of stating the same answer, very different answers.

If you ask an intelligent being the same question they may occasionally change the precise words they use but their answer will be the same over and over.

samuel

That's not an inherent property of the system. You can choose the most likely token(topk=1) and it will be deterministic (at least in theory, in some hardware setups it might be trickier)

beering

A human will give different answers to the same question, so I’m not sure why it’s fair to set a higher bar for an LLM. Or rather, I’m not sure how you would design this test in a way where humans would pass and the best LLM would fail.

simonw

That's because "intelligent beings" have memory. If you ask an LLM the same question within the same chat session you'll get a consistent answer about it.

Arainach

I disagree. If you were to take a snapshot of someone's knowledge and memory such that you could restore to it over and over, that person would give the same answer to the question. The same is not true for an LLM.

Heck, I can't even get LLMs to be consistent about *their own capabilities*.

Bias disclaimer: I work at Google, but not on Gemini. If I ask Gemini to produce an SVG file, it will sometimes do so and sometimes say "sorry, I can't, I can only produce raster images". I cannot deterministically produce either behavior - it truly seems to vary randomly.

danielbln

You'd need to restore more than memory/knowledge. You'd need to restore the full human, and in the exact same condition (inside and out).

Ask me some question before bed and again after waking up, I'll probably answer it at night but in the morning tell you to sod off until I had coffee.

int_19h

Of course it varies randomly, that's literally what temperature is for in generation.

IanCal

You could run an llm deterministically too.

We're often explicitly adding in randomness to the results so it feels weird to then accuse them of not being intelligent after we deliberately force them off the path.

hdjjhhvvhga

If determinism is a hard requirement, then LLM-based AI can't fulfill it by definition.

falcor84

>near 0 intelligence

What does that even mean? Do you actually have any particular numeric test of intelligence that's somehow better than all the others?

danielbln

There are some very strong and very unfounded assumptions in your comment. Is there anything more substantial there other than "that's what it feels like to me"?

ben_w

The concept of "understand" is itself ill-defined — or, to put it another way, not understood.

Comment was deleted :(

Crafted by Rajat

Source Code

hckrnws

Jagged AGI: o3, Gemini 2.5, and everything after