dmm | WorldCoder (Kevin Ellis group), understanding LLM refusals (Neel Nanda collaboration)

"WorldCoder, a Model-Based LLM Agent: BuildingWorld Models by Writing Code and Interacting with the Environment", arxiv.org/abs/2402.12275

Not a widely known paper (the authors don't promote it), but pretty spectacular (a friend of mine said, "Is it AGI already?").

I think I mostly understand how this works and I made some notes yesterday.

A meta-note here: GPT-4-level models mostly understand what they are doing, but are unreliable; so the question is, can one organize a process which reliably produces needed results based on that. There are plenty of papers trying to push in this direction, but this one is very elegant, and the results are quite good.

******

www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction - very elegant and simple

******

May 9, 2024 update: Since this is access-list-only at the moment (although this post is likely to become public eventually), it's a good place for my notes on switching to Twitter "X Premium" experience (in comments).

May 13: let's move this post to being public.

Flat | Top-Level Comments Only

From:

dmm

One really needs to read pages 2-5 to understand WorldCoder.

anhinga_anhinga

testing comment notifications

Switched to "X Premium" on Twitter (trying to increase the efficiency of interactions).

The "verification" is still in progress.

Grok is useful (can talk to an AI in the Twitter context).

Another very useful thing is the ability to see "related posts" to a given tweet. This goes back in time and shows really interesting things.

https://help.twitter.com/en/using-x/x-premium

URLs for "related posts" look like this: https://twitter.com/LouisVArge/status/1788664093622272478/similar

(One could try to see manually whether this would work without paying for at least "X Basic" premium tier.)

EDIT: tested, this does not work from free accounts at the moment ("early access to new features").

Edited Date: 2024-05-10 01:29 am (UTC)

Verified

Both today's events should be recorded:

11am Boston time, Kolmogorov-Arnold networks talk: https://twitter.com/HannesStaerk/status/1789293551211426133

1pm Boston time, "Spring updates" from OpenAI (presumably a YouTube livestream on their channel)

Dataflow matrix machines (by Anhinga anhinga)

WorldCoder (Kevin Ellis group), understanding LLM refusals (Neel Nanda collaboration)

WorldCoder (Kevin Ellis group), understanding LLM refusals (Neel Nanda collaboration)

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

December 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags