dmm: (Default)
Dataflow matrix machines (by Anhinga anhinga) ([personal profile] dmm) wrote2024-04-27 09:35 am

WorldCoder (Kevin Ellis group), understanding LLM refusals (Neel Nanda collaboration)

"WorldCoder, a Model-Based LLM Agent: BuildingWorld Models by Writing Code and Interacting with the Environment", arxiv.org/abs/2402.12275

Not a widely known paper (the authors don't promote it), but pretty spectacular (a friend of mine said, "Is it AGI already?").

I think I mostly understand how this works and I made some notes yesterday.

A meta-note here: GPT-4-level models mostly understand what they are doing, but are unreliable; so the question is, can one organize a process which reliably produces needed results based on that. There are plenty of papers trying to push in this direction, but this one is very elegant, and the results are quite good.

******

www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction - very elegant and simple


******

May 9, 2024 update: Since this is access-list-only at the moment (although this post is likely to become public eventually), it's a good place for my notes on switching to Twitter "X Premium" experience (in comments).

May 13: let's move this post to being public.


anhinga_anhinga: (Default)

[personal profile] anhinga_anhinga 2024-05-09 03:27 am (UTC)(link)
testing comment notifications