LLM emergent behavior written off as rubbish – small models work fine but are measured poorly

[…] As defined in academic studies, “emergent” abilities refers to “abilities that are not present in smaller-scale models, but which are present in large-scale models,” as one such paper puts it. In other words, immaculate injection: increasing the size of a model infuses it with some amazing ability not previously present.


those emergent abilities in AI models are a load of rubbish, say computer scientists at Stanford.

Flouting Betteridge’s Law of Headlines, Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo answer the question posed by their paper, Are Emergent Abilities of Large Language Models a Mirage?, in the affirmative.


When industry types talk about emergent abilities, they’re referring to capabilities that seemingly come out of nowhere for these models, as if something was being awakened within them as they grow in size. The thinking is that when these LLMs reach a certain scale, the ability to summarize text, translate languages, or perform complex calculations, for example, can emerge unexpectedly.


Stanford’s Schaeffer, Miranda, and Koyejo propose that when researchers are putting models through their paces and see unpredictable responses, it’s really due to poorly chosen methods of measurement rather than a glimmer of actual intelligence.

Most (92 percent) of the unexpected behavior detected, the team observed, was found in tasks evaluated via BIG-Bench, a crowd-sourced set of more than 200 benchmarks for evaluating large language models.

One test within BIG-Bench highlighted by the university trio is Exact String Match. As the name suggests, this checks a model’s output to see if it exactly matches a specific string without giving any weight to nearly right answers. The documentation even warns:

The EXACT_STRING_MATCH metric can lead to apparent sudden breakthroughs because of its inherent all-or-nothing discontinuity. It only gives credit for a model output that exactly matches the target string. Examining other metrics, such as BLEU, BLEURT, or ROUGE, can reveal more gradual progress.

The issue with using such pass-or-fail tests to infer emergent behavior, the researchers say, is that nonlinear output and lack of data in smaller models creates the illusion of new skills emerging in larger ones. Simply put, a smaller model may be very nearly right in its answer to a question, but because it is evaluated using the binary Exact String Match, it will be marked wrong whereas a larger model will hit the target and get full credit.

It’s a nuanced situation. Yes, larger models can summarize text and translate languages. Yes, larger models will generally perform better and can do more than smaller ones, but their sudden breakthrough in abilities – an unexpected emergence of capabilities – is an illusion: the smaller models are potentially capable of the same sort of thing but the benchmarks are not in their favor. The tests favor larger models, leading people in the industry to assume the larger models enjoy a leap in capabilities once they get to a certain size.

In reality, the change in abilities is more gradual as you scale up or down. The upshot for you and I is that applications may not need a huge but super powerful language model; a smaller one that is cheaper and faster to customize, test, and run may do the trick.


In short, the supposed emergent abilities of LLMs arise from the way the data is being analyzed and not from unforeseen changes to the model as it scales. The researchers emphasize they’re not precluding the possibility of emergent behavior in LLMs; they’re simply stating that previous claims of emergent behavior look like ill-considered metrics.


Source: LLM emergent behavior written off as ‘a mirage’ by study • The Register

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft

 robin@edgarbv.com  https://www.edgarbv.com