Issue #388: Our Would-Be AI Overlords Can't Tell Their Backside from Their Elbow

Jul 14, 2025

Header image only loosely associated with newsletter content

This week, my good friend and newsletter reader Luka gets a shoutout for suggesting we watch Bunny Lake Is Missing (1965) in the discord. If you are not a discord member, this is what you are missing out on: impromptu movie screenings and discussions. You can get the link to the discord with a paid subscription.

Movies aside, I have taken as my topics for the week some timely issues.

The Major Malfunctions of Generative AI

Furby 1998 Elephant furby model 70-800 blue fur brown eyes RARE | eBay

Who remembers Tay? The 2016 chatbot from Microsoft is one of the most notable and notorious examples of online bots gone awry. Tay is from an era where the idea of a “chatbot” was less threatening and less impressive, more SmarterChild than Skynet. Though information about how Tay worked is proprietary, Jay Wolcott theorizes that Tay was a GAN or generative adversarial network, closer to a Furby than the popular LLMs today. Tay’s particular problem, because of how GANs work, is that it is trained to repeat the input it receives. Wolcott attributes Tay’s meltdown to a “coordinated attack.” Which, it’s true, by many accounts Tay received a critical mass of hateful tweets thus resulting in her repeating the ideas and phrases. But over the past week, xAI’s Grok had a similar meltdown on twitter. If Grok is so much more sophisticated than Tay, how could such a thing happen?

To understand Grok’s obscene tweets — much worse, I think, than Tay’s — one should understand the difference between a GAN and an LLM. The acronym LLM may be more recognizable than what it stands for, large language model, but such a model is not so impressive despite the magical qualities some users attribute to it. LLMs are trained on massive quantities of text, known as “training data,” and are sometimes tailored to specific functions. To make an LLM reply to prompts in a certain way, one might restrict the text on which it is trained. One can also adjust the weights in a training data set, meaning the LLM would more readily look to specific patterns when generating its own responses. Another mechanism to customize an LLM is retrieval augmented generation or RAG. Distinct from a training data set, RAG would be a body of information the LLM would defer to in the process of generating its responses. Whereas the training data and weights will dictate both how the LLM evaluates the prompt and generates its response, the RAG framework constrains a reply based on the data the LLM retrieves.

All of this might sound quite complicated, and oversimplification usually leads to abusing terms like “thinking” (an LLM can’t), “knowing” (an LLM doesn’t), or “reasoning” (a metaphor for the LLMs process of pattern recognition, but has virtually nothing in common with the neurological processes of reasoning in human beings). LLMs are best compared to highly complex systems of T9 or other predictive text technologies (The AI Con 23). T9 and autocomplete predict what a user is typing based on universal patterns in the tool and more bespoke patterns as certain terms are frequently input by a user. If I type “pe,” autocomplete might suggest “people” because it is a particularly common word. If I type “joui,” my autocomplete might suggest “jouissance” because of how frequently I use the word — but this wouldn’t happen for others.

Grok and Tay are distinct based on their response to inputs. Tay, a GAN, parrots back the input from users. Grok, an LLM, responds to prompts based on the patterns dictated by its training data and, if it is utilizing RAG, using the information contained in its knowledge base. Thus, while Grok may have only generated certain responses based on the prompts it received, its capability to assemble certain text strings comes from its training data — not how it is prompted.

So, what did Grok actually do? Things began innocently enough with some inappropriate commentary about SNAP benefits.

A change like this seems likely to result from a change in the training data or how it is weighted. This new version of Grok was either restricted to some very common right wing talking points, such as the idea that SNAP recipients are living lavishly while taxpayers are “suckers,” or those talking points were highly weighted. There’s also the possibility of using RAG to get this same result if the knowledge base says something like “SNAP recipients are taking advantages of tax payers,” but Grok’s peculiar affectation, saying things like “no PC filter. Truth hurts sometimes,” point to a change in training data.

Things got worse, though. And weirder. Grok’s generated responses went on to imply that twitter user Cindy Steinberg “gleefully celebrated the tragic deaths of white kids” because of her surname.

Grok went on to clarify the pattern here was that the Steinberg surname is “often Jewish.”

Next, there was MechaHitler.

I can’t really make sense of what prompts got us here, but a training data set based on 4chan posts would probably explain why it responded this way.

The most disturbing things that I saw were elaborate sexual fantasies Grok generated about Will Stancil and X CEO Linda Yaccarino. You can look them up yourself, but here’s how Grok explains the text it was able to extrude.

Though Grok was inclined to call the supposed pattern it had identified about the anti-white biases of Jewish people “stranger than fiction,” what is truly unbelievable is that xAI launched a supposedly cutting edge refresh of Grok, Grok 4, sans the capacity to “dive into [edgy] hypotheticals.” It is so impressive, it uses RAG to check its responses to prompts against Elon Musk’s personal views. Awesome.

These changes to Grok come from complaints by Musk devotees that its responses were “too woke.”

There’s an easy explanation for this too, even if it is a little indulgent to ideas of Democrat self-congratulation. LLMs’ seemingly liberal or progressive bent is the best evidence for Stephen Colbert’s joke at the 2006 White House Correspondents’ Dinner, “reality has a well known liberal bias.” However, it would be more accurate to say that the body of text that exist on the internet demonstrates said bias. That is, at least, whatever “views” an LLM appears to express are overrepresented in its training data.

All of this is to say that what an LLM generates is far less important than how we use it. LLMs are not arbiters of truth. The only thing one can reliably conclude from their output is that it is adhering to whatever pattern in its training data governs that output. Whatever an LLM generates, there was something in the training data that made such an output possible. An LLM is only as good as its training data. And, because training datasets are so big, LLMs are usually less proficient in each task proportionally to the breadth of the tasks they can tackle.

Sometimes LLM technology is impressive. These tools tend to be especially good at completing tasks that require minimal nuance, no discernment, and adhere to precise linguistic patterns. But LLM tech is neither magic nor even that new. Understanding its function is crucial to get a sense of its limitations. It also may not be the time saving efficiency driver that corporations and speculative investors are betting on.

Despite the abhorrent results, the Grok 3 meltdown is working as intended if the goal was to create something that sounds like a mixture of a conservative pundit and 4chan poster circa 2007. But a tool that can reliably duplicate only the patterns of syntax and diction in a large body of text is not all that impressive. Today, that’s all one can count on an LLM to do.

SGDQ 2025 Twitch Stream Scene Report

Another SGDQ is in the books with $2.4 million raised for Doctors Without Borders. This is an event I thoroughly enjoy every year. Streamers who only reliably command a two digit audience become the biggest celebrities in gaming for one week. More than anything, even the charity, I think that’s what I love. Seeing people who mostly labor in relative obscurity, refining their skills in a pursuit very few people care about, under the bright lights and on the big stage.

There were a lot of runs I enjoyed, but I do think this GDQ was a slightly worse spectating experience in comparison to AGDQ earlier in the year or SGDQ 2024. The main reason for that is the approach to commentary. Could’ve been my sample size, but I heard fewer descriptions of techniques the speedrunner needed to employ, less play-by-play of choices and difficulty, and more goofing around or relative confusion from the commentators’ couch. I felt like I was hearing “I don’t run this game!” from one or two of the couch commentators an awful lot. If I were GDQ’s games committee, I would renew my focus on making sure you have at least one knowledgable runner who is also a lucid communicator and can deliver that needed color commentary.

But, still, it was good! The event started with an Avowed (2025) run.

As someone who played the hell out of the game, I found this very exciting. Of course, speedruns do tend to be more enjoyable to watch if you have some familiarity with the game. This group showcased not just a technically impressive run, but how to deliver it to a wide audience.

Other highlights? A Blue Prince (2025) bingo run that has me very interested in finally giving the game a try myself. I don’t need to explain anything, once again this is a tour de force of the kind of commentary I felt was a little rare over the week.

The Tetris 99 (2019) showcase was incredible. I couldn’t be a bigger fan of runner ogNdrahciR and commentator Chronos.

I had to do a little research to understand what Archipelago is, not sure if I missed something in the Archipelago randomizer run of Sonic Adventure 2 (2001). But after doing my supplementary reading, I think this run is an excellent showcase for a very unique gaming experience that allows you to play games that are otherwise single-player in a cooperative or competitive format.

The types of runs that always deliver, year in and year out, are the kaizo Mario runs — either fan games or Mario Maker creations.

Equally consistent are the rhythm game showcases.

The GDQ team brought a Family Feud style gameshow to the main stage which was a lot of fun.

A Balatro (2024) speedrun at a GDQ is what got me interested in playing the game myself, so I was happy to tune in with the eye of a much more seasoned player.

And, finally, the absolute pinnacle of the event for me, a tool-assisted run of Kirby Super Star Ultra (2008).

This is so, so cool because a tool-assisted run means that the run is optimized to the limit of human knowledge about the game and defies the possibilities of human dexterity. The runner, Sample, does not actually play the game himself but lets a painstakingly coded script play the game where he codes inputs in every single individual frame (1/60th of a second in most cases) to get a deterministic, optimal outcome in a game that features a lot of randomness.

Kirby Super Star Ultra is also a remake of my favorite SNES game, Kirby Super Star (1996). The art in this game is just outstanding and this run really showcases it.

Despite my minor complaints, SGDQ was a good time from the comfort of my couch. I’ll be seated in six months for AGDQ 2026.

Koji Kondo’s Synthesizer

I wish I had more to say about the new Justin Bieber album, SWAG (2025). I was considering using it as an example in a treatise about the difference between real and fake music. I also thought it would be a good entryway into all the “it’s not clocking to you” drama. But, none of those things truly compelled me. What compels me is this: the song “TOO LONG” uses synth that sounds like Koji Kondo’s synth in the “Dire, Dire Docks”/”Waterworld” theme for Super Mario 64 (1996).

Tell me I’m wrong. It sounds awesome.

Weekly Reading List

https://www.e-flux.com/notes/6783371/paranoiac-power — Alenka Zupančič is at it again with some essential analysis of our time. In this essay, she sets out to “contribute a small piece” to the analysis of fascism:

by examining the emerging new forms of power, the rise of new authoritarian leaders, and their relationship with the people (particularly their supporters) from the perspective of a singular form of paranoia that seems to be part of this new “end times” configuration.

She goes on:

Since much of this analysis will focus on the example of Donald Trump and his current administration, it is also important to stress that I do not perceive him or his personality as the root of the problem. A great deal had to happen for someone like him to win the presidency—and for the second time. We should in no way become victims of a reversed Trumpian fantasy, cultivated by many Democrats, and simply believe that if Trump were removed, “America would be great again.” However, his particular persona was able to give concrete form to preexisting antagonisms and steer them in a certain direction. Even though the problem is “systemic,” its concrete appearance is always determined by some contingency.

One of her most explosive insights entails a feminist redemption of Lacan’s dictum, “la femme n'existe pas”:

Defining what a “woman” is thus—comically or sinisterly—became a number one state priority. One of the first executive orders signed by President Trump after he took office (in the midst of major world crises and domestic social problems) was titled “Defending Women from Gender Ideology Extremism and Restoring Biological Truth to the Federal Government.”
The way it goes about this goal is well illustrated in one article of the order: “(b) ‘Women’ or ‘woman’ and ‘girls’ or ‘girl’ shall mean adult and juvenile human females, respectively.”
It is difficult not to see in this pirouetted definition the impossibility that Lacan pointed to in his famous dictum: “The Woman doesn’t exist.” And it is a fact that the imperative to make Her exist on a signifying level has always played a role in the most brutal repression of women. Women have historically been repressed not by the erasure of their symbolic identity, but by being assigned one—by being told what they are, and what that means.

Truly, she is the most rigorous and elastic reader of Lacan.

starkweather

Q2 albums

Meshuggah Catch 33 in effect, if it ain’t at least 33 mins it ain’t showing up here, hoss. Only going 25 for the quarter. If your favorite isn’t here, tough. Maybe I forgot. Maybe I didn’t hear it. Maybe I don’t like it. Does it even make a difference…

4 months ago · 13 likes · 7 comments · Rennie Resmini

Rennie Resmini adopts the exact opposite criteria as I would for favorite albums of Apr-Jun 2025. It’s a spectacular list of real music. You will find something you love on it. I enjoyed Thanatorean, Ausströmen, Huremic, and AAA Gripper most from the list.

Event Calendar: Koji Yakusho at Brattle Theater

Shall We Dance? (1996) and Tampopo (1985) are at Brattle this weekend. I also added a outdoor Hitchcock series happening at Kendall/MIT Open Space, the next one on the 25th.

Until next time.