November 7, 2022
AI Content is not the SEO threat they want you to think it is
Despite Google's policies around AI Content, using the tech correctly does not necessarily impact your visibility and search traffic negatively.
Mike King
In 2018 I made a prediction at TechSEOBoost that we were 5 years away from any random script kiddie being able to leverage Natural Language Generation (NLG) to generate perfectly optimized content at scale using open source libraries. At that point, I’d been keeping a close eye on what was happening in the Natural Language Processing (NLP) space with Large Language Models (LLMs) beginning to mature. Conversations that I had when I got offstage after my keynote suggested that we were actually much closer to perfect automated content being deployed by cargo coders than I thought.
The next year I judged, hosted, and iPullRank provided the grand prize for the TechSEOBoost technical SEO competition . Ultimately, Tomek Rudzi would take home the prize, but there were two strong entries that demonstrated natural language generation in multiple languages from the late great Hamlet Batista and Vincent Terrasi from OnCrawl respectively. Personally, I found these to be the most compelling submissions, but the judging was done by committee.
Let’s take a step back though. What I imagine as “perfectly optimized content” is the combination of what content optimization tools (Frase, SurferSEO, Searchmetrics Content Experience, Ryte’s Content Success, WordLift, Inlinks, etc) do with what natural language generation tools (CopyAI, Jasper, etc) are doing. A key benefit of the latter set of tools is in how they look at relationships across the graph rather than just vertically down the SERP.
In practice, a user would submit a keyword with some instructions to a system that would derive entities, term co-occurrence, and questions as inputs from the SERP and then generate relevant and optimized copy with semantic markup and internal linking baked in. With the emergence of Generative Adversarial Networks (GANs) and the ubiquity of text-to-speech , the system could cook up related imagery and video transcripts as well.
If that sounds a bit like science fiction, it’s not. The components exist and innovative companies like HuggingFace and OpenAI are driving us there, but we are the elements have not been tied together into one system yet. However, there’s still a year left in my prediction and iPullRank has an engineering team, so make of that what you will.
In the meantime, user-friendly tools built from large language models that generate copy virtually indistinguishable from what a human may write are beginning to grab the attention of marketers and would-be spammers. The implications for a search engine that has been caught on the back foot by a social video platform and has a growing symphony of users drawing the conclusion that results limited to Reddit as a source are better than core search quality are potentially immense. Whether or not the two are related, we’re seeing the beginnings of a campaign to discourage the use of this type of generated content.
If it's a problem, Google created it (or What is a Large Language Model?)
Make no mistake, all the technologies that I am referencing herein are incredible albeit somewhat problematic for a number of reasons that we’ll discuss shortly. Google’s AI Research team has driven a quantum leap in the NLP/NLU/NLG fields with a few key innovations that have happened in recent years.
While the fledgling concepts behind them stretch back to prehistoric types of Computer Science theory, the Large Language Models, as they are called, were developed based on the concept of “Transformers.” These marvels of computational linguistics examine a colossal collection of documents and, basically, discover the probability for one word to succeed another word based on the likelihood of words appearing in a given sequence. I say “basically,” because word embeddings play a huge role here and there are also language modeling forms based on “masking” which allows the model to use the context of the surrounding words and not just the preceding words. Transformers were introduced by Google engineers and presented in a paper in 2017 called “ All You Need is Attention ,” explained in something closer to layman’s terms in a post called “ Transformer: A Novel Neural Network Architecture for Language Understanding ” The concepts would be brought to life in a way that SEOs took notice of with Bidirectional Encoder Representations from Transformers (BERT) .
In August 2018, just prior to the open sourcing of BERT, the OpenAI team published their “Improving Language Understanding with Unsupervised Learning” post and the accompanying “Improving Language Understanding with Generative Pre-Training” paper and code wherein they introduced the concept of Generative Pre-Training Transformer or GPT-1.
How do Generative Language Models Work?
In practice, given a prompt, a large generative language model such as GPT-1 and its successors (GPT-2, GPT-3, T-5, etc.) extrapolate the copy to the length a user specifies using the probabilities of what is most likely to be the next token (word, punctuation, or even code) to appear in the sequence. In other words, if I tell the language model to finish the sentence “Mary had a little” it is likely that the highest probability for the next word is “lamb” due to how often that sequence is present in the texts from the Internet that it has learned from.
In effect, language models aren’t actually “writing” anything. They are emulating copy that they have encountered based on the number of parameters generated in their training. You can play with this live in a variety of ways, but Write With Transformer by HuggingFace has a variety of models that illustrate the concept.
In this screenshot, you’re seeing the three different directions that the model feels confident about taking a sentence that starts with “Jay Z is.” The longer it goes further unprompted the more it will begin to meander into topics of disinterest.
This is the same functionality at play when Gmail or Google Docs attempts to autocomplete your sentence. The email or document is the prompt and Google’s language models are predicting your next word, phrase, or response as you write. Note: I suspect that it will be a bit discombobulating to see a portion of the same paragraph in the screenshot above, so I’m just writing an additional sentence to make this area visually easier to differentiate so readers don’t think it was a mistake.
While large language models will generate content just fine “out-of-the-box,” there is also an opportunity to “fine-tune” their output by feeding them additional content from which they can learn more parameters. This is valuable for brands because you can feed it all of your site’s content and the language model will improve its ability to mimic your brand voice. For anyone that is serious about leveraging this technology at scale, fine-tuning needs to be a consideration.
The Large Language Model Explosion
As with all types of machine learning models, a series of values that represent the relationship between variables known as “parameters” are learned in the training of a language model. Parameters in language models represent the pre-trained understanding of probabilities of words in a sequence. What has been found is that, generally, the more parameters, or the more word relationships examined using this process, the stronger the language model is at its various tasks. For instance, language generation substantially improved between GPT-2 and GPT-3.
Despite projects like DeepMind’s RETRO being said to outperform GPT-3 with only 7 billion parameters , companies such as Google, Nvidia, Microsoft, Facebook, and the curiously named OpenAI, have been in proverbial arms race to build bigger models. Google themselves have a series of interesting models such as GLaM , LaMDA , and PaLM ; they’ve also recently leaped into the lead with their 1.6 trillion parameter model Switch-C.
All of this technology is incredible, but these innovations certainly come at a cost.
What are Some Problems with Large Language Models?
When I say that these language models are trained from colossal collections of documents, I mean that the engineers building them take huge publicly available data stores from well-known corpora such as the Common Crawl, Wikipedia, the Internet Archive as well as WordPress, Blogspot (aka Spam City), New York Times, eBay, GitHub, CNN and (yikes) Reddit among other sources.
Naturally inherent in this is a series of biases, hate speech, and any number of potentially problematic elements that come from training anything on a dataset that does not take the time to filter such things out. Surely, I don’t have to reference the Microsoft chatbot that swiftly went Kanye once it was unleashed on Twitter, right?
Of course, there have been warnings from academics about the potential problems. Wrongfully terminated former Googler and founder of DAIR.AI Timnit Gebru along with Emily Bender, Angelina McMillan-Major, Shmargaret Mitchell, and other researchers that were not allowed to be named all questioned the ethics behind LLMs in their research paper “ On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ???? “
In the paper they highlight a series of problems:
LLMs are not actually “writing” – As readers, we ascribe meaning to the output of large language models because they mimic word usage in the way that we expect, but the entity on the other side does not “mean” anything as it spits out the copy. The team clarifies this with: “Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.” This is something we should keep in mind as we use these tools so as to not prematurely believe these tools are self-aware .
Computational expense – Although the quality of results improve as LLMs grow in size, the gains are incremental while the costs are exponential. The research reveals that for a 0.1 gain in BLEU score performance rating of a translation task, the computing costs increases $150k. They also highlight that this not just a language modeling problem, rather it is a problem across machine learning in general and compute requirements have been outpacing Moore’s Law . In the long term, the expense at the inference stage will also eclipse that of the training stage.
Environmental and social expense – Dr. Gebru, et al note that training a big Transformer model yielded 5,680% more carbon emissions than the average human generates per year or in their own words “Training a single BERT base model (without hyperparameter tuning) on GPUs was estimated to require as much energy as a trans-American flight.” The downstream effect is perhaps more consequential in that the countries that experience geographic racism like Africa, for instance, will see more negative environmental impact than the countries for whose languages these models are trained.
Biased training data – In the same way that documented history is a reflection of the perspectives of those who were in power at the time, training language models on datasets from the unfiltered Internet effectively codifies the hegemonic belief systems of the time. As the paper states, “the training data has been shown to have problematic characteristics resulting in models that encode stereotypical and derogatory associations along gender, race, ethnicity, and disability status.” This is both a function of the bias of those who are privileged enough to have access to the Internet, comfortable enough to be involved in its public discourse as well as a reflection of whose voices and viewpoints get amplified across it. Naturally, marginalized people and thinking represents the minority of the data in the training set. Perhaps this was at the heart of OpenAI’s original decision that GPT-2 was too dangerous to share .
Static viewpoints – Language and beliefs evolve everyday. Large Language Models are typically trained once and used for long periods of time and thereby “value-locked” while the world itself changes. As we grow more diverse, inclusive, empathic, and sensitive to the experiences of others, words like “homeless” morph into “unhoused” and their semantic associations change. History is also revised as we grow as a society. This is reflected by updates to Wikipedia and additional news coverage. Large Language models being static do not benefit from these updates.
From my perspective, all of these insights are incredibly valid and well-researched. It’s both enraging and very unfortunate how Google disrespectfully ousted Dr. Gebru , and her colleagues in their attempts to ensure that the technology is handled ethically. I applaud her and her continued efforts to address algorithmic bias .
In fact, as with many machine learning technologies, I find myself conflicted by their exciting use cases and the potential harm they might cause. To that end, what the paper did not explicitly consider was a second order impact of marketers using LLMs to lay waste to the web and the circular effect of generated content finding its way back into future training datasets.
What are LLMs Good At in SEO?
Naturally, much of the Large Language Models conversation in the SEO space is about using them to create long-form content at scale. After all, we’re the industry that thought using content spinners built from Markov Chains was a good idea.
The scope of these applications is actually much bigger since they are capable of document summarization, sentence completion, translation, question answering, even coding, image creation, and even writing music . The concepts behind this tech have also been applied to generating images. In fact, several of Google’s more recent innovations have come on the back of applications from technologies that underpin LLMs or have been built on top of them. For instance, Google’s Multitask Unified Model (MuM) is built leveraging its T5 language model .
With respect to SEO and Content Marketing use cases of language models, the obvious threat is that people will dump million-page websites of raw generated long-form content onto the web. Further still, they could use a language model to write the code for the front end and incorporate its copy. We could see the proliferation of scripts and services allow someone to generate such a site in a few clicks.
For those who have goals that are less nefarious (in the eyes of Google) pursuits, there are three immediate use cases that come to mind:
Short Form Text Generation – With respect to text generation, language models perform best when fine-tuned and are limited to short bursts of text created from very specific prompts. The longer the copy gets, the weaker it gets. Consider using them for generating summaries, meta descriptions , and ad copy.
Content Brief Generation – Through a combination of prompts in the background, some of the existing tools on the market do a fantastic job at generating briefs and informing brainstorms. Consider using language models to spin up a wealth of content briefs on a variety of keyword driven topics.
Image Generation – Despite the fact that DALL-E 2 generates images, it too is built from a generative language model. Marketers have the opportunity to move away from heavy usage of stock photography and instead use imagery that is generated from prompts. Consider them as an opportunity to build well-designed content.
Most importantly, although there have been efforts to improve , LLMs are generally quite bad at factual accuracy. So, no matter what you do, be prepared to edit before you do anything with your generated content.
The Problem is we all know that low utility content can drive relevance
At this point, many of us have witnessed how so many large websites drive a wealth of visibility and traffic with thousands or millions of near-duplicate content pages that feature unique copy either at the top or bottom of the page.
If you haven’t, it’s a very common tactic for both e-commerce sites and publishers. E-commerce sites create Product Listing Pages(PLPs) based on internal searches or n-grams present across the site and slap some Madlib copy on the bottom of the page. Similarly, publishers create Category or Tag pages that simply list articles with a description of the category at the top.
Heatmaps and analytics generally indicate that no one reads that content and, if they do, they deeply regret it when they are finished. In recent years, the question has been presented as to whether or not it actually yields improvement. I won’t waste anyone’s time sharing A/B tests that suggest the practice is not valuable. People tending to large sites that perform know definitively that implementing this sort of copy works and yields dramatic results. There’s even a 2 billion dollar company down the street from Google in Mountain View whose core product generates pages like this. Best practices be darned.
What we also know is that no human being should be subjected to writing this content. Sure, it can highlight facts that someone might want to know, but these very same data points are often provided in tables on the page. Certainly, the goal of the semantic web is to make those things extractable and surface them, but fundamentally that is not enough to drive visibility in Organic Search.
Remember, FUD is page one in the Search Quality playbook
Google’s search quality efforts are one part technology-driven and one part large-scale social
engineering
propaganda campaign. On the technology side, the engineering teams are creating earth-shattering applications that shift the state of the art. They are also open-sourcing components of what they build to leverage the power of the crowd in hopes of scaling innovation. They talk about this concept of “Engineering Economics” in the Google Open Source documentation .
Effectively, they are saying engineering resources at Google are vast, but they are finite. Leveraging the power of the crowd, they are likely to get stronger diversity in ideas and execution by establishing bidirectional contributions between Google engineers and the motivated engineering talent of the planet.
You see an example of this within the progression of the technologies being discussed. Google released Transformers which begat the GPT family and influenced Google’s own increasingly expansive LLMs.
On the social propaganda side, the Search Quality teams make public examples of websites that go against their guidelines. Understanding that the reach of their spokespeople and documentation is limited, they apply the same power of the crowd approach to leverage the SEO community as evangelists and to offset the shortcomings of the tech. We sure did say “how high” when they told us to jump for Core Web Vitals, didn’t we?
Honestly, did people truly ever think Google could definitively determine all the guest posts on the web in order to penalize them? If you did, it was likely because risk-averse readers of SEO publications parroted back these somewhat hollow threats.
I don’t present any of this as a conspiracy theory per se. Rather, I present it as food for thought as we all look to understand how we’re being managed in this symbiotic relationship. After all, imagine what search quality would look like without us giving the scoring functions more ideal inputs.
The Helpful Content and Spam Updates may be the beginnings of an attempt to combat generated content
Recently, we all gripped the arms of our chairs in anticipation of the impact of the Helpful Content Update. Many people in the space speculated that sites using NLG tools to generate their content would be impacted as a result of this update or one of the recent Core or Spam updates.
Well, the Helpful Content algorithm came and went and the consensus across the space was that it was only sites that have little control over what content they serve like lyrics and grammar sites that got clobbered.
Although Google spokespeople are deferring to their broadly binary (no pun) guideline related to “machine created content,” I don’t believe that Google is sitting idly while an explosion of LLMs and AI writing tools happens. My guess is that the Helpful Content update was the initial rollout of a new classifier that is seeking to solve a series of problems related to content utility. Much like Panda before it, I suspect Google is testing and learning and, as Danny Sullivan indicates below, they will continue to refine their efforts.