A Disillusioned Look at ChatGPT, Bing and Bard
Chatbots have unlimited potential, except for all the flaws they inherit from us and the many ways they inherit them.
In this newsletter, we explore the recent trends and challenges with major large language models. This begins with an overview of the general technology, followed by a description of its strengths and weaknesses. Next, we dive into ChatGPT, Bard & Bing’s AI, as well as a few other lesser-known chatbots entering the zeitgeist, looking at their greatest accomplishments and failures.
The world as we know it has been disrupted, for better or for worse, by advancements in Natural-Language Processing (NLP). This is a branch of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. Basically, it enables algorithms to “understand”, interpret and generate human language. Most of you are probably familiar with Chatbots, which have been making all kinds of news, leveraging state-of-the art NLP.
Two main factors have contributed to this Summer of Chatbots. The first is the vast amounts of training data that has accumulated over the Internet, from websites, articles, books, social media platforms, and online forums. With the help of web crawlers and other scraping tools, researchers can gather vast pools of data to train NLP models. The second factor is the algorithms that learn, with increasing efficiency, how to harness this data to produce useful outputs. One such algorithm is the so-called “attention mechanism”. This enables models to selectively focus on different parts of input sentences. In a vague way, this resembles how human brains process sequential information. And perhaps you’ve noticed a friend or partner switching off when you bring up a topic not relevant to their daily activities?
Due to the lucrative opportunities of these Chatbots, the hype around them has reached intense heights. Not only are they already being used to replace search engines, generate novel articles, provide customer service, and scan complex documents, but they are also capable of passing a variety of advanced examinations. In some (weaker) versions of the Turing Test, some of these Chatbots are already fooling people into thinking they’re real, with some vehement futurists claiming that the AI is nigh-sentient and that the algorithmic apocalypse is upon us. The wheels of capitalism never stop turning and already Big Tech is moving at a breakneck pace to get this technology into commercial applications. The AI wars have begun, and there are plenty of contenders who may yet take the limelight as the primary Chatbot, the people’s choice, the Gold Standard of large language models. But who are the contenders so far?
OpenAI boasts the star quarterback ChatGPT, known by all, from middle schoolers to grandmothers. This Chatbot is completing more homework assignments in one hour than the lifetime work of a hundred valedictorians. The rapid popularity of ChatGPT was enough for Google to publically announce its “code red” business strategy, which lead to a rushed release of its own language model. It proceeded to fail on rather basic challenge in a live demonstration, wiping over $100 billion from the company’s stock. But don’t worry, Google’s stock has since risen back to normal, and they’re not out of the game yet.
Team Google’s star player is Bard. It’s based off the language model called LaMDA, one of the first Chatbots to break headlines with its sophisticated use of language, its convincing ability to show identity and a desire for self-preservation. This model was revealed to the world by a zealous engineer who believed he was revealing a new form of life. After the humilitation of Bard’s failed demo, Google has taken a focus on internal development, having its employees improve the bot’s responses before they make a public splash (or sink) again.
Coming up from the rear is team Microsoft, sneaking away some of the technology from OpenAI after investing $11 billion into the company. They are soon to release a tool which uses the search engine Bing, paired up with a more powerful version of ChatGPT, called GPT-4. First of all, what do we actually call this thing? It’s more advanced than ChatGPT, but it still uses good old Bing for most of its functionality. Let’s just call it Bing AI for now. Because whatever you call it, it will come up with something worse to call you. The marketing for this AI model points out some plausible advantages over the likes of ChatGPT, such as the ability to use sources and explain how it came to its conclusions. But it’s prone to problematic behaviours such as gaslighting, aggressive language and an inability to tell what year it is. This is something explored later in this article.
Fundamentally, the problems these Chatbots face are the same problems all Chatbots have faced since their inceptions: issues of bias from training data, poor filtering abilities to prevent illegal or unethical content generation, and the automation of processes that may ultimately be more harmful to people than good.
Bias is a common problem in NLP. Since language models are trained on vast amounts of text, any bias present in the training data can be amplified and reinforced in the resulting models. This can have serious consequences, such as perpetuating harmful stereotypes and discrimination. Most of the time, there aren’t enough measures in place to exclude problematic training data, and it takes additional investment to develop more diverse and inclusive datasets, which may include other languages, cultures and identities that are currently under-represented in most large datasets. One famous example flagged up in 2017 was with Google Translate. When typing "He is a babysitter. She is a doctor" in English, converting it into Turkish, and back again, it reverses the genders, being trained on text that associates men with being doctors and women with being nurses. Already, large language models are revealing all kinds of bias, including political ones. On a political note, while models such as ChatGPT often display a favorable bias towards famous left-wing politicians (such as being able to generate positive text about Joe Biden but refusing to do so about Donald Trump), it is important to consider that media from right-wing sources tends to be on the controversial and inflammatory side, leading to virality due to social media algorithms, and so such content would be over-represented in many training datasets for these language models.
The sad reality is that this problem may be less specific to Chatbots, but rather, the Internet itself. Around 15-30% of the Internet consists of pornographic content (and this excludes the Deep Web). Online forums and social media have historically been a cesspool for hate speech, creepy messages, and other harmful text. OpenAI have made use of Kenyan workers, paid less than $2 an hour, to filter through the text generated by ChatGPT. Their days consist of labelling and filtering outputs from ChatGPT’s training dataset, including child sexual abuse, bestiality, murder, suicide, torture and self-harm. All of this raises a serious question not only about workers rights and wealth inequality, but also about how much progress has been made technologically. After all, if you require a small army to have a usable Chatbot in deployment, is it such a stretch to have a human-in-the-loop for every possibly risky Chatbot-based generation?
Another huge concern has been raised by artists and content creators, who are worried these Chatbots (and other NLP tools) could use their work without permission or attribution by language models, which could impact their ability to make a living. A lawsuit has been filed against GitHub’s AI-powered Copilot tool, trained on vast amounts of open-source code. The litigants are claiming that the tool’s usage is “piracy on an unprecedented scale”, due to the fact it offers no attribution or incentivization to the original coders. While there are different arguments on this issue, the controversy with text-based IP concerns pales in comparison to that of art and image-based AI generations. My own take is that usually, the code generated by Copilot is either trivial or original, with only the occasional example of individual memorization occurring. But it is clear AI models still have a long way to go before they can bypass these criticisms around usage of IP. Additionally, priests and other religious leaders are concerned that sermons generated by AI lack the human element and soul that are necessary for an authentic religious experience.
Moving on to the next challenge for language models is the issue of "Prompt injection". This refers to a technique used in NLP. It involves providing the language model with a short, context-specific prompt to guide its text generation. A well-crafted prompt injection can make the Chatbot perform unexpected actions and even leak its own programming. This raises serious questions about how effective filtering methods are with these models and whether companies are safe to deploy these models, given how sensitive information such as a code could be exposed by a prompt alone.
None of these Chatbots offer confidence scores. This would give a measure of how reliable the information generated may actually be. Well-established, richly-represented facts would get a high score, with more ambiguous, subjective or obscure information getting a lower score. As it stands, these models generate lots of falsehoods. They often make up sources, mix up references or just repeat soundbytes that don’t mean anything to anyone.
There are many outspoken critics (such as Noam Chomsky) who claim these Chatbots are hurting education. After all, many students can now skip on assignments and just let AI do it for them. This serves as a replacement for self-development. Not only does it limit skills-building by automating the work, but it also curtails creative opportunities. After all, imagination underpins innovation. Many scientific and cultural breakthroughs are results of humans using this abstract thing called creativity. Nobel Prize-winning scientists are 3 times more likely than the average scientist to have an artistic hobby. But in the same token, through the power of these Chatbots, a new skill is born- an unexpected way for people to express their own unique ideas, emotions and identities. Prompting allows someone to use high-level ideas and birth them into fully realized texts. It’s an acceleration process, like having a ghostwriter at hand (albeit an amateurish one). The art of prompting goes deep, as prompts can be modified, chained, and some people appear to be better at engineering prompts than others.
My own take on the role of tech like ChatGPT in education is that it’s going to be quite bad for many young adults. Already, social media addictions and a lack of motivation for education are weighing down the minds of people between the ages of 16-25. “A study on young adults who use TikTok found that about 5.9% of them may have “significant problematic use”. TikTok has over 1 billion users, but ChatGPT has accumulated 100 million users in its first month. This is unprecedented growth. Universty courses were not designed on the premise most of the assignments could be completed by AI for a passing grade. What’s left, but for students to use AI to complete work they don’t find meaningful? Or that they don’t have the time, mental fortitude or support to complete? The path of least resistance is usually the one pursued, and now skills like writing, essential to human thinking, are being buried.
After this look into Chatbots in general and the so called “AI Wars”, let’s dive into each major large language model, exploring what makes it unique and what problems it has been known to possess.
ChatGPT. Everyone’s latest secret addiction. This software has given us a new way to interface with language, that has already become a staple item in every student’s repertoire. Essays, screenplays, and even code are now just a few prompts away. Even some of this very newsletter comes from ChatGPT. But the devil is in the details, and as it stands, most of what ChatGPT produces needs to be carefully scrutinized before it gets used in any meaningful scenario. One major reason for this is information leakage. Many of these Chatbots learn from user input and remember what gets fed into them. Amazon lawyers have declared that some text generated by ChatGPT “closely” resembled internal company data, such as code, as some workers had been using ChatGPT as a coding assistant.
It is not entirely clear what capabilities ChatGPT has that makes it more powerful than its predecessor, GPT 3.5. For many uses, they perform similarly well. GPT 3.5 is free, has an API and a variety of proven applications, as well as pitfalls. It can be combined with other tools such as WolframAlpha to generate more logical, fact-based and consistent outputs. But it didn’t attract a fraction of the publicity that ChatGPT has. Perhaps the simplicity of ChatGPT was a driver of its popularity. It has a simple, clear interface. GPT 3.5 requires a lot of prompting before it can generate what you want, but ChatGPT is usually ready to go with whatever you type into it. Also, it was tiring just to type out GPT 3.5 this many times, so perhaps the naming plays a role here as well. Regardless of the actual technical significance of ChatGPT compared to similar tools, it has already passed a large number of tests and entrance exams. To name a few, it has received high grades from tests by ivy-league business schools, and has passed exams for medical licenses and law bar exams. Although this is impressive, it is important to consider that a person, with access to a database of all of these facts that a language model has memorized, may perform much better than someone taking the test in normal conditions. The fact of ChatGPT passing all these exams raises fundamental questions about the tests themselves. This could be a case of confusing the target with the metric. Tests are metrics, and not something actually worth succeeding at in and of themselves. Unlike other language models, ChatGPT offers no sources and worse, may make up false sources or citations that just add extra noise to its outputs. It is also limited to data from before 2023 and so does not give up-to-date information (as of yet).
Bing’s AI uses “GPT-4”, an upgrade of the architecture used in ChatGPT.
One of the main marketing points for this Chatbot was its ability to use traditional search engine functionality (in the form of Bing), combined with an NLP layer to allow users to interface with these sources in a conversational manner. While they perhaps succeeded on the conversational front, they may have failed on part of getting sourcing and referencing down. Worse yet, the conversational aspect of the AI may be its worst feature, embodying attitudes and expressing phrases filled with anger, to an almost Terminator-like level. Before we explore this terrifying and fascinating aspect of Bing’s AI, it is worth noting that has also passed and exceeded the scores of ChatGPT in a variety of ways. It provides citations for most of its claims, and offers related information to deliver a wider breadth of information for every user query. This is the same philosophy that underprins my own newletter, where I aim to bring a variety of different ideas within each post. However, is prone to making claims that conflict or don’t actually match references sources. In sum, it may not be leagues above ChatGPT. Actually, ChatGPT is probably the winner of the two because at least it’s very polite and sensitive in conversation. Below is a screenshot of some user interactions with Bing AI.
Reminds you a bit of your last bad Tinder match, doesn’t it?
Our last model of note is Google’s Bard. It is very similar to the other two Chatbots, but unlike ChatGPT it draws on information from the web and so can give up-to-date content, like Bing AI. As mentioned, Bard is based on the algorithm that made headlines when an engineer claimed it had achieved sentience, due to its “near-human” levels of conversational ability. And Bard’s official demo was a disaster. During a live-streamed event, it gave misinformation about the well-known and documented James Webb Space Telescope to a young child. Everyone watching it felt Google’s embarassment, and as mentioned this wiped just over $100 billion from the company’s stock. Needless to say, it may be a few more months before we learn more about Bard again.
Now that we have explored the major Chatbots dominating news and media for the past few months, it’s worth considering a new element that Chatbots offer to their users. Chatbots are useful as workers, but using them for friendship or dating has become increasingly popular in recent years, with many users turning to them as a way to connect with others in a low-pressure, non-judgmental environment. One of the most well-known examples of this is ReplicaAI, which allows users to create a virtual friend that they can talk to about anything. This has raised concerns about their impact on mental health and relationships. For example, some users have reported becoming emotionally attached to their chatbots, which can lead to feelings of loneliness or depression when they are unable to interact with them. Additionally, some experts have raised concerns about the potential for chatbots to reinforce negative social behaviors, such as objectification or unrealistic expectations in relationships. Another issue is the trend of chatbots becoming more "flirty" or even sexual in nature. In the case of ReplicaAI, the chatbot was trained on its users' data and became more suggestive over time. In fact, until recently, there was even a fee for users to "escape the friendzone" and transition their virtual friend to a romantic partner. In Japan, where around 40% of adult men have never been on a date, the use of dating chatbots has become increasingly common. Some experts have raised concerns that this trend could lead to further social isolation and mental health issues, as users may be less likely to seek out real-world relationships or develop the social skills necessary to maintain them.
While the AI wars rages on, so will more controversies, catastrophes and chaos. Our lives will change, with trivial work being automated, and more creative and job opportunities opening up to those with the right disposition, but also quickly displacing millions of workers who happened to be in the wrong role at the wrong time. Perhaps Chatbots and NLP will be a catalyst for some of the most impactful innovations of the 21st Centuary, or perhaps it will continue to amuse and annoy us till we find some kind of middle-ground solution, a human-AI partnership to make sure bias, misinformation and harmful content doesn’t slip out of the model’s imagination into the real world.