Why Are People Bashing Google Gemini if It’s Better Than GPT-4? (2024)

Why Are People Bashing Google Gemini if It’s Better Than GPT-4? (1)

Whether you think Google DeepMind’s Gemini is a success or not depends mainly on whether you like Google or not.

Google announced Gemini on Wednesday by surprise. It was thought to be released in Q1 2024 but they moved the event up. The reported performance is great — Gemini surpasses GPT-4 on most benchmarks — the demo grabbed people’s attention, and researchers who worked on the project seemed happy with the results.

But soon enough people found flaws and shortcomings in both the model and the presentation. From bashing Google they jumped to bash Gemini. Their rightful skepticism of Google’s ability to make worthwhile AI (which ironically contrasts with the fact that it invented the transformer) and a long history of overpromising and underdelivering that has favored OpenAI for years, are a heavy burden on Google’s back. It seems Gemini is not going to be enough to lift it fully despite the breakthrough it represents.

Before I go on, I want to state very clearly that I don’t prefer Google over OpenAI or OpenAI over Google. I can’t care less which one comes on top today or in 5 years from now. What I care about is the unbiased truth (which is unreachable but I will try anyway). That’s why I’ve started this article with that sentence: I believe the underlying motivation for people to bash or praise Gemini (and by extension Google) depends much more on their preconceptions of the company and its competitors (OpenAI) than on the truth of which model — GPT-4 or Gemini — is actually better.

Now, onto the reasons why people are criticizing Gemini. The most remarkable exhibition we’ve got of the model’s abilities came from a viral 6-minute demo that Google CEO Sundar Pichai shared on X. It’s a nice-looking demo that many people originally praised, not as a polished PR move but genuinely about the clear advances it displays over the previous state of the art. For instance, here’s TED head, Chris Anderson’s reaction:

I can't stop thinking about the implications of this demo. Surely it's not crazy to think that sometime next year, a fledgling Gemini 2.0 could attend a board meeting, read the briefing docs, look at the slides, listen to every one's words, and make intelligent contributions to the issues debated? Now tell me. Wouldn't that count as AGI?

The thing is, the demo is not a real, non-staged interaction with Gemini.

I have to confess I didn’t watch the demo completely at first because I imagined it’d be rather distracting in contrast to just reading the technical report. Unsurprisingly, the video was not made for the world to see what Gemini can do but was an illustration of what “the multimodal user experiences built with Gemini could look like [emphasis mine],” with prompts and outputs “shortened for brevity,” and intended “to inspire developers,” as Oriol Vinyals said after people raised suspicions. What he forgot to mention was that it was also, apparently, designed to mislead potential users

Grady Booch stated the problem very clearly: “Ah, the age-old art of deceptive demos. Dear Google: you should have made this abundantly clear in your original video.” I 100% agree. The demo doesn’t devalue Gemini as a scientific and engineering feat but no one likes being tricked — Google is not in a position where it can do these kinds of weird stunts and expect people to let it get away with it (they explain in a different blog post how researchers created the demo video and, indeed, it has nothing to do with what it finally looked like).

The demo presented a smooth, fluid, high-quality real-time interaction, light on prompt techniques, and revealed a degree of multimodal processing and understanding not previously seen. It was none of those things.

Or at least we don’t know yet. And that’s the second problem: Google has focused on talking about Gemini Ultra (the demo is the largest size, comparable to GPT-4) but has only released the smaller sizes, Gemini Pro on Bard and Gemini Nano on-device for the Pixel 8 Pro. Gemini Ultra is coming out in January so why didn’t they wait until then to have a cleaner release?

The Google executive was surely feeling the explicit pressure from public opinion and the implicit challenge from OpenAI to ship anything soon. They probably thought: We give them what we have now (Pro and Nano) and tell them what Ultra can do (with a fancy demo). Unfortunately, that doesn’t work anymore. OpenAI changed the rules. People are expecting immediate hands-on testing, a first-hand experience of AI — not a heavily edited, PR-focused, disingenuous demo.

People will only believe what they can see and evaluate by themselves, even if as anecdotal evidence it amounts to less than what the technical report may reveal.

Oh, the technical report. That’s the third problem. Instead of going right to watch the demo, I decided to read (note: skim) the blog post and the technical report. My first impressions were not bad (although not as good as if I had watched the demo). The benchmark performance is solid. Gemini Ultra beats GPT-4 on most benchmarks, even if for just a few percentage points. But there was a detail that I overlooked that others mentioned: The blog post compared Gemini’s performance on the MMLU benchmark (one of the most relevant) using a different prompt setting than GPT-4 (CoT@32 vs 5-shot). With this flawed comparison, Gemini got a 90.04% vs GPT-4’s 86.4%. It doesn’t matter to us here what those numbers mean but it matters that they are not the same thing.

Jeff Dean was quick to point out that the technical paper has the benchmarking correctly as an apples-to-apples comparison: When both models are evaluated on CoT@32, Gemini Ultra is still the best model (90.04%) because GPT-4 only increases its performance to 87.29%. However, when both are evaluated on a 5-shot setting, GPT-4 comes out on top with the reported 86.4% vs Gemini Ultra’s 83.7%.

We don’t really know what the CoT@32-setting result implies for users until we test Gemini vs GPT-4, but, on top of the faulty demo and the constant missed deadlines, the blog post mistake (which would’ve been disregarded if it was the only problem) now only worsens the perception of an AI model that’s otherwise pretty neat (on paper, of course).

If you’ve read this far, the idea you have right now of Gemini and Google is probably pretty bad. Deception, delays, mistakes… A very bad look for the company. I agree with all the criticisms above. But, does that mean Gemini is a failure? The demo is clearly misleading, whether intentional or not. After watching it, it’s obvious it was designed to portray Gemini as an AI model much more capable than it actually is (why, I’m not sure). Also, the delays are annoying and the blog post mistake doesn’t feel as ambiguous after the “fake” demo, which leads people to assume the worst.

And I wonder, were the real advances of Gemini over GPT-4 not good enough to not need to overhype Gemini so much as to provoke a backlash more notable than the announcement itself? Why do AI companies keep doing this?

The most striking phenomenon of AI isn’t how good it is or how fast it advances, but that being potentially the most impactful technology ever (or so they say), companies keep trying so hard to make it seem better than it is. Google has managed to damage the company’s reputation — and as a result, its products — despite Gemini having surpassed GPT-4 on 30 out of 32 benchmarks. Congratulations on the impressive achievement!

But — and here’s the most important point I wanted to make with this unscheduled article — that doesn’t make Gemini a bad model.

Google’s maneuvers don’t take away the title of “best AI model in the world” from Gemini (again, on paper). It’s a very bad look for Google, but it shouldn’t affect our interpretation of the underlying reality. Gemini Ultra will eventually arrive (hopefully, if there are no more delays, on January 2024) on Bard Advanced. If it’s better than GPT-4 as the technical report suggests and by how much, we will know by then.

For now, however, it’d be great if we don’t mistake a company’s marketing insincerity for a lack of research prowess or scientific effort. The people who worked on Gemini have probably very little to do with the people who decided how to edit the demo. I stand (for the most part) by this end-of-debate comment by PerplexityAI’s CEO, Aravind Srinivas (who, being a direct Google competitor, has no reason to favor Gemini and all the incentives to do the opposite):

Extreme 1: “Deepmind faked the evals and demo. Gemini sucks”
Extreme 2: “OpenAI is done. Google is back. Bard will run Gemini for free and burn down chatGPT because of margins on compute chip”
Reality: Gemini is cool. The first model that genuinely is comparable to GPT 4. Real accomplishment. Especially that it was just a dense model. Marketing was overboard, but Deepmind is known for aggressive PR. Demos like the multimodal video in reality will be possible in less than a year.

Don’t be part of the “extreme” crew.