Multimodal AI puts on quite a show, but it’s still in its infancy

Sage LazzaroBy Sage LazzaroContributing writer
Sage LazzaroContributing writer

    Sage Lazzaro is a technology writer and editor focused on artificial intelligence, data, cloud, digital culture, and technology’s impact on our society and culture.

    Google Assistant and Bard general manager Sissie Hsiao.
    Google Assistant and Bard general manager Sissie Hsiao.
    Duy Ho for Fortune

    Hello and welcome to Eye on AI.

    Google this past week made clear it’s not going to let 2023 end without marking its own leap in AI. The tech giant, which has fallen behind OpenAI despite making the crucial research breakthrough that made ChatGPT possible in the first place, finally unveiled Gemini, its long-rumored “largest and most capable” AI model yet. 

    The announcement offers a lot to unpack. Gemini—which comes in three increasingly powerful Nano, Pro, and Ultra tiers—is already powering Bard and a few features on Pixel 8 Pro smartphones. Tomorrow, Gemini will be made available to Google Cloud customers via its VertexAI platform, and Google also plans to integrate Gemini into other products across its services such as Search, Chrome, and Ads. Google touted numerous benchmark wins against OpenAI, but because Gemini Ultra, the most powerful tier positioned to compete with GPT-4, won’t actually be available until next year, it’s too early to fully draw any conclusions.  

    One thing that’s clear no matter how Gemini stacks up to OpenAI’s models, however, is that it’s providing a window into the next era of LLMs where multimodality will be the norm. Google created Gemini to be multimodal from launch, meaning it was trained on and can handle combinations of text, image, video, and code prompts, opening up tons of new use cases and user experiences. Google VP Sissie Hsiao called the multimodal capabilities of Gemini the “most visually stunning” of the model’s advancement while onstage at Fortune’s Brainstorm AI event yesterday (more on that later), and leaders across the industry are pointing to multimodal as the obvious next step in the technology.

    “I’m not sure people realize how much multimodal AI will become the default, even for regular chatbot applications,” Robert Nishihara, CEO of Anyscale, the company behind the Ray developer framework that’s powered much of the GenAI boom, told Eye on AI. He added that multimodality is going to become “fundamental to the way we interface with these models.” 

    If you’re chatting with your insurance company via an AI chatbot, for example, multimodality would make it possible to incorporate photos and videos of the damage into the conversation. It could also help developers by enabling coding co-pilots to preemptively spot issues in code as they write it. During her interview, Hsiao gave the example of how she recently input photos of a restaurant menu and wine menu into Bard and asked it for help creating the ideal pairing. 

    While some multi-modal models already exist, these capabilities have typically been stitched together on top of text-based LLMs. Language models only became viable in the last year or so, and multimodal models are even harder from a technical perspective. The act of combining all these different modalities into a single model from the get-go is far simpler than piecing it together, Nishihara said, but has required a fundamental shift at the architecture level. Whereas convolutional neural networks have long been used to process image and video data, Nishihara credits the recent shift to using transformers to process this data as well for kicking off the recent progress in multimodal.

    Still, multimodal AI has several limitations and challenges. One of the most significant is the size of multimodal data, such as photo and video, which is orders of magnitude larger than text data. This makes building applications more data-intensive and introduces new infrastructure challenges. It also has massive impacts on cost, as running data-intensive workloads on GPUs can be extremely expensive. 

    Solutions to these issues will come from the hardware space, according to Nishihara. Pointing to how Cloud Tensor Processing Units (TPUs) perform quite well at processing image data, he said we’re going to start to see more interest in a variety of hardware accelerators.

    “As we work and experiment with more modalities of data, we’re going to see the hardware ecosystem flourish and alleviate some of the resource challenges the industry is experiencing right now,” he said. “That said, we’re still in the early phases and going through growing pains, so I wouldn’t expect that to be visible in the next six months.”

    And with that, here’s the rest of this week’s AI news.

    Sage Lazzaro
    sage.lazzaro@consultant.fortune.com
    sagelazzaro.com

    AI IN THE NEWS

    The EU officially enacts the EU AI Act. Over the weekend, European Union lawmakers finally agreed on terms for the EU AI act, the world’s first piece of comprehensive AI regulation. The act lays out guardrails and stringent transparency requirements for general-purpose AI (GPAI) systems like ChatGPT, particularly for applications it deems high risk. It also bans several applications including untargeted scraping of facial images, emotion recognition in the workplace and educational institutions, biometric categorization systems that use sensitive characteristics, and other AI systems that could be used to manipulate people or exploit their vulnerabilities. Additionally, the act imposes limitations on, but doesn’t ban, the use of biometric identification systems in law enforcement.

    Scale AI releases a foundation model for the autonomous vehicle industry. Based on transformer modules, the company says the model, called AFM-1, is the first generally available zero-shot model specifically for the autonomous vehicle research community. “Zero-shot” refers to the ability for a machine learning system to complete tasks for which it didn’t receive any training examples, which has proven to be a vital problem for autonomous vehicles. 

    Meta launches Purple Llama initiative to release tools for safety testing AI models. Named “purple” for the fact that the project will combine the responsibilities of both attack (red team) and defensive (blue team) evaluation, Meta is positioning the initiative as a two-pronged approach looking at both the inputs and outputs of LLMs. Its first release is Llama Guard, an openly available foundational model to help developers avoid generating potentially risky outputs. All of the tools will be open source, and the project is seemingly tied to the recently-announced AI Alliance launched by Meta and IBM.

    The U.K. is considering an antitrust investigation into Microsoft and OpenAI's partnership, and the FTC is keeping a close watch. The U.K. Competition and Markets Authority (CMA) said it’s currently gathering information to determine if the collaboration between the two firms threatens competition in the country and is taking public comments before reaching a decision on Jan. 3. On a similar note, the U.S. Federal Trade Commission (FTC) is also examining the nature of the companies’ partnership, according to Bloomberg, though its inquiry is preliminary and not a formal investigation. 

    Sam Altman is named Time’s CEO of the year, and ChatGPT tops Wikipedia’s list of the most read articles in 2023. “Altman emerged as one of the most powerful and venerated executives in the world, the public face and leading prophet of a technological revolution,” says Time. It’s not every day a CEO who was just ousted (and then reinstated) from his own company earns such high praise, but OpenAI’s ChatGPT and GPT-4 model was undeniably transformative—no matter how much board drama closes out the year. So it’s no surprise the Wikipedia article for ChatGPT was visited more than any other page on the English version of the site in 2023 with a total of 49,490,406 page views, according to the Wikimedia Foundation.

    EYE ON AI RESEARCH

    The GenAI rankings. As one of the world’s largest networks propping up much of the global internet, Cloudflare has a unique lens into what goes on online. Today, the company released its 2023 Year in Review, complete with a ranking of the top generative AI services. 

    Per Cloudflare’s network data. OpenAI maintained the top spot throughout the entire year, followed by Character AI, Quillbot, and Huggingface. Google’s Bard settled in at No. 8 overall, but peaked at No. 5 in November after its broader release in Europe and Brazil. Midjourney started off strong at No. 3 in March before declining to No. 10 in September.

    Cloudflare additionally notes that OpenAI also made a significant rise in the general top Internet Services list, peaking at No. 104 in November after its developer conference. You can read the GenAI highlights here, and the full 2023 report here.

    FORTUNE ON AI

    Sam Altman explains how being fired as OpenAI CEO was a ‘blessing in disguise’ - Steve Mollman

    One of the two female OpenAI board members replaced after the Sam Altman incident says a company lawyer tried to pressure her with an ‘intimidation’ tactic - Kylie Robison

    ‘We cannot work with both sides’: A major Emirati AI company has picked a side in the U.S.-China tech war - Paolo Confino

    Just 10% of organizations launched generative AI solutions in 2023, according to an Intel company - Sheryl Estrada

    AI can now turn a rough sketch of a skyscraper into a detailed rendering in a matter of minutes. A leading architect demonstrates how - Fortune Editors

    BRAINFOOD

    Brainstorm AI. Today’s newsletter is coming to you live from Fortune’s Brainstorm AI conference in San Francisco, where we’ve gathered with leading academics, prominent policymakers, and C-suite executives to assess the industry, its current challenges, and new business use cases for AI. 

    Fortune’s Jeremy Kahn kicked things off with a discussion with Sissie Hsiao, Google’s VP and General Manager of Google Assistant and Bard. There were also panels about AI in retail, healthcare, entertainment, fintech, and education with executives from Walmart, Pfizer, Adobe, Wells Fargo, Khan Academy, and more. Other discussions focused on themes like the impacts of AI on the workforce, AI infrastructure, misuse and misinformation, and the responsible development of AI—and that’s still just a snapshot of all the interviews and conversations. 

    For a full rundown, be sure to check your email on Friday for a special edition of Eye on AI recapping Fortune Brainstorm AI 2023.

    This is the online version of Eye on AI, Fortune's weekly newsletter on how AI is shaping the future of business. Sign up for free.