The Generative Edge Week 11

The future is multimodal, Llamas and Alpacas are speaking to you and what do you sound like when speaking Chinese?

Mar 14, 2023

Welcome to the this week’s edition of The Generative Edge, your weekly digest of all things generative AI. With so much happening in this rapidly evolving field, we know it can be tough to stay up to date. That's why we're here to help! From cutting-edge research to practical applications, we've got you covered.

So, let's dive in and see what the world of generative AI has been up to this week.

Large Language Models

We seem to be having a “Stable Diffusion (SD)” moment in large language models right now. When Stability.ai released SD last year, an extremely fast moving and vibrant community sprung up around it. We might be seeing the start of this for language models, right now!

What happened?

Meta released LLama at the end of February, a set of powerful, pretrained large language models based on the Chinchilla paper in an effort to make LLMs much more efficient. Only limited access to researchers was granted. However, someone leaked the models publicly and even cheekily opened a pull request.
Now, with Llama in the open, the community got to work and soon Llama.cpp appeared, an effort to run inference on some of these models on commodity hardware (e.g. Macbooks, Raspberry Pis, etc.).
A research team at Stanford created Alpaca - the 7B parameter Llama model fine-tuned on instruction data to turn it from the old GPT-3 style model into an instruct-model (think assistant you can give instructions to, like ChatGPT). This brings its performance close to the much, much larger instructGPT-3 model!
- The fascinating part? The training data to turn Llama 7B into Alpaca was generated with OpenAI’s GPT-3 itself! This means strong, large models can be used to train other models.
- Read the blog article here: https://crfm.stanford.edu/2023/03/13/alpaca.html
- You can try it yourself: alpaca-ai-custom4.ngrok.io/

Expect that you’ll be able to operate your own powerful language models much more easily very soon.

Multi Modal models

Multimodal is the next big thing for large language models. Current generation LLMs are trained on text only and can only understand and output text. Multimodal models can handle a variety of inputs and outputs, like sensor data, video, images, audio, etc.

Imagine an AI assistant that can look at any image, video or website, listen to audio and give you information based on that input but can also generate you any output (image, video, or even controlling a robotic arm). It brings language models into our multimodal world.

source: twitter.com/DrJimFan/status/1634244545360609289

GPT4

Microsoft accidentally leaked GPT4’s imminent release date, which is supposed to be in the next few days
GPT4 is rumored to be multimodal and understand at least images and maybe more

PALM-E

Palm-E was released, a multimodal language model created by fusing a strong visual and large language model
Equipped with it, a robots can now deconstruct an instruction into a series of smaller actions, so you can imagine how multimodal language models will have a definite impact on robotics:

VisualChatGPT

Microsoft’s paper shows how image-focused multimodal can look like, using ControlNet, StabelDiffusion and the ChatGPT API to create a multimodal interaction flow focused on visual foundation models: github.com/microsoft/visual-chatgpt

Generative image models

This space is continuing to accelerate at a break neck pace. Not only are there strong synergies in the multimodal area, new techniques, models and architectures are released on a daily basis. Here just a small selection:

Generative Photoshop flow

A glimpse at what professional, generative workflows in Photoshop can look like:

see NextML's twitter for more details

Driving facial animations

An interpolation technique called Thin Spline Plates can be used to drive a generated still image of a face with a recorded video, like this:

This reddit thread has more information, a link to an iPhone app and instructions on how to do this yourself.

Voice synthesis

We continue to see new and exciting developments in this space. Last week we looked at voice cloning, today we shine a light on a new paper from Microsoft.

Ever wonder what your voice will sound like speaking another language? Vall-E X is going there: https://vallex-demo.github.io/

Here is an example, the first audio file is the native chinese speaker, the second is the generated, voice cloned translation:

1×

0:00

-0:03