The Generative Edge Week 21

Using AI to preserve the languages of the world, run AI chat on your phone in private, generate truly multi-modal content with CoDI and numbers every LLM developer should know about.

May 23, 2023

Welcome to week 21 of The Generative Edge. Here is the gist in 4 bullet points:

Meta AI's MMS project supports over 1100 languages, significantly improving speech recognition technology and promoting global language diversity.
Open-source Large Language Models (LLMs) enable personalized AI chat applications on consumer devices like smartphones, addressing privacy and availability concerns.
Microsoft Research's CoDI is a multimodal generative model capable of handling various input-output combinations, including language, image, video, and audio.
Numbers every LLM developer should know about: github.com/ray-project/llm-numbers

Let’s hop in and dig into the details:

MMS - Massively Multilingual Speech

A while back, OpenAI released Whisper, a multilingual text transcription model that is open source and state-of-the-art, at the same time Elevenlabs is pushing the voice generation frontier with life-like artificial voice cloning.

Last week, Meta AI has released a project called MMS (Massively Multilingual Speech), designed to address the limitations of speech recognition technology in terms of language coverage.

The preservation of cultural diversity is certainly a novel and interesting application of modern AI technology.

Supports over 1100 languages, including highly obscure rarely spoken languages
The model can transcribe as well as generate speech (think Whisper, and Elevenlabs).
Among other things, it utilizes old religious texts for diverse language coverage.
Outperforms existing models, covers 10x more languages.

Publicly shared models and code for research community: facebookresearch/…/mms
Aims to preserve global language diversity.
Check out the project page here: https://ai.facebook.com/blog/multilingual-model-speech-recognition

AI chat: local, private and on your phone

As of now, OpenAI's GPT-4 is the best language model available. Nevertheless, the platform is closed-sourced, only accessible via API, and cannot be hosted internally.

As a result, legitimate concerns arise regarding privacy, availability, bias, etc. We've mentioned before that a large community effort is underway to create, modify, and improve open source language models.

Some of which you can now run in the browser or even on consumer devices like smartphones:

Open-source Large Language Models (LLMs) have proliferated greatly
Exciting opportunities for end users to explore and leverage personalized open LLMs
RedPajama is an example of open-source, high-performing LLMs that can be run at the edge, e.g. on Apple Silicon, AMD/NVIDIA GPUs, WebGPU, and iOS devices

This could lead to a proliferation of personalized fine-tuned models.
Learn more about this project and test the phone apps as well as the browser version here: mlc.ai/blog/2023/05/22/bringing-open-large-language-models-to-consumer-devices

Truly multimodal generation with CoDI

Multi-model models, which can take and output a variety of inputs, have appeared on the scene. From the elusive, yet-to-be released multi-modal capabilities of GPT-4, to TaskMatrix and others. Microsoft Research has now pushed things further and released CoDI, a model that can handle multimodal input and generate combined multi modal outputs.

Generates language, image, video, audio from any input combination.
Not limited to specific input types like existing AI systems.
Works with input-output combinations not in training data.

Examples:

Input: Image of a creek and sound of birds
1×
0:00
-0:09
Output (sounds of water and birds):
1×
0:00
-0:10
Input: Image and Text
Eating on a coffee table
Output: Video