AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
Using Large Language Models for TTS/ASR/OCR(Development of Large Model Applications 19)
Copy link
Facebook
Email
Notes
More

Using Large Language Models for TTS/ASR/OCR(Development of Large Model Applications 19)

Explore multimodal AI applications like text-to-image, image-to-text, TTS, ASR, and OCR using large language models for enhanced content creation and processing.

Meng Li's avatar
Meng Li
Jul 25, 2024
∙ Paid

Share this post

AI Disruption
AI Disruption
Using Large Language Models for TTS/ASR/OCR(Development of Large Model Applications 19)
Copy link
Facebook
Email
Notes
More
1
Share

Hello everyone, welcome to the "Development of Large Model Applications" column.

Table of Contents

Table of Contents

Meng Li
·
June 7, 2024
Read full story

When it comes to multimodal applications, the most common ones are text-to-image and image-to-text conversions. This involves providing prompts to models like Stable Diffusion, Midjourney, or DALL-E to generate images, or feeding images to large language models (LLMs) to get descriptive text.

These two multimodal systems are widely used. Text-to-image models are a key component of AI-generated content (AIGC), significantly improving the efficiency of designers and enabling even non-experts to create images.

In our previous discussion, we covered GPT-4's video interpretation capabilities. With the advent of Sora, people now dream of creating movie-grade special effects.

Today, we'll complete our discussion on multimodal processing by focusing on TTS, ASR, and OCR.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More