A practical guide for combining text, images, and structured data using the Elysia (YG3) API.
Multimodal AI used to require complex architectures, multiple APIs, or special-purpose models. Today, thanks to the YG3 API (also known as the Elysia API), you can build unified multimodal applications — mixing text + images + tool calling + structured reasoning — through one simple, OpenAI-compatible interface.
This guide breaks down how multimodal workflows work in YG3, what they enable, and how to practically use them to build real business applications. The full notebook and runnable example live on GitHub — this article gives you the conceptual map and production thinking behind it.
Multimodal agents unlock capabilities that single-modality text models cannot achieve:
A real business workflow often includes both text and images.
The YG3 API integrates these modalities natively — no special infrastructure required — so you can build applications that see, reason, and act.
This tutorial walks you through:
You’ll understand the logic behind multimodal prompts and how to design workflows that take advantage of visual context.
YG3 supports multimodal content using the OpenAI chat-completions standard.
A single message can contain multiple content blocks:
The model processes the entire message context holistically — meaning:
The model understands what’s happening visually and textually at the same time, and can take actions accordingly.
This unlocks a wide range of use cases.
Here are the types of things the multimodal model can do:
Upload a product image and ask for:
Useful for e-commerce automation or catalog ingestion.
Upload a frontend error screenshot:
This is incredibly valuable for dev-tools workflows.
Upload a chart or dashboard snapshot:
This is the foundation of multimodal BI assistants.
Upload a:
…and the model can parse it into structured fields and use tool calls to save the result.
Upload a design mockup:
Great for design → dev acceleration.
This is the breakthrough.
With YG3, the model can:
This is multimodal ReAct — visual reasoning + action execution.
Every workflow becomes smarter when the model can see first, then act.
Here are production-ready examples you can build today using only the YG3 API.
add_expense() functionlookup_order() or schedule_support_call()Perfect for:
Upload a product image → model produces everything needed for a listing.
This is where BI becomes intelligent.
Upload a messy whiteboard photo → the agent:
This is productivity superpowers.
Use all your existing SDK code.
One endpoint handles text + images + tools.
Industrial-grade image understanding.
Full ReAct loops with visual triggers.
Optimized inference and low-latency performance.
Designed for real business usage.