Files
agentUniverse/docs/guidebook/en/Get_Start/9.Advanced_Techniques-Multimodal_Agents.md
2025-03-11 15:30:32 +08:00

3.0 KiB

Documentation Description

In this example, we will further introduce how to build and run a multimodal agent that performs intelligent Q&A based on the input images or audio.

Example Project Location: multimodal_app

Multimodal Agent

Multimodal Agent:
multimodal_agent.py
multimodal_agent.yaml

aU constructs a multimodal demo agent, configuring prompts, multimodal models, and agent memory.

Multimodal Models

Qwen Multimodal Model:
qwen_multimodal_llm.yaml

We construct the Qwen multimodal agent using the built-in QWenOpenAIStyleLLM in aU. It utilizes qwen-vl for visual understanding, qwen-audio for audio understanding, and qwen-omni for omnimodal understanding.

Usage Instructions

Visual Understanding

When the agent is running, pass the image_urls parameter (the URL address of the image), for example

from agentuniverse.agent.agent import Agent
from agentuniverse.agent.agent_manager import AgentManager

instance: Agent = AgentManager().get_instance_obj('multimodal_agent')
output_object = instance.run(input="What scene is depicted in the image?", image_urls=[
    'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg'])

The agentUniverse reads the image_urls parameter passed by the agent in the agentuniverse.prompt.chat_prompt.ChatPrompt.generate_image_prompt method and assembles it into a standard multimodal prompt format.

The image_urls parameter supports two types:

  1. If image_urls is passed as a standard HTTP or HTTPS URL, aU will directly assemble the multimodal prompt format.
  2. If image_urls is passed as a local file path, aU will read the image file, convert it to Base64 encoding, and then assemble the multimodal prompt format.

Audio Understanding

When the agent is running, pass the audio_url parameter (the URL address of the audio file), for example:

from agentuniverse.agent.agent import Agent
from agentuniverse.agent.agent_manager import AgentManager

instance: Agent = AgentManager().get_instance_obj('multimodal_agent')
output_object = instance.run(input='What is the content of the task in the audio?',
                             audio_url='https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/kvkadk/%E6%8E%A8%E8%8D%90%E4%B9%A6.wav')

In the agentuniverse.prompt.chat_prompt.ChatPrompt.generate_audio_prompt method of the agentUniverse, the audio_url parameter passed by the agent is read and assembled into a standard multimodal prompt format.

The audio_url parameter supports only one type: it must be a standard HTTP or HTTPS URL. aU will directly assemble the multimodal prompt format using the provided URL.