利用OpenVINO和LlamaIndex工具构建多模态RAG应用-电子发烧友网

来源：OpenVINO 中文社区

作者：杨亦诚 英特尔 AI 软件工程师

介绍

Retrieval-Augmented Generation (RAG) 系统可以通过从知识库中过滤关键信息来优化 LLM 任务的内存占用及推理性能。归功于文本解析、索引和检索等成熟工具的应用，为文本内容构建 RAG 流水线已经相对成熟。然而为视频内容构建 RAG 流水线则困难得多。由于视频结合了图像，音频和文本元素，因此需要更多和更复杂的数据处理能力。本文将介绍如何利用 OpenVINO 和 LlamaIndex 工具构建应用于视频理解任务的RAG流水线。

要构建真正的多模态视频理解RAG，需要处理视频中不同模态的数据，例如语音内容、视觉内容等。在这个例子中，我们展示了专为视频分析而设计的多模态 RAG 流水线。它利用 Whisper 模型将视频中的语音内容转换为文本内容，利用 CLIP 模型生成多模态嵌入式向量，利用视觉语言模型（VLM）处理检索到的图像和文本消息以及用户请求。下图详细说明了该流水线的工作原理。

图：视频理解 RAG 工作原理

源码地址：

https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/multimodal-rag

环境准备

该示例基于 Jupyter Notebook 编写，因此我们需要准备好相对应的 Python 环境。基础环境可以参考以下链接安装，并根据自己的操作系统进行选择具体步骤。

https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-getting-started

图：基础环境安装导航页面

此外本示例将依赖 OpenVINO 和 LlamaIndex 的集成组件，因此我们需要单独在环境中对他们进行安装，分别是用于为图像和文本生成多模态向量的llama-index-embeddings-openvino库，以及视觉多模态推理llama-index-multi-modal-llms-openvino库。

模型下载和转换

完成环境搭建后，我们需要逐一下载流水线中用到的语音识别 ASR 模型，多模型向量化模型 CLIP，以及视觉语言模型模型 VLM。

考虑到精度对模型准确性的影响，在这个示例中我们直接从 OpenVINO HuggingFace 仓库中，下载转换以后的 ASR int8 模型。

import huggingface_hub as hf_hub


asr_model_id = "OpenVINO/distil-whisper-large-v3-int8-ov"
asr_model_path = asr_model_id.split("/")[-1]


if not Path(asr_model_path).exists():
    hf_hub.snapshot_download(asr_model_id, local_dir=asr_model_path)

而 CLIP 及 VLM 模型则采用 Optimum-intel 的命令行工具，通过下载原始模型对它们进行转换和量化。

from cmd_helper import optimum_cli


clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
clip_model_path = clip_model_id.split("/")[-1]


if not Path(clip_model_path).exists():
  optimum_cli(clip_model_id, clip_model_path)

视频数据提取与处理

接下来我们需要使用第三方工具提取视频文件中的音频和图片，并利用 ASR 模型将音频转化为文本，便于后续的向量化操作。在这一步中我们选择了一个关于高斯分布的科普视频作为示例（https://www.youtube.com/watch?v=d_qvLDhkg00)。可以参考以下代码片段，完成对 ASR 模型的初始化以及音频内容识别。识别结果将被以 .txt 文件格式保存在本地。

from optimum.intel import OVModelForSpeechSeq2Seq
from transformers import AutoProcessor, pipeline


asr_model = OVModelForSpeechSeq2Seq.from_pretrained(asr_model_path, device=asr_device.value)
asr_processor = AutoProcessor.from_pretrained(asr_model_path)


pipe = pipeline("automatic-speech-recognition", model=asr_model, tokenizer=asr_processor.tokenizer, feature_extractor=asr_processor.feature_extractor)


result = pipe(en_raw_speech, return_timestamps=True)

创建多模态向量索引

这也是整个 RAG 链路中最关键的一步，将视频文件中获取的文本和图像转换为向量数据，存入向量数据库。这些向量数据的质量也直接影响后续检索任务中的召回准确性。这里我们首先需要对 CLIP 模型进行初始化，利用 OpenVINO 和 LlamaIndex 集成后的库可以轻松实现这一点。

from llama_index.embeddings.huggingface_openvino import OpenVINOClipEmbedding


clip_model = OpenVINOClipEmbedding(model_id_or_path=clip_model_path, device=clip_device.value)

然后可以直接调用 LlamaIndex 提供的向量数据库组件快速完成建库过程，并对检索引擎进行初始化。

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext, Settings
from llama_index.core.node_parser import SentenceSplitter


Settings.embed_model = clip_model


index = MultiModalVectorStoreIndex.from_documents(
  documents, storage_context=storage_context, image_embed_model=Settings.embed_model, transformations=[SentenceSplitter(chunk_size=300, chunk_overlap=30)]
)


retriever_engine = index.as_retriever(similarity_top_k=2, image_similarity_top_k=5)

多模态向量检索

传统的文本 RAG 通过检索文本相似度来召唤向量数据库中关键的文本内容，而多模态 RAG 则需要额外对图片向量进行检索，用以返回与输入问题相关性最高的关键帧，供 VLM 进一步理解。这里我们会将用户的提问文本向量化后，通过向量引擎检索得到与该问题相似度最高的若干个文本片段，以及视频帧。LlamaIndex 为我们提供了强大的工具组件，通过调用函数的方式可以轻松实现以上步骤。

from llama_index.core import SimpleDirectoryReader


query_str = "tell me more about gaussian function"


img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(input_dir=output_folder, input_files=img).load_data()

代码运行后，我们可以看到检索得到的文本段和关键帧。

图：检索返回的关键帧和相关文本片段

答案生成

多模态 RAG 流水线的最后一步是要将用户问题，以及检索到相关文本及图像内容送入 VLM 模型进行答案生成。这里我们选择微软的 Phi-3.5-vision-instruct 多模态模型，以及 OpenVINO 和 LlamaIndex 集后的多模态模任务组件，完成图片及文本内容理解。值得注意的是由于检索返回的关键帧往往包含多张图片，因此这里需要选择支持多图输入的多模态视觉模型。以下代码为 VLM 模型初始化方法。

from llama_index.multi_modal_llms.openvino import OpenVINOMultiModal


vlm = OpenVINOMultiModal(
  model_id_or_path=vlm_int4_model_path,
  device=vlm_device.value,
  messages_to_prompt=messages_to_prompt,
  trust_remote_code=True,
  generate_kwargs={"do_sample": False, "eos_token_id": processor.tokenizer.eos_token_id},
)

完成 VLM 模型对象初始化后，我们需要将上下文信息与图片送入 VLM 模型，生成最终答案。此外在这个示例中还构建了基于 Gradio 的交互式 demo，供大家参考。

response = vlm.stream_complete(
  prompt=qa_tmpl_str.format(context_str=context_str, query_str=query_str),
  image_documents=image_documents,
)
for r in response:
  print(r.delta, end="")

运行结果如下：

“A Gaussian function, also known as a normal distribution, is a type of probability distribution that is symmetric and bell-shaped. It is characterized by its mean and standard deviation, which determine the center and spread of the distribution, respectively. The Gaussian function is widely used in statistics and probability theory due to its unique properties and applications in various fields such as physics, engineering, and finance. The function is defined by the equation e to the negative x squared, where x represents the input variable. The graph of a Gaussian function is a smooth curve that approaches the x-axis as it moves away from the center, creating a bell-like shape. The function is also known for its property of being able to describe the distribution of random variables, making it a fundamental concept in probability theory and statistics.”

总结

在视频内容理解任务中，如果将全部的视频帧一并送入 VLM 进行理解和识别，会对 VLM 性能和资源占用带来非常大的挑战。通过多模态 RAG 技术，我们可以首先对关键帧进行检索，从而压缩在视频理解任务中 VLM 的输入数据量，提高整套系统的识别效率和准确性。而 OpenVINO 与 LlamaIndex 集成后的组件则可以提供完整方案的同时，在本地 PC 端流畅运行流水线中的各个模型。

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

英特尔

英特尔

+关注

关注
61

文章
10324

浏览量
181092
流水线

流水线

+关注

关注
0

文章
128

浏览量
27274
模型

模型

+关注

关注
1

文章
3819

浏览量
52269
OpenVINO

OpenVINO

+关注

关注
0

文章
118

浏览量
818

原文标题：开发者实战｜如何利用 OpenVINO™ 在本地构建多模态 RAG 应用

文章出处：【微信号：英特尔物联网，微信公众号：英特尔物联网】欢迎添加关注！文章转载请注明出处。

搜索历史

利用OpenVINO和LlamaIndex工具构建多模态RAG应用

评论