Upload folder using ModelScope SDK

This commit is contained in:
Cherrytest 2025-04-08 18:06:02 +00:00
parent ca99d524a7
commit 9aa6b1807d
15 changed files with 152140 additions and 44 deletions

2
.gitattributes vendored
View File

@ -45,3 +45,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text

283
README.md
View File

@ -1,47 +1,252 @@
---
license: Apache License 2.0
#model-type:
##如 gpt、phi、llama、chatglm、baichuan 等
#- gpt
#domain:
##如 nlp、cv、audio、multi-modal
#- nlp
#language:
##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
#- cn
#metrics:
##如 CIDEr、Blue、ROUGE 等
#- CIDEr
#tags:
##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
#- pretrained
#tools:
##如 vllm、fastchat、llamacpp、AdaSeq 等
#- vllm
pipeline_tag: text-classification
tags:
- vidore
- reranker
- qwen2_vl
language:
- multilingual
base_model:
- Qwen/Qwen2-VL-2B-Instruct
inference: false
license: cc-by-nc-4.0
library_name: transformers
---
### 当前模型的贡献者未提供更加详细的模型介绍。模型文件和权重,可浏览“模型文件”页面获取。
#### 您可以通过如下git clone命令或者ModelScope SDK来下载模型
SDK下载
<br><br>
<p align="center">
<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
</p>
<p align="center">
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>
[Blog](https://jina.ai/news/jina-reranker-m0-multilingual-multimodal-document-reranker) | [API](https://jina.ai/reranker) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-ctlpeffe5koac?sr=0-1&ref_=beagle&applicationId=AWSMPContessa) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-reranker-m0) | [Arxiv](coming soon)
# jina-reranker-m0: Multilingual Multimodal Document Reranker
## Intended Usage & Model Info
**jina-reranker-m0** is our new **multilingual multimodal reranker** model for ranking visual documents across multiple languages: it accepts a query alongside a collection of visually rich document images, including pages with text, figures, tables, infographics, and various layouts across multiple domains and over 29 languages.
It outputs a ranked list of documents ordered by their relevance to the input query. Compared to `jina-reranker-v2-base-multilingual`, `jina-reranker-m0` also improves text reranking for multilingual content, long documents, and code searching tasks.
## Architecture
**jina-reranker-m0** is built on a decoder-only vision language model architecture, specifically:
- **Base model**: `Qwen2-VL-2B-Instruct`, utilizing its vision encoder, projection layer, and language model
- **Adaptation**: Fine-tuned the language model with LoRA (Low-Rank Adaptation) techniques
- **Output layer**: Post-trained MLP head to generate ranking scores measuring query-document relevance
- **Training objective**: Optimized with pairwise and listwise ranking losses to produce discriminative relevance scores
This represents a significant architectural shift from our previous cross-encoder models:
| | **jina-reranker-m0** | **jina-reranker-v2** |
|----------------------------------|--------------------------------------|-------------------------------------|
| **Architecture** | Vision Language Model | Cross-Encoder |
| **Base model** | Qwen2-VL-2B | Jina-XLM-RoBERTa |
| **Parameters** | 2.4 B | 278 M |
| **Max context length** | 10,240 tokens (query + document) | 8,192 tokens |
| **Image processing** | 768 × 28 × 28 patches (dynamic resolution) | ❌ |
| **Multilingual support** | 29+ languages | Multiple languages |
| **Tasks supported** | Text2Text, Text2Image,<br>Image2Text, Text2Mixed | Text2Text |
## Capabilities
- **Multimodal Understanding**: Processes both textual and visual content, including pages with mixed text, figures, tables, and various layouts
- **Long Context Processing**: Handles up to 10K tokens, enabling reranking of lengthy documents
- **Dynamic Image Resolution**: Supports images from 56×56 pixels up to 4K resolution with dynamic patch processing
- **Multilingual Support**: Effectively reranks content across 29+ languages, including bidirectional language pairs
- **Zero-shot Domain Transfer**: Performs well on unseen domains and document types without specific fine-tuning
- **Code Search**: Enhanced capabilities for programming language search and technical document ranking
Compared to `jina-reranker-v2-base-multilingual`, `jina-reranker-m0` significantly improves text reranking for multilingual content, long documents, and code searching tasks, while adding powerful new capabilities for visual document understanding.
# Usage
1. The easiest way to use `jina-reranker-m0` is to call Jina AI's [Reranker API](https://jina.ai/reranker/).
```bash
#安装ModelScope
pip install modelscope
curl -X POST \
https://api.jina.ai/v1/rerank \
-H "Content-Type: application/json" \
-H "Authorization: Bearer JINA_API_KEY" \
-d '{
"model": "jina-reranker-m0",
"query": "slm markdown",
"documents": [
{
"image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
{
"image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
},
{
"image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png"
},
{
"text": "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements."
},
{
"image": "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
},
{
"text": "数据提取么?为什么不用正则啊,你用正则不就全解决了么?"
},
{
"text": "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold."
},
{
"text": "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar."
}
],
"return_documents": false
}'
```
```python
#SDK模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('jinaai/jina-reranker-m0')
You will receive a JSON response with the relevance scores for each document in relation to the query. The response will look like this:
```json
{
"model":"jina-reranker-m0",
"usage": {
"total_tokens":2813
},
"results":[
{
"index":1,
"relevance_score":0.9310624287463884
},
{
"index":4,
"relevance_score":0.8982678574191957
},
{
"index":0,
"relevance_score":0.890233167219021
},
...
]
}
```
Git下载
```
#Git模型下载
git clone https://www.modelscope.cn/jinaai/jina-reranker-m0.git
The `relevance_score` field indicates the relevance of each document to the query, with higher scores indicating greater relevance.
2. You can also use the `transformers` library to interact with the model programmatically.
Before you start, install the `transformers` libraries:
```bash
pip install transformers >= 4.47.3
```
<p style="color: lightgrey;">如果您是本模型的贡献者,我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>,及时完善模型卡片内容。</p>
And then use the following code snippet to load the model:
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
'jinaai/jina-reranker-m0',
torch_dtype="auto",
trust_remote_code=True,
)
model.to('cuda') # or 'cpu' if no GPU is available
model.eval()
```
Now you can use the model function `compute_score` to compute the relevance scores for a query and a list of documents. The function takes a list of sentence pairs, where each pair consists of a query and a document. The model will return a list of scores indicating the relevance of each document to the query.
**A. Visual Documents Reranking**
For handling the image documents, you can use the following code snippet:
```python
# Example query and documents
query = "slm markdown"
documents = [
"https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png",
"https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png",
"https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png",
"https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
]
# construct sentence pairs
image_pairs = [[query, doc] for doc in documents]
scores = model.compute_score(image_pairs, max_length=2048, doc_type="image")
# [0.8576154708862305, 0.9356858730316162, 0.8496521711349487, 0.8664582967758179]
```
**B. Textual Documents Reranking**
```python
query = "slm markdown"
documents = [
"We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.",
"数据提取么?为什么不用正则啊,你用正则不就全解决了么?",
"During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold.",
"Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar.",
]
# construct sentence pairs
text_pairs = [[query, doc] for doc in documents]
scores = model.compute_score(text_pairs, max_length=1024, doc_type="text")
```
The scores will be a list of floats, where each float represents the relevance score of the corresponding document to the query. Higher scores indicate higher relevance.
For instance the returning scores in this case will be:
```bash
[0.9127850532531738, 0.8384682536125183, 0.8870794177055359, 0.842738926410675]
```
**C. Image Querying for Textual Documents**
The model also supports querying textual documents with an image query. You can use the following code snippet:
```python
query = "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
documents = [
"We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.",
"数据提取么?为什么不用正则啊,你用正则不就全解决了么?",
"During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold.",
"Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar.",
]
# reverse the order of the query and document
image_pairs = [[doc, query] for doc in documents]
scores = model.compute_score(image_pairs, max_length=2048, doc_type="text")
# [0.9048659801483154, 0.8266222476959229, 0.8326289653778076, 0.9075747132301331]
```
# Model Performance
Performance of the jina-reranker-m0 on ViDoRe, MBEIR, and Winoground visual retrieval benchmarks showcases its capabilities across diverse multimodal retrieval tasks spanning multiple domains and languages. Each dot represents performance scores for different types of visual documents. The boxplots illustrate the distribution of these scores, with the highlighted numbers indicating the average (mean) performance. For complete benchmark results, please refer to the appendix of this post.
We conduct extensive evaluations on the performance of the model across various visual retrieval benchmarks.
![Model performance comparison across benchmarks](https://jina-ai-gmbh.ghost.io/content/images/size/w1600/2025/04/all-benchmarks--6-.png)
As shown in the figure above, the performance of the `jina-reranker-m0` on `ViDoRe`, `MBEIR`, and `Winoground` visual retrieval benchmarks showcases its capabilities across diverse multimodal retrieval tasks spanning multiple domains and languages. Each dot represents performance scores for different types of visual documents. The boxplots illustrate the distribution of these scores, with the highlighted numbers indicating the average (mean) performance.
We also evaluate the performance of the `jina-reranker-m0` across four text-to-text reranking benchmarks. Each benchmark may include multiple datasets, languages, or tasks, represented by individual dots inside the boxplot. The boxplot shows the distribution of these scores, with the highlighted number showing the average (mean) performance. While most benchmarks use NDCG@10 as their performance metric, MKQA uses recall@10 instead, as MKQA's annotation data doesn't support NDCG calculation (the official evaluation uses recall, which determines document relevance through heuristics).
![Model performance comparison across text-to-text benchmarks](https://jina-ai-gmbh.ghost.io/content/images/size/w1600/2025/04/model-perf-boxplot--13-.png)
For complete benchmark results, please refer to the [online results table](https://docs.google.com/spreadsheets/d/1KrCD7l0lhzMkyg3z-gEDmymxe4Eun9Z-C0kU3_cxw7Q/edit?usp=sharing).
# Contact
Join our [Discord community](https://discord.jina.ai/) and chat with other community members about ideas.
# License
`jina-reranker-m0` is listed on AWS & Azure. If you need to use it beyond those platforms or on-premises within your company, note that the models is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/).

16
added_tokens.json Normal file
View File

@ -0,0 +1,16 @@
{
"<|box_end|>": 151649,
"<|box_start|>": 151648,
"<|endoftext|>": 151643,
"<|im_end|>": 151645,
"<|im_start|>": 151644,
"<|image_pad|>": 151655,
"<|object_ref_end|>": 151647,
"<|object_ref_start|>": 151646,
"<|quad_end|>": 151651,
"<|quad_start|>": 151650,
"<|video_pad|>": 151656,
"<|vision_end|>": 151653,
"<|vision_pad|>": 151654,
"<|vision_start|>": 151652
}

3
chat_template.json Normal file
View File

@ -0,0 +1,3 @@
{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}

45
config.json Normal file
View File

@ -0,0 +1,45 @@
{
"_name_or_path": "jinaai/jina-reranker-v3",
"architectures": ["JinaVLForRanking"],
"auto_map": {
"AutoModel": "modeling.JinaVLForRanking"
},
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 1536,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 8960,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 12,
"num_hidden_layers": 28,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-6,
"rope_scaling": {
"mrope_section": [16, 24, 24],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.1",
"use_cache": false,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"hidden_size": 1536,
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 151936
}

1
configuration.json Normal file
View File

@ -0,0 +1 @@
{"framework": "pytorch", "task": "text-classification", "allow_remote": true}

14
generation_config.json Normal file
View File

@ -0,0 +1,14 @@
{
"attn_implementation": "flash_attention_2",
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001,
"transformers_version": "4.47.1"
}

151388
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e626a7e74b20ab788506406444b5805af6532f1e83b35ff4d0e98215095b6753
size 135

228
modeling.py Normal file
View File

@ -0,0 +1,228 @@
import torch
from torch import nn
import numpy as np
from typing import Optional, Tuple, List, Union
from transformers import Qwen2VLForConditionalGeneration
import logging
import warnings
from PIL import Image
from transformers.image_utils import load_image
logger = logging.getLogger(__name__)
LOGIT_SCALE = 0.68
def load_images(images, lazy_load: bool = True):
# Disable PIL DecompositionBomb threshold for reading large images.
pil_max_px = Image.MAX_IMAGE_PIXELS
Image.MAX_IMAGE_PIXELS = None
images_batch = []
for image in images:
if isinstance(image, Image.Image):
images_batch.append(image)
else:
pil_image = load_image(image)
if lazy_load:
images_batch.append(pil_image)
else:
# avoid Too many open files error
images_batch.append(pil_image.copy())
pil_image.close()
Image.MAX_IMAGE_PIXELS = pil_max_px
return images_batch
def formatting_prompts_func(
query: str,
doc: str,
query_type: str = 'text',
doc_type: str = 'text',
prefix_str: str = '',
) -> str:
"""
Format prompts for different combinations of query and content types.
Args:
query: Query text or image path
doc: Content text or image path
query_type: Whether query is an image
doc_type: Whether content is an image
prefix_str: Optional prefix string to add
"""
# Format query part
if query_type == 'image':
query_part = "**Query**:\n<|vision_start|><|image_pad|><|vision_end|>"
else:
query_part = f"**Query**:\n{query}"
# Format content part
if doc_type == 'image':
doc_part = "**Document**:\n<|vision_start|><|image_pad|><|vision_end|>"
else:
doc_part = f"**Document**:\n{doc}"
# Combine parts
prompt = doc_part + '\n' + query_part
# Add prefix if provided
if prefix_str:
prompt = prefix_str + '\n' + prompt
return prompt
class JinaVLForRanking(Qwen2VLForConditionalGeneration):
def __init__(self, config):
super().__init__(config)
self.padding_side = "left"
self.num_labels = 1 # config.num_labels
# hack the lm_head to do nothing, since we only want the hidden states
self.lm_head = nn.Identity()
# copy the idea from `Qwen2ForRewardModel` to have a MLP layer to get the final score
self.score = nn.Sequential(
nn.Linear(config.hidden_size, config.hidden_size),
nn.ReLU(),
nn.Linear(config.hidden_size, self.num_labels),
)
# Initialize weights and apply final processing
self.post_init()
self.score_token_id = 100
def forward(self, *args, **kwargs) -> torch.Tensor:
# Delete output_hidden_states from kwargs
kwargs.pop("output_hidden_states", None)
kwargs.pop("use_cache", None)
assert kwargs.pop("labels", None) is None, "labels should not be passed to forward()"
outputs = super().forward(
*args,
use_cache=False,
output_hidden_states=True,
**kwargs,
)
# get the hidden states of the last layer
hidden_states = outputs.hidden_states[-1]
# IMPORTANT: the padding token must be on the left side
# get the hidden states of the last token and apply the linear layer
pooled_logits = self.score(hidden_states[:, -1])
return pooled_logits.squeeze(-1)
@torch.no_grad()
def compute_score(
self,
pairs: Union[List[Tuple[str, str]], Tuple[str, str]],
batch_size: int = 8,
max_length: int = 10240,
max_query_length: int = 1024,
max_doc_length: Optional[int] = None,
query_type: str = 'text',
doc_type: str = 'text',
normalize_scores: bool = True,
show_progress: bool = False,
) -> List[float]:
if not hasattr(self, "_processor"):
from transformers import AutoProcessor
self._processor = AutoProcessor.from_pretrained(
self.name_or_path, max_pixels=602112, min_pixels=3136, trust_remote_code=True
)
assert isinstance(pairs, list)
if isinstance(pairs[0], str):
pairs = [pairs]
max_length = max_length or self.config.max_length
if max_doc_length is None:
max_doc_length = max(max_length - max_query_length, max_query_length)
if max_doc_length < max_query_length:
warnings.warn(
f"max_doc_length={max_doc_length} should be greater than max_query_length={max_query_length}"
)
assert (
max_doc_length + max_query_length <= max_length
), f"max_doc_length ({max_doc_length}) + max_query_length ({max_query_length}) should be less than max_length ({max_length})"
max_length = max_length - 1
all_scores = []
device = next(self.parameters()).device
batch_iter = range(0, len(pairs), batch_size)
if show_progress:
from tqdm import trange
batch_iter = trange(0, len(pairs), batch_size, desc="Computing scores")
for start_index in batch_iter:
mini_batch = pairs[start_index : start_index + batch_size]
batch_inputs = []
for q, d in mini_batch:
# TEMP FIX: Truncate long documents
if doc_type == 'text':
tokens = self._processor.tokenizer(d, truncation=True, max_length=max_doc_length)
if len(tokens['input_ids']) >= max_doc_length:
d = self._processor.tokenizer.decode(tokens['input_ids'])
batch_inputs.append(formatting_prompts_func(q, d, query_type=query_type, doc_type=doc_type))
batch_images = None
if doc_type == 'image':
batch_images = load_images([d for (q, d) in mini_batch])
elif query_type == 'image':
batch_images = load_images([q for (q, d) in mini_batch])
batch = self._processor(
text=batch_inputs,
images=batch_images,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length,
)
# append the reward token to the input_ids and attention_mask
batch_size = batch["input_ids"].size(0)
batch["input_ids"] = torch.cat(
[
batch["input_ids"],
torch.full((batch_size, 1), self.score_token_id, device=batch["input_ids"].device),
],
dim=1,
)
batch["attention_mask"] = torch.cat(
[
batch["attention_mask"],
torch.ones((batch_size, 1), device=batch["attention_mask"].device),
],
dim=1,
)
# move the batch to the correct device
batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}
scores = self.forward(**batch).view(-1).cpu().float().numpy()
# normalize scores to [0, 1] with sigmoid with a scale
scores = 1.0 / (1.0 + np.exp(-scores * LOGIT_SCALE))
all_scores.extend(scores.tolist())
if len(all_scores) == 1:
return all_scores[0]
return all_scores

11
preprocessor_config.json Normal file
View File

@ -0,0 +1,11 @@
{
"min_pixels": 3136,
"max_pixels": 12845056,
"patch_size": 14,
"temporal_patch_size": 2,
"merge_size": 2,
"image_mean": [0.48145466, 0.4578275, 0.40821073],
"image_std": [0.26862954, 0.26130258, 0.27577711],
"image_processor_type": "Qwen2VLImageProcessor",
"processor_class": "Qwen2VLProcessor"
}

31
special_tokens_map.json Normal file
View File

@ -0,0 +1,31 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

BIN
tokenizer.json (Stored with Git LFS) Normal file

Binary file not shown.

145
tokenizer_config.json Normal file
View File

@ -0,0 +1,145 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": null,
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"extra_special_tokens": {},
"model_max_length": 32768,
"pad_token": "<|endoftext|>",
"padding_side": "left",
"processor_class": "Qwen2VLProcessor",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long