2025-10-20 08:56:49 +00:00
---
2025-10-20 09:26:22 +00:00
pipeline_tag: image-text-to-text
language:
- multilingual
tags:
- deepseek
- vision-language
- ocr
- custom_code
license: mit
---
< div align = "center" >
< img src = "https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width = "60%" alt = "DeepSeek AI" / >
< / div >
< hr >
< div align = "center" >
< a href = "https://www.deepseek.com/" target = "_blank" >
< img alt = "Homepage" src = "https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" / >
< / a >
< a href = "https://huggingface.co/deepseek-ai/DeepSeek-OCR" target = "_blank" >
< img alt = "Hugging Face" src = "https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" / >
< / a >
2025-10-20 08:56:49 +00:00
2025-10-20 09:26:22 +00:00
< / div >
2025-10-20 08:56:49 +00:00
2025-10-20 09:26:22 +00:00
< div align = "center" >
2025-10-20 08:56:49 +00:00
2025-10-20 09:26:22 +00:00
< a href = "https://discord.gg/Tc7c45Zzu5" target = "_blank" >
< img alt = "Discord" src = "https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" / >
< / a >
< a href = "https://twitter.com/deepseek_ai" target = "_blank" >
< img alt = "Twitter Follow" src = "https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" / >
< / a >
2025-10-20 08:56:49 +00:00
2025-10-20 09:26:22 +00:00
< / div >
2025-10-20 08:56:49 +00:00
2025-10-20 09:26:22 +00:00
< p align = "center" >
< a href = "https://github.com/deepseek-ai/DeepSeek-OCR" > < b > 🌟 Github< / b > < / a > |
< a href = "https://huggingface.co/deepseek-ai/DeepSeek-OCR" > < b > 📥 Model Download< / b > < / a > |
< a href = "https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf" > < b > 📄 Paper Link< / b > < / a > |
< a href = "" > < b > 📄 Arxiv Paper Link< / b > < / a > |
< / p >
< h2 >
< p align = "center" >
< a href = "" > DeepSeek-OCR: Contexts Optical Compression< / a >
< / p >
< / h2 >
< p align = "center" >
< img src = "assets/fig1.png" style = "width: 1000px" align = center >
< / p >
< p align = "center" >
< a href = "" > Explore the boundaries of visual-text compression.< / a >
< / p >
## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8:
2025-10-20 08:56:49 +00:00
```
2025-10-20 09:26:22 +00:00
torch==2.6.0
transformers==4.46.3
tokenizers==0.20.3
einops
addict
easydict
pip install flash-attn==2.7.3 --no-build-isolation
2025-10-20 08:56:49 +00:00
```
2025-10-20 09:26:22 +00:00
```python
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
# prompt = "<image>\nFree OCR. "
prompt = "< image > \n< |grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
# Tiny: base_size = 512, image_size = 512, crop_mode = False
# Small: base_size = 640, image_size = 640, crop_mode = False
# Base: base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False
# Gundam: base_size = 1024, image_size = 640, crop_mode = True
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
2025-10-20 08:56:49 +00:00
```
2025-10-20 09:26:22 +00:00
## vLLM
Refer to [🌟GitHub ](https://github.com/deepseek-ai/DeepSeek-OCR/ ) for guidance on model inference acceleration and PDF processing, etc.<!-- -->
## Visualizations
< table >
< tr >
< td > < img src = "assets/show1.jpg" style = "width: 500px" > < / td >
< td > < img src = "assets/show2.jpg" style = "width: 500px" > < / td >
< / tr >
< tr >
< td > < img src = "assets/show3.jpg" style = "width: 500px" > < / td >
< td > < img src = "assets/show4.jpg" style = "width: 500px" > < / td >
< / tr >
< / table >
## Acknowledgement
We would like to thank [Vary ](https://github.com/Ucas-HaoranWei/Vary/ ), [GOT-OCR2.0 ](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/ ), [MinerU ](https://github.com/opendatalab/MinerU ), [PaddleOCR ](https://github.com/PaddlePaddle/PaddleOCR ), [OneChart ](https://github.com/LingyvKong/OneChart ), [Slow Perception ](https://github.com/Ucas-HaoranWei/Slow-Perception ) for their valuable models and ideas.
We also appreciate the benchmarks: [Fox ](https://github.com/ucaslcl/Fox ), [OminiDocBench ](https://github.com/opendatalab/OmniDocBench ).
## Citation
Coming soon!