← 返回
未分类

Pdf Extractor Skill

Extract text and LaTeX formulas from academic PDFs in English and Chinese, outputting structured Markdown with math, tables, and images preserved.
从英文和中文PDF中提取文本和LaTeX公式,输出保留数学、表格、图片的结构化Markdown。
a851445115
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 257
下载
💾 2
安装
1
版本
#latest

概述

PDF Extractor Skill

Extract text and mathematical formulas from academic PDF papers. Supports both English and Chinese content.

When to Use This Skill

Use this skill when:

  • User needs to extract text and LaTeX formulas from PDF papers
  • User mentions "PDF转文本", "PDF提取公式", "论文OCR"
  • User wants to convert academic papers to Markdown format

Tool Selection

ToolBest ForLanguagesMath Quality
-----------------------------------------
Marker (推荐)中英文论文、复杂公式Chinese + EnglishExcellent
Nougat纯英文论文、arXivEnglish onlyExcellent

推荐使用 Marker:支持中英文混排,公式识别效果更好。


Environment Setup

Conda Environment: pdf-extractor

Python Path: D:\anaconda3\envs\pdf-extractor\python.exe

Key Dependencies

  • PyTorch 2.10.0+cu128 (CUDA 12.8)
  • marker-pdf (Surya OCR + Texify)
  • nougat-ocr 0.1.17
  • transformers

Important: Keep This Skill Self-Contained (No Extra Installs)

This skill is expected to run using ONLY the existing pdf-extractor conda environment and the scripts in scripts/.

Rules:

  • Do NOT run pip install ... / conda install ... / download random libraries during extraction.
  • If a dependency is missing (e.g., Nougat crashes due to missing torchvision), do NOT try to fix by installing packages. Switch tools (prefer Marker) or report the environment issue.
  • Slow runtime is normal for Marker (especially with --ark-code-latest). Prefer splitting the PDF rather than changing tools or adding dependencies.

Recommended approach for long PDFs:

  • Use --page-range (0-based) to extract per page or small page batches.
  • Merge the resulting markdown files afterward (simple concatenation is fine). Keep the combined file in the same folder as the per-page outputs so image links remain valid.

Example (per-page extraction with LLM mode):

D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out/page_01.md"

Tool 1: Marker (推荐 - 中英文支持)

Command Line

# 转换中文论文 (默认支持中英文)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "论文.pdf"

# 指定输出路径
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" -o "output.md"

# 强制 OCR (用于扫描版 PDF)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "scanned.pdf" --force-ocr

# 使用火山方舟 Coding Plan (OpenAI-compatible) 增强转换质量(表格/公式/跨页结构更稳)
# 注意:默认走 ark-code-latest,后台会自动路由到合适的模型
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest

# 只跑第 1 页做快速验证(0-based page index)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out_first_page.md"

# 如需自定义(不推荐):也可以手动指定 --openai-base-url/--openai-api-key/--openai-model

# 指定语言
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --languages Chinese English Japanese

Python API

import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
from pdf2md_marker import convert_pdf, convert_pdf_cli

# 简单用法
output_file = convert_pdf_cli('论文.pdf', 'output.md')

# 完整 API
markdown_text, metadata = convert_pdf(
    'paper.pdf',
    output_dir='./output',
    force_ocr=False,
    batch_multiplier=2,
    languages=['Chinese', 'English']
)
print(markdown_text)

Marker Options

OptionDescription
---------------------
-o, --outputOutput file (.md) or directory
--force-ocrForce OCR even for text PDFs
--batch-multiplierBatch size multiplier (default: 2)
--languagesLanguages in document (default: Chinese English)

Tool 2: Nougat (纯英文论文)

Command Line

# Convert entire PDF
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf"

# Convert specific pages
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" -p 0-5

# Custom output
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" -o output.mmd

# Save each page separately
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" --per-page

Python API

import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
from pdf2latex import load_model, process_pdf, save_results

# Load model (uses GPU if available)
model, device = load_model()

# Process PDF
results = process_pdf('paper.pdf', model, device)

# Save as single markdown file
save_results(results, 'output.mmd')

# Or save per page
save_results(results, 'output_pages/', format='pages')

Nougat Options

OptionDescription
---------------------
-o, --outputOutput file or directory
-p, --pagesPage range (e.g., "0-5" or "1,3,5")
-m, --modelModel tag (default: 0.1.0-base)
--dpiRender DPI (default: 300)
--cpuForce CPU mode
--per-pageSave each page separately

Output Format

Both tools output Markdown with LaTeX math:

  • Text is extracted as regular markdown
  • Mathematical formulas are in LaTeX format:
  • Inline: $formula$
  • Display: $$formula$$
  • Tables, figures, and references are preserved
  • Marker also extracts images to separate folder

Comparison

FeatureMarkerNougat
-------------------------
Chinese Support✓ Excellent✗ Poor
English Support✓ Excellent✓ Excellent
Math Formulas✓ (Texify)✓ (Native)
Table Extraction
Image Extraction
Speed (RTX 4060)~2 min/page~10-15 sec/page
OCR QualityExcellentGood

Troubleshooting

Import Errors

Make sure you're using the correct Python:

D:\anaconda3\envs\pdf-extractor\python.exe your_script.py

CUDA Out of Memory

Try CPU mode (Nougat) or reduce batch size (Marker):

# Nougat: use CPU
D:\anaconda3\envs\pdf-extractor\python.exe pdf2latex.py paper.pdf --cpu

# Marker: reduce batch multiplier
D:\anaconda3\envs\pdf-extractor\python.exe pdf2md_marker.py paper.pdf --batch-multiplier 1

Chinese Characters Not Recognized

Use Marker instead of Nougat for Chinese documents.

Slow Processing

  • Marker is slower but more accurate (uses multiple ML models)
  • For faster processing on English-only papers, use Nougat
  • Ensure GPU is being used (check CUDA availability)

Model Information

Marker Models (downloaded automatically):

  • Surya OCR: Text detection and recognition
  • Texify: Math formula recognition
  • Layout analysis models

Nougat Base Model (1.31 GB):

  • Location: C:\Users\cr\.cache\torch\hub\nougat-0.1.0-base
  • Best for: Standard academic papers, arXiv papers

Example Workflow

import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')

def extract_paper(pdf_path, is_chinese=True):
    """
    Extract text and formulas from academic paper.
    
    Args:
        pdf_path: Path to PDF file
        is_chinese: True for Chinese papers, False for English only
    
    Returns:
        Extracted markdown text
    """
    if is_chinese:
        from pdf2md_marker import convert_pdf
        text, _ = convert_pdf(pdf_path, languages=['Chinese', 'English'])
    else:
        from pdf2latex import load_model, process_pdf
        model, device = load_model()
        results = process_pdf(pdf_path, model, device)
        text = '\n\n'.join([t for _, t in results])
    
    return text

# Usage
text = extract_paper('中文论文.pdf', is_chinese=True)
print(text)

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-12 05:57 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,366 📥 319,412
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,221 📥 267,139
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 672 📥 324,767