← 返回
内容创作

Vector Text Fixer

Fix garbled text in PDF/SVG vector graphics for final editing in AI. Detect, replace and repair garbled text in vector graphic files while maintaining origin...
修复PDF/SVG矢量图形乱码以便AI最终编辑。检测、替换并修复矢量文件中的乱码,保持原始格式。
ec-cyber258
内容创作 clawhub v0.1.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 484
下载
💾 16
安装
1
版本
#latest

概述

Vector Text Fixer

Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools.

Features

  • Garbled Text Detection: Automatically identifies garbled text in PDF/SVG files
  • Smart Repair: Infers original text content based on context
  • Batch Processing: Supports batch processing of multiple files in a folder
  • Format Preservation: Repaired files maintain original vector format and layout
  • AI-assisted Editing: Outputs intermediate format that can be imported into AI editors

Supported Scenarios

1. PDF Garbled Text Repair

  • Box/question mark issues caused by font embedding problems
  • Garbled text caused by encoding conversion errors
  • Abnormal characters generated by missing font substitution
  • Multi-language mixed encoding issues

2. SVG Garbled Text Repair

  • Text entity encoding errors
  • Special character escaping issues
  • Display abnormalities caused by invalid font references
  • XML encoding declaration errors

Usage

Command Line

# Fix a single PDF file
python scripts/main.py --input document.pdf --output fixed.pdf

# Fix a single SVG file
python scripts/main.py --input diagram.svg --output fixed.svg

# Batch process folder
python scripts/main.py --batch ./input_folder --output ./output_folder

# Interactive repair (manually specify replacement content)
python scripts/main.py --input doc.pdf --interactive

# Export as editable format (JSON)
python scripts/main.py --input doc.pdf --export-json editable.json

Python API

from scripts.main import VectorTextFixer

# Create fixer instance
fixer = VectorTextFixer()

# Fix PDF
result = fixer.fix_pdf("input.pdf", "output.pdf")

# Fix SVG
result = fixer.fix_svg("input.svg", "output.svg")

# Batch processing
results = fixer.batch_fix("./input_folder", "./output_folder")

# Get text map (for AI editing)
text_map = fixer.extract_text_map("input.pdf")

Input Parameters

ParameterTypeRequiredDescription
------------------------
--inputstrYes*Input file path (PDF or SVG)
--batchstrNoBatch processing input folder
--outputstrYes*Output file/folder path
--interactiveboolNoEnable interactive repair mode
--export-jsonstrNoExport editable JSON format
--encodingstrNoSpecify source file encoding (default: auto-detect)
--font-substitutiondictNoFont replacement mapping
--repair-levelstrNoRepair level: minimal, standard, aggressive (default: standard)

*At least one of --input and --batch is required

Output Format

Repaired PDF/SVG

  • Maintains original vector format
  • Garbled text replaced with readable content
  • Fonts and layout remain unchanged

JSON Export Format

{
  "file_type": "pdf",
  "pages": [
    {
      "page_num": 1,
      "text_blocks": [
        {
          "id": "tb_001",
          "bbox": [100, 200, 300, 220],
          "original_text": "�����",
          "detected_encoding": "UTF-8",
          "confidence": 0.3,
          "suggested_fix": "Sample Text"
        }
      ]
    }
  ],
  "fonts_used": ["Arial", "SimSun"],
  "repair_summary": {
    "total_blocks": 15,
    "fixed_blocks": 12,
    "skipped_blocks": 3
  }
}

Garbled Text Detection Rules

The tool uses the following rules to detect garbled text:

  1. Replacement Character Detection: Identifies U+FFFD (�) and box characters
  2. Control Character Filtering: Excludes non-printing control characters
  3. Encoding Consistency: Detects anomalies caused by mixed encodings
  4. Font Fallback Detection: Identifies substitution characters generated due to missing fonts
  5. Probability Model: Garbled text probability assessment based on character frequency

Repair Strategies

Minimal

  • Only repairs obvious errors (replacement characters, null bytes)
  • Maintains maximum integrity of original text
  • Suitable for minor garbled text issues

Standard

  • Repairs common encoding issues
  • Smart font replacement
  • Balances repair rate and accuracy

Aggressive

  • Comprehensive text re-encoding
  • Uses OCR-assisted recognition
  • Suitable for severely garbled documents

Examples

Fix Single Page PDF

Input:

python scripts/main.py --input report.pdf --output fixed_report.pdf

Output:

✓ Processing: report.pdf
✓ Detected 5 garbled text blocks
✓ Fixed 4 blocks automatically
⚠ 1 block requires manual review
✓ Output saved: fixed_report.pdf
✓ Report saved: fixed_report_repair_log.json

Export Editable JSON

Input:

python scripts/main.py --input diagram.svg --export-json editable.json

Output JSON Structure:

{
  "file_type": "svg",
  "svg_info": {
    "width": 800,
    "height": 600,
    "viewBox": "0 0 800 600"
  },
  "text_elements": [
    {
      "id": "text_1",
      "x": 100,
      "y": 200,
      "font_family": "Arial",
      "font_size": 14,
      "original": "�����",
      "user_editable": "",
      "confidence": 0.25
    }
  ]
}

Dependencies

pdfplumber>=0.10.0      # PDF parsing
PyMuPDF>=1.23.0         # PDF processing (fitz)
cairosvg>=2.7.0         # SVG conversion
beautifulsoup4>=4.12.0  # SVG parsing
fonttools>=4.40.0       # Font processing
chardet>=5.0.0          # Encoding detection
Pillow>=10.0.0          # Image processing

Limitations

  • Encrypted PDFs require password unlock before processing
  • Severely damaged vector files may not be fully repairable
  • Some rare fonts may not map correctly
  • Scanned PDFs require OCR recognition first

Version Information

  • Version: 1.0.0
  • Last Updated: 2026-02-06
  • Status: Ready for use

Risk Assessment

Risk IndicatorAssessmentLevel
-----------------------------------
Code ExecutionPython/R scripts executed locallyMedium
Network AccessNo external API callsLow
File System AccessRead input files, write output filesMedium
Instruction TamperingStandard prompt guidelinesLow
Data ExposureOutput files saved to workspaceLow

Security Checklist

  • [ ] No hardcoded credentials or API keys
  • [ ] No unauthorized file system access (../)
  • [ ] Output does not expose sensitive information
  • [ ] Prompt injection protections in place
  • [ ] Input file paths validated (no ../ traversal)
  • [ ] Output directory restricted to workspace
  • [ ] Script execution in sandboxed environment
  • [ ] Error messages sanitized (no stack traces exposed)
  • [ ] Dependencies audited
  • Prerequisites

# Python dependencies
pip install -r requirements.txt

Evaluation Criteria

Success Metrics

  • [ ] Successfully executes main functionality
  • [ ] Output meets quality standards
  • [ ] Handles edge cases gracefully
  • [ ] Performance is acceptable

Test Cases

  1. Basic Functionality: Standard input → Expected output
  2. Edge Case: Invalid input → Graceful error handling
  3. Performance: Large dataset → Acceptable processing time

Lifecycle Status

  • Current Stage: Draft
  • Next Review Date: 2026-03-06
  • Known Issues: None
  • Planned Improvements:
  • Performance optimization
  • Additional feature support

版本历史

共 1 个版本

  • v0.1.0 当前
    2026-03-19 17:58 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 857 📥 199,226
content-creation

AdMapix

fly0pants
广告情报与应用数据分析助手,支持搜索广告素材、分析应用排名、下载量、收入及市场洞察,用于广告素材和竞品分析。
★ 294 📥 136,391
data-analysis

Variant Pathogenicity Predictor

ec-cyber258
整合REVEL、CADD、PolyPhen评分预测变异致病性
★ 0 📥 544