A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).
python3 {baseDir}/scripts/extract_text.py input.docx output.txt
Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.
python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt
Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.
python3 {baseDir}/scripts/extract_images.py input.docx output_dir/
Extracts all embedded images with:
python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]
Batch resize/compress images for API processing (saves 50-70% on vision API costs).
python-docx — for .docx processingolefile — for legacy .doc processing Pillow — for image resizing (optional, only needed for resize script)Install:
pip3 install python-docx olefile Pillow
共 1 个版本