幫財務小姐姐寫了幾個 Python 自動化腳本，結果...

2025-08-29 03:15:00

開發前端

大多數自動化項目失敗是因為他們試圖解決所有問題。相反，應該選擇一個重復出現的痛點，并且投資回報率可衡量。

某個無聊的下午，財務小姐姐找到了我，跟我說她厭倦了每周重復那些無聊的點擊操作，想要我幫忙開發一個工具，它可以：監控文件夾、從 PDF 中提取數據、豐富數據、推送報告。

我想，閑著也是閑著，就幫她這個忙吧，說不定還可以...咳咳咳

1. 我要解決的問題

大多數自動化項目失敗是因為他們試圖解決所有問題。相反，應該選擇一個重復出現的痛點，并且投資回報率可衡量。我的做法是：

痛點：

客戶每天都會以分散的 PDF 格式發送發票。
我手動打開它們，提取供應商、日期、金額，然后放入 excel。
每天浪費約 20 分鐘。

目標：將其減少到零人力分鐘。

2. 快速 MVP — 構建文件監視器 + PDF 提取器

從小事做起：查看文件夾，檢測新的 PDF，提取文本。使用watchdog+ PyMuPDF(fitz)。

# file_watcher.py
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
import fitz  # pymupdf

class PDFHandler(FileSystemEventHandler):
    def on_created(self, event):
        if event.src_path.lower().endswith(".pdf"):
            print(f"[+] New PDF: {event.src_path}")
            text = extract_text(event.src_path)
            print(text[:200], "...\n")  # quick preview

def extract_text(path: str) -> str:
    doc = fitz.open(path)
    pages = []
    for page in doc:
        pages.append(page.get_text())
    doc.close()
    return"\n".join(pages)

if __name__ == "__main__":
    observer = Observer()
    handler = PDFHandler()
    observer.schedule(handler, path="./inbox", recursive=False)
    observer.start()
    try:
        whileTrue:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.stop()
    observer.join()

這個腳本已經將小姐姐每天的工作時間縮短至 5 分鐘——主要用于審查。

3. 增強提取器的魯棒性：OCR + 文本回退

部分 PDF 是掃描圖像。請添加pytesseract后備功能。

pip install pytesseract pill 
# 還必須在系統上安裝 tesseract (apt/brew/choco)
from PIL import Image
import pytesseract
import fitz

def extract_text_with_ocr(path: str) -> str:
    doc = fitz.open(path)
    aggregated = []
    for page in doc:
        text = page.get_text()
        if text.strip():
            aggregated.append(text)
        else:
            pix = page.get_pixmap(dpi=200)
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            aggregated.append(pytesseract.image_to_string(img))
    doc.close()
    return"\n".join(aggregated)

這種混合方法（文本層 -> OCR）使該工具對我所見的 95% 的發票都具有可靠性。

4. 使用 OOP 構建結構——構建插件友好的管道

如果想要產品化，請將你的流程模塊化。每個步驟都是一個類：加載器 → 解析器 → 豐富器 → 接收器。這樣你就可以在不重寫代碼的情況下更換存儲（excel表格、數據庫、Webhook）。

# pipeline.py
from abc import ABC, abstractmethod
from typing import Dict

class Step(ABC):
    @abstractmethod
    def run(self, data: Dict) -> Dict:
        pass

class Loader(Step):
    def __init__(self, path): self.path = path
    def run(self, data):
        data['text'] = extract_text_with_ocr(self.path)
        return data

class Parser(Step):
    def run(self, data):
        # naive example; replace with regex or NLP later
        text = data['text']
        data['vendor'] = find_vendor(text)
        data['amount'] = find_amount(text)
        return data

class Sink(Step):
    def run(self, data):
        save_to_excel_sheet(data)
        return data

class Pipeline:
    def __init__(self, steps):
        self.steps = steps
    def execute(self, initial):
        data = initial
        for step in self.steps:
            data = step.run(data)
        return data

此模式可擴展：添加ClassifierStep語言檢測、TranslatorStep非英語文檔等。

5. 信息豐富與提取——先用正則表達式，再用機器學習

從確定性解析（正則表達式）開始。如果發票內容混亂或包含多種布局，請添加機器學習模型（或使用layout-parser）。正則表達式代碼片段示例：

import re

AMOUNT_RE = re.compile(r"(?<!\d)(?:USD|EUR|\$)?\s?([\d{1,3}(?:,\d{3})*(?:\.\d{2})?)\b")

def find_amount(text: str) -> float | None:
    m = AMOUNT_RE.search(text.replace("\n", " "))
    if m:
        s = m.group(1).replace(',', '')
        return float(s)
    return None

為了提高可靠性，請使用spacy+ 自定義 NER 或layout-parser在空間上檢測發票字段。

6. Web 自動化和抓取——Playwright 用于下載和儀表盤

當發票位于網絡儀表板后面時，使用 Playwright 自動下載。

pip install playwright
playwright install
def login_and_download(url, user, password, download_path):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.fill('#username', user)
        page.fill('#password', password)
        page.click('#login')
        page.wait_for_selector('a.download')
        with page.expect_download() as download_info:
            page.click('a.download')
        download = download_info.value
        download.save_as(download_path)
        browser.close()

這使得服務可以自動收集源 PDF——如果小姐姐想運行系統每天早上獲取客戶文檔的訂閱，這一點至關重要。

7. 打包工具 — CLI 使用`Typer`/`Click`

對于分發，將功能包裝為 CLI，以便非開發客戶可以在本地運行它，或者可以在服務器上運行它。

pip install typer
import typer
from pipeline import Pipeline, Loader, Parser, Sink

app = typer.Typer()

@app.command()
def process(path: str):
    steps = [Loader(path), Parser(), Sink()]
    p = Pipeline(steps)
    p.execute({})
    typer.echo("Processed!")

if __name__ == "__main__":
    app()

構建一個setup.py/pyproject.toml并發布到 PyPI，或者打包為 wheel / Docker 鏡像。

8. 使用 worker 進行擴展：Celery + Redis（或 FastAPI + 后臺任務）

如果你想讓更多的小姐姐一起同時使用時，那么在工作隊列中運行處理工作，而不是阻止所有內容。

pip install celery redis
#tasks.py
from celery import Celery
from pipeline import Pipeline, Loader, Parser, Sink

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def process_file(path):
    steps = [Loader(path), Parser(), Sink()]
    Pipeline(steps).execute({})

Web 前端/API 入隊process_file.delay(path)并立即返回。工作線程負責處理并將結果推送至存儲。

9. 可觀察性和可靠性——日志、指標、可重試步驟

使用loguru+結構化日志，并導出正常運行時間、隊列長度和故障率的指標（Prometheus）。

pip install loguru
from loguru import logger
logger.add("service.log", rotation="10 MB", level="INFO")

try:
    process_file("/tmp/a.pdf")
except Exception as e:
    logger.exception("Processing failed")

工具做完了，讓財務小姐姐試用時，結果她投來了崇拜的眼神......

責任編輯：武曉燕來源：數據STUDIO

Python 自動化腳本