2020.09.30 2021.05.05

OCR フォルダ振り分け禁止文字 + 日本語対応【 Python 】

k.w

前回、Windowsでは禁止文字があることを記載したうえで、 OCR のサンプルを紹介しました。OCR結果に禁止文字が含まれているだけでなく、 Python + OpenCVを使った場合は、日本語のフォルダ・ファイル名に対応していません。今回は対応を反映しました。

OCR結果のフォルダ振り分けについて、以下参照ください。

[https://way2se.ringtrees.com/py_comb-001/]

サンプルコード

/* ライブラリインポート */
import os
import glob
import cv2
import sys
import configparser as cfg
import numpy as np
import pandas as pd
from operator import itemgetter
import pyocr
import pytesseract

user = os.getlogin()

/* Config設定 */
cfg = cfg.ConfigParser()
cfg.read('_config.ini', 'UTF-8')

/* OCR設定 */
tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
tool = tools[0]
lang = 'eng', 'jpn'
print(tool.get_name())

/* File読み込みフォルダ指定 */
files = glob.glob(cfg.get(user, 'img_folder') +
                  '/cell_cut/*.jpg', recursive=True)

/* File読み込み（複数可） */
for file in files:
    filename = os.path.basename(file)
    print('File:' + filename)
    img = cv2.imread(file, cv2.IMREAD_GRAYSCALE)
    kernel = np.ones((2, 2), np.uint8)
    img = cv2.erode(img, kernel, iterations=1)
    ret, img = cv2.threshold(img, 150, 255, cv2.THRESH_BINARY)

    minlength = int(img.shape[0] * 0.1)
    gap = 0

    lines = []
    lines = cv2.HoughLinesP(img, rho=1, theta=np.pi/360,
                            threshold=1, minLineLength=minlength, maxLineGap=gap)

    line_list = []
    k = 1
    judgelength = int(img.shape[0] * 0.95)
    for line in lines:
        x1, y1, x2, y2 = line[0]
        if abs(x1 - x2) <= k and abs(y1 - y2) >= judgelength:
            blackline = 1
            lineadd_img = cv2.line(img,
                                   (line[0][0], line[0][1]),
                                   (line[0][2], line[0][3]),
                                   (0, 0, 0),
                                   blackline)
            cv2.imwrite(cfg.get(user, 'output_folder') +
                        '/' + filename[:-4] + '_linemod.jpg', lineadd_img)
            line = (x1, y1, x2, y2)
            line_list.append(line)

    line_list.sort(key=itemgetter(0, 1, 2, 3))
    df = pd.DataFrame(line_list)

    line_count = 0
    x1 = line_list[0][0]
    base_dir = os.getcwd()
    for line in line_list:
        judge_x1 = line[0]
        judge_x2 = line[2]
        if abs(judge_x1 - x1) != 0:
            if abs(x1-judge_x1) < 2:
                x1 = judge_x1
            else:
                line_count = line_count + 1
                x2 = judge_x2
                sep_cut = img[:, x1 + 1:x2]
                w_imgh = np.ones((5, sep_cut.shape[1]), np.uint8) * 255
                sep_cut = cv2.vconcat([w_imgh, sep_cut, w_imgh])
                w_imgw = np.ones((sep_cut.shape[0], 5), np.uint8) * 255
                sep_cut = cv2.hconcat([w_imgw, sep_cut, w_imgw])

                ocr = pytesseract.image_to_string(sep_cut, lang='eng+jpn', config='--psm 10')
                ocr = ocr.replace('\n', '')
                ocr = ocr.replace('\x0c', '')
                if ocr != '':
                    print('OCR:' + str(ocr))
                    if ocr == '/' or ocr == ':' or ocr == ';' or ocr == '*' or ocr == '?' or ocr == "'" or ocr == '"' or ocr == '<' or ocr == '>' or ocr == '|' or ocr == '.' or ocr == '¥':
                        if not os.path.exists(cfg.get(user, 'char_output_folder') + cfg.get('ng_folder', ocr)):
                            os.mkdir(cfg.get(user, 'char_output_folder') + cfg.get('ng_folder', ocr))
                        os.chdir(cfg.get(user, 'char_output_folder') + cfg.get('ng_folder', ocr))
                        cv2.imwrite(filename[:-4] + '-' + str(line_count) + '_cut.jpg', sep_cut)
                        os.chdir(base_dir)
                    else:
                        if not os.path.exists(cfg.get(user, 'char_output_folder') + '/' + ocr):
                            os.mkdir(cfg.get(user, 'char_output_folder') + '/' + ocr)
                        os.chdir(cfg.get(user, 'char_output_folder') + '/' + ocr)
                        cv2.imwrite(filename[:-4] + '-' + str(line_count) + '_cut.jpg', sep_cut)
                        os.chdir(base_dir)
                x1 = judge_x1*/

/* コンフィグファイル　 */
/* config.ini */
[ng_folder]
/ = /_slash
* = /_asterisk
? = /_question
' = /_single_quotation
" = /_double_quotation
< = /_less_than_sign > = /_grater_than_sign
| = /_vertical_bar
. = /_dot
¥ = /_yen_sign

前回の紹介からの違いを補足します。

[https://way2se.ringtrees.com/py_comb-001/]

禁止文字に対応できるようにしています。（92行目）

禁止文字に対応するために、禁止文字毎にフォルダ名をコンフィグがファイルで指定しています。

OpenCVでは日本語対応できていないので、カレントディレクトリを変更しながら、カットファイルを保存することを実現しています。（102行目）

結果として、日本語にも問題なく振り分けできています。

ただ、 ” (ダブルクォーテーション）は ‘ (シングルクォーテーション)として認識しているなど、精度にはまだまだ難ありです。

引き続き、対応方法を検討していきたいと思います！

#OCR #OpenCV #Python #出力

OCR フォルダ振り分け禁止文字 + 日本語対応【 Python 】

サンプルコード

一文字カットに挑戦！【 Python OpenCV 】

HoughLinesP の引数感度分析【 Python OpenCV 】

ファイル / フォルダ指定方法【 Python os glob 】

ファイル + フォルダ一覧取得も可能！【 Python 】

画像の文字認識に挑戦！【 Python OCR 】

OCR の時のひと手間余白で精度向上！【 Python OCR】

サンプルコード

一文字 カット に挑戦！【 Python OpenCV 】

HoughLinesP の 引数 感度分析【 Python OpenCV 】

ファイル / フォルダ 指定方法【 Python os glob 】

ファイル + フォルダ 一覧取得 も可能！【 Python 】

画像 の文字認識に挑戦！ 【 Python OCR 】

OCR の時のひと手間 余白 で 精度 向上！【 Python OCR】

一文字カットに挑戦！【 Python OpenCV 】

HoughLinesP の引数感度分析【 Python OpenCV 】

ファイル / フォルダ指定方法【 Python os glob 】

ファイル + フォルダ一覧取得も可能！【 Python 】

画像の文字認識に挑戦！【 Python OCR 】

OCR の時のひと手間余白で精度向上！【 Python OCR】