Intelligently Entity Extraction Using OCR and NER for Business Cards

Các tác giả

  • Thanh Thao Thai Thi Trường Đại học Ngoại Ngữ Tin học TP.HCM
  • Xuan Thu Tuong Thi

Tóm tắt

This paper introduces a framework for building a custom Named Entity Recognizer (NER) tailored for extracting important entities from scanned documents, with a focus on business cards to ensure data privacy. The approach is adaptable to other financial documents, including invoices, shipping bills, and bills of lading. However, in this paper I focus on the Bussiness Cards only. The project is the combination of two main data science technologies: Computer Vision and Natural Language Processing (NLP).  In which, the Computer Vision component involves extracting text from document images using tools like OpenCV, NumPy, and Pytesseract. The NLP phase focuses on entity recognition, text cleaning, and parsing through the use of libraries such as SpaCy, Pandas, Regular Expressions, and String manipulation. This method provides a flexible and efficient solution for automating entity extraction across different types of financial documents.

Tải xuống

Đã Xuất bản

08-07-2025

Cách trích dẫn

Thai Thi, T. T., & Tuong Thi, X. T. (2025). Intelligently Entity Extraction Using OCR and NER for Business Cards. Tạp Chí Khoa học HUFLIT, 9(2), 22. Truy vấn từ https://vjst.net/index.php/hjs/article/view/262

Số

Chuyên mục

Bài điểm báo

##category.category##