Intelligently Entity Extraction Using OCR and NER for Business Cards
Tóm tắt
This paper introduces a framework for building a custom Named Entity Recognizer (NER) tailored for extracting important entities from scanned documents, with a focus on business cards to ensure data privacy. The approach is adaptable to other financial documents, including invoices, shipping bills, and bills of lading. However, in this paper I focus on the Bussiness Cards only. The project is the combination of two main data science technologies: Computer Vision and Natural Language Processing (NLP). In which, the Computer Vision component involves extracting text from document images using tools like OpenCV, NumPy, and Pytesseract. The NLP phase focuses on entity recognition, text cleaning, and parsing through the use of libraries such as SpaCy, Pandas, Regular Expressions, and String manipulation. This method provides a flexible and efficient solution for automating entity extraction across different types of financial documents.