SCID:a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images
نویسندگان
چکیده
目的 视觉富文档信息抽取致力于将输入文档图像中的关键文字信息进行结构化提取,以解决实际业务问题,财务票据是其中一种常见的数据类型。解决该类问题通常需要应用光学字符识别(optical character recognition,OCR)和信息抽取等多个领域的技术。然而,目前公开的相关数据集的数量较少,且每个数据集中包含的图像数量也较少,这都成为了制约该领域技术发展的一个重要因素。为此,本文收集、标注并公开发布了一个真实中文扫描票据数据集SCID(scanned Chinese invoice dataset),包含6类常见财务票据,共40 716幅图像。方法 该数据集提供了用于OCR任务和信息抽取的两种标签。针对该数据集,本文提出一个基于LayoutLM v2(layout languagemodel v2)的基线方案,实现了从输入图像到最终结果的端到端推理。基于该数据集承办的CSIG(China Society ofImage and Graphics)2022票据识别与分析挑战赛,吸引了大量科研人员参与,并提出了优秀的解决方案。结果 在基线方案实验中,分别验证了使用OCR引擎推理、OCR模型精调和OCR真值3种设定的实验结果,F1值分别为0.768 7、0.857 0和0.985 7,一方面证明了LayoutLM v2模型的有效性;另一方面证明了该场景下OCR的挑战性。结论 本文提出的扫描票据数据集SCID展示了真实OCR技术应用场景的多项挑战,可以为文档富视觉信息抽取相关技术领域研发和技术落地提供重要数据支持。该数据集下载网址:https://davar-lab.github.io/dataset/scid.html。;Objective Visually-rich document information extraction is committed to such key images-related text structure. Invoice-contextual data can be as one of the commonly-used types documents. For enterprises-oriented reimbursement process,much more demands are required invoices. To resolve this problem,such techniques like optical recognition(OCR)and have been developing intensively. However,the number related publicly available datasets images involved relatively challenged rich in each dataset. Method We develop a real financial scanned dataset,for which it used for collection,annotation,and releasing further. This set consists 40 716 six invoices context aircraft itinerary tickets,taxi invoices,general quota invoices,passenger invoices,train tickets, toll It divided into training/validation/testing sets further 19 999/10 358/10 359 images. The labeling process dataset concerned steps pseudo-label generation,manual recheck cleaning,and manual desensitization,which offer two sort labels-related OCR task deliberately. Such challenges still resolved print misalignment,blurring,and overlap. facilitate baseline scheme realize end-to-end inference result. overall solution four mentioned below:1)a module predict all instances’content location. 2)A block ordering re-arrange instances feasible order serialize 2D 1D. 3)The LayoutLM v2 model melted three modalities information(text,visual,and layout)and generate prediction sequence labels,which utilize knowledge generated from pre-trained language model. 4)The post-processing transfer model’s output final structural information. simplify complexity ticket system via integration multiple Result experimental results verified using engine reasoning, prediction,and ground-truth value. F1 value 0. 768 7/0. 857 0/0. 985 7 reached well. Furthermore,the effectiveness V2 optimized,and challenging issue reflected scenario. Tesla-V100 GPU-based speed 1. 88 frame/s. accuracy 90% raw image only input. demonstrate that optimal solutions roughly segmented categories:one category focused on melting structured detection straightforward(i. e. ,multi-category detection),and requirement recognition identify with corresponding concern. other implement general strategy,and an independent extract These integrate potentials technologies farther. Conclusion SCID(scanned dataset)proposed demonstrates application scenarios technology provide support research development visually-rich extraction-related technical implementation. linked downloaded https://davar-lab. github. io/dataset/scid. html.
منابع مشابه
Table Detection in Scanned Document Images
Vertically and horizontally aligned text cells and/or ruling lines are the basic elements in a table image. This paper develops an algorithm based on Delaunay triangulation and Freeman chaincode to identify such table image through measuring the widths and orientations of the connected components of the image and locate the aligned text cell regions in the image. Experiments show that the propo...
متن کاملDetermining the resolution of scanned document images
Abstract Given the existence of digital scanners, printers and fax machines, documents can undergo a history of sequential reproductions. One of the most important determiners of the quality of the resulting image is the set of underlying resolutions at which the images were scanned and binarized. In particular, a low resolution scan produces a noticeable degradation of image quality, and produ...
متن کاملInformation Extraction from Symbolically Compressed Document Images
The extraction of information from symbolically compressed document images is an increasingly important problem as the related standard (JBIG2) and commercial products become available. Symbolic compression techniques work by clustering individual connected connected components (blobs) in a document image and storing the sequence of occurrence of blobs and representative blob templates, hence t...
متن کاملExtraction of Relevant Information from Document Images Using Measures of Visual Attention
This paper describes an approach to attention based layout segmentation using general principles of the human visual perception to achieve this goal. The text is considered as texture in different resolution levels. A new measure of attractiveness is introduced. The segmentation is generic and not limited to specific document classes and models. The resulting regions of interest may be used for...
متن کاملAutomatic Trajectory Extraction And Validation Of Scanned Handwritten Characters
A well-established task in forensic writer identification is the comparison of prototypical character shapes (allographs) present in the handwriting. Using elastic matching techniques like Dynamic Time Warping (DTW), comparison results can be made that are plausible and understandable to the human expert. Since these techniques require the dynamics of the handwritten trace, the “online” dynamic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Image and Graphics
سال: 2023
ISSN: ['1006-8961']
DOI: https://doi.org/10.11834/jig.220911