A Cross-lingual Sentence Similarity Calculation Method with Multifeature Fusion
نویسندگان
چکیده
Cross-language sentence similarity computation is among the focuses of research in natural language processing (NLP). At present, some researchers have introduced fine-grained word and character features to help models understand meanings, but they do not consider coarse-grained prior knowledge at level. Even if two cross-linguistic pairs same meaning, representations extracted by baseline approach may language-specific biases. Considering above problems, this paper, we construct a Chinese–Uyghur cross-lingual dataset propose method compute fusing multiple features. The based on pretraining model XLM-RoBERTa assists calculation introducing features, i.e., sentiment length time, eliminate possible biases vectors, whitened vectors different languages ensure that were all represented under standard orthogonal basis. combination has effects final performance model, introduce vector for comparison experiments basic feature splicing method. results show absolute value difference between can reflect sentences well. F1 our reaches 98.97%, which 19.81% higher than baseline.
منابع مشابه
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC
We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a careful...
متن کاملA resource-light method for cross-lingual semantic textual similarity
Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named ...
متن کاملCross-lingual Similarity Discrimination with Translation Characteristics
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language into two classes according their similarities to a given sentence in source language. Positive outputs of the discriminative model are then ranked according to the similarity proba...
متن کاملCross-lingual sentence extraction for information distillation
Information distillation aims to analyze and interpret large volumes of speech and text archives in multiple languages and produce structured information of interest to the user. In this work, we investigate cross-lingual information distillation, where nonEnglish (source language) documents are searched for user queries that are in English (target language). We propose to perform distillation ...
متن کاملCross-lingual Sentence Compression for Subtitles
We present an approach for translating subtitles where standard time and space constraints are modeled as part of the generation of translations in a phrase-based statistical machine translation system (PBSMT). We propose and experiment with two promising strategies for jointly translating and compressing subtitles from English into Portuguese. The quality of the automatic translations is measu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2022
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2022.3159692