A Cross-lingual Sentence Similarity Calculation Method with Multifeature Fusion

نویسندگان

چکیده

Cross-language sentence similarity computation is among the focuses of research in natural language processing (NLP). At present, some researchers have introduced fine-grained word and character features to help models understand meanings, but they do not consider coarse-grained prior knowledge at level. Even if two cross-linguistic pairs same meaning, representations extracted by baseline approach may language-specific biases. Considering above problems, this paper, we construct a Chinese–Uyghur cross-lingual dataset propose method compute fusing multiple features. The based on pretraining model XLM-RoBERTa assists calculation introducing features, i.e., sentiment length time, eliminate possible biases vectors, whitened vectors different languages ensure that were all represented under standard orthogonal basis. combination has effects final performance model, introduce vector for comparison experiments basic feature splicing method. results show absolute value difference between can reflect sentences well. F1 our reaches 98.97%, which 19.81% higher than baseline.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a careful...

متن کامل

A resource-light method for cross-lingual semantic textual similarity

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named ...

متن کامل

Cross-lingual Similarity Discrimination with Translation Characteristics

In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language into two classes according their similarities to a given sentence in source language. Positive outputs of the discriminative model are then ranked according to the similarity proba...

متن کامل

Cross-lingual sentence extraction for information distillation

Information distillation aims to analyze and interpret large volumes of speech and text archives in multiple languages and produce structured information of interest to the user. In this work, we investigate cross-lingual information distillation, where nonEnglish (source language) documents are searched for user queries that are in English (target language). We propose to perform distillation ...

متن کامل

Cross-lingual Sentence Compression for Subtitles

We present an approach for translating subtitles where standard time and space constraints are modeled as part of the generation of translations in a phrase-based statistical machine translation system (PBSMT). We propose and experiment with two promising strategies for jointly translating and compressing subtitles from English into Portuguese. The quality of the automatic translations is measu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2022

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2022.3159692