Identifying Phrasemes via Interlingual Association Measures - A Data-driven Approach on Dependency-parsed and Word-aligned Parallel Corpora
نویسنده
چکیده
It has been understood for a long time that the semantic content of a combination of two or more words often cannot be derived from the semantics of the single words, but that the use of one particular word imposes restrictions upon others (Firth 1957; Evert 2004, 15–17). The semantics is then either determined by the ruling word, e.g., in the case of light verb constructions (attention entails pay in pay attention), or by the entity of all participating words, e.g., in the case of idiomatic expressions or set phrases (so to speak). Many names have been given to this phenomenon, each of which looking at it from a slightly different perspective: collocations, multiword expressions, phrasemes, idioms or formulaic sequences, to name just a few. The term lexical function (Wanner 1996; Mel’čuk 1998) stresses the aspect of one word being the value returned by a function applied to another word. That lack of flexibility of the determined word, which does not contribute much – if anything – to the meaning of the composed expressions is what we make use of in the approach described in this paper. In our work, we address the issue of phraseme identification in one language by searching for corresponding syntactic structures in parallel, word-aligned corpora. We exemplify our approach by retrieving and ranking support verb constructions that consist of a verb and its direct object. The support verb’s nature allows for correspondences that can be regarded as translations only in the context of the whole construction. A suitable translation of the support verb construction pay attention into German is Aufmerksamkeit schenken ‘attention’ + ‘give as a present’/‘make a gift’. While attention and Aufmerksamkeit embody the same semantic concept, pay and schenken can hardly be seen as good translations except for this particular case – and only in conjunction with their direct objects. This paper is structured as follows: In section 2, we give an overview of statistical association measures and their motivation. Section 3 explains the design of our corpus including the choice of corpus material and the annotation and alignment tasks that are required to allow for complex corpus queries such as the ones we use for
منابع مشابه
Exploring Properties of Intralingual and Interlingual Association Measures Visually
We present an interactive interface to explore the properties of intralingual and interlingual association measures. In conjunction, they can be employed for phraseme identification in word-aligned parallel corpora. The customizable component we built to visualize individual results is capable of showing part-of-speech tags, syntactic dependency relations and word alignments next to the tokens ...
متن کاملIdentifying Correspondences Between Words: An Approach Based On A Bilingual Syntactic Analysis Of French/English Parallel Corpora
We present a word alignment procedure based on a syntactic dependency analysis of French/English parallel corpora called “alignment by syntactic propagation”. Both corpora are analysed with a deep and robust parser. Starting with an anchor pair consisting of two words which are potential translations of one another within aligned sentences, the alignment link is propagated to the syntactically ...
متن کاملتأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملOntology-Based Word Sense Disambiguation in Parallel Corpora
Lately, there seems to be a growing acceptance of the idea that multilingual lexical ontologies might be the key towards aligning different views on the semantic atomic units to be used in characterizing the general meaning of various and multilingual documents. Comparing performances of word sense disambiguation systems is a difficult evaluation task when different sense inventories are used a...
متن کاملImproving Bilingual Projections via Sparse Covariance Matrices
Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1709.08196 شماره
صفحات -
تاریخ انتشار 2017