Abstract The explosion of the Web leads to production large amounts texts and inevitably influences their quality. Errors that tend occur more often can distort results, especially when are used for scientific purposes, in language teaching or learning. Hence, there is a need examine existing corpora based on web clean up data, which may contain such “noisy” fragments. In our study, we deal wit...