IEEE ACCESS, ss.74313-74334, 2025 (SCI-Expanded, Scopus)
Text similarity is a crucial area of study that evaluates how similar texts are both semantically and syntactically. As data volumes increase, understanding the similarities and relationships between texts becomes essential, particularly in natural language processing (NLP) tasks such as text generation, summarization, and classification. This study examines the similarities between human-written scientific abstracts, AI-paraphrased abstracts, and AI-generated abstracts. Various methods, including cosine similarity, Word2Vec, and BERT, were evaluated based on mean, median, and standard deviation metrics. Among these, Word2Vec and FastText achieved the highest mean similarity scores (0.930), while BERT demonstrated superior performance with the highest median (0.841) and the lowest standard deviation (0.019) in the 'Human-Paraphrased' category, showing consistent results across datasets. Additionally, the research investigates the implications of these similarities for text analysis and ethical standards, comparing various techniques for measuring text similarity and analysing their effectiveness. The findings offer valuable insights into the application areas of text similarity analysis.