| Lithuanian Journal of Statistics | |
| Statistical Analysis of Word Frequency Distribution in Lithuanian Texts of Different Genres | |
| Neringa Bružaitė1  Tomas Rekašius1  | |
| [1] Vilnius Gediminas Technical University, Lithuania; | |
| 关键词: word frequencies; structural distribution; Zipf’s law; hierarchical clustering; Jaccard distance; Ward method; | |
| DOI : 10.15388/LJS.2016.13868 | |
| 来源: DOAJ | |
【 摘 要 】
The paper examines Lithuanian texts of different authors and genres. The main points ofinterest – the number of words, the number of different words and word frequencies. Structural type distributionand Zipf’s law are applied for describing the frequency distribution of words in the text. It is obvious that thelexical diversity of any text can be defined by different words that are used in the text, also called vocabulary.It is shown that the information contained in a reduced vocabulary is enough for dividing the texts analyzedin this article into groups by genre and author using a hierarchical clustering method. In this case, distancesbetween clusters are measured using the Jaccard distance measure, and clusters are aggregated using the Wardmethod.
【 授权许可】
Unknown