Abstract:
Kazakh language is an agglutinative language and it belongs to the Turkish Language group. Kazakh is a lowresource
language by Arabic script in China, there are still many serious challenges in these research areas by
natural language processing. This paper standardized the processing coding and storage scheme of Kazakh
corpus, then constructed Kazakh Language Corpus (KzLC), which lay the foundation for further research on
syntactic analysis etc. of Kazakh language processing. Aiming at frequency issue of Kazakh language, this
paper focused on relation of Zipf's law of power law in Kazakh word, which is based on frequency statistic of
the word. On the basis of frequency statistics of Kazakh words from Kazakh textbooks, this research came up
worth word information analysis and statistic method based on corpus, which revealed language rule and
phenomenon among Kazakh words information.