Zhang, HuaRui is a lecturer in the Institute of Computational Linguistics, School of Computer Science. He obtained his B.Sc. and M.Sc from Tsinghua University in 1992 and 1995, and Ph.D. from Peking University in 2016 respectively. His research interests include lexical statistics, quantitative linguistics, especially on statistical laws of Chinese characters and sentences, entropy measure with linguistic constraint and square-mean-root evenness of lexicon.
Dr. Zhang has published several research papers on square-mean-root evenness of lexicon, 2 of them published in LREC. He has served as the conference chair of the 4thStudents' Workshop on Computational Linguistics. He was a member of the Comprehensive Language Knowledge-base Project, led by Prof. Yu Shiwen, with Science and Technology Progress Award granted by Ministry of Education of China. He has written an detailed introduction for the bookWord Frequency Distributionsby R. H. Baayen, proposing several novel interpretations on basic concepts and a new model for vocabulary growth.
Dr. Zhang has achieved the following academic contributions:
1) Proposal of Square-Mean-Root (SMR) evenness preferable to Root-Mean-Square (RMS) evenness and Shannon-Entropy evenness according to the following 3 criteria:
a) When a corpus is divided into n divisions with equal size, the evenness of a word should be m/n if it occurs in only m divisions with equal occurrence count;
b) While the divisions are combined, the value of evenness should not decrease;
c) When several divisions with unequal size and the same relative frequency are combined, the value of evenness should remain unchanged.
Evaluation shows that both RMS evenness and Shannon evenness violate all of these criteria, but SMR evenness conforms to all of them. Through the extension of SMR evenness to binary and multiple SMR evenness, a self-consistent system was constructed for evaluating and developing measures of evenness.
2) Two fundamental statistical laws related to Chinese characters are discovered:
a) Power-exponential relation between frequency and rank of Chinese characters instead of Zipf’s Law;
b) The number of strokes of Chinese characters follows square-root normal distribution.