3000 Hanzi's Measure Word Project

One of the secrets of 3000 Hanzi is it's backed by one of the largest Chinese corpora ever created.

This means we have lots of data. This data has a lot of potential, potential that wasn't being used quite effectively. Mostly it was collecting digital dust.

I realized this when I started asking some prominent bloggers what kinds of features they were looking for in an online Chinese dictionary. Even more then features, what they were looking for was great content:

As Olle of Hacking Chinese said, "What's lacking is information about how a word is used, how common it is, etc." It reminded me of the thesis of Albert's post on Laowai Chinese about online Chinese dictionaries: they are "not complete" and "not useful"

I completely agree.

Essentially every online dictionary out there is exactly the same. They have exactly the same content (CC-CEDICT), with the same mistakes, and generally the same features. Adding colored characters or options for bopomofo wouldn't really be enough to make 3000 Hanzi's Chinese Dictionary different from the pack. I realized that no one was creating what people wanted: better Chinese content. And then I realized that with my experience as both a programmer and a content creator I was in the unique position to do what no one else could do: create and present the content that Chinese learners are looking for.

After that realization, the main question was: what kind of content should I create first?

For a while, I had plans to create a series of "Chinese in Usage" blog posts that helped learners understand different the usage of different vocabulary, but I'd never had the time to get it started. As I considered starting this project, I realized the scope of it was far too large. Chinese usage can be examined in a number of ways, including frequency, collocations, etc., and there was no quick and easy way divide the project in a way that would leave me satisfied with the results.

And then I cam upon an aspect of Chinese that was both unique and "simple", Chinese measure words (or classifiers). I set up a program to go through the corpus looking for numbers (knowing that measure words would probably be around them); it generated 3.6GB of data. Then I created another program to sort through that data, looking for common (and uncommon) patterns where measure words occur looking for three things, the number, the classifier, and the noun (or action) being counted. I came up with over ten patterns that found those relationships in phrases as simple as 三 个 苹果 or as complicated as 三 个 很 大 的 华盛顿州 苹果.

3000 Hanzi's Chinese corpus ended up identifying over 200 measure words and around 12,000 measure word / noun combinations (e.g. 个 + 人 occured 480972 times and 个 + 月 occured 112315 times). Using this data, I analyzed and classified and defined the different measure words. From the very beginning I knew these examples would serve as the foundation for 3000 Hanzi's measure word project.

Finally, I started planning a series of blog posts that would be simple enough for beginners to understand but still be deep enough for more advanced learners to appreciate.

That series starts on 3000 Hanzi's Chinese Measure Words page. As I publish more content (about 2 posts a week), you'll be able to find it on that page (or by following the blog, or 3000 Hanzi on twitter).

Please let me know what you think.