I've modified the code in Python Crunching Mandarin slightly to get the simple words only. Also, I realised that in the dictionary file from CEDICT, there might be more than one entry for a particular word. Here is the diff output for the updated code:
8c8
< frequent[word] = None
---
> frequent[word] = []
17c17
< frequent[word] = ' '.join(line[2:])
---
> frequent[word].append(' '.join(line[1:]))
26c26
< output.write(word + ' ' + frequent[word])
---
> output.write(' '.join(frequent[word]))
And the updated output file is available for downloading.

Recent Comments