開発
Parsing kanji
denvazh
I promised myself, that this post would be dead simple, but very informative. Because of this, let’s skip the introduction part and start directly from “why?” followed by “how?”.
“Why?”
Recently, I was working on a project where I was preparing a lot of testing data. In Ruby this can be easily accomplished using Faker gem and/or using sequence feature of FactoryGirl gem. Everything worked fine so far, until I started feeding tests real data in Japanese.
Faker supports locale switching and to some degree it can generate japanese names, but functionality is quite limited. To be able to generate name which would have not only kanji, but katakana, hiragana and romaji (using latin alphabet) I had to implement my own solution.
I don’t want to go into too much details, because I explained everything quite well
here.
What I would like to explain here, is how I generated this long list of static fake data
“How?”
While surfing the web for existing solutions for what I wanted to do I found a library which provides interface to MeCab, a natural langauge processor for japanese.
Because I use Mac and homebrew I was able to install it like this:
$ brew install mecab mecab-ipadic
This would grab recent version of mecab and its dictionary.
Now we can actually start writing some ruby code.
$ mkdir parse_kanji && cd parse_kanji
Create Gemfile
$ bundle init
Open Gemfile, delete all lines starting with gem and put this line there
gem "natto"
Install dependencies
$ bundle install
From this point you can create any ruby script file and work normally.
Let’s actually do something interesting.
Suppose we have a file with list of kanji which we have no idea how to read, but we like to create a csv file with readings.
面白 目黒 岡田 無論 外国 漠然
For this task we also include another gem to conveniently convert katakana to romaji. Add the line below to the Gemfile ( and don’t forget to run bundle install again)
gem "romaji"
Now we can actually write our script. Below I will point out few imporant points and then give a reference to the full script.
Referencing natto interface for mecab from global variable. This is convenient for small scripts.
$nm = Natto::MeCab.new
Then, to actually create a conversion portions of code we just need to implement two functions:
Conversion of kanji string to katakana
def to_katakana(s) arr =[] $nm.parse(s) do |n| if n.char_type==2 yomi = n.feature.split(',')[-2] arr << yomi else arr << n.surface end end arr.join end
Conversion to hiragana is more of a convenience method, rather than a complete implementation of something new. We merely wrap NKF method to convert katakana to hiragana string, which we also expect to be in UTF8.
def to_hiragana(s) NKF.nkf('-h1 -w', s) end
Full script one can find here
Finally, using this script would give us a way to read the kanji above:
面白,オモシロ,おもしろ,omoshiro 目黒,メグロ,めぐろ,meguro 岡田,オカダ,おかだ,okada 無論,ムロン,むろん,muron 外国,ガイコク,がいこく,gaikoku 漠然,バクゼン,ばくぜん,bakuzen