Korean¶
A library for Korean morphology
Introduction¶
Sometimes you should localize your project for Korean. But common internationalization solutions such as Gettext are not working with non Indo-European languages well. We would get an awkward Korean sentence with those solutions because Korean has many morphological difference with Indo-European language.
korean
a Python module provides useful Korean morphological functions
for getting natural Korean sentences.
Allomorphic particle¶
In English, “be” is an allomorph. So the English localization system should can
select the correct form such as “is”, “am”, “are”. Fortunately Gettext offers
ngettext
to make a natural plural expression. If it didn’t offer, you would
see that awkward sentence:
>>> print _('Here is(are) %d apple(s).') % 1
Here is(are) 1 apple(s).
Some Korean particles (postposition) also have different allomorphs but they need different allomorphic selection rule; it needs check the preceding phoneme. However common internationalization solutions don’t offer about it. Of course, :mod:`korean does:
>>> from korean import Noun, NumberWord, Loanword
>>> fmt = u'{subj:은} {obj:을} 먹었다.'
>>> fmt2 = u'{subj:은} 레벨 {level:이} 되었다.'
>>> print fmt.format(subj=Noun(u'나'), obj=Noun(u'밥'))
나는 밥을 먹었다.
>>> print fmt.format(subj=Noun(u'학생'), obj=Noun(u'돈까스'))
학생은 돈까스를 먹었다.
>>> print fmt2.format(subj=Noun(u'용사'), level=NumberWord(4))
용사는 레벨 4가 되었다.
>>> print fmt2.format(subj=Noun(u'마왕'), level=NumberWord(98))
마왕은 레벨 98이 되었다.
>>> print fmt2.format(subj=Loanword(u'Leonardo da Vinci', 'ita'),
... level=NumberWord(67))
Leonardo da Vinci는 레벨 67이 되었다.
Working with Gettext¶
It also can be worked with Gettext. Just use korean.l10n.patch_gettext()
function:
msgid ""
msgstr ""
"Locale: ko_KR\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
msgid "I like a {0}."
msgstr "나는 {0:을} 좋아합니다."
msgid "banana"
msgstr "바나나"
msgid "game"
msgstr "게임"
>>> from babel.support import Translations
>>> import korean
>>> translations = Translations.load('i18n', 'ko_KR')
>>> korean.l10n.patch_gettext(translations)
>>> _ = translations.ugettext
>>> _(u'I like a {0}.').format(_(u'banana'))
나는 바나나를 좋아합니다.
>>> _(u'I like a {0}.').format(_(u'game'))
나는 게임을 좋아합니다.
Proofreading legacy text¶
If your text already has been written with naive particle such as “을(를)”,
use korean.l10n.proofread()
fucntion to get correct particles:
>>> import korean
>>> korean.l10n.proofread(u'용사은(는) 검을(를) 획득했다.')
용사는 검을 획득했다.
>>> korean.l10n.proofread(u'집(으)로 가자.')
집으로 가자.
API¶
korean.morphology¶
copyright: |
|
---|---|
license: | BSD, see LICENSE for more details. |
-
class
korean.morphology.
Morpheme
(*forms)¶ This class presents a morpheme (형태소) or allomorph (이형태). It can have one or more forms. The first form means the basic allomorph (기본형).
Parameters: forms – each forms of allomorph. the first form will be basic allomorph. -
basic
()¶ The basic form of allomorph.
-
classmethod
get
(key)¶ Returns a pre-defined morpheme object by the given key.
-
read
()¶ Every morpheme class would implement this method. They should make a morpheme to the valid Korean text with Hangul.
-
classmethod
register
(key, obj)¶ Registers a pre-defined morpheme object to the given key.
-
-
class
korean.morphology.
Particle
(after_vowel, after_consonant=None, after_rieul=None)¶ Particle (조사) is a postposition in Korean. Some particles have different allomorphs such as 을/를, 이/가. These forms follow forward syllable ends what phoneme; a vowel, a consonant, or a Rieul (ㄹ).
-
class
korean.morphology.
Substantive
(*forms)¶ A class for Korean substantive that is called “체언” in Korean.
-
class
korean.morphology.
Noun
(*forms)¶ A class for Korean noun that is called “명사” in Korean.
-
read
()¶ Reads a noun as Korean. The result will be Hangul.
>>> Noun('레벨42').read() '레벨사십이'
-
-
class
korean.morphology.
NumberWord
(number)¶ A class for Korean number word that is called “수사” in Korean.
-
read
()¶ Reads number as Korean.
>>> NumberWord(1234567890).read() '십이억삼천사백오십육만칠천팔백구십' >>> NumberWord.read(0) '영'
-
classmethod
read_phases
(number)¶ Reads number as Korean but seperates the result at each 10k.
>>> NumberWord.read_phases(1234567890) ('십이억', '삼천사백오십육만', '칠천팔백구십') >>> NumberWord.read_phases(0) ('영',)
-
korean.l10n¶
Helpers for localization to Korean.
copyright: |
|
---|---|
license: | BSD, see LICENSE for more details. |
-
class
korean.l10n.
Proofreading
(token_types)¶ A function-like class. These
__call__()
replaces naive particles to be correct. First, it finds naive particles such as “을(를)” or “(으)로”. Then it checks the forward character of the particle and replace with a correct particle.Parameters: token_types – specific types to make as token. -
parse
(text)¶ Tokenizes the given text with unicode text or
Particle
.Parameters: text – the string that has been written with naive particles.
-
-
korean.l10n.
proofread
= <korean.l10n.Proofreading object>¶ Default
Proofreading
object. It tokenizesunicode
andkorean.Particle
. Use it like a function.
-
class
korean.l10n.
Template
¶ The
Template
object extendsunicode
and overridesformat()
method. This can format particle format spec without evinciveNoun
orNumberWord
arguments.Basically this example:
>>> import korean >>> korean.l10n.Template('{0:을} 좋아합니다.').format('향수') '향수를 좋아합니다.'
Is equivalent to the following:
>>> import korean >>> '{0:을 좋아합니다.}'.format(korean.Noun('향수')) '향수를 좋아합니다.'
korean.ext.gettext¶
Gettext is an internationalization and localization system commonly used for writing multilingual programs on Unix-like OS. This module contains utilities to integrate Korean and the Gettext system. It also works well with Babel.
copyright: |
|
---|---|
license: | BSD, see LICENSE for more details. |
-
korean.ext.gettext.
patch_gettext
(translations)¶ Patches Gettext translations object to wrap the result with
korean.l10n.Template
. Then the result can work with a particle format spec.For example, here’s a Gettext catalog for ko_KR:
msgid "{0} appears." msgstr "{0:이} 나타났다." msgid "John" msgstr "존" msgid "Christina" msgstr "크리스티나"
You can use a particle format spec in Gettext messages after translations object is patched:
>>> translations = patch_gettext(translations) >>> _ = translations.ugettext >>> _('{0} appears.').format(_('John')) '존이 나타났다.' >>> _('{0} appears.').format(_('Christina')) '크리스티나가 나타났다.'
Parameters: translations – the Gettext translations object to be patched that would refer the catalog for ko_KR.
korean.ext.jinja2¶
Jinja2 is one of the most used template engines for Python. This module
contains Jinja2 template engine extensions to make korean
easy to
use.
New in version 0.1.5.
Changed in version 0.1.6: Moved from korean.l10n.jinja2ext
to korean.ext.jinja2
.
copyright: |
|
---|---|
license: | BSD, see LICENSE for more details. |
-
class
korean.ext.jinja2.
ProofreadingExtension
(environment)¶ A Jinja2 extention which registers the
proofread
filter and theproofread
block:<h1>ProofreadingExtension Usage</h1> <h2>Single filter</h2> {{ (name ~ '은(는) ' ~ obj ~ '을(를) 획득했다.')|proofread }} <h2>Filter chaining</h2> {{ '%s은(는) %s을(를) 획득했다.'|format(name, obj)|proofread }} <h2><code>proofread</code> block</h2> {% proofread %} {{ name }}은(는) {{ obj }}을(를) 획득했다. {% endproofread %} <h2>Conditional <code>proofread</code> block</h2> {% proofread locale.startswith('ko') %} {{ name }}은(는) {{ obj }}을(를) 획득했다. {% endproofread %}
The import name is
korean.ext.jinja2.proofread
. Just add it into your Jinja2 environment by the following code:from jinja2 import Environment jinja_env = Environment(extensions=['korean.ext.jinja2.proofread'])
New in version 0.1.5.
Changed in version 0.1.6: Added
enabled
argument to{% proofread %}
.
-
korean.ext.jinja2.
proofread
¶ alias of
ProofreadingExtension
korean.hangul¶
Processing a string written by Hangul. All code of here is based on hangul.py by Hye-Shik Chang at 2003.
copyright: |
|
---|---|
license: | BSD, see LICENSE for more details. |
-
korean.hangul.
char_offset
(char)¶ Returns Hangul character offset from “가”.
-
korean.hangul.
is_hangul
(char)¶ Checks if the given character is written in Hangul.
-
korean.hangul.
is_vowel
(char)¶ Checks if the given character is a vowel of Hangul.
-
korean.hangul.
is_consonant
(char)¶ Checks if the given character is a consonant of Hangul.
-
korean.hangul.
is_initial
(char)¶ Checks if the given character is an initial consonant of Hangul.
-
korean.hangul.
is_final
(char)¶ Checks if the given character is a final consonant of Hangul. The final consonants contain what a joined multiple consonant and empty character.
-
korean.hangul.
get_initial
(char)¶ Returns an initial consonant from the given character.
-
korean.hangul.
get_vowel
(char)¶ Returns a vowel from the given character.
-
korean.hangul.
get_final
(char)¶ Returns a final consonant from the given character.
-
korean.hangul.
split_char
(char)¶ Splits the given character to a tuple where the first item is the initial consonant and the second the vowel and the third the final.
-
korean.hangul.
join_char
(splitted)¶ Joins a tuple in the form
(initial, vowel, final)
to a Hangul character.
Installation¶
Install via PyPI with
easy_install
or pip
command:
$ easy_install korean
$ pip install korean
or check out development version:
$ git clone git://github.com/sublee/korean.git
Changelog¶
Version 0.1.7¶
- Fixes an infinite loop bug on negative numbers.
- Adds a template tag and a filter for django.
See
korean.ext.django.templatetags.korean
.
Version 0.1.6¶
- Moves
korean.l10n.jinja2ext
tokorean.ext.jinja2
. - Renames
{% autoproofread %}
to{% proofread %}
. - Moves
korean.l10n.patch_gettext()
tokorean.ext.gettext.patch_gettext()
. - Adds a condition argument to enable behind autoproofread Jinja2 block.
- Fixes PEP8 errors without E301.
Version 0.1.5¶
Released on Jan 30th 2013.
- Supports Python 3.
- Adds
korean.l10n.jinja2ext.ProofreadingExtension
for Jinja2 template engine.
Version 0.1.3¶
Released on Aug 15th 2012.
korean.l10n.Proofreading
supports more various naive particle forms.
Version 0.1¶
First public preview release.
Licensing and Author¶
This project licensed with BSD, so feel free to use and manipulate as long as you respect these licenses. See LICENSE for the details.
I’m Heungsub Lee. Any regarding questions or patches are welcomed.