自然言語処理の前処理・素性いろいろ

ちゃお・・・†

舞い降り・・・†

先日、前処理大全という本を読んで自分なりに何か書きたいなと思ったので、今回は自然言語処理の前処理とそのついでに素性の作り方をPythonコードとともに列挙したいと思います。必ずしも全部やる必要はないので目的に合わせて適宜使ってください。

前処理

余分な改行やスペースなどを除去

with open(path) as fd:
    for line in fd:
        line = line.rstrip()

アルファベットの小文字化

text = text.lower()

正規化 (半角/全角変換などなど)

import neologdn

neologdn.normalize('ハンカクカナ')
# => 'ハンカクカナ'
neologdn.normalize('全角記号!?@#')
# => '全角記号!?@#'
neologdn.normalize('全角記号例外「・」')
# => '全角記号例外「・」'
neologdn.normalize('長音短縮ウェーーーーイ')
# => '長音短縮ウェーイ'
neologdn.normalize('チルダ削除~∼∾〜〰~')
# => 'チルダ削除'
neologdn.normalize('いろんなハイフン˗֊‐‑‒–⁃⁻₋−')
# => 'いろんなハイフン-'
neologdn.normalize('   PRML  副 読 本   ')
# => 'PRML副読本'
neologdn.normalize(' Natural Language Processing ')
# => 'Natural Language Processing'
neologdn.normalize('かわいいいいいいいいい', repeat=6)
# => 'かわいいいいいい'

GitHub - ikegami-yukino/neologdn: Japanese text normalizer for mecab-neologd

トークナイズ

MeCab

import MeCab

mecab_wakati = MeCab.Tagger('-Owakati')
words = mecab_wakati.parse(text).strip().split()

SentencePiece

import sentencepiece as spm

spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=test_model --vocab_size=1000')

sp = spm.SentencePieceProcessor()
sp.Load('test/test_model.model')
sp.Encode('This is a test')

sentencepiece/README.md at master · google/sentencepiece · GitHub

NLTK

from nltk.tokenize import sent_tokenize, word_tokenize

text = "I'm crying. Is this a pen?"
sentences = sent_tokenize(text)
words = list(map(word_tokenize, sentences))

絵文字の除去

import emoji

words = list(filter(lambda x: x in emoji.UNICODE_EMOJI, words))

絵文字のデコード/エンコード

Slackなどのテキスト処理用です。

>> import emoji
>> print(emoji.emojize('Python is :thumbs_up:'))
Python is 👍
>> print(emoji.emojize('Python is :thumbsup:', use_aliases=True))
Python is 👍
>> print(emoji.demojize('Python is 👍'))
Python is :thumbs_up:

from emoji · PyPI

短縮表現を元に戻す

don't を do not に、won't をwill not みたいにする処理。

import re

shortened = {
    '\'m': ' am',
    '\'re': ' are',
    'don\'t': 'do not',
    'doesn\'t': 'does not',
    'didn\'t': 'did not',
    'won\'t': 'will not',
    'wanna': 'want to',
    'gonna': 'going to',
    'gotta': 'got to',
    'hafta': 'have to',
    'needa': 'need to',
    'outta': 'out of',
    'kinda': 'kind of',
    'sorta': 'sort of',
    'lotta': 'lot of',
    'lemme': 'let me',
    'gimme': 'give me',
    'getcha': 'get you',
    'gotcha': 'got you',
    'letcha': 'let you',
    'betcha': 'bet you',
    'shoulda': 'should have',
    'coulda': 'could have',
    'woulda': 'would have',
    'musta': 'must have',
    'mighta': 'might have',
    'dunno': 'do not know',
}

shortened_re = re.compile('(?:' + '|'.join(map(lambda x: '\\b' + x + '\\b', shortened.keys())) + ')')

sentence = 'I\'m'

sentence = shortened_re.sub(lambda x: shortened[x.group(0)], sentence)

Hiroshi Manabe on Twitter: "@_yukinoi こんな感じでどう? https://t.co/eXj35AOPdl" / Twitter

HTMLタグの除去

import nltk

raw = nltk.clean_html(html)

ストップワードを除去する

日本語

SlothLib ストップワードリストなどを使ってストップワードを除外する

with open('Japanese.txt') as fd:
    stop_words = frozenset(fd.splitlines())

words = list(filter(lambda x: x not in stop_words, words))

英語

from nltk.corpus import stopwords

stop_words = frozenset(stopwords.words('english'))
words = list(filter(lambda x: x in stop_words, words))

特定の品詞だけを抽出

import MeCab

CONTENT_WORD_POS = ('名詞', '動詞', '形容詞', '副詞')

tagger = MeCab.Tagger()
words = []
for line in tagger.parse(sentence).splitlines()[:-1]:
    surface, feature = line.split('\t')
    if feature.startswith(CONTENT_WORD_POS) and ',非自立,' not in feature:
        words.append(surface)

Stemming

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
words = list(map(stemmer.stem, words))

Lemmatization

英語

from nltk.stem.wordnet import WordNetLemmatizer

wnl = WordNetLemmatizer()
words = list(map(wnl.lemmatize, words))

日本語

import MeCab

tagger = MeCab.Tagger()
lemmas = []
for line in tagger.parse('行った').splitlines()[:-1]:
    surface, feature = line.split('\t')
    if feature.split(',')[6] != '*':
        lemmas.append(feature.split(',')[6])  # 「行く」が入る

typoを直す

以下は英語のtypoを直す処理です

import pytypo

words = list(map(pytypo.correct, words))

<BOS>や<EOS>をつける

文の始まりを表すBOS (Beginning of Sentence) と文の終わりを表すEOS (End of Sentence) をつける

with open(path) as fd:
    for line in fd:
        words = ['<BOS>'] + line.split() + ['<EOS>']

素性にするための処理

単語のID化

from collections import defaultdict

word_to_id = defaultdict(lambda: len(word_to_id))

n-gram

sklearnなどでやってくれるので自前でn-gram化の処理を書くことは少ないです

def to_ngrams(item, max_n):
    return [to_ngram(item, n) for n in range(2, max_n + 1)]

def to_ngram(item, n):
    return [item[i:i+n] for i in range(len(item)-n+1)]

Bag-of-Words

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)

6.2. Feature extraction — scikit-learn 1.1.1 documentation

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

6.2. Feature extraction — scikit-learn 1.1.1 documentation

絵文字の有無

import emoji

ALL_EMOJI = set(emoji.EMOJI_UNICODE.values())


def has_emoji(text):
    return any(x in text for x in ALL_EMOJI)

パディング

固定長の行列に可変長データを入れる処理

from keras.preprocessing import sequence
X = sequence.pad_sequences(X, maxlen=MAX_LEN)

単語の分散表現

word2vec

import os

from gensim.models.word2vec import Word2Vec, PathLineSentences

sentences = PathLineSentences(os.path.join(os.getcwd(), 'corpus'))
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv['computer']

models.word2vec – Word2vec embeddings — gensim

FastText

from gensim.models import FastText

sentences = [['cat', 'say', 'meow'], ['dog', 'say', 'woof']]

model = FastText(sentences)
say_vector = model.wv['say']  # get vector for word
of_vector = model.wv['of']  # get vector for out-of-vocab word

models.fasttext – FastText model — gensim

文書の分散表現

from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

docs = []
docs.append(TaggedDocument(words=['cat', 'say', 'meow'], tags=['cat']))
docs.append(TaggedDocument(words=['dog', 'say', 'woof'], tags=['dog']))

model = Doc2Vec(documents=docs, min_count=1)

print(model.docvecs['cat'])

models.doc2vec – Doc2vec paragraph embeddings — gensim

極大部分文字列

import pykwic

kwic = pykwic.EKwic()
kwic.add_line('うなぎうなうなうなぎなう')
kwic.add_line('うらにはにわにわとりがいる')
kwic.build()

maxsubsts = []
for maxsubst in kwic.maxmal_substring():
    maxsubsts.append(maxsubst[0])

はてなグループの終了日を2020年1月31日(金)に決定しました - はてなの告知

文字種による抽象化

TinySegmenterなどで使われている素性です

import re

patterns = (
    (re.compile('[一二三四五六七八九十百千万億兆]'), 'M'),
    (re.compile('[一-龠々〆ヵヶ]'), 'H'),
    (re.compile('[ぁ-ん]'), 'I'),
    (re.compile('[ァ-ヴーア-ン゛ー]'), 'K'),
    (re.compile('[a-zA-Za-zA-Z]'), 'A'),
    (re.compile('[0-90-9]'), 'N')
)
features = []
for char in sentence:
    for pattern, val in patterns:
        if pattern.match(char):
            features.append(val)
            break
word = 'Melbourne'
feature = []
feature.append('word.lower=%s' % word.lower())
feature.append('word.isupper=%s' % word.isupper())
feature.append('word.istitle=%s' % word.istitle())
feature.append('word.isdigit=%s' % word.isdigit())

語の接尾

固有表現抽出や、名前から性別を推定するタスクなどに用いられることがあります

word = 'Melbourne'
feature = []
feature.append('word[-3:]=%s' % word[-3:])
feature.append('word[-2:]=%s' % word[-2:])

オントロジーによる言い換え

WordNetなどのオントロジーを使って同義語に置き換えたり上位概念に抽象化したりする

import MeCab
from nltk.corpus import wordnet as wn

def extract_noun(sentence):
    tagger = MeCab.Tagger()
    nouns = []
    for line in tagger.parse(sentence).splitlines()[:-1]:
        surface, feature = line.split('\t')
        if feature.startswith('名詞'):
            nouns.append(surface)
    return nouns


sentence = '猫が鳴いてる'
nouns = extract_noun(sentence)
sentences = []
for noun in nouns:
    for synset in wn.synsets(noun, lang='jpn', pos=wn.NOUN):
        for synonym in wn.synset(synset.name()).lemma_names('jpn'):
            sentences.append(sentence.replace(noun, synonym))

        for hypernym in synset.hypernyms():
            for hypernym_synset in wn.synset(hypernym.name()).lemma_names('jpn'):
                sentences.append(sentence.replace(noun, hypernym_synset))

NLTK :: Sample usage for wordnet

分散表現での類似語による言い換え

from gensim.models.word2vec import Word2Vec

model = Word2Vec.load('w2v.model')
word = '雨'
sentence = '雨に唄えば'
sentences = []
for (similar_word, score) in model.wv.most_similar([model.wv[word]], [], 5):
    if word == similar_word:
        continue
    sentences.append(sentence.replace(word, similar_word))

models.word2vec – Word2vec embeddings — gensim

トピックモデル・クラスタリングによる文書の抽象化

NLPではないタスクですが、こんな素性を使うことがありました

Latent Dirichlet Allocation

import joblib

lda = joblib.load('lda.model')
lda.transform(X).argmax(axis=1)

sklearn.decomposition.LatentDirichletAllocation — scikit-learn 1.1.1 documentation

KMeans

import joblib

kmeans = joblib.load('kmeans.model')
kmeans.predict(X)

sklearn.cluster.KMeans — scikit-learn 1.1.1 documentation

文字列同士の距離

オントロジー (ここではWordNet) のパス類似度

雑にいうと2つの概念がどれだけ似ているかという尺度。

"""実行するにはあらかじめ以下のコードを実行すること
import nltk

nltk.download("wordnet")
nltk.download("omw")
"""
import itertools

from nltk.corpus import wordnet as wn


def synsets_path_similarity(synsets1, synsets2):
    results = []
    for (synset1, synset2) in itertools.product(synsets1, synsets2):
        if synset1 == synset2:
            return 1
        results.append(synset1.path_similarity(synset2))
    return max(results)


synsets_path_similarity(wn.synsets("犬", lang="jpn"), wn.synsets("猫", lang="jpn"))

NLTK :: Sample usage for wordnet

word2vec

from gensim.models.word2vec import Word2Vec

model = Word2Vec.load('w2v.model')
model.wv.similarity(word1, word2)

models.word2vec – Word2vec embeddings — gensim

FastText

from gensim.models import FastText

model = FastText.load('fasttext.model')
model.wv.similarity(word1, word2)

models.fasttext – FastText model — gensim

レーベンシュタイン編集距離やハミング距離など

import distance

distance.levenshtein('lenvestein', 'levenshtein')
distance.hamming('hamming', 'hamning')

GitHub - doukremt/distance: Levenshtein and Hamming distance computation

発音に着目した文字列間の距離

import fuzzy
soundex = fuzzy.Soundex(4)
soundex('fuzzy')
# => 'F200'
dmeta = fuzzy.DMetaphone()
dmeta('fuzzy')
# => ['FS', None]
fuzzy.nysiis('fuzzy')
# => 'FASY'

# あとはレーベンシュタイン距離などで距離をはかる

文書同士の距離

Doc2Vec

from gensim.models.doc2vec import Doc2Vec

model = Doc2Vec.load('doc2vec.model')
model.docvecs.similarity(doc1, doc2)
model.docvecs.similarity_unseen_docs(model, doc1, doc2)

models.doc2vec – Doc2vec paragraph embeddings — gensim

Normalized compression distance

import zlib

def ncd(x, y):
    if x == y:
        return 0
    z_x = len(zlib.compress(x))
    z_y = len(zlib.compress(y))
    z_xy = len(zlib.compress(x + y))
    return float(z_xy - min(z_x, z_y)) / max(z_x, z_y)

if __name__ == '__main__':
    query = 'Hello, world!'
    results = ['Hello, Python world!',
               'Goodbye, Python world!',
               'world record']

    for r in results:
        print(r, ncd(query, r))

Tech Tips: Normalized compression distanceとNormalized Google distance

Lexical Density

from __future__ import division
import MeCab

CONTENT_WORD_POS = ('名詞', '動詞', '形容詞', '副詞')


def compute_lexical_density(sentence):
    t = MeCab.Tagger()
    n = t.parseToNode(sentence)

    content_words = 0
    total = 0
    while n:
        if not n.feature.startswith('BOS/EOS'):
            if n.feature.startswith(CONTENT_WORD_POS) and ',非自立,' not in n.feature:
                content_words += 1
            total += 1
        n = n.next
    return content_words / total

日本語テキストのLexical density測って遊んでみた - Debug me