트레이닝 데이터 준비
스팸메일(spam)
정상메일(ham)
enron1.zip / enron2.zip / enron3.zip
https://github.com/WolffRuoff/Enron-Ham-or-Spam-Filter
정상메일 샘플 | |
Subject: unify close schedule the following is the close schedule for this coming month ( year - end . ) please keep in the mind the following key times . . . . unify to sitara bridge back 1 : 45 p . m . thursday , dec 30 th ( all errors must be clear by this time ) |
|
스팸메일 샘플 | |
Subject: dobmeos with hgh my energy level has gone up ! stukm introducing doctor - formulated hgh human growth hormone - also called hgh is referred to in medical science as the master hormone . it is very plentiful |
NaiveBayesClassifier 를 위한 데이터셋 구조
[ (featureset, label), (featureset, label), (featureset, label), ... ]
☞ featureset = { feature_name : feature_value, feature_name : feature_value, feature_name : feature_value , ... }
☞ feature_name : 단어
☞ feature_value : True ( ※ bool, int, str 이 가능하나 영향이 없어 그냥 True 로 고정 )
☞ label ⇒ "ham" or "spam"
파일별 내용을 읽어 들여 리스트[ ] 생성
import os
def read_in(folder): files = os.listdir(folder)
a_list = []
for a_file in files:
if not a_file.startswith("."):
with open(folder + a_file, "r", encoding="utf-8") as f:
a_list.append(f.read())
return a_list
spam_list = read_in("./enron1/spam/")
ham_list = read_in("./enron1/ham/")
|
[ (featureset, label), ... ] 생성
import random
all_mails = [(email, 'spam') for email in spam_list]
all_mails += [(email, 'ham') for email in ham_list]
random.seed(42)
random.shuffle(all_mails)
|
각 freatureset 의 email 을 토큰화 한 후 dict { 단어 : True, ... } 생성
import nltk
from nltk import word_tokenize
nltk.download('punkt')
def get_features(text):
featureset = {}
# 대소문자 구분 없애기 word_list = [word for word in word_tokenize(text.lower())]
for word in word_list:
featureset[word] = True
return features
all_features = [(get_features(email), label) for (email, label) in all_mails]
|
Naïve Bayes classifier 를 사용한 Train
Naïve Bayes classifier 의 설명 : https://eldercoder.tistory.com/122
train 과 test 데이터의 비율 = 80% : 20%
from nltk import NaiveBayesClassifier, classify
def train(features, proportion):
train_size = int(len(features) * proportion)
train_set = features[:train_size]
test_set = features[train_size:]
classifier = NaiveBayesClassifier.train(train_set)
return train_set, test_set, classifier
train_set, test_set, classifier = train(all_features, 0.8)
|
20% test 데이터를 이용한 classifier 의 평가
def evaluate(train_set, test_set, classifier):
print (f"Accuracy on the training set = {str(classify.accuracy(classifier, train_set))}")
print (f"Accuracy of the test set = {str(classify.accuracy(classifier, test_set))}")
classifier.show_most_informative_features(50)
evaluate(train_set, test_set, classifier)
|
Accuracy on the training set = 0.9608411892675852 Accuracy of the test set = 0.9420289855072463 |
classifier 의 재사용 (save, load)
import pickle
# 화일로 저장 f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
# 분류 프로그램에서 로드 f = open('my_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
|
실제 메일을 분류하기 위해서는 메일 내용 전체를 위의 데이테셋 구조로 만들고 로드한 classifier에 입력.
'자연언어처리 (NLP)' 카테고리의 다른 글
Naive Bayes Classifier 설명 (0) | 2024.04.05 |
---|---|
wordcloud (0) | 2024.03.13 |
stopwords (0) | 2024.03.13 |
워드 토큰화 (0) | 2024.03.13 |
(Word2Vec) model training (1) | 2023.12.21 |