[자연언어처리] 2-3. 텍스트 표현 (1) Bag of Words

Bag of Words

정의

가장 간단하게 텍스트를 표현(수치화) 하는 방법
문서가 포함하는 단어들의 빈도로 벡터를 만드는 방법이다.
가장 기초적이고 원시적인 방식의 텍스트 표현 방법이다.

만드는 방법

모든 문서에 등장한 모든 단어에 고유한 인덱스를 부여해 단어 주머니(Bag of Words)를 만든다.
각 문서에 등장하는 단어들의 등장 횟수(count)를 해당 단어의 인덱스에 대응하는 값으로 넣는다.

장점과 단점

장점 : 간단하고 구현이 쉽고, 문서의 주제 파악에 유용하다.
단점 : 단어의 순서를 완전히 무시하며, 단어가 많아질 경우 벡터 차원이 커지고 Sparse Vector가 되기 쉽다.

Bag of Words 만들어보기

설치

gensim 라이브러리를 설치한다.
gensim은 자연어 처리를 위한 오픈소스 파이썬 라이브러리로 주로 토픽 모델링과 단어 임베딩을 효율적으로 처리하기 위해 설계되었다.

pip install gensim

실습 데이터

gensim 의 샘플 데이터를 활용한다.
text corpus 는 9개의 문장으로 이루어져 있다.

gensim - example data

text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

전처리

이번의 전처리 조건은 다음과 같다.

(1) 토크나이징 : white space(공백)을 기준으로 토큰을 나눈다.
(2) 불용어 처리 : for, a, of, the, and 등의 stop-words를 문장에서 제거한다.  
(3) 텍스트 정규화 : 문장에서 등장하는 단어를 소문자로 바꾼다.  
(4) 단어사전 : 단어의 등장 빈도를 count 하고, 전체 문장에서 두 번 이상 등장한 단어만을 유지한다.  

# tokenizer
class WhitespaceTokenizer():
    def tokenize(self, input:str) -> list[str]:
        if isinstance(input, str):
            result = input.split(" ")
        return result

# Text Cleaner
class TextCleaner:
    def __init__(self):
        # Create a set of frequent words
        self.stopwords = set('for a of the and to in'.split(' '))
    def clean_text(self, words:list[str]) -> list[str]:
        # Lowercase each document, split it by white space and filter out stopwords
        words = [word.lower() for word in words if word.lower() not in self.stopwords]
        return words

# filter by frequency
class FilterByFrequency:
    def __init__(self):
        # Count word frequencies
        from collections import defaultdict
        self.frequency_dict = defaultdict(int)
    def make_filter(self, docs:list[list[str]]):
        for text in docs:
            for token in text:
                self.frequency_dict[token] += 1
    def filter(self, words:list[str], threshold:int=1):
        # Only keep words that appear more than once
        filtered_words = [token for token in words if self.frequency_dict[token] > threshold]
        return filtered_words

# (1) 토크나이징 : 공백을 기준으로
tokenizer = WhitespaceTokenizer()
tokenized_docs = [tokenizer.tokenize(doc) for doc in text_corpus]
# (2) 텍스트 클리닝 - lower + stopwords
text_cleaner = TextCleaner()
cleaned_docs = [text_cleaner.clean_text(words) for words in tokenized_docs]
# (3) 빈도 기반 필터링 : 1회 발생 단어는 제외
filter = FilterByFrequency()
filter.make_filter(cleaned_docs)
processed_corpus = [filter.filter(doc, 1) for doc in cleaned_docs]

print(processed_corpus)
# 결과
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

사전 만들기

전처리를 통해 걸러진 단어들에 대해 고유의 인덱스를 부여한다.
gensim.corpora의 Dictionary 클래스를 이용해 인덱스 - 단어 쌍으로 이루어진 사전을 제작한다.
사전이 만들어졌다면, BoW를 위한 기본적인 준비는 완료됐다.
corpora 는 gensim 라이브러리에서 말뭉치(corpus)와 사전(dctionary)를 다루는 모듈이다.

# bow
class BagOfWords:
    def __init__(self):
        self.dictionary:dict[str,int]|None=None
    def create_dictionary(self, input:list[list[str]]):
        from gensim import corpora
        self.dictionary = corpora.Dictionary(input)
    def represent_bow(self, input:list[list[str]]):
        bow_corpus = [self.dictionary.doc2bow(text) for text in input]
        return bow_corpus

# (4) BoW 생성
bow_model = BagOfWords()
bow_model.create_dictionary(processed_corpus)
bow = bow_model.represent_bow(processed_corpus)

print(bow_model.dictionary.token2id)
# 출력
{'computer': 0, 'human': 1, 'interface': 2,
 'response': 3, 'survey': 4, 'system': 5,
 'time': 6, 'user': 7, 'eps': 8, 'trees': 9,
 'graph': 10, 'minors': 11}

샘플 문장들을 BoW로 표현하기

만들어진 단어 사전을 이용해 처음에 주어진 샘플 문장들을 BoW를 통해 벡터로 만들어보자.
4번 문장에서 system이 2회 등장함을 볼 수 있다.
빈도가 0인 단어들은 표시되지 않는다.

# (4) BoW 생성
bow_model = BagOfWords()
bow_model.create_dictionary(processed_corpus)
bow = bow_model.represent_bow(processed_corpus)

print("===== corpus(Words) =====")
print(processed_corpus)
print("===== BoW Token Dictionary =====")
print(bow_model.dictionary.token2id)

# 출력
===== corpus(words) =====
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
===== BoW =====
[[(0, 1), (1, 1), (2, 1)],
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
[(2, 1), (5, 1), (7, 1), (8, 1)],
[(1, 1), (5, 2), (8, 1)],
[(3, 1), (6, 1), (7, 1)],
[(9, 1)],
[(9, 1), (10, 1)],
[(9, 1), (10, 1), (11, 1)],
[(4, 1), (10, 1), (11, 1)]]

새로운 단어를 BoW 로 벡터화 하기

“Human computer interaction” 이라는 새로운 문장이 들어왔다.
이를 BoW 로 벡터화를 해본다면 아래와 같다.

new_sentence = "Human computer interaction"
cleaned_words = filter.filter(text_cleaner.clean_text(tokenizer.tokenize(new_sentence)))
new_vec = bow_model.dictionary.doc2bow(cleaned_words)
print(new_vec)

# 출력
[(0, 1), (1, 1)]
# 0번 단어 (computer) : 1회 등장
# 1번 단어 (human) : 1회 등장
# interaction : 단어사전에 포함되지 않은 단어

Reference

방송통신대학교 - 자연언어처리 수업 (유찬우 교수)
gensim 샘플 데이터 및 코드
https://wikidocs.net/24557

Twitter Facebook LinkedIn

[자연언어처리] 2-3. 텍스트 표현 (1) Bag of Words

Jongya

Bag of Words

정의

만드는 방법

장점과 단점

Bag of Words 만들어보기

설치

실습 데이터

전처리

사전 만들기

샘플 문장들을 BoW로 표현하기

새로운 단어를 BoW 로 벡터화 하기

Reference

Comments

You May Also Enjoy

텍스트를 QR 코드로 인코딩하기

[C언어] 포인터의 개념

[C언어] 함수에서의 배열 사용

[C언어] 문자형 배열과 문자열