02장. DistilBERT 파인튜닝 및 평가 (문제4~7)

조조링 2024. 11. 23. 00:03

728x90

문제4. IMDB 데이터세트

IMDB 데이터세트는 영화 리뷰 코멘트의 긍정/부정 감성을 판단하기 위해 사용하는 감성 분석 데이터세트이다.
- 25,000개 학습 데이터 (텍스트 및 레이블)
- 25,000개 테스트 데이터 (텍스트 및 레이블)
50,000개 데이터세트를 다운로드하고, 학습과 테스트를 위해 랜덤하게 1,000개씩 데이터를 추출하여 리스트 형식으로 저장하시오.

torchtext는 Pytorch의 자연어 전처리용 라이브러리이다. 이를 이용하면 데이터를 얻을 수 있다.

!pip install torchtext==0.15.2
!pip install portalocker==2.7.0
!pip install accelerate -U

from torchtext.datasets import IMDB

train_iter = IMDB(split='train')
test_iter = IMDB(split='test')

train_iter  # ShardingFilterIterDataPipe

# 런타임 약 30초 소요
# 출력 결과를 고정하기 위해 random.seed 도입
import random
random.seed(6)

# train_iter를 리스트 타입으로 변경
train_lists = list(train_iter)
test_lists = list(test_iter)

# 각기 1000개씩 랜덤 샘플링
train_lists_small = random.sample(train_lists, 1000)
test_lists_small = random.sample(test_lists, 1000)

# 각 변수에 담긴 인덱스 0에 해당하는 원소, 즉 첫번째 원소 출력
print(train_lists_small[0])
print(test_lists_small[0])

# (2, "I LOVED this movie! I am biased seeing as I am a huge Disney fan, but I really enjoyed myself. The action takes off running in the beginning of the film and just keeps going! This is a bit of a departure for Disney, they don't spend quite as much time on character development (my husband pointed this out)and there are no musical numbers. It is strictly action adventure. I thoroughly enjoyed it and recommend it to anyone who loves Disney, be they young or old.")
# (1, 'This was an abysmal show. In short it was about this kid called Doug who guilt-tripped a lot. Seriously he could feel guilty over killing a fly then feeling guilty over feeling guilty for killing the fly and so forth. The animation was grating and unpleasant and the jokes cheap. <br /><br />It aired here in Sweden as a part of the "Disney time" show and i remember liking it some what but then i turned 13.<br /><br />I never got why some of the characters were green and purple too. What was up with that? <br /><br />Truly a horrible show. Appareantly it spawned a movie which i\'ve never seen but i don\'t have any great expectations for that one either.')

문제5. 레이블 인코딩

문제4에서 구한 텍스트와 레이블은 튜플 쌍 원소로 구성된 리스트에 담겨있다. 현재 레이블이 긍정인 경우는 2, 부정인 경우는 1로 부여되어 있다. 이를 긍정을 1로, 부정을 0으로 변경하세요. 아울러 텍스트와 레이블을 별도의 더 작은 리스트인 train_text, train_lables, test_texts, test_lables에 각각 저장하세요.

# train_texts와 train_labels라는 컨테이너 생성
# 아래 반복분에서 생성된 결과를 담는 그릇으로서 역할 수행
train_texts = []
train_labels = []

# for 반복문
# train_lists_small에 담긴 튜플 쌍 원소를 변수명 label과 text를 부여하여 순서대로 추출
for label, text in train_lists_small:
  # IMDB 데이터의 기존 레이블 2를 1로 변경, 기존 레이블 1을 0으로 변경
  train_labels.append(1 if label == 2 else 0)
  train_texts.append(text)

# text_texts와 test_labels라는 컨테이너 생성
test_texts = []
test_labels = []

# for 반복문
for label, text in test_lists_small:
  # IMDB 데이터의 기존 레이블 2를 1로 변경, 기존 레이블 1을 0으로 변경
  test_labels.append(1 if label == 2 else 0)
  test_texts.append(text)

# 각 변수에 담긴 인덱스 0에 해당하는 원소, 즉 첫번째 원소 출력
print(train_texts[0])
print(train_labels[0])
print(test_texts[0])
print(test_labels[0])

# I LOVED this movie! I am biased seeing as I am a huge Disney fan, but I really enjoyed myself. The action takes off running in the beginning of the film and just keeps going! This is a bit of a departure for Disney, they don't spend quite as much time on character development (my husband pointed this out)and there are no musical numbers. It is strictly action adventure. I thoroughly enjoyed it and recommend it to anyone who loves Disney, be they young or old.
# 1
# This was an abysmal show. In short it was about this kid called Doug who guilt-tripped a lot. Seriously he could feel guilty over killing a fly then feeling guilty over feeling guilty for killing the fly and so forth. The animation was grating and unpleasant and the jokes cheap. <br /><br />It aired here in Sweden as a part of the "Disney time" show and i remember liking it some what but then i turned 13.<br /><br />I never got why some of the characters were green and purple too. What was up with that? <br /><br />Truly a horrible show. Appareantly it spawned a movie which i've never seen but i don't have any great expectations for that one either.
# 0

문제6. 학습 및 검증 데이터 분리

문제5에서 '0'과 '1'로 인코딩된 1,000개의 학습 데이터 중에서 800개는 학습용도로, 200개는 검증 용도로 나누세요.

# train_test_split 결과를 고정하기 위해 random_state 지정
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2,
                                                                    random_state=3)
print(len(train_texts))
print(len(train_labels))
print(len(val_texts))
print(len(val_labels))

문제7. 토크나이징 및 인코딩

문제5와 6에서 추출된 학습 데이터(800개), 검증 데이터(200개), 테스트 데이터(1,000개)를, 사전학습된 distilbert-base-uncased 모델에 투입하기 위해, 토크나이저를 사용해 인코딩하세요.

이 실습에서 distilbert-base-uncased 모델을 사용하는 이유는 이 모델이 상대적으로 가볍기 때문이다.

# distilbert-base-uncased 모델에서 토크나이저 불러오기
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# 토크나이징을 통한 인코딩
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# train_texts 0번째 텍스트 출력
train_texts[0]

# # 0번째 입력문(텍스트)의 5번째 토큰까지의 input_ids 출력
print(train_encodings["input_ids"][0][:5])

# 위의 결과를 디코딩하여 출력
print(tokenizer.decode(train_encodings["input_ids"][0][:5]))

# CAT SOUP has two "Hello Kitty"-type kittens embarking on a bizarre trip through the afterlife, where anything can happen, and does. This mind-tripping Asian short uses no dialog, substituting word balloons instead. There is no way of describing this demented cartoon except to tell you to see it for yourself. And make sure no one under 10 is in the room. Dismemberment and cannibalism and cruelty and savagery and sudden death and callous disregard for others are common themes. Honest. Perhaps the most memorable image is that of an elephant composed of water that the kitties swim through and in, and also ride. But like practically everything else in this film, that silly, picaresque interlude soon comes to a horrible end.

# [101, 4937, 11350, 2038, 2048]
# [CLS] cat soup has two

728x90