01장. 파이프라인 (감성 분석, 질의 응답 with DistilBERT)

조조링 2024. 11. 21. 22:19

728x90

문제2. 감성 분석 - DistilBert 모델 사용

Distilbert 모델을 사용하여 감성 분석을 실행하는 문제이다. 다음 나열된 입력 문장들이 긍정인지 부정인지 판단하세요.
(1) " I like Olympic games as it's very exciting." (나는 올림픽이 흥미진진하기 때문에 좋아합니다.)
(2) " I'm against to hold Olympic games in Tokyo in terms of preventing the covid19 to be spread."
(나는 코비드19 확산 방지 차원에서 도쿄 올림픽 개최를 반대합니다.)

문제2는 허깅 페이스의 transformers 라이브러리를 활용하여 DistilBERT 모델을 사용한 감성 분석을 수행한다.

감성 분석은 텍스트에서 감정을 판별하여 긍정 또는 부정으로 분류하는 NLP 작업이다.

DistilBERT 모델 소개

BERT의 경량화 버전으로, BERT의 구조를 단순화하여 학습 속도를 60% 이상 개선하고 메모리 사용량을 40% 감소시킨 모델이다. 이 과정에서도 성능은 BERT 대비 97% 수준으로 유지된다.

Knoledge Distillation: 큰 모델(BERT)에서 작은 모델(DistilBERT)로 지식을 압축하여 효율성을 높인다.
구조 단순화: BERT의 12개 레이어를 DistilBERT에서는 6개로 줄였으며, Token Type Embeddings 제거 및 일부 레이러 최적화

코드 설명

1. 필요한 라이브러리 임포트 및 파이프라인 생성

from transformers import pipeline
sentiment = pipeline('sentiment-analysis')

pipeline은 허깅 페이스에서 제공하는 고수준 API로, 다양한 NLP 작업을 빠르게 설정할 수 있다.
pipeline('sentiment-analysis')는 사전 학습된 감성 분석 모델을 자동으로 로드하여 준비한다.

2. 모델 구조 확인

sentiment.model

# DistilBertForSequenceClassification(
#   (distilbert): DistilBertModel(
#     (embeddings): Embeddings(
#       (word_embeddings): Embedding(30522, 768, padding_idx=0)
#       (position_embeddings): Embedding(512, 768)
#       (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#       (dropout): Dropout(p=0.1, inplace=False)
#     )
#     (transformer): Transformer(
#       (layer): ModuleList(
#         (0-5): 6 x TransformerBlock(
#           (attention): DistilBertSdpaAttention(
#             (dropout): Dropout(p=0.1, inplace=False)
#             (q_lin): Linear(in_features=768, out_features=768, bias=True)
#             (k_lin): Linear(in_features=768, out_features=768, bias=True)
#             (v_lin): Linear(in_features=768, out_features=768, bias=True)
#             (out_lin): Linear(in_features=768, out_features=768, bias=True)
#           )
#           (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#           (ffn): FFN(
#             (dropout): Dropout(p=0.1, inplace=False)
#             (lin1): Linear(in_features=768, out_features=3072, bias=True)
#             (lin2): Linear(in_features=3072, out_features=768, bias=True)
#             (activation): GELUActivation()
#           )
#           (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#         )
#       )
#     )
#   )
#   (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
#   (classifier): Linear(in_features=768, out_features=2, bias=True)
#   (dropout): Dropout(p=0.2, inplace=False)
# )

Embedding 레이어: 단어와 위치를 임베딩(768차원 벡터)으로 변환
6개의 Transformer 블록: 입력 문장의 문맥 정보를 학습하여 최종적으로 문장 표현 벡터를 생성
Classifier 레이어: Transformer의 출력(768차원)을 2개 클래스(긍정/부정)로 분류

3. 감성 분석 실행

print(sentiment(["I like Olympic games as it's very exciting."]))
# {'label': 'POSITIVE', 'score': 0.9998}

print(sentiment(["I'm against to hold Olympic games in Tokyo in terms of preventing the covid19 to be spread."]))
# {'label': 'NEGATIVE', 'score': 0.9792}

pipeline에 문자열 배열을 입력하면 모델이 각 문장을 분석하고, 결과로 감정 레이블과 확률 값을 반환한다.
문장 분석
- 99.98% 확률로 첫 번째 문장을 긍정으로, 97.92% 확률로 두 번째 문장을 부정으로 분류

문제3. 질의 응답

트랜스포머 라이브러리에서 question-answering 파이프라인을 불러온다. 그리고 도쿄 올림픽에 관한 영문 위키피디아(
https://en.wikipedia.org/wiki/2020_Summer_Olympics
)의 처음 다섯 문단을 컨텍스트(context)로 주고, 도쿄 올림픽이 연기된 이유에 대한 질의 응답을 실시한다.

문제3은 허깅 페이스의 tranformers 라이브러리를 활용하여 question-answering 파이프라인을 사용해 질의 응답을 수행한다.

DistilBERT는 QA 작업에서도 자주 사용되며, 특히 문맥(context) 내에서 특정 질문에 대한 답을 정확히 찾아내는 능력으로 유명하다.

코드 설명

1. 파이프 라인 설정

from transformers import pipeline

qa = pipeline("question-answering")

허깅 페이스의 pipeline 함수는 QA 작업을 간단히 설정할 수 있는 고수준 API이다.
quesion-answering 파이프라인은 DistilBERT 기반의 사전 학습된 QA 모델을 기본으로 사용한다.

2. 컨텍스트 설정, 질문 생성 및 모델 실행

# 여러 줄로 이루어진 문장을 인용할 경우 """를 인용부호로 사용
olympic_wiki_text = """
The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020), is an ongoing international multi-sport event being held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.

Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013.[3] Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled).[4] However, the event retains the Tokyo 2020 name for marketing and branding purposes.[5] It is being held largely behind closed doors with no public spectators permitted due to the declaration of a state of emergency.[b] The Summer Paralympics will be held between 24 August and 5 September 2021, 16 days after the completion of the Olympics.[6]

The 2020 Games are the fourth Olympic Games to be held in Japan, following the Tokyo 1964 (Summer), Sapporo 1972 (Winter), and Nagano 1998 (Winter) games.[c] Tokyo is the first city in Asia to hold the Summer Games twice. The 2020 Games are the second of three consecutive Olympics to be held in East Asia, following the 2018 Winter Olympics in Pyeongchang, South Korea, and preceding the 2022 Winter Olympics in Beijing, China.

The 2020 Games introduced new competitions and re-introduced competitions that once were held but were subsequently removed. New ones include 3x3 basketball, freestyle BMX and mixed gender team events in a number of existing sports, as well as the return of madison cycling for men and an introduction of the same event for women. New IOC policies also allow the host organizing committee to add new sports to the Olympic program for just one Games. The disciplines added by the Japanese Olympic Committee are baseball and softball, karate, sport climbing, surfing, and skateboarding, the last four of which make their Olympic debuts.[7]

Bermuda, the Philippines, and Qatar won their first-ever Olympic gold medals.[8][9][10] San Marino, Turkmenistan, and Burkina Faso won their first-ever Olympic medals.[11][12][13]

"""
print(qa(question="What caused Tokyo Olympic postponed?", context=olympic_wiki_text))

# {'score': 0.6619064807891846, 'start': 635, 'end': 652, 'answer': 'COVID-19 pandemic'}

qa 모델은 question과 context를 기반으로 답변을 예측한다.
결과 분석
- score: 모델이 예측한 답변의 신뢰도 점수 (0~1)
- start, end: 답변이 문맥에서 위치하는 시작과 끝 인덱스
- answer: 컨텍스트에서 추출된 최종 답변 텍스트로 코비드19라고 예측함.

3. 모델 구조 확인

qa.model

# DistilBertForQuestionAnswering(
#   (distilbert): DistilBertModel(
#     (embeddings): Embeddings(
#       (word_embeddings): Embedding(28996, 768, padding_idx=0)
#       (position_embeddings): Embedding(512, 768)
#       (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#       (dropout): Dropout(p=0.1, inplace=False)
#     )
#     (transformer): Transformer(
#       (layer): ModuleList(
#         (0-5): 6 x TransformerBlock(
#           (attention): DistilBertSdpaAttention(
#             (dropout): Dropout(p=0.1, inplace=False)
#             (q_lin): Linear(in_features=768, out_features=768, bias=True)
#             (k_lin): Linear(in_features=768, out_features=768, bias=True)
#             (v_lin): Linear(in_features=768, out_features=768, bias=True)
#             (out_lin): Linear(in_features=768, out_features=768, bias=True)
#           )
#           (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#           (ffn): FFN(
#             (dropout): Dropout(p=0.1, inplace=False)
#             (lin1): Linear(in_features=768, out_features=3072, bias=True)
#             (lin2): Linear(in_features=3072, out_features=768, bias=True)
#             (activation): GELUActivation()
#           )
#           (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#         )
#       )
#     )
#   )
#   (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
#   (dropout): Dropout(p=0.1, inplace=False)
# )

임베딩, 트랜스포머 구조는 문제2와 동일
QA 출력 레이어: Transformer 블록의 결과를 입력 받아 start, end 위치를 예측하는 2차원 출력

sentiment와 qa 모델 구조의 차이점

[공통 구조]

1. Embeddings:

단어 임베딩 및 위치 임베딩 레이어
입력 토큰을 768차원 벡터로 변환

2. Transformer

6개의 Transformer 블록, 각 블록은 Attention 및 FeedForward 레이어로 구성
문맥 정보와 질문 또는 문장의 관계를 학습

[차이점]

1. 감성 분석 모델 (DistilBertForSequenceClassification)

최종 출력
- pre_classifier: DistilBERT에서 나온 [CLS]토큰의 벡터를 가공 (768 -> 768)
- classifier: 768차원 입력을 최종 클래스(2개: 긍정/부정)로 매핑 (768 -> 2)

2. 질의 응답 모델 (DistilBertForQuestionAnswering)

최종 출력
- qa_outputs: DistilBERT에서 나온 모든 토큰에 대해 각각 시작 위치, 끝 위치를 예측 (768 -> 2)

728x90