[리뷰] Attention Is All You Need

Attention Is All You Need는 실제 구글 번역기에 적용된 'Transformer' 기술을 소개하는 논문이다. BERT 나 ChatGPT 역시 transformer 모델을 기반으로 변형한 모델을 만들어 사용 있고, 현재 가장 진보된 번역모델로 알려진 NLLB-200 역시 transformer 모델을 기반으로 하고 있기에 시계열 데이터를 처리함에 있어 매우 중요한 논문이라고 할 수 있다. 때문에 해당 논문을 자세히 분석해 보고자 포스팅을 작성한다.

1. Introduce & Background

기존에 시계열 데이터를 처리하는 방식은 recurrence 계열의 RNN - LSTM - GRU기술이 있다. 하지만 이 기술들은 하나씩 문제점을 가지고 있다. 그 중 공통적으로 가지고 있는 문제점은 '병렬연산' 이 불가능하여 GPU의 막대한 자원을 온전히 활용하지 못하는 문제점을 가지고 있다.
때문에 해당 논문에서는 '병렬연산' 이 가능한 transfomer 모델을 제안하고 있다. transformer 모델은 반복(recurrence), 컨볼루션을 완전히 생략한채, 오로지 어텐션 메커니즘에 기반을 두고 있다. 이러한 transfomer 모델은 병렬연산이 가능하기에 번역작업에서 훨씬 적은 시간만을 소모할 수 있고, 심지어 결과조차도 우수하다.
기존에 번역작업에서 주로 사용되었던 반복(recurrence) 모델들은 입력 및 출력 시퀀스의 기호 위치에 따라 계산을 고려한다. 즉, hidden state $ht_{t-1}$ 과 입력 position 값 $t$ 를 통해 hidden state 시퀀스인 $ht$ 를 생성한다. 위에서 언급한 반복기반 학습방법을 배운 사람이라면 알고 있듯, 이러한 방식은 훈련 예제에서 '병렬화' 를 배제한다. 물론, 최근 연구에서 인수분해 트릭이나, 조건부 계산을 통해 효율성을 크게 향상한 것은 사실이나 '근본적인' 문제가 해결된것은 아니다.
때문에 해당 논문에서는 이러한 반복 방법의 한계점(병렬연산 불가능)을 타파하기 위해, transformer 를 제안한다. transformer 로 학습된 번역기의 경우 NVIDIA P100 GPU 8개를 병렬연결하여 고작 12시간동안 학습한 결과가 2017년 논문출판 당시에 BLUE 지수가 높은 기법들을 모두 제쳤었다.
이러한 transformer 는 sequencealigned RNNs 이나 Convolution 을 사용하지 않고 입력과 출력의 표현을 계산하기 위해 전적으로 self-attention 에 의존하는 최초의 변환 모델이라고 지칭하고 있다.

Attention 에 대해서는 해당 포스팅을 참조하길 바라며, self-attention 은 아래서 추가로 설명한다.

2. Model Architecture

transformer 는 기본적으로 Encoder, Decoder로 구성된다. 아래 그림 Figure 1 을 참조하자.

Figure 1: The Transformer - model architecture.

Left - Encoder, Right - Decoder

2-1. Positional Encoding

RNN 계열의 모델들을 이용하여 자연어 분석을 수행할 때, 자연스럽게 '단어별 순서' 가 기록된다. 하지만 transformer 모델에서는 'order of the sequence (시계열 데이터의 입력순서)' 가 들어있지 않기때문에 이러한 정보를 추가적으로 주입 해 주어야 한다. 때문에 Figure 1 그림 하단 Encoder, Decoder 에서는 Embedding 이후에 Positional Encoding 을 수행하도록 설계되어 있다.

transformer 모델에서 위치 정보가 없는 이유는 모델의 구조와 연산 방식 때문이다. 위에서 언급했듯 transformer는 순차적인 데이터 처리 방식을 사용하지 않고, 입력 데이터를 병렬적으로, 한 번에 처리한다. 이 때문에 기존의 RNN(Recurrent Neural Networks)이나 LSTM(Long Short-Term Memory) 같은 모델과 달리 순서에 대한 정보가 자연스럽게 포함되어 있지 않다.

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types.

$n$ = sequence length, $d$ = representation dimension,
$k$ = kernel size of convolution, $r$ = restricted self-attention(neighborhood)

아래와 같은 예시를 들 수 있다.

$n$ : 자연어 처리에서의 '단어의 개수'
$d$ : 표현 차원
$k$ : Convolution Layer 의 View Size
$r$ : 말 그대로 self-attention 의 거리제한.

위에서 중요하게 살펴보아야 할 구간은 'Complexity per Layer' 이다. 결국 sequence length 가 representation dimension 보다 작다면 self-attention 방식이 반복 방식보다 빠르다는 것이다. 여기서 만약 문장의 길이가 너무 길어져 $n^{2}$ 이 $d^{2}$ 보다 복잡도가 커지게 된다면 문장 전체를 고려하는것이 아닌, 각 출력위치를 중심으로 하는 입력 시퀀스에서 크기 $r$ 만큼만큼의 이웃만 고려하도록 제한할 수 있다. (제한으로 인해 유형에 따라 정확도가 낮아질 수도 있다)
참고로 이러한 '위치정보' 에 대한 값을 만들기 위해 아래 두개의 함수를 사용한다. (간단히 봐 두도록 하자)
$$ \begin{align}
PE_{pos, 2i} = sin(pos/10000^{2i/d_{model}}) \\
PE_{pos, 2i+1} = cos(pos/10000^{2i/d_{model}})
\end {align} $$

$pos$ = position, $i$ = dimension

2-2. Encoder, Decoder

인코더는 6개($N$)의 동일한 레이어 스택으로 구성된다.(Nx 부분이 6번 반복된다) 각 레이어에는 두 개의 하위 계층이 존재한다. 첫 번째는 multi-head self-attention 이고 두 번째는 간단한 FNN 이다. 2개의 하위 계층 각각 주위에 Residual Learning(특정 레이어를 건너 뛰어서 그 입력값을 그대로 전달)을 사용하고 Layer Normalization(layer 별 입력을 배치단위의 평균, 분산으로 정규화해 학습을 진행)를 사용한다. 즉, 각 하위 계층의 출력은 $LayerNorm(x + Sublayer(x))$이다. 더하여 residual learning 연결을 용이하게 하기 위해 모든 하위 계층과 임베딩 계층은 차원 $dmodel = 512$ 의 출력을 생성한다.

디코더도 6개($N$)의 동일한 레이어 스택으로 구성된다.(Nx 부분이 6번 반복된다) 각 인코더 계층의 두 하위 계층 외에도 디코더는 인코더 스택의 출력에 대해 multi-head attention을 수행하는 세 번째 하위 계층을 삽입한다. 인코더와 마찬가지로 각 하위 계층서 residual connection을 사용하고 layer normalization를 수행한다.
참고로 위에서 두번째 layer 를 살펴보면 encoder 로 부터 직접적인 residual connection 을 입력받고 있다. 해당 입력은 당연히 인코더의 마지막 계층으로부터 입력받는것이며 Nx 모두 해당 입력을 입력받게 된다.

6번 반복된다고 한 이유는
해당 논문에서 Encoder 와 Decoder 를 각각 6번 쌓아서 제작하였기 때문이다.
커스텀하여 사용할 목적이면 $x$번 쌓아 구현하면 된다.

2-3. Attention

Figure 1 에서 보여지는 Multi-Head Attention 은 당연하게도 추상화 된 로직이다. 그 내부는 아래 이미지와 같이 구성되어 있다.

Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

오른쪽 그림 내의 Scaled Dot-Product Attention 은 왼쪽 그림에서 상세히 살펴볼 수 있다.

어텐션 함수는 Query, Key, Value 세트를 출력해 매핑하는것으로 설명할 수 있다.(출력은 벡터 형태이다) 출력값은 가중합으로 계산되며, 각 값에 할당된 가중치는 해당 키가 있는 쿼리의 호환성 함수에 의해 계산된다.

Query : 유사도를 검사할 특정 단어 벡터 (소스) (단일 단어인것처럼 명시했지만 여러 단어도 입력 가능.)
Key : A와 비교될 단어 벡터 (타켓)
Value : Key 값에 대해 학습된 가중치 벡터

해당 논문서 중점은 '특정 단어' 가 '다른 단어들' 에 대해서 어느정도의 가중치 값을 가질수 있는지(얼마나 유사성이 있는지)를 찾아내기 위해서 이다. 즉 value 는 사전에 key 값에 대해서 가중치가 학습된 상태여야 한다. (이러한 학습과정은 인코더에서 이뤄진다)

Scaled Dot-Product Attention 을 살펴보면 어떤식으로 운용되는지 쉽게 파악할 수 있다. Query와 Key 값이 MatMul(행렬곱), Scale(스케일 조정), Mask, SoftMax 함수를 거쳐 Query 로 들어온 값이 Key 단어와 MatMul 하여 출력을 만들고, 이 출력을 기존 Value 값과 MatMul 하여 최종적인 출력을 만들어 낸다. 여기까지가 Scaled Dot-Product Attention 에서 행하는 일이며 이를 수식으로 나타내면 아래와 같다.
$$ Attention(Q,K,V) = softmax({QK^{T} \over \sqrt{d_{k}}})V$$
더하여 이 Scaled Dot-Product Attention 과정은 Linear Layer 를 거쳐 $h$ 만큼 병렬로 처리된다.(이것이 MultiHead Attention 이다) 이때 입력값과 출력값의 Dimension 을 맞춰 줌으로서 처리에 용이하게끔 한다.
부가적으로 위 식에서 굳이 ${1 \over \sqrt d_{k}}$ 을 Softmax() 에 취해주는 이유는, $d_{k}$ 값이 매우 큰 경우 내적의 크기가 과도하게 커져 Softmax() 의 기울기가 매우 작은영역으로 밀려나는 문제로 인해(논문은 suspect 라고 언급), 해당 효과를 상쇄시키기 위해 곱해주는 값 이다.

Multi-Head Attention 을 살펴보면 dmodel 의 차원 키, 값 및 쿼리로 단일 어텐션 기능을 수행하는 대신 $dk$ 및 $dv$ Dimension에 각각 다른 학습된 Linear Layer를 사용하여 Query, Key, Value를 h번 linear projection 하는 것이 유익하다는 것을 깨달았다고 말한다. 이러한 Attention을 병렬로 수행하여 dv 차원 출력 값을 생성하며. 이들은 연결(concat)되고 다시 한 번 투영(linear)되어 Figure 2와 같이 최종 값이 출력 된다는 명쾌한 설명이다.

추가적으로 여기서 사용되는 self-attention 들은 완전히 동일하게 동작한다고 착각할 수 있지만 서로 다른 동작 매커니즘을 가진다.

encoder-decoder attention : 디코더 파트에서 사용되는 어텐션으로 query는 디코더에 위치하고 key, value 는 인코더의 출력에서 전달된다. 때문에 디코더는 모든 위치에서 입력 시퀀스의 모든 위치에 관여할 수 있다. → 인코더가 번역전 텍스트이고, 디코더가 번역 후 텍스트라면 번역 후의 특정 단어는 번역 전 단어중 어떤 단어에 더 많은 가중치 가지는가? [ query : 디코더 벡터 / key = value : 인코더 벡터 ]
encoder-self attention : 모든 query, key, value 가 같은 위치에서 전달되며, 인코더를 통해 이전계층의 모든 위치로 전달된다 → 각각의 단어가 서로에게 어떤 가중치를 가지는지 모든단어 참조(따라서 위에서 언급한 사전에 학습된 value 를 얻을 수 있다.) [ query = key = value ]
decoder-self attention(masked) : 디코더의 각 위치가 해당 위치까지 포함하여 디코더의 모든 위치에 대응할 수 있도록 한다. 여기서 masked 로 일종의 제한을 두는데, 앞에 있는 단어는 뒤쪽 단어를 알 수 없게끔 처리하여 올바른 학습을 유도하는 어텐션이다. → 나중에 등장한 단어가 일방적으로 먼저 등장한 단어를 인식할 수 있다. [ query = key = value ]

3. Why Self-Attention

위에서 self-attention 에 대한 명확한 정의를 생략 하였는데, 결론적으로 self-attention 이란 attention 을 자기 자신에게 취한다는 의미이다. 이는 '문장들 내부 단어들 간의 상호 관계를 파악하기 위함' 인데 이 과정을 위에서 말한 query, key, value 를 통해서 수행하는것이다.
즉 이러한 self-attention 이 여러개(multi) 존재한다 하여 multi-head attention 이라고 지칭하는 것이다.

Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb &lsquo;making&rsquo;, completing the phrase &lsquo;making...more difficult&rsquo;. Attentions here shown only for the word &lsquo;making&rsquo;. Different colors represent different heads. Best viewed in color.

위 이미지는 장거리 의존성에 대한 연결 예시이지만,
이렇게 자기 자신을 비교하여 연관성을 찾는다 이해하면 된다.

그렇다면 결국 왜 self-attention 을 사용하는걸까? 논문에서는 그 이유를 세가지로 정리하였다.

Layer 당 전체 계산 복잡도
순차작업을 최소화하고, 병렬화 할 수 있는 계산량
네트워크 내부의 장거리 종속성 간의 경로 길이

위에서 언급했던 표를 다시 살펴보자

1번 항목인 계산 복잡도 측면은 위에 언급된[2-1 Positional Encoding] 항목을 참조하면 된다.
2번 항목에서 '순차작업' 을 최소화 해야 한다고 명시했다. Recurrent 의 경우 순차작업이 필연적이므로 $O(n)$ 의 순차작업을 가진다.(다른 방식대비 순차 처리 작업이 $n$ 배)
3번 항목은 single convolution layer 와 비교해 보면 이해가 된다. 기본적으로 kernel 너비가 $k<n$ 인 single convolution layer 는 모든 쌍의 입력, 출력 위치를 연결하지 않는다. 이걸 굳이굳이 연결하려면 stack 사이즈를 크게 늘려야 하는데 이럴경우 당연히 두 위치 사이의 경로가 증가한다. 추가적으로 convolution layer 는 일반적으로 순환레이어보다 k 배 비싸다는 점(Complexity per Layer) 역시 강조한다.

즉, recurrent 는 순차작업 처리서 문제가 있고, convolution 은 거리 종속성 경로문제와 근본적인 비용문제가 존재하며, 결정적으로 recurrent 는 병렬작업이 불가능한 문제가 존재하기에 비용과 시간문제를 해결한 Self-Attention 을 사용하는 것이다.

4. Training

Input Data	각각 450만 문장으로 구성된 영어, 독일어 데이터세트
H/W Spec	NVIDIA P100 GPU 8EA
Training Time	3.5 day

optimizer 는 Adam 을 사용하였고 hyper-parameter 값들을 설정하였다 ($ \beta_{1} = 0.9, \beta_{2} = 0.98, \epsilon = 10^{-9} $)
Residual Dropout 값의 경우 $P_{drop} = 0.1$ 을 사용하였다.

5. Conclusion

Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.

결과표를 보면 BLEU(BiLingual Evaluation Understudy) 라는 항목이 존재한다. 이는 기계 번역된 텍스트를 자동으로 평가하는 측정 항목으로 BLEU 점수는 기계 번역된 텍스트와 고품질 번역 세트(사람이 번역) 의 유사성을 측정한다. (높을수록 좋음) EN-DE 는 영어 → 독일어 변환 과정이며, EN-FR 은 독일어 → 영어 변환 과정에서의 점수를 매긴 것이다. Transformer 가 모든 부분에서 가장 우수한 결과를 나았다 (2017년 논문이고, 비교대상 모델들은 2014년 임을 유의하자)

Training Cost(FLOPs(부동 소수점 연산단위이다)) 를 살펴보면 위에서 예견된 바와 같이 transformer 가 다른 모델들 대비 비교적 적은 비용을 소모하는 것을 확인할 수 있다.

6. Code Implementation

구현에 특별히 관심이 없다면 해당 챕터를 생략해도 좋다.
여기서는 모든 코드를 직접 구현하지 않고, Tensorflow library 를 빌리도록 한다.

GPU	RTX3070
Environment	tensorflow/tensorflow - Docker Image \| Docker Hub
Target	Portuguese to English

Transformer.ipynb

0.02MB

솔직히 말해서 한국어 - 영어간 번역 예제를 직접 제작하고 싶었는데 여러 문제로 인해 포루투갈어-영어간 번역 예제를 설명하는 것으로 대체하였다. 한국어-영어간 번역 데이터를 AI Hub 측에서 얻을 순 있었지만 시간문제로 인해 쉽게 제공되는 Ted의 자료를 이용하기로 하였다.(무엇보다 한국어 Tokenizer를 검증 및 제작하는 과정에서 시간 소요를 피하고 싶었다.) 해당 챕터는 어디까지나 Transformer 의 동작원리를 코드로 구현하고, 그를 이해하는데 있으므로 가능하면 쉽게 취득이 가능한 데이터와 Preprocessing 이 용이한 형태로 사용하기로 하였다.

6-1 import

필요한 라이브러리들을 import 해 준다.

Source Code - import

import logging
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow as tf

import tensorflow_text

6-2 Download Dataset

Ted Talks Open Translation Project 는 여러가지 데이터 세트를 제공하고 있다. 이를통해 포루투갈-영어 번역 데이터 세트를 로드한다.

Source Code - Download Dataset

# pt to en = 포루투갈어 대 영어
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en',
                               with_info=True,
                               as_supervised=True)

train_examples, val_examples = examples['train'], examples['validation']

# 가져온 trains, examples 를 변환하여 하나씩 출력함.
for pt_examples, en_examples in train_examples.batch(3).take(1):
  print('> Examples in Portuguese:')
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))
  print()

  print('> Examples in English:')
  for en in en_examples.numpy():
    print(en.decode('utf-8'))
    
 # 출력되는 각 영어, 포루투갈어는 '매칭되어 있음' (pt[0] == en[0], pt[1] == en[1] ...)

Output

> Examples in Portuguese:
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

> Examples in English:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

train_examples, valid_examples 의 형태인 tensorflow.python.data.ops.prefetch_op._PrefetchDataset 는 대략적으로 아래와 같은 형태로 구성되어 있다.

Example - train_examples, valid_examples

(<tf.Tensor: shape=(), dtype=string, numpy=b'tinham comido peixe com batatas fritas ?'>, <tf.Tensor: shape=(), dtype=string, numpy=b'did they eat fish and chips ?'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'estava sempre preocupado em ser apanhado e enviado de volta .'>, <tf.Tensor: shape=(), dtype=string, numpy=b'i was always worried about being caught and sent back .'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'escolhi um com o tom de pele de uma lagosta com um escald\xc3\xa3o .'>, <tf.Tensor: shape=(), dtype=string, numpy=b'i chose one with the skin color of a lobster when sunburnt .'>)
...
...
...

6-3 Tokenizer

ted 데이터셋중 포루투갈어, 영어간 미리 완성된 토크나이저를 가져온다.

언어별 토크나이저를 직접 제작할때는 간단하게 토큰화 후 특수문자만 지우는 형태로 제작 할 수도 있고, 시제별로 구분, 변환하는 등의 과정을 통해 정교하게 제작 할 수도 있다. 만약 '감정분석' 과 같은 영역에서 Tokenizer를 사용한다면 '과거형', '미래형' 과 같은 시제구분은 큰 의미가 없으므로 '갔었다', '갈 것이다' 와 같은 단어들을 모두 '간다' 라는 단어로 치환해서 사용하는 형태로 토크나이저를 제작해도 된다.

ted 에서 제공해 주는 convertor는 단순 '토큰화' 뿐만 아니라 단어를 벡터화 (TokenID) 시켜주고, 반대로 벡터화된 단어를 원본 단어로 치환해 주는 등의 편리한 기능을 담고 있다.

Source Code - Tokenizer

model_name = 'ted_hrlr_translate_pt_en_converter'
tf.keras.utils.get_file(
    f'{model_name}.zip',
    f'https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip',
    cache_dir='.', cache_subdir='', extract=True
)

tokenizers = tf.saved_model.load(model_name)

# 배치화할 텍스트를 확인한다.
for en in en_examples.numpy():
  print(en.decode('utf-8')) #print(batch)

encoded = tokenizers.en.tokenize(en_examples)

# encoded 는 token ID 로 벡터화 한다.
for row in encoded.to_list():
  print(row) # print(encode)

# detokenize 는 token ID 를 글자로 변환한다.
round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
  print(line.decode('utf-8')) # print(decode)

# 가장 중요한 부분으로 하위수준 lookup 을 사용한다 [TokenID to TokenText]
tokens = tokenizers.en.lookup(encoded)
print(tokens) # print(tokens)

Output - print(batch)

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

Output - print(encode)

[2, 39, 27, 1106, 13, 71, 223, 80, 179, 437, 207, 71, 6063, 74, 71, 2741, 15, 3]
[2, 71, 2813, 446, 72, 71, 751, 14, 92, 1075, 1684, 2100, 80, 166, 415, 14, 948, 72, 2767, 100, 71, 1080, 774, 88, 15, 3]
[2, 10, 383, 11, 149, 388, 1041, 1479, 27, 76, 9, 55, 330, 51, 9, 3095, 77, 71, 1117, 15, 3]

Output - print(decode)

c : success , the change is only coming through the barrel of the gun .
the documentation and the hands - on teaching methodology is also open - source and released as the creative commons .
( video ) didi pickles : it ' s four o ' clock in the morning .

'하위수준 Lookup' 이란 tokenizer 의 하위단어 측면을 볼 수 있다는것을 의미한다. "searchability" 라는 단어는 "search ##ability" 로 분해된다. 이런식으로 처리하면 모델이 "search", “ability” 같은 단어를 각각 별도로 처리하지 않아도 된다.

[START] 로 시작하며 [END] 로 끝나도록 설계되어 있다.

Output - print(tokens)

<tf.RaggedTensor [[b'[START]', b'c', b':', b'success', b',', b'the', b'change', b'is',
  b'only', b'coming', b'through', b'the', b'barrel', b'of', b'the', b'gun',
  b'.', b'[END]']                                                          ,
 [b'[START]', b'the', b'document', b'##ation', b'and', b'the', b'hands',
  b'-', b'on', b'teaching', b'method', b'##ology', b'is', b'also', b'open',
  b'-', b'source', b'and', b'released', b'as', b'the', b'creative',
  b'common', b'##s', b'.', b'[END]']                                       ,
 [b'[START]', b'(', b'video', b')', b'did', b'##i', b'pick', b'##les', b':',
  b'it', b"'", b's', b'four', b'o', b"'", b'clock', b'in', b'the',
  b'morning', b'.', b'[END]']                                               ]>

6-4 Set up the input pipeline with tf.data

Source Code - Pipeline

MAX_TOKENS=128

# 텍스트 배치를 인코딩하는데 사용됨. (위에서 TokenID 로 바꾸는 방식)
# 몇가지 제약사항을 추가함.
# 1) MAX_TOKENS 를 넘지 않도록 잘라냄
# 2) 텐서로 만듬.
def prepare_batch(pt, en):
    pt = tokenizers.pt.tokenize(pt)      # Output is ragged.
    pt = pt[:, :MAX_TOKENS]    # Trim to MAX_TOKENS.
    pt = pt.to_tensor()  # Convert to 0-padded dense Tensor

    en = tokenizers.en.tokenize(en)
    en = en[:, :(MAX_TOKENS+1)]
    en_inputs = en[:, :-1].to_tensor()  # Drop the [END] tokens
    en_labels = en[:, 1:].to_tensor()   # Drop the [START] tokens

    return (pt, en_inputs), en_labels

BUFFER_SIZE = 20000
BATCH_SIZE = 64

# 데이터 처리, 셔플하는 간단한 입력 파이프라인.
# + 텍스트 토큰화 뒤 너무 긴 시퀀스는 prepare_batch() 를 통해 필터링함.
def make_batches(ds):
  return (
      ds
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(prepare_batch, tf.data.AUTOTUNE)
      .prefetch(buffer_size=tf.data.AUTOTUNE))
    
    
# 훈련 및 검증용 배치세트 생성
# 위에서 언급한 53000개 가량의 dataset 이 vector 화 되어 변수에 저장된다.
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

# 해당 배치세트로부터 pt, en 예제를 추출
for (pt, en), en_labels in train_batches.take(1):
  break

# 각 shape 형태 확인 - print
print(pt.shape)
print(en.shape)
print(en_labels.shape)

# 각 content 확인
print(en[0][:10])
print(en_labels[0][:10])

6-5 Positional Encoding

위에서 간단히 살펴보고 넘어간 함수가 기억나는가?

$$ \begin{align}
PE_{pos, 2i} = sin(pos/10000^{2i/d_{model}}) \\
PE_{pos, 2i+1} = cos(pos/10000^{2i/d_{model}})
\end {align} $$

해당 함수를 구현해 준다.

Source Code - Positional Encoding

def positional_encoding(length, depth):
  depth = depth/2

  positions = np.arange(length)[:, np.newaxis]     # (seq, 1)
  depths = np.arange(depth)[np.newaxis, :]/depth   # (1, depth)

  angle_rates = 1 / (10000**depths)         # (1, depth)
  angle_rads = positions * angle_rates      # (pos, depth)

  pos_encoding = np.concatenate(
      [np.sin(angle_rads), np.cos(angle_rads)],
      axis=-1) 

  return tf.cast(pos_encoding, dtype=tf.float32)

class PositionalEmbedding(tf.keras.layers.Layer):
  def __init__(self, vocab_size, d_model):
    super().__init__()
    self.d_model = d_model
    self.embedding = tf.keras.layers.Embedding(vocab_size, d_model, mask_zero=True) 
    self.pos_encoding = positional_encoding(length=2048, depth=d_model)

  def compute_mask(self, *args, **kwargs):
    return self.embedding.compute_mask(*args, **kwargs)

  def call(self, x):
    length = tf.shape(x)[1]
    x = self.embedding(x)
    # This factor sets the relative scale of the embedding and positonal_encoding.
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x = x + self.pos_encoding[tf.newaxis, :length, :]
    return x

이러한 Positional Encoding 벡터를 추가해 줄 수 있는 Embedding 벡터 클래스를 추가한다.

6-6 Base Attention Layer

Base Attention Layer는 각 MultiHead Attention 들을 구현하는데에 필요한 부모 클래스 역할을 수행한다. **kwargs 변수를 통해 Keras library 내에 정의되어 있는 MultiHeadAttention 클래스에서 사용된다.

Source Code - BastAttention

class BaseAttention(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.layernorm = tf.keras.layers.LayerNormalization()
    self.add = tf.keras.layers.Add()

6-7 Cross Attention Layer

이미지에서 살펴보는 바와 같이 key, value 값은 Encoder 측 context parameter 로 부터 전달받고, query 값은 Decoder 측의 x 로 부터 전달받는다.

Source Code - CrossAttentionLayer

class CrossAttention(BaseAttention):
  def call(self, x, context):
    attn_output, attn_scores = self.mha(
        query=x,
        key=context,
        value=context,
        return_attention_scores=True)

    # Cache the attention scores for plotting later.
    self.last_attn_scores = attn_scores

    x = self.add([x, attn_output])
    x = self.layernorm(x)

    return x

6-8 Global Self-Attention Layer

query, key, value 모두 x parameter 로 부터 입력받는다.

Source Code - GlobalSelfAttentionLayer

class GlobalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

6-8 Causal Self-Attention Layer

같은 형태인 Causal Self-Attention

Source Code - CausalSelfAttentionLayer

class CausalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

6-9 Feed Forward Network

Source Code - FeedForwardNetwork

class FeedForward(tf.keras.layers.Layer):
  def __init__(self, d_model, dff, dropout_rate=0.1):
    super().__init__()
    self.seq = tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),
      tf.keras.layers.Dense(d_model),
      tf.keras.layers.Dropout(dropout_rate)
    ])
    self.add = tf.keras.layers.Add()
    self.layer_norm = tf.keras.layers.LayerNormalization()

  def call(self, x):
    x = self.add([x, self.seq(x)])
    x = self.layer_norm(x) 
    return x

6-10 Encoder Layer

위에서 제작한 GlobalSelfAttention 레이어와 FeedForward 층을 서로 결합하여 연결하여 EncoderLayer 클래스를 제작한다.

Source Code - EncoderLayer

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self,*, d_model, num_heads, dff, dropout_rate=0.1):
    super().__init__()

    self.self_attention = GlobalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)

    self.ffn = FeedForward(d_model, dff)

  def call(self, x):
    x = self.self_attention(x)
    x = self.ffn(x)
    return x

6-11 Encoder

Encoder 는 최종적으로 PositionalEmbedding 과 위에서 제작한 EncoderLayer 를 결합하여 제작한다.

Source Code - Encoder

class Encoder(tf.keras.layers.Layer):
  def __init__(self, *, num_layers, d_model, num_heads,
               dff, vocab_size, dropout_rate=0.1):
    super().__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.pos_embedding = PositionalEmbedding(
        vocab_size=vocab_size, d_model=d_model)

    self.enc_layers = [
        EncoderLayer(d_model=d_model,
                     num_heads=num_heads,
                     dff=dff,
                     dropout_rate=dropout_rate)
        for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

  def call(self, x):
    # `x` is token-IDs shape: (batch, seq_len)
    x = self.pos_embedding(x)  # Shape `(batch_size, seq_len, d_model)`.

    # Add dropout.
    x = self.dropout(x)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x)

    return x  # Shape `(batch_size, seq_len, d_model)`.

6-12 Decoder Layer

위에서 제작한 CausalSelfAttention과 CrossAttention 을 결합하여 DecoderLayer 를 제작한다

Source Code - DecoderLayer

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self,
               *,
               d_model,
               num_heads,
               dff,
               dropout_rate=0.1):
    super(DecoderLayer, self).__init__()

    self.causal_self_attention = CausalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)

    self.cross_attention = CrossAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)

    self.ffn = FeedForward(d_model, dff)

  def call(self, x, context):
    x = self.causal_self_attention(x=x)
    x = self.cross_attention(x=x, context=context)

    # Cache the last attention scores for plotting later
    self.last_attn_scores = self.cross_attention.last_attn_scores

    x = self.ffn(x)  # Shape `(batch_size, seq_len, d_model)`.
    return x

6-13 Decoder

방금 제작한 DecoderLayer 와 이전에 제작한 PositionalEmbedding 을 합쳐 제작한다.

Source Code - Decoder

class Decoder(tf.keras.layers.Layer):
  def __init__(self, *, num_layers, d_model, num_heads, dff, vocab_size,
               dropout_rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size,
                                             d_model=d_model)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dec_layers = [
        DecoderLayer(d_model=d_model, num_heads=num_heads,
                     dff=dff, dropout_rate=dropout_rate)
        for _ in range(num_layers)]

    self.last_attn_scores = None

  def call(self, x, context):
    # `x` is token-IDs shape (batch, target_seq_len)
    x = self.pos_embedding(x)  # (batch_size, target_seq_len, d_model)

    x = self.dropout(x)

    for i in range(self.num_layers):
      x  = self.dec_layers[i](x, context)

    self.last_attn_scores = self.dec_layers[-1].last_attn_scores

    # The shape of x is (batch_size, target_seq_len, d_model).
    return x

6-14 Transformer

처음에 언급했듯 Transformer 는 Encoder와 Decoder 로 구성된다. 지금까지 개별 요소들을 잘 구축해 왔으니 최종적으로 Encoder, Decoder 클래스를 이용해 Transformer 를 생성한다.

Source Code - Transformer

class Transformer(tf.keras.Model):
  def __init__(self, *, num_layers, d_model, num_heads, dff,
               input_vocab_size, target_vocab_size, dropout_rate=0.1):
    super().__init__()
    self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=input_vocab_size,
                           dropout_rate=dropout_rate)

    self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=target_vocab_size,
                           dropout_rate=dropout_rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inputs):
    # To use a Keras model with `.fit` you must pass all your inputs in the
    # first argument.
    context, x  = inputs

    context = self.encoder(context)  # (batch_size, context_len, d_model)

    x = self.decoder(x, context)  # (batch_size, target_len, d_model)

    # Final linear layer output.
    logits = self.final_layer(x)  # (batch_size, target_len, target_vocab_size)

    try:
      # Drop the keras mask, so it doesn't scale the losses/metrics.
      # b/250038731
      del logits._keras_mask
    except AttributeError:
      pass

    # Return the final output and the attention weights.
    return logits

6-15 Set Hyperparameters

몇가지 하이퍼파라미터들을 설정해 주고 transformer 인스턴스를 생성한다.

Source Code - Set Hyperparameters

# Tensorflow 예제서 권장하는 값.
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

# 실제 Transformer 논문에서 사용된 값.
# num_layers = 6 # 인코더와 디코더가 총 몇 층으로 구성되었는지를 의미
# d_model = 512 # 인코더와 디코더에서의 정해진 입력과 출력의 크기를 의미
# dff = 2048 # 트랜스포머 내부에는 피드 포워드 신경망이 존재하며 해당 신경망의 은닉층의 크기를 의미
# num_heads = 8 # 병렬로 어텐션을 수행한다고 위에서 언급했다. 몇개를 병렬수 수행할 것인가?
# dropout_rate = 0.1

transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    dropout_rate=dropout_rate)

Tensorflow 논문에서 사용한 값을 그대로 이용하고 싶지만. VRAM 용량 부족으로 대부분의 환경에서 실행되지 않을 것이다.

6-16 Set Optimizer, Loss Function

optimizer(Adam), loss function 등을 설정한다. CustomSchedule() 는 learning_rate 를 설정하기 위한 클래스이다.

Source Code - Set Optimizer

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super().__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    step = tf.cast(step, dtype=tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
    
    
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

Source Code - Set Masked

def masked_loss(label, pred):
  mask = label != 0
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
  loss = loss_object(label, pred)

  mask = tf.cast(mask, dtype=loss.dtype)
  loss *= mask

  loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
  return loss


def masked_accuracy(label, pred):
  pred = tf.argmax(pred, axis=2)
  label = tf.cast(label, pred.dtype)
  match = label == pred

  mask = label != 0

  match = match & mask

  match = tf.cast(match, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(match)/tf.reduce_sum(mask)

6-17 Train

최종적으로 학습을 시작한다.

Source Code - Train

transformer.compile(
    loss=masked_loss,
    optimizer=optimizer,
    metrics=[masked_accuracy])
    
transformer.fit(train_batches,
                epochs=20,
                validation_data=val_batches)

Output - Train

810/810 [==============================] - 181s 194ms/step - loss: 6.5914 - masked_accuracy: 0.1456 - val_loss: 5.0451 - val_masked_accuracy: 0.2516
Epoch 2/20
810/810 [==============================] - 110s 135ms/step - loss: 4.5738 - masked_accuracy: 0.2981 - val_loss: 4.0411 - val_masked_accuracy: 0.3555
Epoch 3/20
810/810 [==============================] - 107s 131ms/step - loss: 3.8290 - masked_accuracy: 0.3791 - val_loss: 3.4058 - val_masked_accuracy: 0.4397
Epoch 4/20
810/810 [==============================] - 105s 129ms/step - loss: 3.2869 - masked_accuracy: 0.4387 - val_loss: 3.0181 - val_masked_accuracy: 0.4835
Epoch 5/20
810/810 [==============================] - 105s 129ms/step - loss: 2.8805 - masked_accuracy: 0.4850 - val_loss: 2.7747 - val_masked_accuracy: 0.5135
Epoch 6/20
810/810 [==============================] - 105s 129ms/step - loss: 2.5655 - masked_accuracy: 0.5232 - val_loss: 2.4830 - val_masked_accuracy: 0.5541
Epoch 7/20
810/810 [==============================] - 103s 127ms/step - loss: 2.2968 - masked_accuracy: 0.5585 - val_loss: 2.3379 - val_masked_accuracy: 0.5755
Epoch 8/20
810/810 [==============================] - 105s 130ms/step - loss: 2.1044 - masked_accuracy: 0.5844 - val_loss: 2.2247 - val_masked_accuracy: 0.5899
Epoch 9/20
810/810 [==============================] - 105s 129ms/step - loss: 1.9574 - masked_accuracy: 0.6055 - val_loss: 2.1576 - val_masked_accuracy: 0.6014
Epoch 10/20
810/810 [==============================] - 105s 129ms/step - loss: 1.8399 - masked_accuracy: 0.6218 - val_loss: 2.1136 - val_masked_accuracy: 0.6078
Epoch 11/20
810/810 [==============================] - 104s 128ms/step - loss: 1.7417 - masked_accuracy: 0.6361 - val_loss: 2.0858 - val_masked_accuracy: 0.6143
Epoch 12/20
810/810 [==============================] - 105s 129ms/step - loss: 1.6591 - masked_accuracy: 0.6482 - val_loss: 2.0575 - val_masked_accuracy: 0.6222
Epoch 13/20
810/810 [==============================] - 104s 128ms/step - loss: 1.5876 - masked_accuracy: 0.6590 - val_loss: 2.0522 - val_masked_accuracy: 0.6210
...
Epoch 19/20
810/810 [==============================] - 103s 127ms/step - loss: 1.2902 - masked_accuracy: 0.7059 - val_loss: 2.0291 - val_masked_accuracy: 0.6315
Epoch 20/20
810/810 [==============================] - 105s 129ms/step - loss: 1.2564 - masked_accuracy: 0.7114 - val_loss: 2.0455 - val_masked_accuracy: 0.6283

6-18 Translate

많은 예제들이 fit() 을 통해 학습을 진행한 뒤 loss, accuracy 등의 값만 확인한 뒤 작업을 끝내곤 하는데, 실질적으로 '학습된 모델 을 가지고 '어떻게 써먹을 수 있는가?' 가 가장 중요한데 이걸 빼먹곤 한다. 따라서 우리는 아래 코드를 통해 우리가 학습한 모델을 기반으로 번역작업을 시도해 보고자 한다.

우선, 많은 예제에서 언급된 문장들은 이미 '검증되어' 있기에 우리는 포루투갈 언론사에 직접 접속하여 몇몇 기사들의 내용을 발췌해 이를 구글 번역기를 통해 영어로 번역한 뒤, 우리가 만든 Transformer 모델을 통해 번역한 뒤 상호비교해 보고자 한다. (실제로는 이런식으로 검증하면 안되지만... 포루투갈어를 잘 모르니. 이미 훌륭한 품질을 지닌 구글 번역기를 통해 간접적으로 검증하도록 한다.)

코드는 기본적으로 위에서 생성한 transformer 모델과, 전처리 과정을 수행할 tokenizers 변수를 입력받아 지정한다. 그리고 encoder 부분에 입력할 sentence(포루투갈어 문장) 를 입력한다. transformer 는 시작 embedding 을 [START] 로, 끝나는 부분을 [END] 로 지정한다.

Source Code - Translator

class Translator(tf.Module):
  def __init__(self, tokenizers, transformer):
    self.tokenizers = tokenizers
    self.transformer = transformer

  def __call__(self, sentence, max_length=MAX_TOKENS):
    # 입력 문장이 포루투갈어 이다. '[START]' 및 '[END]' 토큰을 추가한다.
    assert isinstance(sentence, tf.Tensor) # assert 문을 사용하여 sentence 가 tf.Tensor 클래스의 인스턴스인지 확인한다.
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]

    # 토크나이저를 거치게 되면 대략 아래와 같은 형태의 Embedding Vector 가 생성된다.
    #[[   2  246   40   40 5413 6571   40  366   40 1170  155 4362  612  121
    #84 6679 4752  190  504   83  818 3502  287   94 6422  342  116 1120
    #83 3915  124 1155 1297   99  307   40 2955 2726   16    3]]
    sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()

    encoder_input = sentence

    # 출력 언어는 영어이므로 '[START]' 토큰으로 출력을 초기화한다.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis]
    end = start_end[1][tf.newaxis]

    # 최종적으로 출력할 output_array.
    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)

    for i in tf.range(max_length):
      output = tf.transpose(output_array.stack()) # transfpose 는 텐서의 축을 바꾸고, stack 은 모든 텐서를 합쳐 하나의 텐서로 만든다.
      predictions = self.transformer([encoder_input, output], training=False)

      # 마지막 토큰을 제외하고 predictions 로 설정.
      predictions = predictions[:, -1:, :]  # Shape `(batch_size, 1, vocab_size)`.

      predicted_id = tf.argmax(predictions, axis=-1)

      # `predicted_id`를 디코더에 입력으로 제공되는 출력에 연결.
      output_array = output_array.write(i+1, predicted_id[0])

      if predicted_id == end:
        break

    output = tf.transpose(output_array.stack())
    text = tokenizers.en.detokenize(output)[0]  # Shape: `()`.
    self.transformer([encoder_input, output[:,:-1]], training=False)

    return text

Source Code - Try Translate 1

translator = Translator(tokenizers, transformer)

def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')

# 청문회에서 Ana Abrunhosa는 통행료 인하 계획에 대해 야당 대표로부터 질문을 받았지만 아무런 도움이 되지 않았습니다.
sentence = "Durante a audição Ana Abrunhosa foi questionada pelos deputados da oposição quanto ao plano de redução das portagens mas nada adiantou."
ground_truth = "During the hearing, Ana Abrunhosa was questioned by opposition deputies about the plan to reduce tolls, but nothing helped."

translated_text = translator(tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Output - Translated 1

Input:         : Durante a audição Ana Abrunhosa foi questionada pelos deputados da oposição quanto ao plano de redução das portagens mas nada adiantou.
Prediction     : during the audace was that he was questioned by the oppppist of the opposition of the compliberate complived of the door - uploads , but nothing hated .
Ground truth   : During the hearing, Ana Abrunhosa was questioned by opposition deputies about the plan to reduce tolls, but nothing helped.

터무니없이 이상한 결과가 도출된다. 몇몇 단어들의 '유사성' 은 존재하나 번역 결과가 너무 이상하다.
우리가 훈련시킨 Transformer 모델은 고작 5만 문장으로 이뤄진 Dataset 을 통해 진행하였기 때문에 발생한 문제로, 위 문장은 일반적인 문장이 아닌 특정 분야(정치) 에 치우쳐져 있으므로 보다 일반적인 기사를 찾아보도록 하자.

Source Code - Try Translate 2

# 올해 이리나 샤크의 칸 영화제 출품은 이번이 세 번째다.
sentence = "Esta foi a terceira aparição de Irina Shayk este ano em Cannes"
ground_truth = "This was Irina Shayk's third appearance at Cannes this year."

translated_text, translated_tokens, attention_weights = translator(tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Output - Translated 2

Input:         : Esta foi a terceira aparição de Irina Shayk este ano em Cannes
Prediction     : this was the third toension of going shakenk this year in canninez
Ground truth   : This was Irina Shayk's third appearance at Cannes this year.

음... ㅋㅋ 고유명사(Irina Shayk) 인식 및 칸 영화제 관련된 단어들이 학습되어있지 않은 듯 하다. 고작 5만개의 데이터셋으로는 번역기 내부에 학습된 데이터가 턱없이 부족한 듯 하다. 더더욱 일반적인 문장을 찾아보자.

Source Code - Try Translate 3

# 그 장면은 매년 반복된다.
sentence = "A cena repete-se todos os anos."
ground_truth = "The scene repeats itself every year."

translated_text, translated_tokens, attention_weights = translator(tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Output - Translated 3

Input:         : A cena repete-se todos os anos.
Prediction     : the scene repeats every year .
Ground truth   : The scene repeats itself every year.

드디어 봐줄만한 번역 결과가 도출되었다. 대단히 일반적인 구문을 사용하니 성공적으로 번역한다! 기본적으로 트랜스포머 모델 자체로 번역기를 만들어도 '어느정도는' 볼만한 번역결과를 보여주는것이 정상적이다. 하지만 우리가 테스트한 번역모델은 학습 데이터셋이 매우 작은 관계로, 학습되지 못한 단어는 전혀 이상한 단어로 치환해 번역하기도 하는등 불완전한 모습을 보여준다. 만약 데이터셋이 방대하였다면 훨씬 더 좋은 결과를 보여줄 수 있을 것이다.

7. BLEU Score

우리가 개발한 번역기의 BLEU Score 를 계산하지는 않을것이다. 위에서 누차 이야기한 바와 같이 데이터셋이 매우 작기때문에 유의미한 점수를 뽑아내기가 어렵기 때문이다. 때문에 해당 챕터에서는 BLUE Score 가 무엇이고, 대략 어떤식으로 계산되는지 간단한 원리를 짚고 넘어가고자 한다.

BLEU Score 의 첫 발상은 한심할정도로 간단하다. (교수님한테 들고가면 욕먹었을 정도...)

그 발상의 시작은 바로 유니그램 정밀도(Unigram Precision) 인데

$$ \text{Unigram Precision =}\frac{\text{B 문장내에 존재하는 A 문장내의 단어의 수}}{\text{A 문장의 총 단어 수}} $$

그러니까 그냥 문장 A와, 문장 B를 비교해서 어떤 단어가 얼마나 존재하는지 비교한 것이 바로 '유니그램 정밀도' 이다. 이상하지 않나? 저런 방식으로 번역의 정밀도를 평가할 수 있을리가 만무하다. 특히 중복된 단어가 많이 존재하는 문장일경우, 당연히 점수가 높게 나올 것인데... 이건 말이 되지 않는다.

그래서 유니그램 정밀도 방법을 개선하여 중복을 제거하여 보정 한다.

$$\text{Modified Unigram Precision =}{{\text{A의 각 유니그램에 대해 중복을 보정한 뒤 모두 더한 값}\over{\text{A문장의 의 총 유니그램 수}}}}$$

여기까지 하면 '중복문제' 역시 어느정도 해결 된듯 하다. 하지만 이상하지 않나? 번역이라는 작업이 고작 단어의 빈도수만 때려맞춘다고 해서 정확도를 평가할 수 있을리가 없지 않는가?

KO : 나는 오늘 바나나를 먹었다.
EN : banana today I a ate

문장속 단어들은 올바르게 모두 삽입되어 있지만 단어의 순서가 맞지 않다. 이런 문제를 개선하기 위해서 n-gram 을 도입하였다. n-gram 에서는 다음에 등장할 단어까지 함께 고려하여 카운트하도록 설계되어 있다. (n-gram도 세부적으로 종류가 나뉜다).

n-gram 도 라이브러리가 잘 만들어져 있기에 편하게 테스트 할 수 있다.

import nltk.translate.bleu_score as bleu

candidate = 'Iran has a direct route to send Russia weapons – and Western powers can do little to stop the shipments'
references = ['Iran has a direct route to send arms to Russia, and the Western powers can do little to block the shipments.']

print('BLEU :',bleu.sentence_bleu(list(map(lambda ref: ref.split(), references)),candidate.split()))

이러한 BLEU Score 는 여러 단점이 존재한다. 가령 같은 의미의 단어라도 다른 단어를 사용하면 틀렸다고 판단하는 등의 에러이다. 하지만 전 세계 모든 언어를 대상으로 적용할 수 있다는 압도적인 '범용성' 이 바로 BLEU Score 의 장점이고, 이러한 장점이 더 크게 부각되어 현재로서 수많은 번역 인공지능 모델들의 검증용으로 사용되고 있다.

저작자표시 비영리 동일조건

'Artificial Intelligence > Article' 카테고리의 다른 글

[리뷰] GAN(Generative Adversarial Networks) (0)	2023.11.23
[리뷰] wav2vec 2.0 (0)	2023.08.21
[리뷰] StarGANv2-VC (0)	2023.07.29
[리뷰] Style-Based GAN (0)	2023.07.09
[리뷰] PGGAN(Progressive Growing of GANs) (0)	2023.06.24

1. Introduce & Background

2. Model Architecture

2-1. Positional Encoding

2-2. Encoder, Decoder

2-3. Attention

3. Why Self-Attention

4. Training

5. Conclusion

6. Code Implementation

6-1 import

6-2 Download Dataset

6-3 Tokenizer

6-4 Set up the input pipeline with tf.data

6-5 Positional Encoding

6-6 Base Attention Layer

6-7 Cross Attention Layer

6-8 Global Self-Attention Layer

6-8 Causal Self-Attention Layer

6-9 Feed Forward Network

6-10 Encoder Layer

6-11 Encoder

6-12 Decoder Layer

6-13 Decoder

6-14 Transformer

6-15 Set Hyperparameters

6-16 Set Optimizer, Loss Function

6-17 Train

6-18 Translate

7. BLEU Score

'Artificial Intelligence > Article' 카테고리의 다른 글

티스토리툴바