Prompt를 input에 추가했을 때 input, output processing

이때까지 알게 된 input에 prompt를 추가하는 방법은 크게 두가지가 있다.

하나는 맨 처음 input에만 prompt를 추가하는 방식.

두번째는 prompt를 모든 layer마다 추가하는 방식이다.

이 방법은 The Power of Scale for Parameter-Efficient Prompt Tuning 논문에 사용되었으며, 아래 2번 논문에서 "tembedding-only ablation" 라고 칭하는 방식으로 사용되었다.

Huggingface transformer 모델의 input은 input_ids, attention_mask이다.

Input에 prompt를 넣으려면 input_ids에는 임의의 input_id들을 concat하고(어짜피 쓰지 않는다), attention_mask에는 1을 concat한다.

이 때 모델의 embedding layer인 wte를 model.set_input_embeddings(d_wte) 를 이용해 바꿔주면 된다.

새로 정의한 s_wte는 forward() 함수 내에서 input_ids[:prompt_length]는 무시하고, 새로 생성한 embedding matrix를 soft prompt로 넘겨준다.

그리고 원래 input인 input_ids[prompt_length:]는 원래의 embedding matrix를 이용해 embedding을 얻는다.

이렇게 얻은 soft prompt와 embedding을 concat해서 self-attention layer로 넘겨주면 된다.

Input에 prompt가 추가된 만큼 lebel의 앞부분에 -100을 추가해준다.

추가해주지 않으면 label의 길이는 원래 input_length와 동일하지만 output은 prompt 길이만큼 추가되어 shape이 맞지 않는다.

-100은 torch에서 CrossEntropy를 계산할 때 ignore_index로 기본 세팅되어 있어서 output의 앞부분에 나온 값들을 label과 비교하지 않고 무시할 수 있다.

두번째 방법은 Prefix-Tuning: Optimizing Continuous Prompts for Generation 이란 논문에 사용됐다.

소스 링크: https://github.com/XiangLi1999/PrefixTuning (코드가 좀 더럽다...)

모든 레이어의 activation에 prompt를 추가한다는 내용은 논문의 4.1, 7.2에 나와 있다.

이 방법을 구현할 때 huggingface의 transformer library에 past_key_value 변수를 넣어주면 된다.

이 때 past_key_value는 (n_layers, batch_size, n_head, seq_length, embed_dim) shape인 tensor가 2개 들어있는 tuple을 갖는다.

이렇게 전달된 past_key_value는 각 layer별로 나눠져서

에서 attention을 계산할 때 input과 concat되어 계산된다.

이 때 output은 (batch_size, input_length(prompt 길이 무시), dimension) 을 갖는다.

Self-attention layer의 key, value에 추가된 prompt의 길이가 무시된 이유는 attention을 계산할 때 Query는 prefix가 붙지 않은 형태로 넘겨져서 output의 shape이 맞춰지기 때문이다.

A = Q * K^t -> (input_len , prompt_len + input_len)

A * V -> (input_len, dimension)

OLoRA 리뷰 (0)	2024.09.24
Production을 위한 LLM 최적화 기법들 - from 허깅페이스 블로그 (0)	2023.09.20
Emergent abilities (0)	2023.06.12
Transformer와 Noam scheduling (0)	2022.05.25

인공지능 + 자연어처리