使用LoRA和Hugging Face高效训练大语言模型-电子发烧友网

在本文中，我们将展示如何使用大语言模型低秩适配 (Low-Rank Adaptation of Large Language Models，LoRA) 技术在单 GPU 上微调 110 亿参数的 FLAN-T5 XXL 模型。在此过程中，我们会使用到 Hugging Face 的 Transformers、Accelerate 和 PEFT 库。

快速入门: 轻量化微调 (Parameter Efficient Fine-Tuning，PEFT)

PEFT 是 Hugging Face 的一个新的开源库。使用 PEFT 库，无需微调模型的全部参数，即可高效地将预训练语言模型 (Pre-trained Language Model，PLM) 适配到各种下游应用。

注意: 本教程是在 g5.2xlarge AWS EC2 实例上创建和运行的，该实例包含 1 个 NVIDIA A10G。

1. 搭建开发环境

在本例中，我们使用 AWS 预置的 PyTorch 深度学习 AMI，其已安装了正确的 CUDA 驱动程序和 PyTorch。在此基础上，我们还需要安装一些 Hugging Face 库，包括 transformers 和 datasets。运行下面的代码就可安装所有需要的包。

#installHuggingFaceLibraries
!pipinstallgit+https://github.com/huggingface/peft.git
!pipinstall"transformers==4.27.1""datasets==2.9.0""accelerate==0.17.1""evaluate==0.4.0""bitsandbytes==0.37.1"loralib--upgrade--quiet
#installadditionaldependenciesneededfortraining
!pipinstallrouge-scoretensorboardpy7zr

2. 加载并准备数据集

这里，我们使用 samsum 数据集，该数据集包含大约 16k 个含摘要的聊天类对话数据。这些对话由精通英语的语言学家制作。

{
"id":"13818513",
"summary":"AmandabakedcookiesandwillbringJerrysometomorrow.",
"dialogue":"Amanda:Ibakedcookies.Doyouwantsome?
Jerry:Sure!
Amanda:I'llbringyoutomorrow:-)"
}

我们使用 Datasets 库中的 load_dataset() 方法来加载 samsum 数据集。

fromdatasetsimportload_dataset

#Loaddatasetfromthehub
dataset=load_dataset("samsum")

print(f"Traindatasetsize:{len(dataset['train'])}")
print(f"Testdatasetsize:{len(dataset['test'])}")

#Traindatasetsize:14732
#Testdatasetsize:819

为了训练模型，我们要用 Transformers Tokenizer 将输入文本转换为词元 ID。

fromtransformersimportAutoTokenizer,AutoModelForSeq2SeqLM

model_id="google/flan-t5-xxl"

#LoadtokenizerofFLAN-t5-XL
tokenizer=AutoTokenizer.from_pretrained(model_id)

在开始训练之前，我们还需要对数据进行预处理。生成式文本摘要属于文本生成任务。我们将文本输入给模型，模型会输出摘要。我们需要了解输入和输出文本的长度信息，以利于我们高效地批量处理这些数据。

fromdatasetsimportconcatenate_datasets
importnumpyasnp
#Themaximumtotalinputsequencelengthaftertokenization.
#Sequenceslongerthanthiswillbetruncated,sequencesshorterwillbepadded.
tokenized_inputs=concatenate_datasets([dataset["train"],dataset["test"]]).map(lambdax:tokenizer(x["dialogue"],truncation=True),batched=True,remove_columns=["dialogue","summary"])
input_lenghts=[len(x)forxintokenized_inputs["input_ids"]]
#take85percentileofmaxlengthforbetterutilization
max_source_length=int(np.percentile(input_lenghts,85))
print(f"Maxsourcelength:{max_source_length}")

#Themaximumtotalsequencelengthfortargettextaftertokenization.
#Sequenceslongerthanthiswillbetruncated,sequencesshorterwillbepadded."
tokenized_targets=concatenate_datasets([dataset["train"],dataset["test"]]).map(lambdax:tokenizer(x["summary"],truncation=True),batched=True,remove_columns=["dialogue","summary"])
target_lenghts=[len(x)forxintokenized_targets["input_ids"]]
#take90percentileofmaxlengthforbetterutilization
max_target_length=int(np.percentile(target_lenghts,90))
print(f"Maxtargetlength:{max_target_length}")

我们将在训练前统一对数据集进行预处理并将预处理后的数据集保存到磁盘。你可以在本地机器或 CPU 上运行此步骤并将其上传到 Hugging Face Hub。

defpreprocess_function(sample,padding="max_length"):
#addprefixtotheinputfort5
inputs=["summarize:"+itemforiteminsample["dialogue"]]

#tokenizeinputs
model_inputs=tokenizer(inputs,max_length=max_source_length,padding=padding,truncation=True)

#Tokenizetargetswiththe`text_target`keywordargument
labels=tokenizer(text_target=sample["summary"],max_length=max_target_length,padding=padding,truncation=True)

#Ifwearepaddinghere,replacealltokenizer.pad_token_idinthelabelsby-100whenwewanttoignore
#paddingintheloss.
ifpadding=="max_length":
labels["input_ids"]=[
[(lifl!=tokenizer.pad_token_idelse-100)forlinlabel]forlabelinlabels["input_ids"]
]

model_inputs["labels"]=labels["input_ids"]
returnmodel_inputs

tokenized_dataset=dataset.map(preprocess_function,batched=True,remove_columns=["dialogue","summary","id"])
print(f"Keysoftokenizeddataset:{list(tokenized_dataset['train'].features)}")

#savedatasetstodiskforlatereasyloading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

3. 使用 LoRA 和 bnb int-8 微调 T5

除了 LoRA 技术，我们还使用 bitsanbytes LLM.int8() 把冻结的 LLM 量化为 int8。这使我们能够将 FLAN-T5 XXL 所需的内存降低到约四分之一。

训练的第一步是加载模型。我们使用 philschmid/flan-t5-xxl-sharded-fp16 模型，它是 google/flan-t5-xxl 的分片版。分片可以让我们在加载模型时不耗尽内存。

fromtransformersimportAutoModelForSeq2SeqLM

#huggingfacehubmodelid
model_id="philschmid/flan-t5-xxl-sharded-fp16"

#loadmodelfromthehub
model=AutoModelForSeq2SeqLM.from_pretrained(model_id,load_in_8bit=True,device_map="auto")

现在，我们可以使用 peft 为 LoRA int-8 训练作准备了。

frompeftimportLoraConfig,get_peft_model,prepare_model_for_int8_training,TaskType

#DefineLoRAConfig
lora_config=LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q","v"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM
)
#prepareint-8modelfortraining
model=prepare_model_for_int8_training(model)

#addLoRAadaptor
model=get_peft_model(model,lora_config)
model.print_trainable_parameters()

#trainableparams:18874368||allparams:11154206720||trainable%:0.16921300163961817

如你所见，这里我们只训练了模型参数的 0.16%！这个巨大的内存增益让我们安心地微调模型，而不用担心内存问题。

接下来需要创建一个 DataCollator，负责对输入和标签进行填充，我们使用 Transformers 库中的 DataCollatorForSeq2Seq 来完成这一环节。

fromtransformersimportDataCollatorForSeq2Seq

#wewanttoignoretokenizerpadtokenintheloss
label_pad_token_id=-100
#Datacollator
data_collator=DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=label_pad_token_id,
pad_to_multiple_of=8
)

最后一步是定义训练超参 ( TrainingArguments)。

fromtransformersimportSeq2SeqTrainer,Seq2SeqTrainingArguments

output_dir="lora-flan-t5-xxl"

#Definetrainingargs
training_args=Seq2SeqTrainingArguments(
output_dir=output_dir,
auto_find_batch_size=True,
learning_rate=1e-3,#higherlearningrate
num_train_epochs=5,
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=500,
save_strategy="no",
report_to="tensorboard",
)

#CreateTrainerinstance
trainer=Seq2SeqTrainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_dataset["train"],
)
model.config.use_cache=False#silencethewarnings.Pleasere-enableforinference!

运行下面的代码，开始训练模型。请注意，对于 T5，出于收敛稳定性考量，某些层我们仍保持 float32 精度。

#trainmodel
trainer.train()

训练耗时约 10 小时 36 分钟，训练 10 小时的成本约为 13.22 美元。相比之下，如果在 FLAN-T5-XXL 上进行全模型微调 10 个小时，我们需要 8 个 A100 40GB，成本约为 322 美元。

我们可以将模型保存下来以用于后面的推理和评估。我们暂时将其保存到磁盘，但你也可以使用 model.push_to_hub 方法将其上传到 Hugging Face Hub。

#SaveourLoRAmodel&tokenizerresults
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
#ifyouwanttosavethebasemodeltocall
#trainer.model.base_model.save_pretrained(peft_model_id)

最后生成的 LoRA checkpoint 文件很小，仅需 84MB 就包含了从 samsum 数据集上学到的所有知识。

4. 使用 LoRA FLAN-T5 进行评估和推理

我们将使用 evaluate 库来评估 rogue 分数。我们可以使用 PEFT 和 transformers 来对 FLAN-T5 XXL 模型进行推理。对 FLAN-T5 XXL 模型，我们至少需要 18GB 的 GPU 显存。

importtorch
frompeftimportPeftModel,PeftConfig
fromtransformersimportAutoModelForSeq2SeqLM,AutoTokenizer

#Loadpeftconfigforpre-trainedcheckpointetc.
peft_model_id="results"
config=PeftConfig.from_pretrained(peft_model_id)

#loadbaseLLMmodelandtokenizer
model=AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,load_in_8bit=True,device_map={"":0})
tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)

#LoadtheLoramodel
model=PeftModel.from_pretrained(model,peft_model_id,device_map={"":0})
model.eval()

print("Peftmodelloaded")

我们用测试数据集中的一个随机样本来试试摘要效果。

fromdatasetsimportload_dataset
fromrandomimportrandrange

#Loaddatasetfromthehubandgetasample
dataset=load_dataset("samsum")
sample=dataset['test'][randrange(len(dataset["test"]))]

input_ids=tokenizer(sample["dialogue"],return_tensors="pt",truncation=True).input_ids.cuda()
#withtorch.inference_mode():
outputs=model.generate(input_ids=input_ids,max_new_tokens=10,do_sample=True,top_p=0.9)
print(f"inputsentence:{sample['dialogue']}
{'---'*20}")

print(f"summary:
{tokenizer.batch_decode(outputs.detach().cpu().numpy(),skip_special_tokens=True)[0]}")

不错！我们的模型有效！现在，让我们仔细看看，并使用 test 集中的全部数据对其进行评估。为此，我们需要实现一些工具函数来帮助生成摘要并将其与相应的参考摘要组合到一起。评估摘要任务最常用的指标是 rogue_score，它的全称是 Recall-Oriented Understudy for Gisting Evaluation。与常用的准确率指标不同，它将生成的摘要与一组参考摘要进行比较。

importevaluate
importnumpyasnp
fromdatasetsimportload_from_disk
fromtqdmimporttqdm

#Metric
metric=evaluate.load("rouge")

defevaluate_peft_model(sample,max_target_length=50):
#generatesummary
outputs=model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(),do_sample=True,top_p=0.9,max_new_tokens=max_target_length)
prediction=tokenizer.decode(outputs[0].detach().cpu().numpy(),skip_special_tokens=True)
#decodeevalsample
#Replace-100inthelabelsaswecan'tdecodethem.
labels=np.where(sample['labels']!=-100,sample['labels'],tokenizer.pad_token_id)
labels=tokenizer.decode(labels,skip_special_tokens=True)

#Somesimplepost-processing
returnprediction,labels

#loadtestdatasetfromdistk
test_dataset=load_from_disk("data/eval/").with_format("torch")

#runpredictions
#thiscantake~45minutes
predictions,references=[],[]
forsampleintqdm(test_dataset):
p,l=evaluate_peft_model(sample)
predictions.append(p)
references.append(l)

#computemetric
rogue=metric.compute(predictions=predictions,references=references,use_stemmer=True)

#printresults
print(f"Rogue1:{rogue['rouge1']*100:2f}%")
print(f"rouge2:{rogue['rouge2']*100:2f}%")
print(f"rougeL:{rogue['rougeL']*100:2f}%")
print(f"rougeLsum:{rogue['rougeLsum']*100:2f}%")

#Rogue1:50.386161%
#rouge2:24.842412%
#rougeL:41.370130%
#rougeLsum:41.394230%

我们 PEFT 微调后的 FLAN-T5-XXL 在测试集上取得了 50.38% 的 rogue1 分数。相比之下，flan-t5-base 的全模型微调获得了 47.23 的 rouge1 分数。rouge1 分数提高了 3%。

令人难以置信的是，我们的 LoRA checkpoint 只有 84MB，而且性能比对更小的模型进行全模型微调后的 checkpoint 更好。

审核编辑：刘清

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉