PyTorch教程19.3之异步随机搜索

1920908 2023-06-05 | pdf | 0.29 MB | 次下载 | 免费

资料介绍

正如我们在前面的第 19.2 节中看到的那样，由于超参数配置的评估代价高昂，我们可能不得不等待数小时甚至数天才能在随机搜索返回良好的超参数配置之前。在实践中，我们经常访问资源池，例如同一台机器上的多个 GPU 或具有单个 GPU 的多台机器。这就引出了一个问题：我们如何有效地分布随机搜索？

通常，我们区分同步和异步并行超参数优化（见图19.3.1）。在同步设置中，我们等待所有并发运行的试验完成，然后再开始下一批。考虑包含超参数的配置空间，例如过滤器的数量或深度神经网络的层数。包含更多过滤器层数的超参数配置自然会花费更多时间完成，并且同一批次中的所有其他试验都必须在同步点（图 19.3.1 中的灰色区域）等待，然后我们才能继续优化过程。

在异步设置中，我们会在资源可用时立即安排新的试验。这将以最佳方式利用我们的资源，因为我们可以避免任何同步开销。对于随机搜索，每个新的超参数配置都是独立于所有其他配置选择的，特别是没有利用任何先前评估的观察结果。这意味着我们可以简单地异步并行化随机搜索。这对于根据先前的观察做出决定的更复杂的方法来说并不是直截了当的（参见第 19.5 节）。虽然我们需要访问比顺序设置更多的资源，但异步随机搜索表现出线性加速，因为达到了一定的性能K如果K试验可以并行进行。

https://file.elecfans.com/web2/M00/A9/CE/poYBAGR9PW-AdScKAACpELBjcYw139.svg

图 19.3.1同步或异步分配超参数优化过程。与顺序设置相比，我们可以减少整体挂钟时间，同时保持总计算量不变。在掉队的情况下，同步调度可能会导致工人闲置。

在本笔记本中，我们将研究异步随机搜索，其中试验在同一台机器上的多个 Python 进程中执行。分布式作业调度和执行很难从头开始实现。我们将使用Syne Tune （Salinas等人，2022 年），它为我们提供了一个简单的异步 HPO 接口。Syne Tune 旨在与不同的执行后端一起运行，欢迎感兴趣的读者研究其简单的 API，以了解有关分布式 HPO 的更多信息。

				import logging
from d2l import torch as d2l

logging.basicConfig(level=logging.INFO)
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import RandomSearch

				 

				INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
  pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
  pip install 'syne-tune[aws]'
or (for everything)
  pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
  pip install 'syne-tune[aws]'
or (for everything)
  pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
  pip install 'syne-tune[raytune]'
or (for everything)
  pip install 'syne-tune[extra]'

			

19.3.1。目标函数

首先，我们必须定义一个新的目标函数，以便它现在通过回调将性能返回给 Syne Tune report。

					def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
  from syne_tune import Reporter
  from d2l import torch as d2l

  model = d2l.LeNet(lr=learning_rate, num_classes=10)
  trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
  data = d2l.FashionMNIST(batch_size=batch_size)
  model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
  report = Reporter()
  for epoch in range(1, max_epochs + 1):
    if epoch == 1:
      # Initialize the state of Trainer
      trainer.fit(model=model, data=data)
    else:
      trainer.fit_epoch()
    validation_error = trainer.validation_error().cpu().detach().numpy()
    report(epoch=epoch, validation_error=float(validation_error))

					 

请注意，PythonBackendSyne Tune 需要在函数定义中导入依赖项。

19.3.2。异步调度器

首先，我们定义同时评估试验的工人数量。我们还需要通过定义总挂钟时间的上限来指定我们想要运行随机搜索的时间。

					n_workers = 2 # Needs to be <= the number of available GPUs

max_wallclock_time = 12 * 60 # 12 minutes

接下来，我们说明要优化的指标以及我们是要最小化还是最大化该指标。即，metric需要对应传递给回调的参数名称report。

					mode = "min"
metric = "validation_error"

我们使用前面示例中的配置空间。在 Syne Tune 中，该字典也可用于将常量属性传递给训练脚本。我们利用此功能以通过 max_epochs. 此外，我们指定要在中评估的第一个配置initial_config。

					config_space = {
  "learning_rate": loguniform(1e-2, 1),
  "batch_size": randint(32, 256),
  "max_epochs": 10,
}
initial_config = {
  "learning_rate": 0.1,
  "batch_size": 128,
}

					 

接下来，我们需要指定作业执行的后端。这里我们只考虑本地机器上的分布，其中并行作业作为子进程执行。但是，对于大规模 HPO，我们也可以在集群或云环境中运行它，其中每个试验都会消耗一个完整的实例。

					trial_backend = PythonBackend(
  tune_function=hpo_objective_lenet_synetune,
  config_space=config_space,
)

					 

BasicScheduler我们现在可以为异步随机搜索创建调度程序，其行为与我们在第 19.2 节中的类似。

					scheduler = RandomSearch(
  config_space,
  metric=metric,
  mode=mode,
  points_to_evaluate=[initial_config],
)

					 

					INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 4033665588

Syne Tune 还具有一个Tuner，其中主要的实验循环和簿记是集中的，调度程序和后端之间的交互是中介的。

					stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)

tuner = Tuner(
  trial_backend=trial_backend,
  scheduler=scheduler,
  stop_criterion=stop_criterion,
  n_workers=n_workers,
  print_update_interval=int(max_wallclock_time * 0.6),
)

					 

让我们运行我们的分布式 HPO 实验。根据我们的停止标准，它将运行大约 12 分钟。

					tuner.run()

					 

					INFO:syne_tune.tuner:results of trials will be saved on /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691
INFO:root:Detected 8 GPUs
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.31642002803324326 --batch_size 52 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.31642002803324326, 'batch_size': 52, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 0 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.045813161553582046 --batch_size 71 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint