README.md 13.3 KB
Newer Older
1
# Horovod Reference Architecture
Attila Farkas's avatar
Attila Farkas committed
2

3
[![pipeline status](https://git.sztaki.hu/science-cloud/reference-architectures/horovod/badges/master/pipeline.svg)](https://git.sztaki.hu/science-cloud/reference-architectures/horovod/-/commits/master) [![Latest Release](https://git.sztaki.hu/science-cloud/reference-architectures/horovod/-/badges/release.svg)](https://git.sztaki.hu/science-cloud/reference-architectures/horovod/-/releases)
4

5
[Horovod](https://horovod.readthedocs.io/en/v0.26.0/) is a distributed deep learning framework for TensorFlow, Keras, Pytorch and Apache MXNet. It was originally developed by Uber, for the purposes of enabling swift and simple distributed deep learning, and significantly reducing the training time of neural networks. Using Horovod, the training of neural networks can easily be parallelized, allowing the utilization of possibly hundreds of GPUs, with minimal modifications to the original code.
Attila Farkas's avatar
Attila Farkas committed
6

7
Using the Horovod reference architecture, along with Terraform and Ansible, the following infrastructure can be provisioned on the [ELKH Cloud](https://science-cloud.hu/en):
Attila Farkas's avatar
Attila Farkas committed
8

9
![horovod_ref_arch](docs/pics/Horovod_Ref_Arch.png "Horovod reference architecture")
Attila Farkas's avatar
Attila Farkas committed
10

11
The Horovod nodes are initailized within Docker containers on the virtual machines. We appoint a master node, which hosts a [JupyterLab](https://jupyterlab.readthedocs.io/en/3.5.x/) instance, providing an interface for performing deep learning operations. The sharing of training data and code between the nodes is facilitated using NFS, in the form of the shared '/horovod' directory. The network setting and firewall rules (security groups) of the nodes are also automatically created and configured.
Attila Farkas's avatar
Attila Farkas committed
12

Administrator's avatar
Administrator committed
13
## Features
Attila Farkas's avatar
Attila Farkas committed
14

Administrator's avatar
Administrator committed
15
- Support for distributed deep learning applications
16
- JupyterLab development environment
Administrator's avatar
Administrator committed
17
- Network based file sharing between nodes
18
- Optional, Prometheus and Grafana based monitoring
Attila Farkas's avatar
Attila Farkas committed
19

20
21
22
23
24
25
26
27
28
29
The most important libraries of the deep learning software stack available on each node can be seen below:
|package|version  |
|:-|--|
| horovod | 0.26.1 |
| tensorflow | 2.9.2 |
| keras | 2.9.0 |
| torch | 1.12.1+cu113 |
| tensorboard | 2.9.1 |
| jupyterlab| 3.5.1 |

30
31
## Prerequisites

32
- Configuring an SSH key on the ELKH Cloud
33
- Terraform and Ansible are required
34
  - You can install them by running the commands below, or omit the installation and use the [RefArch Toolset Docker image](https://git.sztaki.hu/science-cloud/reference-architectures/horovod#deployment-using-the-refarch-toolset-docker-image), which includes these tools.
Attila Farkas's avatar
Attila Farkas committed
35

36
Installation of Terraform in accordance with the [Official guide](https://learn.hashicorp.com/tutorials/terraform/install-cli):
Attila Farkas's avatar
Attila Farkas committed
37
38
39
40
41

```
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common curl
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
42
sudo apt-get update && sudo apt-get install terraform=1.3.6
Attila Farkas's avatar
Attila Farkas committed
43
44
```

45
Installation of Ansible in accordance with the [Official guide](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html):
Attila Farkas's avatar
Attila Farkas committed
46
47
48

```
sudo apt install python3-pip
49
sudo python3 -m pip install ansible==6.7.0
Attila Farkas's avatar
Attila Farkas committed
50
51
```

52
## Deployment
Attila Farkas's avatar
Attila Farkas committed
53

Administrator's avatar
Administrator committed
54
1. Download and extract descriptor files:
Attila Farkas's avatar
Attila Farkas committed
55
56

```
Póra Krisztián's avatar
Póra Krisztián committed
57
wget https://git.sztaki.hu/science-cloud/reference-architectures/horovod/-/archive/master/horovod-master.tar.gz -O raf-horovod.tar.gz
Attila Farkas's avatar
Attila Farkas committed
58
tar -zxvf raf-horovod.tar.gz
Attila Farkas's avatar
Attila Farkas committed
59
60
```

Administrator's avatar
Administrator committed
61
2. Enter the directory of OpenStack descriptors:
Póra Krisztián's avatar
Póra Krisztián committed
62
63

```
64
cd horovod-master/terraform_openstack
Póra Krisztián's avatar
Póra Krisztián committed
65
66
```

Administrator's avatar
Administrator committed
67
3. Customize OpenStack descriptors:
Attila Farkas's avatar
Attila Farkas committed
68

69
70
71
72
73
- An **application credential** and your **authentication URL** is needed to enable Terraform to access your resources

    - You can create an application credential on the OpenStack web interface under **Identity > Application Credentials**. Please note that application credentials are valid only for the project selected at the time of their creation.
    - You can find your authentication URL under **Project > API Access**, as the endpoint of the entry labeled 'Identity'
- In the `auth_data.auto.tfvars` file, authentication information must be set according to the following format:
Attila Farkas's avatar
Attila Farkas committed
74
75
```
auth_data = ({
76
77
    credential_id= "SET_YOUR_CREDENTIAL_ID"
    credential_secret= "SET_YOUR_CREDENTIAL_SECRET"
kpora's avatar
kpora committed
78
    auth_url = "SET_YOUR_AUTH_URL"
Attila Farkas's avatar
Attila Farkas committed
79
80
})
```
81
- In the `resources.auto.tfvars` file, properties of the deployment must be set:
82
  - In the `horovod_master_node` block, the desired properties of the Horovod master node must be set. Please choose an appropiate volume size for you needs, as both training data and code will be stored on the Horovod Master.
83
  - In the `horovod_worker_node` block, the desired properties of the Horovod worker nodes must be set.
84
  - In the `horovod_shared_volume` block, properties of the shared volume must be set. This volume is dedicated to deep learning data, and hosts the contents of the shared '/horovod' directory. Please choose an appropiate size for your needs.
85
  - In the `horovod_network` block, the name and subnet range of the network we wish to connect the Horovod nodes to must be set.
86
87
  - In the `monitoring_server` block, the properties of the monitoring server can be set. Setting a floating IP is optional. This only has to be filled if monitoring is enabled (see below).
  - In the `monitoring_volume` block, properties of the monitoring volume must be set. This volume is a dedicated storage for Prometheus blocks. Please choose an appropiate size for your needs.
Administrator's avatar
Administrator committed
88
  - In the `user_config` block, we can perform additional customization, using variables. The following options are available:
89
90
    - `jupyter_password`: Set the password for accessing JupyterLab.
    - `enable_gpu`: Utilization of GPU resources in the Horovod containters. Requires NVIDIA drives and the NVIDIA Container Runtime on the virtual machines.
91
    - `monitoring`: Include Prometheus and Grafana based monitoring for the provisioned cluster. Initializes a monitoring server, and additional containers on the Horovod nodes. If you have not associated a floating IP with the monitoring server, in order to access the Grafana web interface after deployment, you must perform SSH port forwarding on your local machine with the following command:
Administrator's avatar
Administrator committed
92
       ```
93
       ssh -i {path to your private key} -L localhost:3000:{monitoring server private IP}:3000 ubuntu@{horovod master public IP}
94
       ```
95
       Before issuing the command, substitute the {placeholders} with the required parameters. The path to your private key should be set as the path to the private part of the SSH key you configured on OpenStack. You can find the private IP of the monitoring server and the public IP of horovod master on the OpenStack interface after deployment.
96
After executing the command, you can access Grafana by opening 'localhost:3000' in your web browser.
97

98
99
    - `monitoring_password`: Set the password for accessing Grafana. The default username is 'admin'.

100
    - `nvidia_driver_install`: Performs the installation of NVIDIA Driver and NVIDIA Container Runtime using NVIDIA's [nvidia_driver](https://galaxy.ansible.com/nvidia/nvidia_driver) and [nvidia_docker](https://galaxy.ansible.com/nvidia/nvidia_docker) Ansible roles. The roles must be installed beforehand with the following command:
101
       ```
102
       ansible-galaxy install nvidia.nvidia_driver,v2.2.1 nvidia.nvidia_docker,v1.2.4
103
       ```
Administrator's avatar
Administrator committed
104
    - `elkh_cloud_dedicated_network`: A dedicated, high bandwidth network for the Horovod cluster on the ELKH Cloud. This is currently a limited, experimental feature, and is not recommended for use.
Attila Farkas's avatar
Attila Farkas committed
105

Administrator's avatar
Administrator committed
106
4. Adding the private SSH key to the SSH agent:
Póra Krisztián's avatar
Póra Krisztián committed
107
108

```
109
eval $(ssh-agent -s) && echo "$(cat PATH_TO_YOUR_KEY)" | tr -d '\r' | ssh-add -
Póra Krisztián's avatar
Póra Krisztián committed
110
```
Administrator's avatar
Administrator committed
111
- In place of `PATH_TO_YOUR_KEY`, the path to the private part of the previously configured SSH key must be set.
Attila Farkas's avatar
Attila Farkas committed
112

Administrator's avatar
Administrator committed
113
5. Provisioning the Horovod cluster:
Póra Krisztián's avatar
Póra Krisztián committed
114

Attila Farkas's avatar
Attila Farkas committed
115
116
```
terraform init
Póra Krisztián's avatar
Póra Krisztián committed
117
terraform apply --auto-approve
Attila Farkas's avatar
Attila Farkas committed
118
119
```

120
6. (Optional) Terminating the Horovod cluster:
Póra Krisztián's avatar
Póra Krisztián committed
121

Attila Farkas's avatar
Attila Farkas committed
122
123
124
125
```
terraform destroy
```

126
127
128
129
130
**Note:** Executing `terraform destroy` will terminate the shared volume along with all other resources. If you wish to retain data stored on it, perform the following steps:
- Before destroying, create a snapshot of the shared volume
- Create a new copy of the volume using said snapshot
- Delete the snapshot
- Execute `terraform destroy`
131
132
- Before rebuilding, set `reattach_volume_id` in `resources.auto.tfvars` to the ID of the previously created copy
- The Horovod cluster will be recreated with your saved training data on the next apply
133

134
## Deployment using the RefArch Toolset Docker image
135
136
137
138
- Using the RefArch Toolset image, you can omit the installation of extra tools such as Terraform and Ansible.

When running the container, you must mount a working directory and the private part of your configured SSH key into the container:
```
139
docker run -it -v PATH_TO_WORKDIR:/home/refarch -v PATH_TO_PRIVATE_KEY:/root/.ssh/id_rsa:ro git.sztaki.hu:5050/science-cloud/reference-architectures/refarch-toolset bash
140
141
142
```
- In place of 'PATH_TO_WORKDIR', the path to your working directory must be set.
  - This directory will contain the descriptor files.
143
  - You can perform the download and customization of the descriptor files (steps 1-3 of [Deployment](https://git.sztaki.hu/science-cloud/reference-architectures/horovod#deployment)) in this working directory before running the container, or inside the container after running it.
144
145
146
  - The mounted working directory will be the entrypoint of the container.
- In place of 'PATH_TO_PRIVATE_KEY', the path to the file containing the private part of your configured SSH key must be set.
  - The private key file is mounted with the read only option.
147
148
  - Adding the private SSH key to the SSH agent (step 4 of [Deployment](https://git.sztaki.hu/science-cloud/reference-architectures/horovod#deployment)) can be omitted when using the container.
- After the initial steps, you can perform provisioning and termination (steps 5, 6 of [Deployment](https://git.sztaki.hu/science-cloud/reference-architectures/horovod#deployment)) within the container freely.
149

Administrator's avatar
Administrator committed
150
## Usage
Attila Farkas's avatar
Attila Farkas committed
151

152
### Accessing the JupyterLab interface ###
153
- JupyterLab will be accessible at the configured floating ip of the Horovod master, on port 8888.
154
- The password will be the one configured in `resources.auto.tfvars`, or 'elkhcloud' by default.
155
- For additional details regarding JupyterLab, please refer to the [official documentation](https://jupyterlab.readthedocs.io/en/3.5.x/).
Attila Farkas's avatar
Attila Farkas committed
156

157
### Preparing training data and code ###
Administrator's avatar
Administrator committed
158

159
160
161
When performing distributed training, all nodes within the cluster must have access to both training data and script.
- The Horovod reference architecture enables efficient file sharing among the nodes of the cluster using NFS.
- The contents of the '/horovod' directory are accessible from all Horovod nodes.
162
- Within this folder, you can also find a file named 'cluster.txt'. This file contains a human readable summary of the nodes in the cluster, and their resources (such as CPU, GPU, and Memory).
Administrator's avatar
Administrator committed
163

164
In order to enable the distribution of training, the training script has to be modified.
Administrator's avatar
Administrator committed
165
166
167
168
- Using Horovod, the training script requires all but a few lines of additional code.
- For the details of implementation, please refer to the [official Horovod documentation](https://horovod.readthedocs.io/en/stable/summary_include.html#usage).
- In the '/examples' directory, some example training scripts can be found for a variety of different frameworks. These scripts are part of the official Horovod repository, and can give an overview of what additions are necessary in order to make training code compatible with Horovod.

Administrator's avatar
Administrator committed
169
### Running Horovod ###
Administrator's avatar
Administrator committed
170

171
The below snippet is the horovodrun command in a simplified form. The first two arguments to be used are related to the number of processes we intend to utilize during training. Generally, one process will be allocated per GPU/CPU.
Administrator's avatar
Administrator committed
172
173
174
```
horovodrun -np N -H host1:n1,host2:n2 -p 12345 python train.py
```
175
176
177
- With the '-np' flag, we must specify the total number of processes (N).
- With the '-H' flag, we must specify the addresses of individual hosts (host1, host2), and their corresponding number of processes (n1, n2). The total number of processes will be the sum of all individual processes (N=n1+n2).
- The '-p' flag specifies the port to use for communication between the nodes. In case of the Horovod reference architecture, this value should always be set to '12345', as the Horovod nodes can communicate through this port.
Administrator's avatar
Administrator committed
178
179
- Lastly, 'train.py' represents the training script we wish to run.

180
To run Horovod on a single host with one GPU:
Administrator's avatar
Administrator committed
181
182
183
```
horovodrun -np 1 -H 192.168.0.1:1 -p 12345 python train.py
```
184
To run Horovod on four hosts with one GPU each:
Administrator's avatar
Administrator committed
185
186
187
```
horovodrun -np 4 -H 192.168.0.1:1,192.168.0.2:1,192.168.0.3:1,192.168.0.4:1 -p 12345 python train.py
```
188
In order to simplify usage, the Horovod reference architecture includes an automatically generated hostfile, which can be supplied instead of a long list of IP addresses and process numbers. The '-hostfile' flag substitutes the '-H' flag. The hostfile, named 'HOSTLIST' can be found within the shared '/horovod' directory. The '-np' flag along with the total number of processes still has to be manually supplied (the total number of processes will be the sum of individual process numbers labeled as 'slots' within the 'HOSTLIST' file).
Attila Farkas's avatar
Attila Farkas committed
189
```
Administrator's avatar
Administrator committed
190
horovodrun -np 4 -hostfile HOSTLIST -p 12345 python train.py
Attila Farkas's avatar
Attila Farkas committed
191
192
```

Administrator's avatar
Administrator committed
193
For additional details regarding running Horovod, please refer to the [official Horovod documentation](https://horovod.readthedocs.io/en/stable/running_include.html).