Federated Training with Flower¶
You will be using MedPerf to run a federated learning experiment for the task of chest X-ray classification using the Flower Framework. The data used for this tutorial is a sample from the NIH Chest-Xray dataset
Throughout the tutorial, you will play three roles:
- The Model owner who defines and manages the federated learning experiment, and also runs the aggregation server.
- Data Owner 1 and Data Owner 2, who are collaborators providing their local data for training. The data never leaves their machines.
The following outlines the steps involved:
- Model Owner: Register the Data Preparation Container
- Model Owner: Submit the Flower container
- Model Owner: Create a Training Experiment
- Model Owner: Submit the Aggregator
- Model Owner: Link the Aggregator to the Training Experiment
- Model Owner: Submit the Training Plan (Configuration)
- Model Owner: Start the Training Event
- Data Owner 1: Register, prepare, and associate their dataset
- Data Owner 1: Get and submit a signing certificate, then start training
- Data Owner 2: Register, prepare, and associate their dataset
- Data Owner 2: Get and submit a signing certificate, then start training
- Model Owner: Get a server certificate and start the aggregator
- Model Owner: Close the Training Event
- Data Owner 1: Stop the Training Node
- Data Owner 2: Stop the Training Node
Running in cloud via Github Codespaces¶
As the easiest way to play with the tutorials you can launch a preinstalled Codespace cloud environment for MedPerf by clicking this link:
After opening the link, proceed to creating the codespace without changing any option. It will take around 7 minutes to get the codespace up and running. Please wait until you see the message Medperf is ready for local usage printed on the terminal.
Once the codespace is ready, you will get three URLs. Each URL corresponds to the MedPerf webUI instance for a user role (Model Owner, data1 owner, and data2 owner). Make sure you use the correct webUI instance for each tutorial section.
Additionally, you will get the address that you will later use in the tutorial when you register the aggregator server.
Now, we can start the tutorial.
1. Model Owner: Register the Data Preparation Container¶
Make sure you use the Model Owner webUI instance.
Make sure you click in the Training button on the top of the web page. This will switch MedPerf mode to Training.
The data preparation container transforms raw data into a format ready for training. In this tutorial, it will transform chest X-ray images and labels into numpy arrays. You can take a look at the implementation in the following folder: medperf_tutorial/data_preparator.
Navigate to the Containers tab at the top, and click on the Register a new container button.
For the Data Preparation container, the registration should include:
-
The path to the container configuration file:
-
The path to the parameters file:
Give this container a memorable name (e.g., chestxray_prep).
2. Model Owner: Submit containers¶
The training container represents the Flower container. It contains the training logic that will run on each collaborator's machine and also the aggregator server logic. You can take a look at the implementation in the following folder: medperf_tutorial/fl_container.
Navigate to the Containers tab at the top, and click on the Register a new container button.
For the training container, the registration should include:
-
The path to the container configuration file:
-
The URL to the hosted additional files tarball file, which includes the initial model weights:
Give this container a memorable name (e.g., train-chestxray).
3. Model Owner: Create a Training Experiment¶
Navigate to the Experiments tab at the top, and click on the Register a new training experiment button.
To register the training experiment, you need the following information:
- A name for the experiment (e.g.,
trainexp). - A short description.
- The name of the data preparator container you registered in step 1 (e.g.,
chestxray_prep). - The name of the training container you registered in step 2 (e.g.,
train-chestxray).
After submitting, the training experiment will be reviewed and approved by the MedPerf administrator before proceeding. For the tutorial, it will be auto approved, so that you can proceed.
4. Model Owner: Submit the Aggregator¶
Navigate to the Aggregators tab at the top, and click on the Register a new aggregator button.
To register the aggregator, you need the following information:
- A name for the aggregator (e.g.,
aggreg). - The network address (hostname or IP) of the machine where the aggregator will run. For the tutorial, this information was given when you opened the codespace.
- The port for federated learning communication (e.g.,
50273). - The admin port for management commands (e.g.,
50274). - The name of the training container registered in step 2 (e.g.,
train-chestxray).
5. Model Owner: Set the Aggregator on the Training Experiment¶
Navigate to the Experiments tab and open your training experiment's detail page. Click the Set Aggregator button and select the aggregator registered in step 4 (e.g., aggreg).
6. Model Owner: Submit the Training Plan¶
On your training experiment's detail page, click the Set Plan button.
The training plan is a configuration file that defines the federated learning hyperparameters (e.g., number of rounds, aggregation strategy). Provide the path to the plan file:
7. Model Owner: Start the Training Event¶
On your training experiment's detail page, click the Start Event button.
To start the training event, you need to provide:
- A name for the event (e.g.,
event1). - The list of collaborator emails who will participate in this event. For this tutorial, use the file located at
medperf_tutorial/cols_list.yaml, which contains the emails of the two data owners used in this tutorial.
8. Data Owner 1: Register, prepare, and associate their dataset¶
Make sure you use the Data Owner 1 webUI instance.
Make sure you click in the Training button on the top of the web page. This will switch MedPerf mode to Training.
Register the dataset¶
Navigate to the Datasets tab at the top, and click on the Register a new dataset button.
To register the dataset, you need the following information:
- A name (e.g.,
col1). - A short description.
- The source location of the data (e.g.,
col1location). - The path to the data records:
medperf_tutorial/sample_raw_data/col1/images. - The path to the labels:
medperf_tutorial/sample_raw_data/col1/labels. - The data preparator container name:
chestxray_prep.
The data consists of chest X-ray images and their labels. You can take a look at the data located in this folder: medperf_tutorial/sample_raw_data.
Note
You will be submitting general information about the data, not the data itself. The data never leaves your machine.
Prepare the dataset¶
On your dataset's detail page, click the Prepare button to run the data preparation container on your local data.
Set the dataset to operational mode¶
After preparation completes, click the Set Operational button to mark your dataset as ready for training. This will also upload dataset statistics to the MedPerf server.
Associate with the training experiment¶
Click the Associate with a training experiment button on your dataset's detail page, and select the training experiment created in step 3 (e.g., trainexp).
9. Data Owner 1: Get and submit a client certificate, then start training¶
Still logged in as Data Owner 1, navigate to the settings page, and scroll down to the signing certificate section.
Get and submit a client certificate¶
Before starting training, you need a client certificate so the aggregator can authenticate you.
Click the Get User Certificate button to generate a certificate on your local machine.
Then click the Submit Certificate button to upload your public certificate to the MedPerf server.
Start training¶
Navigate to your dataset dashboard, scroll down to the training experiments section.
Click the Start Training button next to the training experiment (e.g., trainexp).
The training node will start in the background on your machine.
10. Data Owner 2: Register, prepare, and associate their dataset¶
Make sure you use the Data Owner 2 webUI instance.
Make sure you click in the Training button on the top of the web page. This will switch MedPerf mode to Training.
Follow the same steps as Data Owner 1 (steps 8 and 9) using the second dataset:
- Dataset name:
col2 - Path to data records:
medperf_tutorial/sample_raw_data/col2/images - Path to labels:
medperf_tutorial/sample_raw_data/col2/labels
11. Data Owner 2: Get and submit a client certificate, then start training¶
Follow the same certificate and training steps as Data Owner 1:
- On the settings page, click
Get User Certificateto generate a certificate locally. - On the settings page, click
Submit Certificateto upload it to the MedPerf server. - On the dataset dashboard, Click
Start Trainingnext to the training experiment (trainexp).
The second training client will now also be running a federated node in the background.
12. Model Owner: Get the server certificate and start the aggregator¶
Make sure you use the Model Owner webUI instance.
Get the server certificate¶
Navigate to the Aggregators tab and open your aggregator's detail page. Click the Get Certificate button to generate the server-side TLS certificate used by the aggregator.
Start the aggregator¶
In the Run aggregator section, select the training experiment (e.g., trainexp) and the network interface you want the server to bind to. For the tutorial, use 0.0.0.0 (which will listen on all interfaces). Then, click the Run aggregator button.
The aggregator will start on your machine, and the collaborators' training clients (from steps 9 and 11) will automatically connect to it.
13. Model Owner: Close the Training Event¶
Once the aggregator has finished, navigate to your training experiment's detail page and click the Close Event button to finalize the training event.
14. Data Owner 1: Stop the Training Node¶
You can open the Data Owner 1 webUI instance and click the Stop button to stop the training node.
15. Data Owner 2: Stop the Training Node¶
You can open the Data Owner 2 webUI instance and click the Stop button to stop the training node.
This concludes our tutorial!