Photometric redshifts with DaME

1. The Scientific Problem

Photometric redshifts have become one of the main tools to investigate the spatial distribution of galaxies, since they are necessary to reconstruct the 3-dimensional position of very large number of sources using only their photometric properties. One application of zphot are amazing maps of the Universe like this:

A 3D map of the Universe

The mechanism responsible for the correlation between the photometric features (in this example colours, but the same mechanism holds for fluxes and magnitudes) and the redshift of an astronomical source is the change in the contribution to the observed fluxes caused by the prominent features of the observed spectrum continuum and line emission components shifting through the different filters of the photometric system as the spectrum of the source is redshifted.

.

One family of methods for photometric redshift estimation is called "empirical" since these methods can be applied only to "mixed surveys", i.e. to datasets where accurate multiband photometric observations for a large number of source are supplemented by spectroscopic redshifts for a smaller but still significant subsample of the same sources, representative from a statistical point of view of the parent population. These spectroscopic data are used to constrain the fit of an interpolating function mapping the photometric parameter space; different specific methods differ mainly in the way such interpolation is performed. Neural networks (NN), among other machine learning algorithms, are very efficient at recognizing relations between data (see the web page http://voneural.na.infn.it/mlp.html for more details on NN and on the Multi Layer Perceptron, one of the most common models of NN), and in the "training phase" they need a set of "examples" to learn efficiently how to reconstruct the relation between the "parameters" and the "target". In the specific case of photometric redshifts, the parameters are fluxes, magnitudes or colours of the extragalactic sources while the targets, an independent and reliable estimate of the quantity the NN are trained to evaluate, are the redshifts of the sources measured from their observed spectra.

An example of scientific application of the method and algorithms described in this tutorial can be found in the scientific pages of this website.

2. Let's start.

We will consider the general situation where the user (you!) has a file containing a table composed of M columns for R rows (each row represents a different source while each column containing a different quantity). N of such columns contain the photometric data used as "parameters" for the NN (fluxes, magnitudes or colours), while one of the columns contains the spectroscopic redshifts ("target" of the training).

If you don't want to care about the preprocessing or you don't own a suitable file to create the two datasets you need to complete the experiment, skip the "Dataset section" of this tutorial and just download the reference files we have prepared for you for training and run case. These files, named dataset_training.dat and dataset_run.dat, are ASCII files with 5 and 4 columns respectively. In more details, the "dataset_train.dat" file will be used in the "training phase" of the experiment: the first 4 columns represent the observed colours of galaxies while the last contains the spectroscopic redshift, taken from the SDSS DR7 archive. The "dataset_run.dat" files contains instead only the photometric data of a different sample of galaxies (the column containing the spectroscopic redshift is missing) drawn from the same parent sample of "dataset_train.dat", and it will be used in one of the last steps of the experiment to produce a catalogue of photometric redshifts.

Otherwise, if you are brave enough to proceed with your own data, read the following section.

3. Dataset preparation

  1. Download the dataset_parent.fits file, which contains a full catalog which you are going to split into the "train" and "run" datasets.
  2. Load the file with Topcat: Launch Topcat -> Open a new table -> Select the FITS format from the Format menu and the location of the file clicking on Filestore browser -> Ok;
  3. Inspect the content of the file and select the columns to be used with Topcat: Display Column Metadata -> unselect all columns by clicking on Make all table columns invisible -> select only the columns containing the magnitudes and the spectroscopic redshift by ticking the corresponding checkboxes in the Visible column (at this point, this table should contain N + 1 visible columns);
  4. .
  5. Close the Table Columns window -> Table Browser: rearrange the order of the columns so that the last column is the column containing the spectroscopic redshift by dragging and dropping the redshift column in the right place -> close the Table Browser window;
  6. .
  7. Open the Row subsets window -> split the current table in two different samples by clicking on New Subset from first Rows and filling the Row Count field with a number approximately equal to the 70% of the total number of rows in your file, then click OK -> in the Row Subsets window you will find a table called "head_numrows" where 'numrows' is the number you have used, containing the first 'numrows' sources of the original file. Select it and then click on Create new subset complementary to selected subset -> a new subsample of the original file, called "not_head_numrows", the complement to the "head-numrows" sample, will appear -> close the Row Subsets window;
  8. .
  9. Select the "head_numrows" subsample from the Row Subsets menu in the main window -> save the table in a file with Save Table: select ASCII as format in the Output format menu and choose a location and a name (train.dat is good enough!) for the file by clicking on Filestore Browser;
  10. .
  11. This time, select the "not_head_numrows" subsample from the Row Subsets menu in the main window -> inspect the content of the file and select the columns to be used with Topcat: Display Column Metadata -> unselect the last column by unticking the checkbox of the column containing the the spectroscopic redshifts (at this point, this table should contain with N visible columns) -> save the table in a different file with Save Table: select ASCII as format in the Output format menu and choose a location and a name (run.dat could be a good choice) for the file by clicking on Filestore Browser;
  12. .
  13. You're done with Topcat (for now) and your input datasets for DAME are ready!

Now you have two different files, either you have chosen to use the ones we prepared for you either you decided to make them by yourself and it's time for you to get acquainted with DAME! DAME will show to you as a webpage whence you can register, upload your files, choose the algorithm and run the experiments, and much more!

4. Subscription to DaME

  1. Go to the webpage dame.na.infn.it (or simply click here);
  2. .
  3. If you are a new user click on the link Sign Up!, otherwise jump directly to point 3 of this section of the tutorial by clicking here;
    1. Create you account by filling the fields in this webpage: you have to provide your name, choose a username, a valid e-mail address (you can access, of course…) and a password (twice). Click on Register;
    2. .
    3. A confirmation email is sent shortly to the address you provided. Click on the link contained in the e-mail in order to activate your DAME account;
    4. Click on the log in link that appears in the confirmation page;
  4. Provide your Username and Password in the corresponding fields and click on Log In;
  5. You have successfully registered as DAME user and this step of the tutorial is over!

5. Launch an Experiment

Now it's time to launch your experiment! First of all, the NNs needs to learn to recognize the functional relation between the photometric data and spectroscopic redshifts of the sources contained in the "train" input file, in order to be able to approximate it afterward and produce good estimates of the photometric redshift for the parameters values contained in the "run" file. In other words, the NNs need to be trained. This is what is described in the following section of the tutorial.

6. Training a new Neural Network

  1. Go to the webpage dame.na.infn.it and log in, if you have not done it yet;
  2. The first page you will see contains two sections: My Experiments, which will list your experiments as you perform them, and My Filestore where you will see the files you upload and/or are produced during the experiment. Upload the files "train.dat" and "run.dat" by selecting them clicking Browse and then clicking on Click here to upload the file;
  3. .
    .
  4. Click on New MLP button on the left of the page -> select Regression from the Science Case menu -> select Full (Train + Test) from the Mode menu -> click on Go!;
  5. .
  6. In the next page, choose a name and fill the Experiment name field -> In the Input nodes field is required the number of parameters (photometric information, i.e. magnitudes or colours or fluxes) of the "train" file -> Hidden nodes: it depends on the experiment, but usually 20 nodes are fine with almost all kind of experiments -> set 1 as Output nodes (the number of output nodes of the NN is always equal to the number of target, in this case the redshift) -> set Max Epochs to 500 (but can vary) -> set Tolerance to 0.001 (but can vary) -> choose "MSE-INCREMENTAL" from the Training algorithm menu -> select your "train" file for the following Training set, Validation set and Test set menus -> tick the checkbox Do validation and select the same "train" file in the Validation set and Test set menus-> click the Go! button;
  7. .
  8. The next page will show you in real-time what is happening to your experiment: the status (Launched, Started, Finished), a summary of the details of the experiment, the list of files produced during the experiment, the log and graphical;
  9. Now you can evaluate the results of the experiment: click on the link MyFilestore on the left of the page -> click on the Donwload link on the left of your experiment: a folder containing all files created during the experiment will be downloaded to your local drive -> open Topcat -> following the same steps described before, load the file ending with ".tes" in csv format (containing two columns: the spectroscopic redshift - target - and the photometric redshifts for a subsample - test set - of the original list of sources) -> click on Scatter plot: a scatter-plot of the spectroscopic redshifts vs photometric redshifts for the test set is produced (you may want to estimate the accuracy of the photometric redshifts using Topcat);
  10. If the reconstruction of the photometric redshifts satisfies you, you have completed the training of the NNs. Compliments!

And now? When you embarked in this experiment, you probably wanted to calculate the photometric redshifts for a sample of galaxies for which you do not have spectroscopic redshift (in this case, our "run.dat" file), i.e. you wanted to create a catalogue of photometric redshifts. That is what our trained NN is for! For the last step of the process, read the next section of the tutorial.

7. Creating a catalogue of photometric redshifts

  1. Go to the webpage dame.na.infn.it and log in into your account;
  2. Click on New MLP button on the left of the page -> select Regression from the Science Case menu -> select Run from the Mode menu -> click on Go!;
  3. Fill the Experiment name field with a suitable name ('catalogue_exp') -> in the Network menu, select the file containing the saved NN of your experiment (look for the file ending with "_netTrain.mlp" inside the folder named as the name of your experiment, this is the saved network) -> select the second file you uploaded (the 'run.dat' file) in the Data set menu -> click Go!;
  4. Your catalogue is being created. When the status of the experiment is 'Finished', you can visualize on screen the content of each output file just clicking on its name, delete it using the Delete button or download them all by clicking on Download;
  5. Dowload the whole folder created during the experiment, photometric redshifts are contained in the file with extension ".run". Now you can enjoy your brand new photometric redshifts!