Website by Joana Lopes
2015
contacts
faq's
web app
forum
downloads
how to use
what is
Frequently Asked Questions
Q: What is KeyGenes?
A: KeyGenes is an algorithm to predict the identity and provide you with an identity score for the queried samples. It uses transcriptional profiles of the queried data (test set) aand matches them to chosen sets of transcripcional profiles (training set).
The idea is that you select data set from tissues/organs of interest and in general (training set) and that you provide data set of cells (test set) differentiated towards one of the tissues/organs included in the training set. KeyGenes will provide the samples in the test set with an identity score to the samples that have been included in the training set. Therefore, it is very important to choose your training set carefully. Moreover, KeyGenes uses the top 500 most variably expressed genes. This top 500 list should also be carefully selected depending on the training set used.




Q: I do not know R. Can I still use KeyGenes?
A: Yes, a Web App is provided but it is at the moment limited to the use of “fixed” training sets of human transcriptional data. In that Web App, a test set can be uploaded and one of the provided “fixed” training sets can be selected. At the moment the WebApp is also limited to the use of NGS-derived data. Choosing a “flexible” training set at the moment is not possible using the Web App.
A: Althrough KeyGenes was designed for NGS-data, it can be applied on a microarray- derived test set by using Script 3. However, microarray-derived data of tissues/organs from Affymetrix and Illumina platforms have been tested only using the training set “fetal”.

Microarray-derived data of differentiated cells have not been tested with other training sets, and therefore, the results should be interpreted with care. See 'How to use KeyGenes' section 2.3. for more information.
In addition, we also suggest for that to use

Q: Can I use KeyGenes on a microarray-derived test set?
A: When one of the provided training sets or the script for microarray-derived data is used, the genes must be annotated with Ensembl Gene IDs. If a new training set of NGS-derived data is used, alternative annotations are possible. But, both, the training and the test set, need to have the same annotation. See 'How to use KeyGenes' section 2.4. for more information.
Q: My data is annotated with other than Ensembl Gene IDs. Can I still use the “fixed” training set?
A: Yes, this is possible as described in section 2.4. However, if the training set is assembled of data from different sources, the selection of the adequate top500 is crucial to obtain meaningful predictions and identity scores. We are working on a Web App to allow this as well.
Q: I would like to use my own training set. Is this possible?
A: Yes, this is possible. Providing the training set and test set are from the same organism, KeyGenes can analyze the data regardless of the organism. See 'How to use KeyGenes' section 2.4. for more information.
Q: Can I analyse mouse data using KeyGenes?
Contacts
Matthias Roost
M.S.Roost@lumc.nl
PhD student: human development: pancreas, pluripotency and epigenetics
Bontius Stichting Graduate Student
Susana Lopes
Associate Professor, LUMC, the Netherlands
Guest Professor, Gent University, Belgium
S.M.Chuva_de_Sousa_Lopes@lumc.nl
Downloads
Packages
Microarray
Training Sets
Scripts
What is KeyGenes
How to use KeyGenes
KeyGenes has been developed and tested using NGS-derived human fetal transcriptional datasets as both training and test sets (Roost et al., 2015). The datasets used were expanded with available datasets on human adult tissues/organs (Cnop et al., 2014; Fagerberg et al., 2014; Illumina Body Map 2.0).

It is worth mentioning, that NGS-derived data of more human adult organs/tissues are available (Fagerberg et al., 2014; Illumina Bodymap 2.0; Epigenomic Roadmap; ENCODE) and could be incorporated as the datasets to be used as training sets are flexible and expandable.
We provide seven basic “fixed” training sets, which are supposed to give users a headstart to assign both identity and developmental stages to their differentiated cells (test set). How to create and use a “flexible” training set is described in section 2.
For a first and general assessment and before using a “flexible” training set, the following “fixed” training sets are recommended to determine identity:
For a first and general assessment and before using a “flexible” training set, the following “fixed” training sets are recommended to determine identity:
For a basic assignment of a developmental stage to differentiated derivatives of pluripotent stem cells (PSCs), we provide the “fetal wo” training set now separated into first and second trimester and well as an adult training set. By comparing the outcomes using the three training sets, you can determine whether your sample is closer to a particular organ in the 1st trimester, 2nd trimester or adult:
1. Fetal
The training set “fetal” contains transcriptional signatures of 21x fetal organs/tissues and the maternal endometrium.
2. Fetal wo
The training set “fetal wo” contains transcriptional signatures of 17x fetal organs, excluding the extraembryonic organs/tissues and the maternal endometrium. Male and female gonads are taken together as gonad.
3. Fetal wo 1T
The training set “fetal wo 1T” contains transcriptional signatures of 13x 1st trimester fetal, excluding the extraembryonic organs/tissues and the maternal endometrium.
6. Fetal wo islets
This training set includes the “fetal wo” training set expanded with five adult islet of Langerhans samples (Cnop et al., 2014). KeyGenes will split up the identity scores for pancreas into pancreas and adult islet. This training set has been tested on PSC-derived endocrine cells (Roost et al., 2015).
7. Fetal wo islets 2
This training set includes the “fetal wo” training set expanded with five adult islet of Langerhans samples (Cnop et al., 2014). KeyGenes will split up the identity scores for pancreas into 1T, 2T and adult islet. This training set has been tested on PSC-derived endocrine cells (Roost et al., 2015).

4. Fetal wo 2T
The training set “fetal wo 2T” contains transcriptional signatures of 16x 2nd trimester fetal organs, excluding the extraembryonic organs/tissues and the maternal endometrium. Male and female gonad are taken together as gonad.
5. Adult
The training set “adult” contains transcriptional signatures of 11x adult organs, excluding the extraembryonic organs/tissues and the cervix (Fagerberg et al., 2014; Illumina Bodymap 2.0). Ovaries and testes are taken together as gonads. With the heart samples, no distinction is made between atria and ventricles.
How to use KeyGenes
1. Provided "fixed" training sets
The R scripts allow a prediction of NGS- or microarray-derived data based on one of the provided training sets described in section 1 or based on a training set defined by the user.
2. R Scripts
1.1. Basic training sets
1.2. Staging training sets
The scripts of KeyGenes are based on R, therefore, R is required.
In order to run KeyGenes properly, the following packages are needed:

- limma
- ggplot2
- gplots
- glmnet (version 1.9-8)

Please note that this particular version of glmnet (Version 1.9-8) must be used (Friedman et al., 2010). Compressed files of the four packages can be downloaded from this website.

The provided training sets are based on Ensembl Gene IDs. Therefore, test sets must be in a specific tab-separated text file with genes (Ensembl Gene IDs, no duplicates) as rows and the queried samples as columns.

The queried samples should be labelled as follows: sample_additional information (e.g. stomach_adult patient 1). To get a meaningful prediction of the queried samples, the organ/tissue/cell type of interest must be represented in the training set.

If your text file with the test set contains only one queried sample (column), you have to duplicate this column since KeyGenes needs at least two queried samples.

2.1. Requirements
3. Web App
References
The web app allows a prediction of NGS-derived data based on one of the “fixed” training sets described above in section 1.

To run the KeyGenes Web App, follow the steps below:
The provided training sets are based on Ensembl Gene IDs. Therefore, the test set must be in a specific tab-separated text file with genes
(Ensembl Gene IDs, no duplicates) as rows and the queried samples as columns.

The queried samples should be labelled as follows: sample_additional information (e.g. stomach_adult patient 1). To get a meaningful prediction of the queried samples, the tissue/cell type of interest must be represented in the training set. For the web app, it is important that the sample names do not contain any other special characters.

If your text file with the test set contains only one queried sample (column), you have to duplicate this column since KeyGenes needs at least two queried samples.





1. Fill in your email address in the first field, so you will get the results in your mailbox.

2. Choose the “fixed” training set of interest from the dropdown menu.

3. Upload the test set containing the raw reads











4. Press the submit button and check your mailbox for the results.

3.1. Web App using the provided "fixed" training sets
3.2. Web App using "flexible" training sets
KeyGenes is designed to be used on human next-generation data (Script 2, 2_KeyGenes_NGS_1.0). The provided human training sets with their associated files (foldid, top500) described above can be downloaded from this website.
To run the KeyGenes script, follow the steps below:
2.2. Using the provided "fixed" training sets for NGS-derived data (Script 2)
1.3 Other training sets
2.3. Using the provided "fixed" training sets for microarray-derived data (Script 3)
KeyGenes can be used to predict microarray-derived data of human tissue samples using the fetal training set (Roost et al., 2015; Script 3, 3_KeyGenes_MA_1.0). However, only data derived from Affymetrix and Illumina platforms haven been tested. Furthermore, the other provided training sets have not been tested on microarray-derived data of differentiated derivatives of human pluripotent stem cells.

For microarray data of differentiated derivatives of pluripotent stem cells, it is recommended to use also

The only difference to Script 2 is that in the section “Training Set”, an additional argument (housekeeper) must be loaded. This argument remains the same if the training set is changed. The file “houeskeepers_microarray.txt” can be downloaded from this website.
2.4. How to use a "flexible" training set (Scripts 1, 2 and 3)
In order to get the most meaningful prediction for a particular experiment, KeyGenes allows to use other training sets or modify existing training sets. In order to do so, the foldid and top500 text files of new training sets must be generated in Script 1 (1_KeyGenes_top500_1.0).
There are four important things to consider:
− There need to be at least two samples per “classification”.
− The training set should be based on Ensembl Gene IDs. Therefore, the training set must be in a tab-separated text file with genes (Ensembl Gene IDs, no duplicates) as rows and the queried samples as columns.
− The samples should be labelled as follows: sample_additional information (e.g. stomach_adult patient 1). Samples of the same classification must have the same “sample” name (e.g. stomach_1 and stomach_2).
− If the training set is assembled of data from different sources, the selection of the top500 is crucial.

Theoretically, it is possible to use other gene IDs than Ensembl, and even other species, for training sets. However, this will only work for NGS-derived data (Script 2, 2_KeyGenes_NGS_1.0).

The two text file generated with Script1 (1_KeyGenes_top500_1.0) are then loaded into Script2 (2_KeyGenes_NGS_1.0) or Script3 (3_KeyGenes_MA_1.0), together with the corresponding training set.

1.Load the packages (see section “2.1. Requirements”).
2.Set the working directory to where your files are located.
3.Specify your training set. To each training set belongs a foldid and top500 text file.
4.Specify your test set containing raw reads (see section “2.1 Requirements” for information on the format of the test set).
## Load Packages ##
## Working Directory ##
## Training Set ##
## Test Set ##
5.Label the output files as you wish (see file “What is KeyGenes” for information on the output).
6.Once all the arguments are loaded (steps 1-5), run the rest of the script in one go.
6.Once all the arguments are loaded (steps 1-5), run the rest of the script in one go.
## Output ##
“YourWorkingDirectory”
# Version: 1.9-8
library
library
library
library
(limma)
(gplots)
(ggplot2)
(glmnet)
working_dir <-
“training_fetal.txt”
training <-
“KeyGenes_Heatmap.pdf”
heatmap <-
“KeyGenes_Matrix.pdf”
matrix <-
“KeyGenes_Prediction.pdf”
prediction <-
“KeyGenes_Classifier.pdf”
classifier <-
“YourTestSet.txt”
test <-
4 Specify your test set (see section “2.1. Requirements” for information on the format of the test set).
## Test Set ##
5.Label the output files as you wish (see file “What is KeyGenes” for informationon output).
## Output ##
4.Label your output.
5.Once all the arguments are loaded (steps 1-4), run the rest of the script in one go.
We are currently working on the Web App to allow the use of flexible training sets.
Cnop, M., Abdulkarim, B., Bottu, G., Cunha, D.A., Igoillo-Esteve, M., Masini, M., Turatsinze, J.V., Griebel, T., Villate, O., Santin, I., et al. (2014). RNA sequencing identifies dysregulation of the human pancreatic islet transcriptome by the saturated fatty acid palmitate. Diabetes. 63(6): 1978-1993.
Fagerberg, L., Hallstrom, B.M., Oksvold, P., Kampf, C., Djureinovic, D., Odeberg, J., Habuka, M., Tahmasebpoor, S., Danielsson, A., Edlund, K., et al. (2014). Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & cellular proteomics: MCP. 13(2): 397-406.
Friedman J., Hastie T., and Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 33(1): 1-22.
Roost M.S., van Iperen L., Ariyurek Y., Buermans H.P., Arindrarto W., Devalla H.D., Passier R., Mummery C.L., Carlotti F., de Koning E.J.P., van Zwet E.W., Goeman J.J., and Chuva de Sousa Lopes S.M. (2015). KeyGenes, a tool to probe tissue differentiation using a human fetal transcriptional atlas. Stem Cell Reports. 4(6):1112-24.
## Output ##
“KeyGenes_Heatmap.pdf”
heatmap <-
“KeyGenes_Matrix.pdf”
matrix <-
“KeyGenes_Prediction.pdf”
prediction <-
“KeyGenes_Classifier.pdf”
classifier <-
test <-
“foldid_fetal.txt”
foldid <-
“top500_fetal.txt”
top500 <-
1.Load the packages (see section "2.1. Requirements").
2.Set the working directory to where your files are located.
3.Specify your training set. To each training set belongs a foldid and top500 text file. “housekeepers_microarray.txt” can be downloaded.
## Load Packages ##
## Working Directory ##
## Training Set ##
“YourWorkingDirectory”
# Version: 1.9-8
library
library
library
library
(limma)
(gplots)
(ggplot2)
(glmnet)
working_dir <-
“training_fetal.txt”
training <-
1.Load the packages (see section "2.1. Requirements").
2.Set the working directory to where your files are located.
3.Specify your training set.
## Load Packages ##
## Working Directory ##
## Data Set ##
“YourWorkingDirectory”
“YourDataset.txt”
# Version: 1.9-8
library
library
(limma)
(glmnet)
working_dir <-
dataset <-
“foldid_fetal.txt”
foldid <-
“top500_fetal.txt”
top500 <-
“top500.txt”
top500 <-
“FOLDID.txt”
foldid <-
“housekeepers_microarray.txt”
“YourTestSet.txt”
housekeeper <-
How to use KeyGenes
What is KeyGenes
3. A text file (KeyGenes_Prediction.txt) with the queried samples and the sample in the training set with the highest identity score.
4. A text file (KeyGenes_Classifier.txt) containing the list of classifier genes per sample calculated from the training set used to determine the identity scores (between 0 and 1) of the queried samples matched to the samples included in the training set.
True Tissue
Sample A_adult
Sample B_adult
Sample C_adult
Sample D_adult
Sample E_adult
Brain
Gonad
Gonad
Kidney
Muscle
Predicted Tissue
2. A text file (KeyGenes_Matrix.txt) containing a matrix with the identity scores (between 0 and 1) of the queried samples matched to the samples included in the training set.
Information about the different “fixed” training sets provided as a headstart as well as the instructions how to use either the Web App on “fixed” training sets or the R scripts on “fixed” or “flexible” training sets can be found on http://www.keygenes.nl/ (“How to use KeyGenes”). The R scripts and the different available “fixed” training sets (with associated files), can be downloaded from http://www.keygenes.nl/.
What do I get from KeyGenes?
The output you will get from KeyGenes consists of four files:

1. A PDF file (KeyGenes_Heatmap.pdf) with a heatmap containing the identity scores (between 0 and 1) of your samples matched to the samples included in the training set.
KeyGenes is an algorithm to predict the identity and determines identity scores of queried samples (test set) to a provided group of samples (training set). It uses transcriptional profiles of the queried data (test set) and matches them to sets of transcriptional profiles of organs or cell types (training set). KeyGenes uses a 10-fold cross validation on the basis of a LASSO (Least Absolute Shrinkage and Selection Operator) regression available in the R package “glmnet” (Friedman et al., 2010).
What is KeyGenes?
References:
Friedman J., Hastie T., and Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent.
Journal of statistical software. 33(1): 1-22.
close
what is
how to use
downloads
web app
faq's
contacts