Data and databases setup:

Assembling the files required for the database creation:

In order to build the main knowledge repository, BioFlow will go and look for the following data repositories specified in the $BIOFLOWHOME/configs/main_configs.yaml file:

  • OBO 1.2 file of GO terms and relations,

    • dowloaded from: here
    • will download/look for for go.obo file at $DB_HOME$/GO/
  • UNIPROT-SWISSPROT .txt text database file

    • downloaded from here
    • will store/look for the master tab file at $DB_HOME$/Uniprot/uniprot_sprot.dat
    • will load the information specified by the NCBI tax id for the organism currently loaded
  • Reactome.org “Events in the BioPax level 3” file

    • downloaded from here
    • will store/look for the files relevant to the organism at $DB_HOME$/Reactome/
  • HiNT PPI binary interaction files for the organisms of interest

    • dowloaded from here, here and here
    • will store/look for the files SaccharomycesCerevisiaeS288C_binary_hq.txt, HomoSapiens_binary_hq.txt, MusMusculus_binary_hq.txt at the path $DB_HOME$/HiNT/
  • BioGRID ALL_ORGANISMS PPI file in the tab2 format

    • dowloaded from here
    • will store/look for the files at $DB_HOME$/BioGRID/
  • TRRUST literature-curated TF-target interaction files in tsv format

    • downloaded from here and here
    • will store/look for the files at $DB_HOME$/TFs/TRRUST
  • IntAct ComplexPortal tsv files

    • dowloaded from here and here
    • will store/look for the files at $DB_HOME$/ComplexPortal
  • Phosphosite protein kinase-subrstrate tsv files

    • downloaded from here
    • will store/look for the files at $DB_HOME$/PhosphoSite

It is possible to specify the file locations and identifiers manually, and then download and install them. This is to be used when the download locations for the files move.

Similarly, the configs file also controls the organism selection. Three organisms have provided configurations (human, mouse, S. Cerevisiae). Using the same pattern, other organisms can be configured, also the lack of data can be a problem (this is already the case for mouse - we recommend mapping the genes to human if the mouse is used as a model for the organism).

Warning

While BioFlow provides an interface to download the databases programmatically, the databases are subject to Licenses and Terms that it’s up to the end users to respect

Weighting and secondary set:

In addition to the main analysis set, it is possible to supply to BioFlow as well the secondary analysis set. In case it is supplied, the information flow will be calculated between the set of hits and the secondary set. In other terms, only the hypothesis that link the primary set to the secondary set will be evaluated. This is useful in cases where a set of hits from experimental results need to be linked to the set of proteins canonically involved in a phenotype or process of interest.

The secondary set can be supplied to both the knowledge_analysis and interactome_analysis as an optional parameter secondary_source_list. This is the sec_ids return from the map_and_save_gene_ids method. In order to generate mappints of the ids from the files that would be compatible with the auto_analyze methods, there are several possible ways of supplying file paths:

> # No background
> hit_ids, sec_ids, background_ids = map_and_save_gene_ids(hits_file)
> # Background
> hit_ids, sec_ids, background_ids = map_and_save_gene_ids(hits_file, background_file)
> # Single secondary file with background
> hit_ids, sec_ids, background_ids = map_and_save_gene_ids((hits_file, sec_file), background_file)
> # mixed list of files with background
> hit_ids, sec_ids, background_ids = map_and_save_gene_ids([(hits_file, sec_file)
                                                            hits_file_2, ...],
                                                           background_file)

All of them generate valid input for both the knowledge_analysis and interactome_analysis, but if you are using a list, a list of experiment names is strongly recommended.

Warning

Secondary set has to be disjoing from the primary set. If this is not the case, an explicit error will be raised

For the main as well as the secondary set BioFlow supports weighting. It is possible to supply a

Bioflow supports as well the weighting of the main

Reweighting and node edge exclusion: