Automatic acquisition of gene expression level data by python and R

This article Blue Collar Bioinformatics; Automated retrieval of expression data with Python and R It is a translation. It's a little old, but I can also study for myself. We welcome any typographical errors or mistranslations.

Translated below

This February, I will go to Japan to participate in the Bio Hackathon 2010, which focuses on the integration of biology data. So, I was thinking about a use case as a preparation.

When wrestling with a dataset, what additional data would help solve the problem? A typical example is information on gene expression level data. If you have a list of genes of interest and want to know their expression levels.

In a world where the dream of the Semantic Web has come true, I think you'll be querying a set of RDF triples like this:

Expression level set-> has species-> Mus musculus
Mus musculus-> has cell tumor-> proB
proB -> has biological state -> wild type
Expression level set-> has identifier-> a RefGene ID
a RefGene ID-> has expression level-> 7.65

Then, the expression level of the cell tumor (proB) of interest under normal conditions can be obtained, and then the expression level of the gene at hand can be obtained. Ideally, I would like the expression level set to have metadata. Then you can see what the number 7.65 means from the context of the experiment.

You can't throw this query from a Semantic Web channel, but you can't automatically solve a similar problem. There are a lot of tools.

NCBI's Gene Expression Omnibus (GEO) hosts expression data online. You can query the dataset with the Eutils API in Biopython. Once you have identified the dataset you are interested in, retrieve the data at Bioconductor's GEOQuery and correspond to the UCSC RefGene identifier. You can. All of these can be concatenated by running R from python on Rpy2.

Translation: Now maybe pipeR is better

First, set top-level goals. Here, we will obtain wild-type expression level data of mouse proB cells.

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = "[email protected]"
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir)
    

def _is_wild_type(result):
    """Judge from the title whether the sample is wild type"""
    return result.samples[0][0].startswith("WT")

It then sends a query to GEO to get the available data based on the species and cell tumor. In real work, you might replace your private _is_wild_type function with one that acts interactively, letting your bioscientist choose according to your experimental method. Also note that the final result is pickle and saved at hand. By querying only once in this way, the load on the external server is reduced.

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)
                break

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The challenge of sending a query to GEO and getting the results is done by Biopython's Entrez interface. After pumping up the query, the results are parsed into a simple object with a description, holding the expression paired with the title and the ID for each sample.

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to obtain GEO for specific species and cell tumors.
    """
    Entrez.email = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term

    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results = Entrez.read(handle)
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary = Entrez.read(handle)
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                 if sample['Title'].find(cell_type) >= 0:
                     samples.append((sample['Title'], sample['Accession']))
                     break
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

With this, if you choose the type of experiment, you will break through the first barrier. Next, get the value of the expression level data. This is done by creating a dictionary that maps RefGene IDs to expression levels. In other words, the result is:

WT ProB, biological rep1
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]),
('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
('NM_177325', [7.3312364040000002, 11.831475960000001]),
('NM_177323', [6.9779868059999997, 6.3926399939999996]),
('NM_177322', [5.0833683379999997])]

The real hassle is getting the expression level and mapping it to the gene ID, which Bioconductor's GEOquery library does for you. Gene-expression correspondence tables and higher-order metadata are stored in local files. The former is an R-style tab-delimited table, and the latter is held in JSON. Save both locally and in an easy-to-read format.

def write_gsm_map(self, gsm_id, meta_file, table_file):
    """GEO expression level data is obtained using Bioconductor and stored in a table.
    """
    robjects.r.assign("gsm.id", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
    robjects.r('''
        library(GEOquery)
        library(rjson)
        gsm <- getGEO(gsm.id)
        write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
        col.names = TRUE)
        cat(toJSON(Meta(gsm)), file = meta.file)
        ''')

This function creates a correspondence table between probe names and expression levels. To make it more convenient, the expression microarray probe should be matched to the ID of the biological gene. This is done by calling the GEO library again. The data is retrieved again in the form of an R data frame and only the elements of interest are saved in a tab-delimited file.

def _write_gpl_map(self, gpl_id, gpl_file):
    """Use R to retrieve GEO platform data and save it in tabular format"""
    robjects.r.assign("gpl.id", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
    robjects.r('''
    library(GEOquery)
    gpl <- getGEO(gpl.id)
    gpl.map <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
    write.table(gpl.map, file = gpl.file, sep = "\t", row.names = FALSE,
    col.names = TRUE)
    ''')

Finally, I will summarize the processing so far. Download and parse each correspondence table file, and connect the correspondence with the expression level-> probe ID-> Transcript ID. The final dictionary will be the ID of the gene associated with the expression level, so it will be returned.

def get_gsm_tx_values(self, gsm_id, save_dir):
    """"Transcribed product"=> 「GEO,Create a correspondence table of "expression level obtained from GSM file"
    """
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or \
        not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab')
        reader.next() # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
                tx_to_vals[tx_id].append(float(probe_val))
    return tx_to_vals

The Complete Script (https://github.com/chapmanb/bcbb/blob/022c50b3132bc3d009ed1d533a3954bd564a9ed3/rest_apis/find_geo_data.py) is an integration of the old snippets. We started with the query and finally got the required expression level data.

However, it has become a detour compared to the ideal one using RDF. Thus, the desire to make access to question-driven data as simple as possible is a recurring challenge for bioinformatics tool creators.

The translation ends here

The script itself is https://github.com/chapmanb/bcbb/blob/022c50b3132bc3d009ed1d533a3954bd564a9ed3/rest_apis/find_geo_data.py It is in.

The author's bcbb repository has many other useful bioinformatics programs.