Sonntag, Juli 31, 2022
StartMicrobiologyPredicting most cancers prognosis and drug response from the tumor microbiome

Predicting most cancers prognosis and drug response from the tumor microbiome


Information retrieval and processing

Normalized and batch impact corrected microbial abundance knowledge for 32 TCGA tumor varieties have been downloaded from the net knowledge repository referenced in Poore et al.10 (ftp://ftp.microbio.me/pub/cancer_microbiome_analysis). Particularly, the “Kraken-TCGA-Voom-SNM-Plate-Heart-Filtering-Information.csv” microbial abundance knowledge file and adjoining “Metadata-TCGA-Kraken-17625-Samples.csv” metadata file have been used because the beginning enter for additional knowledge processing.

We first filtered the info for major tumor samples (TCGA “Main Tumor” or “Further – New Main” pattern varieties). Poore et al. generated microbial abundances from all of the out there WGS and RNA-seq knowledge in legacy TCGA (after some high quality filters), which ceaselessly contained replicate WGS and RNA-seq knowledge for a similar case and pattern kind. It was widespread in legacy TCGA to extend WGS sequencing protection by performing a further sequencing run from the identical pattern and these secondary runs sometimes had a a lot decrease variety of reads and protection in comparison with their corresponding major sequencing runs. When evaluating the normalized and batch impact corrected learn counts between these WGS runs, we discovered that microbial abundance knowledge which got here from decrease protection secondary runs could possibly be considerably completely different from abundances derived from the bigger major sequencing runs. Due to this fact, we excluded microbial abundance knowledge which got here from secondary runs. As well as, legacy TCGA generally contained knowledge for a similar samples analyzed utilizing completely different computational pipeline variations. We excluded replicate microbial abundance knowledge from older TCGA evaluation pipeline variations if a replicate from a more recent model existed. After the above filters, the Poore et al. knowledge went from 17,625 samples and 10,183 distinctive circumstances to 12,111 samples and 9812 distinctive circumstances (comprising of 1944 WGS samples from 1904 distinctive circumstances and 10,167 RNA-seq samples from 9745 distinctive circumstances).

TCGA gender, age at prognosis, and tumor stage demographic and medical knowledge and in addition to major tumor RNA-seq learn rely knowledge for the 32 TCGA tumor varieties included in our examine have been obtained from the NCI Genomic Information Commons (GDC Information Launch v29.0) utilizing the R Bioconductor package deal GenomicDataCommons. TCGA GENCODE v22 gene annotations have been obtained from the GDC knowledge portal and Ensembl Gene v98 utilizing the R package deal rtracklayer and R Bioconductor packages AnnotationHub and ensembldb. The downloaded GDC major tumor cohort with RNA-seq learn rely knowledge comprised of 9735 samples from 9680 distinctive circumstances. There have been 68 circumstances on the GDC which had lacking age of prognosis however present values within the Poore et al. knowledge and we selected to not exclude these knowledge and used the Poore et al. age of prognosis values for these circumstances. TCGA curated survival phenotypic knowledge45 have been obtained from UCSC Xena. Circumstances which had each lacking total survival (OS) and progressive-free interval (PFI) end result knowledge have been excluded from survival modeling.

TCGA curated drug response medical knowledge have been compiled from Ding et al.14 Our drug response fashions used the next binary classification targets: full response (CR) and partial response (PR) have been labeled as responders and steady illness (SD) and progressive illness (PD) as non-responders. All TCGA samples with drug response phenotypic knowledge have been from pre-treatment biopsies. Because of the restricted cancer-drug mixture cohort sizes in TCGA, we modeled every drug individually, even when a affected person acquired a number of medicine concurrently. If the identical drug was given at a number of timepoints to a affected person, we solely thought-about their first drug response. We thought-about cancer-drug mixtures that contained a minimal of 18 circumstances and not less than 4 circumstances per response binary class, apart from STAD oxaliplatin, the place we allowed a minimal of 14 circumstances in order that the gene expression dataset could possibly be included. In complete, we analyzed 30 cancer-drug mixtures which had paired microbial abundance and gene expression knowledge that met the above thresholds. Mixed characteristic microbial abundance and gene expression datasets have been created by becoming a member of knowledge from every particular person dataset which had matching TCGA pattern UUIDs. For some TCGA circumstances, knowledge existed from a number of completely different aliquots per pattern or a number of technical runs per aliquot, subsequently in these circumstances all mixtures have been joined on the pattern UUID stage. Cross-validation sampling chance weights in addition to mannequin and scoring pattern weights have been utilized to account and alter for any imbalance attributable to the method. Supplementary Information 1 accommodates a full accounting of the cohort sizes utilized in every computational experiment, damaged down by most cancers, characteristic kind and machine studying goal, per drug therapy or survival end result goal.

Statistics

ML modeling

Machine studying (ML) fashions have been constructed utilizing the scikit-learn46 and scikit-survival libraries47,48,49. Customized extensions to scikit-learn and scikit-survival have been developed so as to add strategies and functionalities required by this mission. Survival fashions have been constructed utilizing Coxnet— regularized Cox regression with elastic internet penalties11. Coxnet fashions managed for gender, age at prognosis, and tumor stage medical prognostic covariates by together with them as unpenalized options within the mannequin (Coxnet penalty issue = 0). Drug response classification fashions have been constructed utilizing three completely different ML strategies: (1) a variant of the linear assist vector machine recursive characteristic elimination (SVM-RFE) algorithm15 that we developed with plenty of extra options and higher efficiency than the scikit-learn built-in model, (2) logistic regression (LGR) with elastic internet16 (L1 + L2) penalties and embedded characteristic choice, and (3) LGR with an L2 penalty and limma17 (for tumor microbial and mixture datasets) or edgeR18,19 (for RNA-seq rely datasets) differential abundance/expression characteristic scoring inside a k-best wrapper characteristic choice technique across the studying algorithm. Limma differential abundance evaluation was run contained in the ML pipeline with default parameters apart from becoming an intensity-dependent pattern to the prior variances and operating a strong empirical Bayes process (eBayes perform parameters pattern = TRUE and strong = TRUE). edgeR differential expression evaluation was run contained in the ML pipeline with default parameters apart from enabling strong estimation of the damaging binomial dispersion (calcDispersions perform strong = TRUE) and strong estimation of the prior quasi-likelihood (QL) dispersion (glmQLFit perform strong = TRUE). Each limma and edgeR strategies scored and ranked options by differential abundance/expression p-value.

All three drug response ML strategies unconditionally included the identical three medical covariates within the mannequin as within the prognosis fashions by having them bypass characteristic choice within the ML pipeline, although in drug response fashions, medical covariates have been modeled as L2 penalized options. In SVM-RFE, medical covariate options bypassed recursive characteristic elimination however have been all the time included at every RFE recursive characteristic elimination mannequin becoming step in addition to remaining mannequin refitting. To one of the best of our information, no out there complete ML library in python or R at present offers an elastic internet LGR algorithm with the performance to specify options that may bypass embedded characteristic choice and be modeled with an L2 penalty (setting the R glmnet penalty issue, for instance, doesn’t present this performance as it’s not a penalty issue per regularization time period however an element utilized to the sum of each L1 and L2 phrases). With the intention to develop this performance for our examine, our elastic internet LGR mannequin pipeline was designed as a two-level LGR, (1) an elastic internet LGR and embedded characteristic choice on solely microbial abundance or gene expression options with medical covariates bypassing this step, adopted by (2) an L2 penalized LGR on options chosen by the elastic internet LGR step and the medical covariates. We all know this design does unlikely produce the very same mannequin settings and outcomes of a single-level elastic internet LGR algorithm with the performance we would have liked, if such an implementation it existed, although we examined each drug response mannequin by means of an ML pipeline with elastic internet LGR and no medical characteristic choice bypass and located that mannequin predictive efficiency, characteristic coefficients and indicators, and have significance rankings have been much like our two-level ML pipeline setup.

Gender was one-hot encoded and tumor stage ordinal encoded by main stage. Within the remaining cohort included in our prognosis and drug response fashions, 3363 out of 9708 tumor microbial abundance circumstances (34.64%) and 3244 out of 9484 gene expression circumstances (34.21%) had tumor stage “not reported” or BRCA stage “X”. Since lacking tumor stage metadata is so prevalent in TCGA, we took the method of together with these in our examine and modeled lacking tumor stage with as impartial an ordinal encoding as attainable. Trying on the distribution of reported main tumor phases in our cohort, we decided that encoding lacking knowledge as an ordinal between tumor stage II and III was as near the center of the distribution of phases in TCGA as we might probably obtain with ordinal encoding.

All prognosis and drug response fashions included the beforehand described characteristic choice in addition to normalization and transformation steps built-in into the ML modeling pipeline utilizing an prolonged model of the scikit-learn Pipeline framework. Every most cancers, knowledge kind, and survival or drug response goal kind mixture was modeled individually utilizing a nested cross-validation (CV) technique to carry out mannequin choice and analysis on held-out take a look at knowledge. Coaching knowledge splits all the time underwent characteristic choice, normalization, and transformation by means of the ML pipeline independently from held-out take a look at or validation knowledge splits earlier than studying. Fashions constructed utilizing gene expression learn rely knowledge included edgeR low rely filtering, weighted trimmed imply of M-values (TMM) normalization, and log counts per million (CPM) transformation steps throughout the ML pipeline. These have been developed and built-in into our scikit-learn-based framework through R and rpy2. All fashions additionally included standardization of options throughout the ML pipeline simply earlier than studying. Throughout prediction, held-out take a look at or validation knowledge have been characteristic chosen, normalized, and reworked by means of the ML pipeline utilizing the parameters realized from the coaching knowledge at every pipeline step earlier than mannequin prediction and scoring. Hyperparameter search and optimization of all mannequin pipeline steps was carried out in nested vogue throughout the inside nested CV. All cross-validation iterators saved replicate pattern knowledge per case grouped collectively such that knowledge would solely reside in both the prepare or take a look at cut up throughout every CV iteration.

Survival fashions used a stratified and randomly shuffled outer CV with 75% prepare and 25% take a look at cut up sizes that was repeated 100 instances. The CV process stratified the splits on occasion standing. Every coaching set from the outer CV was used to carry out hyperparameter tuning and mannequin choice by optimizing Harrell’s concordance index (C-index) over a stratified, randomly shuffled, 4-fold inside CV on the coaching set repeated 5 instances. A number of most cancers datasets contained fewer than 4 uncensored circumstances which required lowering the variety of inside CV folds for these fashions such that not less than one case per fold was uncensored. The information derived from Poore et al. typically included multiple pattern per case, and an unequal variety of samples between circumstances, subsequently requiring both ML mannequin pattern weighting or CV random sampling per case relying on what’s supported by the modeling and scoring strategies used. The Coxnet implementation in scikit-survival doesn’t at present assist pattern weighting, subsequently our customized outer CV iterator randomly sampled one replicate pattern per case throughout every iteration, utilizing a sampling process with chance weights that balanced the chance {that a} replicate WGS- or RNA-seq-based pattern was chosen throughout every CV iteration. Mannequin choice grid search was carried out on the next hyperparameters: elastic internet penalty L1 ratios 0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99, and 1, and for every L1 ratio a default alpha path of 100 alphas utilizing an alpha min ratio of 10–2. Alpha is the fixed multiplier of the penalty phrases within the Coxnet goal perform. Optimum alpha and L1 ratio settings have been decided through inside CV and a mannequin with these settings was then refit on your entire outer CV prepare knowledge cut up. Mannequin efficiency was evaluated in each inside and outer CV on every held-out validation or take a look at knowledge cut up, respectively, by producing mannequin take a look at predicted threat scores and utilizing these scores to straight calculate C-index scores. We additionally evaluated and in contrast mannequin predictive efficiency for every take a look at knowledge cut up survival time interval by calculating time-dependent cumulative/dynamic AUCs12,13.

Drug response fashions used a stratified, randomly shuffled, 4-fold outer CV that was repeated 25 instances (i.e., 100 mannequin cases). Every coaching set from the outer CV was used to carry out hyperparameter tuning and mannequin choice by optimizing the realm underneath the receiver-operating attribute curve (AUROC) over a stratified, randomly shuffled, 3-fold inside CV repeated 5 instances. Case replicate pattern weights have been supplied to SVM-RFE and LGR studying algorithms and all mannequin choice and analysis scoring strategies. Class weights have been supplied to SVM-RFE and LGR studying algorithms to regulate for any class imbalance. Mannequin choice grid search was carried out on the next hyperparameters: L2 penalized SVM and LGR C regularization parameter from a variety of 10–5 to 103, elastic internet LGR L1 ratios of 0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99 and 1, elastic internet LGR C regularization parameter from a variety of 10−2 to 103 (microbial abundance) or from 10−2 to 101 (gene expression and mixed knowledge kind), and at last RFE, elastic internet LGR, and limma and edgeR characteristic scorer k-best characteristic choice search vary from 1 to 400 high scoring microbial abundance, gene expression, or mixed knowledge kind options. SVM-RFE fashions carried out a characteristic elimination process of the one worst characteristic per recursive step for microbial abundance fashions (which began with 1287 options within the Poore et al. knowledge) and 5% of worst remaining options per recursive step till 1300 options have been reached adopted by the one worst characteristic per recursive step for gene expression (beginning with 60,483 options in GENCODE v22) and mixed knowledge kind fashions (beginning with 61,770 options). Optimized hyperparameter settings have been decided through inside CV and a mannequin with the optimized settings was then refit on your entire outer CV prepare knowledge cut up. Mannequin efficiency was evaluated in each inside and outer CV on every held-out validation or take a look at knowledge cut up, respectively, by AUROC, common precision (AVPRE) or space underneath precision-recall curve (AUPRC), and balanced accuracy (BCR). AUROC was used to guage and choose one of the best mannequin and optimized hyperparameter settings from the grid search.

Gender, age at prognosis, and tumor stage medical covariate-only survival fashions have been constructed utilizing customary unpenalized Cox regression. Medical covariate-only drug response fashions have been constructed utilizing L2 penalized linear SVM or LGR. Fashions included standardization of options as a part of the ML pipeline. Fashions have been educated and examined utilizing the identical outer CV iterators and prepare/take a look at knowledge splits as their corresponding microbial abundance, gene expression, or mixture knowledge kind fashions. To check whether or not a Coxnet, SVM-RFE, or LGR microbial abundance or gene expression mannequin was considerably higher than their corresponding Cox, linear SVM, or LGR medical covariate-only mannequin, respectively, a two-sided Wilcoxon signed-rank take a look at was carried out between the 100 pairs of C-index or AUROC scores between each fashions. All uncooked p-values generated from the signed-rank take a look at throughout survival or drug response analyses from the identical knowledge kind have been adjusted for a number of testing utilizing the Benjamini–Hochberg (BH) process to manage the false discovery fee (FDR), and a threshold FDR ≤0.01 was used to find out statistical significance. To check whether or not a mixed knowledge kind mannequin was considerably higher than its corresponding microbial abundance or gene expression mannequin, a two-sided Dunn take a look at was carried out between all three teams of knowledge kind mannequin scores. Every Dunn take a look at uncooked p-value was adjusted for a number of testing utilizing the Benjamini-Hochberg (BH) process to manage the false discovery fee (FDR), and a threshold FDR ≤0.05 was used to find out statistical significance.

Permutation exams have been carried out by shuffling dataset class labels 1000 instances and every time operating the outer CV process on the permuted dataset, the place for every CV iteration we match a mannequin occasion and calculated an AUROC rating, totaling 100,000 matches and scores for every mannequin. Permutation imply AUROC scores have been in comparison with the true imply AUROC rating for the mannequin and a one-sided empirical p-value was calculated from the fraction of permutation imply scores that have been better than or equal to the true imply rating. A p-value ≤0.05 was used to find out statistical significance. The Freedman-Draconis rule was utilized in permutation take a look at histogram plots to compute the bin width. Evaluation of the impact of variety of chosen options on mannequin efficiency was carried out through the hyperparameter grid search and tuning that occurred within the nested inside CV throughout every mannequin occasion becoming, the place scores for each mixture of hyperparameter setting and inside CV prepare/validation fold have been saved for all mannequin cases and used for plotting.

Microbial abundance mannequin characteristic evaluation

For every evaluation, 100 prognosis or drug response mannequin cases have been generated from the outer CV process. Every mannequin occasion chosen a subset of options that carried out greatest throughout CV and the mannequin algorithm realized coefficients (or weights) for every characteristic. To pick out microbial genera for downstream investigation from the characteristic outcomes throughout all these mannequin cases, we proceeded as follows. First, we utilized a two-sided Wilcoxon signed-rank take a look at that the imply characteristic coefficient rank generated by the mannequin is shifted away from zero, and thus that the genus is identifiably positively or negatively related to survival or drug response. For all Wilcoxon exams, we used the package deal coin50, which permits actual calculation of p-values. Coefficients have been ignored when a genus was assigned a zero coefficient or absent from a mannequin. Second, inside every mannequin, all coefficients, ignoring the outcomes of the Wilcoxon take a look at, have been ranked by absolute magnitude. We then saved genera that have been among the many high 50 options in not less than 20% of the fashions and for which the Holm-adjusted, two-sided Wilcoxon signed-rank take a look at p-value was ≤0.01. Having a Coxnet characteristic coefficient equal to zero or characteristic being absent from an SVM-RFE or LGR mannequin was not sturdy sufficient proof that the genus has no impact, however quite that a number of options with stronger impact have been chosen. Thus, we ignored genera with a zero coefficient or absent from a mannequin when computing imply coefficient weight and Wilcoxon statistics on the means.

For the drug response fashions, the place three ML strategies have been examined, we famous the options chosen by particular person fashions and the median rank the characteristic attained within the cases by which it appeared, however additional filtered the options to account for the consensus between ML fashions. We saved options chosen in any two ML mannequin strategies that individually met our standards for inclusion, ignoring options in ML fashions that didn’t meet these standards. We then computed the Spearman correlation between the median ranks attained by the options.

For every chosen microbial characteristic, we examined whether or not it was a considerably univariate characteristic of survival or drug response. It is a strictly completely different query than whether or not the coefficient of a characteristic has constant signal— signal could also be constant when utilized in mixture with different options, however the characteristic will not be individually predictive. For drug response fashions, we divided people into responders and non-responders, and for survival knowledge we divided people whose survival time was better or lower than the censored median, ignoring those that have been misplaced to comply with up earlier than median time. For circumstances that had technical replicates, we randomly chosen a single replicate. For every cancer-test kind pair, we utilized a two-sided Wilcoxon rank-sum take a look at. We utilized a Benjamini-Hochberg a number of speculation correction for every cancer-test kind pair and report the false discovery fee in Supplementary Information 2.

We analyzed the distribution of options, chosen by the foundations described above, that had constructive or damaging indicators for his or her imply coefficient. We used a two-sided binomial take a look at to indicate that chosen options had considerably extra damaging the constructive imply coefficients. We used a two-sided Fisher’s actual take a look at to find out if chosen genera belonging to Firmicutes had a statistically important distinction within the breakdown between constructive and damaging imply coefficients than chosen options as a complete.

Reporting abstract

Additional info on analysis design is on the market within the Nature Analysis Reporting Abstract linked to this text.

RELATED ARTICLES

Most Popular

Recent Comments