Help & Documentation

Complete reference for the Microbiome Analysis Platform — pipeline, statistics, outputs, and scientific reporting.

Survival Analysis Microbiome & Multi-omics Scientific Paper Outputs 5 Statistical Models Parallel Secondary Analyses Classification + DOI/PMID

1.Overview

The Microbiome Analysis Platform is a full-stack web application purpose-built for researchers who study the relationship between the gut (or other) microbiome and time-to-event clinical outcomes such as progression-free survival (PFS), overall survival, or other endpoints. It automates a reproducible, configurable multi-step analysis pipeline and produces publication-quality outputs in standard open formats.

Data ingestion

Upload patient metadata and taxonomy abundance files (CSV, Excel). Multiple files per dataset, versioned and traceable.

Automated pipeline

Cohort selection, extreme value handling, attribute and microbial filtering, stratification, clustering, dimension reduction, and feature scaling — all configurable.

Scientific outputs

Coefficient tables (CSV & JSON), forest plots, volcano plots, Kaplan-Meier curves, box plots, correlation heatmaps, and pipeline summaries — ready for papers.

Scientific context

Linking microbiome composition to survival outcomes presents unique methodological challenges: the feature space is high-dimensional (often thousands of taxa), compositional, sparse, and correlated. The platform addresses this by:

  • Microbial discarding: removing taxa with near-zero variance or low prevalence, reducing noise before modelling.
  • Compositional transformation: optional Centered Log-Ratio (CLR) scaling to handle the compositional nature of abundance data.
  • Microbial clustering: grouping correlated taxa into clusters and using cluster representatives, drastically reducing the effective dimensionality while preserving biological signal.
  • Events-per-variable (EPV) awareness: each analysis method documents its EPV requirements so users can assess feasibility.
  • Multiple survival models: Cox PH, AFT, Frailty, Competing Risks, and Bayesian models — selectable per analysis, with parallel method-comparison runs supported.

2.Biological entity classification service

We provide a dedicated service to support manuscript preparation and supplementary materials: for any list of biological entities (e.g. taxon IDs, gene symbols, protein accessions, or organism names), we can generate a detailed classification table with full taxonomic or functional annotations and traceable references to the scientific literature.

2.1 What we provide

Classification table

Each entity is annotated with its official classification (e.g. kingdom, phylum, class, order, family, genus, species), standard identifiers (NCBI Taxonomy ID, where applicable), and a short description. The table is formatted for direct use in papers or as a supplementary file.

Literature references (DOI & PMID)

For each classification or nomenclatural decision, we attach the DOI (Digital Object Identifier) and PMID (PubMed ID) of the primary literature that documents that classification. This supports reproducibility and meets journal requirements for cited taxonomy and nomenclature.

2.2 DOI & PMID references

Every classification or nomenclatural assignment in the table is linked to the primary scientific publication that documents it. We supply both DOI (e.g. 10.1038/...) and PMID (PubMed ID) so you can cite the source in your Methods or supplementary materials and meet journal requirements for traceable taxonomy and nomenclature.

2.3 Use cases

  • Supplementary table of all microbial taxa (e.g. genera or species) reported in your analysis, with full taxonomy and references.
  • Gene or protein lists with functional classification and literature supporting each assignment.
  • Organism or strain lists with standardised names and the publication(s) establishing the current classification.

2.4 How to request

Contact the platform maintainer with your entity list (one identifier per line or as a CSV column) and the type of classification desired (taxonomic, functional, or both). Deliverables include a CSV/Excel table and an optional PDF summary with DOI and PMID links for easy verification.

3.Getting Started

  1. Login via Google OAuth. Your data is private to your account.
  2. Create a dataset from the Dashboard → New Dataset. Give it a descriptive name.
  3. Upload files in the Files tab: at minimum one patient/metadata file and one taxonomy abundance file.
  4. Create an analysis in the Analysis tab → New Analysis. Configure Data Sources, Pre-Analysis, Analysis method, Post-Analysis, and Output options.
  5. Run the analysis. A draggable progress dialog shows each pipeline step in real time. Secondary analyses (per-stratum, per-method) run in parallel automatically.
  6. Download results from the Reports tab: CSV tables, JSON, and image files (JPG/PNG) are available for direct import into R, Python, or statistical software.
Before running: Make sure your patient file contains a duration column (numeric, > 0) and an event column (0 = censored, 1 = event). These names are configured in the project metadata.

4.Workflow Diagram

The full pipeline runs in a fixed order. Secondary runs (per population sector and per analysis method) branch off automatically after the cohort is established at microbial grouping.

Data Sources Patient · Taxonomy Cohort definition Timepoints · Extremes Attribute groups · Stratos Dimensionality Attr. discarding · Microbial discarding · Grouping Clustering (70-90) Cluster features · Name clusters · Reduce matrix Normalization (A5) Z-score · CLR · Min-Max Analysis Cox · AFT · Frailty Competing Risks · Bayesian Output Generation Forest · Volcano · Box · KM Radar · Heatmap · Summary Population-sector analyses One run per stratum (e.g. High-Risk vs. Low-Risk) Method comparison (parallel) Diagram legend Data ingestion Cohort / selection Dimensionality reduction Clustering Normalization Survival analysis Output generation Secondary (parallel) runs Main pipeline flow Secondary branch Secondary analyses share the same cohort. Each runs clustering→output independently, in parallel.
Full pipeline architecture. Main analyses flow top-right; secondary (population-sector and method-comparison) analyses are forked in parallel after cohort definition and share the same grouping.

5.Data Sources & File Requirements

5.1 File formats

Each dataset holds an arbitrary number of versioned files. In the Analysis configuration (Data Sources tab) you select:

RoleTypical contentRowsKey columnsAccepted formats
Patient data file Clinical & survival metadata One row per patient duration_pfs, pfs_status, clinical variables, stratification variables CSV, XLSX
Taxonomy / abundance file Bracken/Kraken read counts or relative abundances per taxon One row per sample (must be joinable to patient) Taxon IDs (NCBI or custom) as column names; sample identifier column CSV, XLSX

5.2 Survival columns

The two mandatory survival columns are configured in project metadata (metadata/COLUMNS.py, KAPLAN_MEIER key):

  • duration (e.g. duration_pfs): numeric, strictly > 0. Units are arbitrary (months, days) but must be consistent.
  • event (e.g. pfs_status): integer — 0 = right-censored, 1 = event of interest. For competing risks, 2, 3, … denote competing event types.
Illustrative patient follow-up timeline (right-censored data) t = 0 t event (status=1) P1 censored (status=0) P2 P3 competing event (status=2) P4
Four hypothetical patients. Solid circles = primary event (status = 1). Open squares = right-censored (status = 0). Triangle = competing event (status = 2, for competing-risks analyses). Duration is measured from enrolment to each symbol.

6.Pipeline Stages

The pipeline runs steps in numeric order. Each step logs row and column counts; the pipeline summary CSV reports the state at every stage. Parameters are set in the Analysis editor under five tabs: Data Sources, Pre-Analysis, Analysis, Post-Analysis, Output.

6.1 Pre-analysis steps

KM for stratos (output only)
Generates Kaplan-Meier survival curves for each stratification variable before any filtering. Used to visualise baseline group differences.
Timepoints
Selects which Bracken/abundance timepoints to include (e.g. baseline only, or multiple time-points merged). Resolves and joins source files into a single analysis frame.
Extremes
Optional winsorisation or exclusion of extreme-value patients. Configurable percentile thresholds (e.g. lower/upper 5th percentile). The example analysis started with 56 patients; 34 were preserved after extremes handling (lower edge = 17, upper edge = 17).
Attribute groups
Reads column-group metadata (e.g. clinical vs. microbial vs. stratification columns). Ensures downstream steps apply the correct filtering policy to each group. In the example, 69 clinical columns were classified.
Stratos selection
Restricts the cohort to the selected strata within each stratification variable (e.g. keep only "High-Risk" and "Intermediate-Risk" from FISH indicators). Removes rows not belonging to any selected stratum. The result is the analysis cohort for all downstream steps.
Alluvial stratos (output only)
Produces an alluvial (Sankey-style) diagram showing how patients flow through stratification layers.
Attribute discarding
Drops non-microbial columns by configurable policy: constant columns (zero variance), near-constant, high missingness, etc. In the example, 2 constant columns were removed (melphalanmgperm2_1, melphalanmgperm2).
Microbial discarding
Drops microbial (taxonomy) features by policy: constant, near-zero abundance, low prevalence across samples, etc. In the example, 1,710 microbial columns were removed, leaving a tractable feature set for clustering.
Ridgeline abundance (output only)
Visualises abundance distributions of retained microbial features using ridgeline plots.
Missingness table (output only)
CSV table of missing-value counts per column after the discarding steps. Useful for reporting data quality in Methods sections.
Univariate screening (output only)
Univariate Cox (or selected-method) screening of each covariate independently. Results exported to CSV/JSON and visualised as a volcano plot (−log₂ p vs. log hazard ratio).
Fix microbial grouping (branching point for secondary analyses)
Applies microbial grouping policy (e.g. aggregate to genus level, use Bracken multi-sample grouping). This is the last shared step before secondary runs fork off.
Clustering (configurable method)
Clusters remaining microbial features using the chosen method (hierarchical, k-means, DBSCAN, etc.). Produces cluster assignments and representatives. Dramatically reduces the effective dimensionality while preserving biological groupings.
Clustering naming
Assigns human-readable labels to clusters (e.g. by dominant taxon or user-defined names). Labels appear in all downstream plots and tables.
Clustered reduction
Reduces the design matrix to cluster representatives (one feature per cluster) plus retained clinical variables. This is the final feature matrix passed to normalization and the survival model.
Correlation heatmap (output only)
Correlation heatmap of the reduced feature matrix. Saved as PNG. Reveals residual collinearity between retained features.
Feature scaling (Normalization)
Scales the feature matrix before the survival model.
Survival analysis
Runs the selected method on the prepared, scaled matrix. Produces the primary coefficient table: coef, exp(coef), se(coef), 95% CI, z, p, −log₂(p). Also reports top-10 statistically significant covariates.

6.2 Feature scaling

Applied to numeric covariates before fitting the survival model. The scaling method is set in the Analysis → Pre-Analysis tab:

MethodFormulaBest for
Z-score (default) x′ = (x − μ) / σ Cox PH, Frailty, Competing Risks. Puts all features on the same scale; coefficients are interpretable as per-SD effects.
CLR x′ = log(x / g(x)), where g(x) is the geometric mean Microbial (compositional) abundance data. Removes the unit-sum constraint and log-scales counts.
Min-Max x′ = (x − min) / (max − min) When bounded [0, 1] inputs are preferred (e.g. some Bayesian priors).
None When data are already suitably scaled, or for manual inspection.

6.3 Clustering methods

Microbial feature clustering reduces dimensionality while preserving co-abundance structure. Available methods:

Agglomerative linkage (ward, complete, average, or single). Distance metrics: Euclidean, Manhattan, cosine, correlation. Number of clusters configurable. Produces a dendrogram; deterministic. Best for moderate-dimensional data with interpretable cluster structure.

Partitions features into k clusters by minimising within-cluster variance. Requires specifying k; random initialisation (multiple restarts). Fast and scalable but assumes spherical clusters.

Density-based clustering. No need to pre-specify k; identifies arbitrary-shape clusters and noise points (outliers). Parameters: ε (neighbourhood radius), minPts.

Each retained feature is treated as its own cluster. Useful when the feature set after microbial discarding is already small, or when you want to model individual taxa directly.
Dimensionality reduction flow: from raw taxa to cluster representatives Raw taxonomy ~2000+ taxa ··· 1710 discarded discarding After discarding N features steps 70-90 Clusters Cluster A (k taxa) Cluster B (m taxa) Cluster C (n taxa) reduction Representatives rep(A) · rep(B) · rep(C) + clinical variables → final design matrix
Dimensionality reduction: ~2000+ raw taxa → discarding → retained features → clustering → one representative per cluster entering the survival model.

7.Survival Analysis Methods

All methods model the relationship between the covariate vector X and a time-to-event outcome (T, δ), where T is the observed time and δ ∈ {0, 1} is the event indicator. Every method produces the same standardised output table (Section 8). Configure the method and its parameters in the Analysis → Analysis tab.

Time t (months) S(t) = P(T > t) Kaplan-Meier survival curves — schematic 0 6 12 18 0 0.25 0.50 0.75 1.00 Group A (n=18; better) Group B (n=16; worse) Censoring tick Median S(t)=0.5
Schematic Kaplan-Meier survival curves. Vertical drops at event times; small ticks indicate censored observations. Horizontal dashed line at S(t) = 0.5 marks median survival time. The platform generates KM curves per stratification variable automatically.
Cox Proportional Hazards Regression

Model: The semi-parametric Cox model factorises the hazard into a non-parametric baseline hazard h₀(t) (unspecified) and a parametric relative risk term:

h(t | X) = h0(t) · exp(Xβ) (1)
log h(t | X) = log h0(t) + Xβ

Estimation: β is estimated by maximising the partial likelihood:

L(β) = ∏i: δi=1   exp(Xiβ) / ∑j ∈ R(ti) exp(Xjβ) (2)

where R(ti) is the risk set at event time ti (all subjects still under observation). No baseline hazard parameters are estimated.

Regularization: An elastic-net penalty can be added: −log L(β) + λ [α ||β||₁ + (1−α)/2 · ||β||₂²], where α is the L1 ratio and λ the penalizer strength.

Output quantities reported per covariate:

  • coef (β̂): log-hazard ratio.
  • exp(coef): Hazard Ratio (HR) = exp(β̂). HR > 1 → increased hazard; HR < 1 → reduced hazard.
  • se(coef): standard error of β̂ from the inverse observed Fisher information matrix.
  • coef lower/upper 95%: Wald CI = β̂ ± 1.96 · SE.
  • exp(coef) lower/upper 95%: exponentiated CI for HR.
  • z: Wald z-statistic = β̂ / SE.
  • p: two-sided p-value from standard normal: p = 2(1 − Φ(|z|)).
  • −log₂(p): negative binary logarithm of p. Values ≥ 4.32 correspond to p ≤ 0.05.

Assumptions & requirements

  • Proportional hazards: HR is constant over time.
  • EPV ≥ 10 events per variable (rule of thumb).
  • Independent observations (use Frailty if clustered).

Configurable parameters

  • α: significance level (default 0.05).
  • penalizer: L2 penalty λ (default 0, no regularisation).
  • l1_ratio: 0 = pure L2, 1 = pure L1 (default 0).
  • max_iter: convergence iterations (default 1000).
  • tolerance: convergence threshold (default 1e-6).
Accelerated Failure Time (AFT) Model

Model: The AFT model is fully parametric. It assumes that covariates act multiplicatively on survival time (equivalently, additively on log-time):

log T = μ + Xβ + σW (3)

where W follows a specified error distribution. The parametric family determines the baseline survival function:

DistributionW followsBaseline S₀(t)
Weibull (default)Extreme value (Gumbel)exp(−(t/λ)ᵨ)
ExponentialExtreme value (Gumbel), ρ=1exp(−t/λ)
Log-normalNormal1 − Φ(log t)
Log-logisticLogistic1/(1+(t/λ)ᵨ)

Interpretation: exp(β) is the time ratio (TR): the factor by which expected survival time is multiplied per unit increase in the covariate. TR > 1 → longer expected survival; TR < 1 → shorter.

Comparison with Cox: AFT is more efficient when the distribution is correctly specified, but sensitive to distributional misspecification. Use information criteria (AIC, BIC) to select the distribution.

Frailty Model

Model: An extension of Cox regression that introduces a latent (unobserved) random effect ui — the frailty — to account for within-cluster correlation or unobserved heterogeneity:

h(t | Xi, ui) = ui · h0(t) · exp(Xiβ) (4)

Subjects with ui > 1 are more "frail" (higher baseline risk); those with ui < 1 are more resistant. The frailty term is integrated out of the likelihood:

L(β, θ) = ∫ L(β | u) · p(u; θ) du (5)

Frailty distributions:

  • Gamma (default): conjugate to the Poisson process; Laplace transform is analytically tractable.
  • Log-normal: heavier tails; may be more flexible but requires numerical integration.
  • Inverse Gaussian: intermediate tail behaviour.

Cluster-robust alternative: If no cluster column is specified, the platform falls back to Cox regression with a sandwich (robust) variance estimator, which provides consistent standard errors under heterogeneity without requiring a frailty distribution.

EPV recommendation: ≥ 15 events per variable; more clusters improve frailty variance estimation.

Competing Risks Analysis

Setting: When subjects are at risk of multiple mutually exclusive event types (e.g. progression vs. death without progression), standard methods that treat competing events as censoring are biased. The event column must encode: 0 = censored, 1 = event of interest, 2, 3, … = competing events.

The platform supports two complementary frameworks:

Cause-specific hazard (CSH)

hk(t | X) = limΔt→0 P(t ≤ T < t+Δt, K=k | T≥t) / Δt
= h0,k(t) · exp(Xβk) (6)

Estimated by a separate Cox model for each cause k, treating other causes as censored. Measures the rate of the event among those still at risk; useful for aetiology.

Sub-distribution hazard — Fine-Gray (SDH)

1(t | X) = h̃0,1(t) · exp(Xγ) (7)
CIF1(t) = 1 − exp(−H̃1(t)) = P(T ≤ t, K=1)

Direct model of the cumulative incidence function (CIF). Subjects who experience a competing event remain in the risk set. Coefficients directly describe effects on the probability of the event of interest.

Interpretation: CSH and SDH coefficients can have different signs. CSH is appropriate for mechanistic interpretation; SDH is appropriate when absolute risk prediction is the goal.
Bayesian Survival Model

Model: Bayesian inference places prior distributions over the regression coefficients β. The posterior is:

p(β | data) ∝ L(data | β) · p(β) (8)

where L is the Cox partial likelihood and p(β) is the prior (typically independent normals with configurable scale). Sampling uses Markov Chain Monte Carlo (MCMC) via PyMC.

Reported quantities:

  • Posterior mean/median of β (used as coef).
  • Posterior SD (used as se(coef)).
  • 95% credible interval (equal-tailed, i.e. 2.5th–97.5th posterior percentiles; reported as coef lower/upper 95%).
  • MCMC convergence: check that effective sample size ≫ 100 and R̂ ≈ 1.0 for all parameters.
Prior: βj ~ Normal(0, σprior) (9)

Configurable parameters: n_samples (default 2000, min 100, max 10 000) and prior_scale (default 1.0). Regularisation via prior: a smaller scale yields more shrinkage toward zero.

Advantages: Full uncertainty quantification; robust to small EPV due to prior regularisation; model comparison via WAIC/LOO-CV.

Choosing a method
MethodWhen to useKey outputMin EPV
Cox PHDefault; PH assumption plausible; moderate EPVHazard ratio (HR)10
AFTDistribution known; prefer time ratiosTime ratio (TR)10
FrailtyClustered data; unobserved heterogeneityHR + frailty variance15
Competing risksMultiple event types (progression + death)CSH or SDH15
BayesianSmall EPV; uncertainty quantification neededPosterior mean + credible interval3

8.Secondary Analyses

After the main analysis finishes, the platform automatically launches secondary runs in parallel. Two types are supported:

8.1 Population-sector analyses

One run per stratum of each stratification variable. These run from clustering onwards, sharing the same cohort (established at microbial grouping) but restricted to a single stratum. Example strata from a real analysis:

  • FISH indicators: Intermediate-Risk (n=13), High-Risk (n=5), Favorable (n=16)
  • Disease characteristics: Low Risk (n=15), Intermediate Risk (n=11), High Risk (n=8)
  • Demographics (age): Middle-aged 51-65 (n=19), Elderly >65 (n=13), Young ≤50 (n=2)
  • Genomic markers: No Markers (n=11), Cyclin D1 t(11;14) (n=10), 1q Gain (n=9), TP53 del17p (n=7), MAF rearranged (n=5), FGFR3/MMSET t(4;14) (n=3)

8.2 Method-comparison analyses

If multiple analysis methods are selected in the Analysis → Analysis tab (under Analysis Methods Comparison), each method runs independently in its own subfolder under analysis_method/. Results include all standard outputs (coefficient table, forest plot, volcano, box plot, radar) for each method, enabling direct comparison.

Secondary result location: Each secondary run writes its outputs to a named subfolder. For example, method comparison results live at analysis_method/accelerated_failure_time/ with files prefixed analysis_method_1_accelerated_failure_time_*. Population-sector results live under population_sector/<stratification>/<stratum>/.

9.Outputs & Scientific Reporting

All outputs are written as open, machine-readable files (CSV, JSON) and high-resolution images (JPG/PNG). They are designed to be directly imported into statistical software (R, Python/pandas, SPSS) or inserted into manuscript figures and supplementary tables.

9.1 Primary result table

File: *_results_main_<method>.csv (CSV) and *_results_main_<method>.json (JSON).

One row per covariate. Columns:

ColumnTypeDescriptionStatistical meaning
covariatestringCovariate name (taxon ID or clinical variable)Identifies the predictor
coeffloatFitted coefficient β̂Log-hazard ratio (Cox/Frailty/CR) or log-time ratio (AFT) or posterior mean (Bayesian)
exp(coef)floatexp(β̂)Hazard ratio (HR) or time ratio (TR); effect size on the original scale
se(coef)floatStandard error of β̂Estimated sampling variability; used to compute Wald CI and z
coef lower 95%floatLower Wald 95% CI: β̂ − 1.96·SELower confidence / credible interval bound on log scale
coef upper 95%floatUpper Wald 95% CI: β̂ + 1.96·SEUpper confidence / credible interval bound on log scale
exp(coef) lower 95%floatexp(coef lower 95%)Lower bound of HR/TR CI
exp(coef) upper 95%floatexp(coef upper 95%)Upper bound of HR/TR CI
cmp tofloatReference value (usually 0)Value against which β is compared (always 0 for log-scale)
zfloatWald statistic: z = β̂ / SEStandard normal test statistic. |z| > 1.96 ↔ p < 0.05
pfloatTwo-sided p-value: 2·(1 − Φ(|z|))Probability of observing |z| ≥ observed under H₀: β = 0
-log2(p)float−log₂(p)Manhattan/volcano plot scale. Value ≥ 4.32 ↔ p ≤ 0.05; ≥ 10 ↔ p ≤ 0.001

Example rows from a real Cox PH analysis (4 most significant covariates):

covariatecoefexp(coef)se(coef)coef lower 95%coef upper 95%zp-log2(p)
functional_hr27.9191.39×10¹²8.41711.4244.423.3160.0009110.10
13592.89118.021.0120.9084.8752.8570.004277.87
22794236.4506.76×10¹⁵13.7979.40863.4922.6420.008256.92
weight_kg−10.0844.15×10⁻⁵4.362−18.63−1.54−2.3110.020825.59
beta2microglobulin0.0471.0491.962−3.7983.8930.0240.9810.028

Green rows: p < 0.001 (−log₂(p) > 10). Yellow rows: p < 0.05. Protective covariates have negative coef (HR < 1).

9.2 Additional tabular outputs

Pipeline summary CSV

Row per pipeline step. Columns: step name, patients in, patients out, features in, features out, duration (s), status.

*_pipeline_summary.csv
Cluster definition table

Maps each retained feature to its cluster label, representative status, and summary statistics.

*_cluster_definition_table.csv
Missingness table

Per-column missing value counts and percentages. Report directly in the data-quality section of your Methods.

*_missingness_table.csv
Discordance table

Compares directional concordance across analysis methods or strata. Useful for sensitivity analyses.

*_discordance_table.csv
Univariate screening CSV

Results of individual (univariate) analysis for each covariate before multivariate modelling.

*_screening_univariate.csv
Step-level CSVs

Intermediate data frames saved after key steps (e.g. *_10_main.csv, *_90_reduced_clusters.csv) for full reproducibility audit.

process/*.csv

9.3 Visualizations

Forest plot — schematic (HR with 95% CI) HR = 1 0.1 0.5 2 5 10 Covariate HR (95% CI) functional_hr 1.4×10¹² (CI extends) * 1359 18.0 (2.48–131.0) ** 227942 6.8×10¹⁵ (CI extends) ** weight_kg 4.2×10⁻⁵ (CI extends) * beta2microglobulin 1.05 (0.02–49.0) Risk (HR>1) Protective (HR<1) Not sig. * p<0.05   ** p<0.01
Schematic forest plot: each row is one covariate. The box marks exp(β̂) (HR or TR); horizontal lines are 95% CI. CI crossing the red reference line (HR = 1) indicates non-significance. Actual forest plots are produced at publication quality as JPG.
Volcano plot — schematic (log HR vs −log₂ p) log Hazard Ratio (β) −log₂(p) p=0.05 1359 227942 functional_hr weight_kg
Schematic volcano plot: x-axis = log hazard ratio (β̂), y-axis = −log₂(p). Points above the dashed line (−log₂(0.05) ≈ 4.32) are significant. Red = risk-associated; green = protective. Actual plots are generated as JPG/JSON.
Kaplan-Meier curves

Generated per stratification variable (demographics, FISH indicators, disease characteristics, genomic markers, laboratory values, etc.). File per stratum: *_KM_<stratum>.jpg + *.json.

Box plots

Distribution of each covariate (min, Q1, median, Q3, max) across patient groups. File: *_box_plot.jpg + *.json.

Radar (spider) plot

Clinical profile of each microbial cluster centroid. Helps characterise clusters clinically. File: *_radar_clinical_cluster.jpg.

Correlation heatmap

Pearson correlation matrix of the reduced feature set (post-clustering). Reveals residual collinearity. File: *_correlation_heatmap.png.

Alluvial (Sankey) plot

Flow of patients across stratification layers. Useful for CONSORT-style participant flow diagrams. File: *_alluvial_stratos.jpg.

Ridgeline abundance

Distribution of microbial abundance across patients for each retained taxon. Good for supplementary data quality figures.

9.4 Using results in scientific papers

Publication-ready by design. All result files follow standard field names so they can be imported directly into R (read.csv), Python (pandas.read_csv), or SPSS — no reformatting needed.

Recommended reporting practice for the Methods section:

  1. State which microbiome data pipeline was used (Bracken/Kraken, classification level, timepoints selected).
  2. Describe pre-processing: attribute and microbial discarding criteria, feature scaling method (e.g. "CLR-transformed abundances, then z-scored"), clustering method and number of clusters.
  3. Report EPV: "N events / P covariates = EPV".
  4. Identify the survival model and software library (e.g. "Cox proportional hazards regression implemented via lifelines v0.27").
  5. State significance threshold (α, e.g. 0.05) and correction for multiple testing if applicable.

Recommended reporting for Results:

  • Primary table: covariate, HR (95% CI), z, p — the CSV is ready to paste into a table editor.
  • Forest plot: insert the JPG directly; caption with the model name and n events.
  • Volcano plot: use as a supplementary figure showing the full covariate landscape; label significant hits.
  • KM curves: one per key stratification; include number at risk and log-rank test p-value in the caption.

Example Methods sentence:

"Microbiome features were first aggregated to genus level and filtered to remove constant-value taxa. The resulting feature matrix was clustered using hierarchical clustering (Ward linkage, Euclidean distance, k = 4 clusters) and reduced to cluster representatives. Abundances were CLR-transformed and z-scored. Multivariate survival analysis was performed using Cox proportional hazards regression (lifelines, penalizer = 0, α = 0.05). A total of N events across P covariates (EPV = N/P) were analysed. Statistical significance was defined as p < 0.05 (two-sided Wald test). Secondary analyses were conducted for each disease-risk stratum and for four additional survival models (AFT, Frailty, Competing Risks, Bayesian)."

10.Interpreting Results

10.1 Hazard ratio (Cox / Frailty / Competing Risks)

  • HR = 1.0: no association with hazard.
  • HR > 1.0: each unit increase in the covariate is associated with higher instantaneous event rate (risk factor). Example: HR = 18.0 for covariate 1359 → 18-fold higher hazard per unit increase.
  • HR < 1.0: associated with lower hazard (protective). Example: HR = 4.15×10⁻⁵ for weight_kg → markedly protective.
  • Wide CI: high uncertainty, often due to small EPV or collinearity. Use penalization (L2/L1) or the Bayesian model.

10.2 The −log₂(p) scale

Volcano plots use −log₂(p) on the y-axis. Key thresholds:

p-value−log₂(p)Significance
0.054.32Standard α = 0.05
0.016.641%
0.0019.970.1% (Bonferroni for ~1000 tests)
0.000113.29Genome-wide equivalent

10.3 EPV and model reliability

With very low EPV (< 5), coefficients may be unstable or biased. Apply L2 penalization, reduce the number of covariates, or switch to the Bayesian model which applies implicit regularisation via the prior.

10.4 Proportional hazards check

The Cox model assumes that the HR is constant over time. Violation (time-varying HR) can be detected by plotting Schoenfeld residuals vs. time or using Grambsch-Therneau tests. If PH is violated for key covariates, consider time-varying covariates, stratified Cox, or AFT models.

10.5 Bayesian convergence

Examine the MCMC trace plots and R̂ statistics in the process JSON (*_B0_bayesian_info.json if present). R̂ > 1.05 suggests poor mixing; increase n_samples or adjust the prior scale.

11.Glossary

Censoring (right)
Observation where the event had not occurred by end of follow-up, loss to follow-up, or study end. Event indicator = 0. Survival analysis uses all available follow-up time without treating censoring as an event.
Hazard function h(t)
Instantaneous rate of the event at time t given survival to t: h(t) = limΔt→0 P(t ≤ T < t+Δt | T ≥ t) / Δt. Related to survival: S(t) = exp(−∫₀ᵗ h(u)du).
Hazard ratio (HR)
Ratio of hazards between two covariate values: exp(β·Δx). Constant over time under the proportional hazards assumption.
Time ratio (TR)
In AFT models: the factor by which median (or mean) survival time is multiplied per unit covariate increase. exp(β) in the AFT parameterisation.
Events per variable (EPV)
Number of observed events divided by number of covariates. Rule of thumb: EPV ≥ 10 for Cox to limit small-sample bias. Use penalization or Bayesian methods with lower EPV.
Proportional hazards (PH)
Assumption that hazard ratios are constant over time. Formally: h(t|X₁)/h(t|X₂) = c for all t. Testable via Schoenfeld residuals or log-log plots.
Stratification
Partition of the cohort into mutually exclusive subgroups by a categorical variable (e.g. disease risk, genomic markers). Population-sector analyses run one model per stratum.
Frailty
Unobserved random multiplicative factor on the hazard, representing unobserved individual or cluster-level heterogeneity. Variance θ ≥ 0; θ = 0 reduces to Cox.
Competing risks
Setting where multiple mutually exclusive events can occur. The event of interest is prevented (or its observation altered) by competing events. Cause-specific Cox and Fine-Gray sub-distribution models are the two main approaches.
Cumulative Incidence Function (CIF)
CIF₁(t) = P(T ≤ t, K = 1): the probability of experiencing the event of interest by time t in the presence of competing risks. Modelled directly by Fine-Gray.
CLR (Centered Log-Ratio)
Isometric log-ratio transformation for compositional data: x′ = log(x / g(x)), where g(x) is the geometric mean of the composition. Removes the unit-sum constraint of relative abundances.
Credible interval (Bayesian)
Posterior interval [a, b] such that P(a ≤ β ≤ b | data) = 0.95. Directly probabilistic interpretation (unlike frequentist CI).
MCMC / R̂
Markov Chain Monte Carlo: simulation-based posterior sampling. R̂ (Gelman-Rubin) is a convergence diagnostic: R̂ ≈ 1 indicates chains have mixed; R̂ > 1.05 suggests convergence issues.
Wald test
Test statistic z = β̂ / SE(β̂), compared to standard normal. Used by default for all frequentist methods in the platform. Alternative: likelihood ratio test (more accurate for small samples).
EPV-adjusted penalization
Adding a ridge (L2) or lasso (L1) penalty to the log-likelihood reduces overfitting when EPV is low. The penalty λ shrinks coefficients toward zero; λ = 0 yields standard (unpenalised) MLE.

12.Key References

  1. Cox DR. Regression Models and Life-Tables. J R Stat Soc Ser B. 1972;34(2):187–220.
  2. Fine JP, Gray RJ. A Proportional Hazards Model for the Subdistribution of a Competing Risk. J Am Stat Assoc. 1999;94(446):496–509.
  3. Vaupel JW, Manton KG, Stallard E. The Impact of Heterogeneity in Individual Frailty on the Dynamics of Mortality. Demography. 1979;16(3):439–454.
  4. Wei LJ. The Accelerated Failure Time Model: A Useful Alternative to the Cox Regression Model in Survival Analysis. Stat Med. 1992;11(14–15):1871–1879.
  5. Gelman A, et al. Bayesian Data Analysis. 3rd ed. CRC Press; 2013.
  6. Davidson-Pilon C. lifelines: survival analysis in Python. JOSS. 2019;4(40):1317. lifelines.readthedocs.io
  7. Aitchison J. The Statistical Analysis of Compositional Data. Chapman & Hall; 1986.
  8. Pedregosa F, et al. Scikit-learn: Machine Learning in Python. JMLR. 2011;12:2825–2830.

This documentation is generated from the platform source code and analysis metadata. Statistical formulas reflect the implementations in step_B0_analysis.py and metadata/ANALYSIS_METHODS.py. For questions or issue reports, contact glevcovich@gmail.com.