LungTCR (https://www.lungtcr.com/) is a comprehensive website for analyzing T-cell receptor (TCR) repertoire data with specialized functions for cancer risk assessment. This repository contains:
- TCR repertoire feature calculation pipeline
- Cancer-associated TCR enrichment scoring
- Machine learning models for lung cancer/malignant pulmonary nodule risk prediction
Feature | Description |
---|---|
TCR Repertoire Analysis | Calculates 20+ diversity and clonality metrics |
Cancer TCR Enrichment | Quantifies tumor-associated TCR signatures |
Risk Prediction Models | Random Forest/GBM models for lung cancer risk assessment |
Visualization | Plotting of key TCR features and model result |
The input file format is VDJtools' table format. Run Convert routine by VDJtools (https://vdjtools-doc.readthedocs.io/en/master/input.html#vdjtools-format) to geneate the format.
Column | Required | Description | Example |
---|---|---|---|
count | Yes | Read counts of TCR clones | 161853 |
freq | Yes | Frequency of TCR clones | 0.009385 |
cdr3_nt | Yes | CDR3 nucleic acid sequence | TGTGCCAGTTCGTCGTCTAGCTCCTACAATGAGCAGTTCTTC |
cdr3_aa | Yes | CDR3 amino acid sequence | CASSSSSSYNEQFF |
v | Yes | V gene segment | TRBV6-4 |
d | Yes | D gene segment | . |
j | Yes | J gene segment | TRBJ2-7 |
VEnd | No | Position of the V gene end | 7 |
Dstart | No | Position of the d gene start | . |
Dend | No | Position of the D gene end | . |
Jstart | No | Position of the J gene strat | 18 |
sample_id | No | Sample identifier | Patient01_PBMC |
Example file (tabular format):
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
161853 0.009385105218213133 TGTGCCAGTTCGTCGTCTAGCTCCTACAATGAGCAGTTCTTC CASSSSSSYNEQFF TRBV6-4 . TRBJ2-1 7 -1 -1 18
128851 0.007471472215355789 TGTGCCAGCTCACCATAGGACAGTGCTTCTCTGGAAACACCATATATTTT CASSP*DS_FSGNTIYF TRBV18 TRBD1 TRBJ1-3 14 17 22 28
107730 0.006246763329429179 TGTGCCAGCAGTTACGGTCTAAGAGATACGCAGTATTTT CASSYGLRDTQYF TRBV6-5 . TRBJ2-3 14 -1 -1 23
...
python TCRfeatureCal.py -m /extdata/metadata.tsv -o output_features/
python diversity_vdjtools_wrapper.py -m /extdata/metadata.txt -o outout_diversity/ -x 10000000
Rscript LungTCR_ModelPrediction.R -i /extdata/test.csv -o output_model/ -m /model/models_list.rds -f /model/SelectedFeatures.csv
Output includes:
- Diversity indices (Shannon, Simpson, etc.)
- Clonality metrics
- V/J gene usage profiles
- CDR3 length distributions
- CDR3 amino acid compositions
- TCR clones frequency distributions
- TCR convergence index
- Lung cancer enrichment score
- Lung cancer prediction result
The code files here are linked to the work "Large-Scale TCR Repertoire Profiling Unveils Tumor-Specific Signals for Diagnosing Indeterminate Pulmonary Nodules" by Chen et al .