T Cell Receptor Modeling Based on VDJdb¶

Author: Jiarui Li
Email: jli78@tulane.edu
Create: 2024/09/29
Update: 2024/09/29
Github Page Link: https://www.jiarui-li.com/CMPS6790-Milestone1/

docflow is a private package maintained on tulane gitlab and our own python package index, which can only be accessed from campus. It can be utilized to generate markdown/html document from python code. The full document for this package can be found at these links below:
GitLab Link: https://git.tulane.edu/jrestool/docflow
Python Package Index: https://jellyroll.cs.tulane.edu/pypi/simple/docflow/
If you are insterested in this package, welcome to add issues to the GitLab page or contact me by Email (jli78@tulane.edu).

We use this package to manage the document, experiment report, and automatically generate github page.

In [1]:
# install docflow from our python package index server
!pip install docflow --index-url https://jellyroll.cs.tulane.edu/pypi/simple/
import docflow as doc
Looking in indexes: https://jellyroll.cs.tulane.edu/pypi/simple/
Requirement already satisfied: docflow in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (0.0.3)
Requirement already satisfied: pandas in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from docflow) (2.0.3)
Requirement already satisfied: matplotlib in c:\users\11056\appdata\roaming\python\python38\site-packages (from docflow) (3.7.2)
Requirement already satisfied: markdown in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from docflow) (3.7)
Requirement already satisfied: tabulate in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from docflow) (0.9.0)
Requirement already satisfied: importlib-metadata>=4.4 in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from markdown->docflow) (8.5.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (1.1.0)
Requirement already satisfied: cycler>=0.10 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (4.42.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (1.4.4)
Requirement already satisfied: numpy>=1.20 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (1.24.4)
Requirement already satisfied: packaging>=20.0 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\11056\appdata\roaming\python\python38\site-packages (from matplotlib->docflow) (6.0.1)
Requirement already satisfied: pytz>=2020.1 in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from pandas->docflow) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from pandas->docflow) (2024.2)
Requirement already satisfied: zipp>=3.20 in c:\users\11056\anaconda3\envs\datascience\lib\site-packages (from importlib-metadata>=4.4->markdown->docflow) (3.20.2)
Requirement already satisfied: six>=1.5 in c:\users\11056\appdata\roaming\python\python38\site-packages (from python-dateutil>=2.7->matplotlib->docflow) (1.16.0)

Project Introduction¶

In [2]:
# Create a dictionary to store all experiment records
lab_doc = {}
# Setup project information as variables
info_project_title = 'T Cell Receptor Modeling Based on VDJdb'
info_author        = 'Jiarui Li'
info_contact       = 'jli78@tulane.edu'
info_page_link     = 'https://www.jiarui-li.com/CMPS6790-Milestone1/'
info_course        = 'CMPS 6790 Data Science'
info_course_link   = 'https://nmattei.github.io/cmps6790/projects/FinalTutorial/FinalTutorial/'
info_intro         = '''
This is the coursework for the Milestone 01 of CMPS 6790 Data Science.  
For this project, I will leverage dataset `VDJdb` and task `T Cell Receptor Modeling`
to learn the ETL (Extraction, Transform, and Load) skills.  
The final work will be uploaded to my Github page.
'''
# Generate document for project introduction part and show it.
lab_doc['introduction'] = doc.Document(
    doc.Title('Introduction', level=2),
    doc.Sequence({
        'Author':           info_author,
        'Email':            info_contact,
        'Github Page Link': info_page_link,
        'Course':           info_course,
        'Course Link':      info_course_link
    }),
    doc.Text(info_intro),
)
lab_doc['introduction']
Out[2]:

Introduction¶

  • Author: Jiarui Li
  • Email: jli78@tulane.edu
  • Github Page Link: https://www.jiarui-li.com/CMPS6790-Milestone1/
  • Course: CMPS 6790 Data Science
  • Course Link: https://nmattei.github.io/cmps6790/projects/FinalTutorial/FinalTutorial/

This is the coursework for the Milestone 01 of CMPS 6790 Data Science.
For this project, I will leverage dataset VDJdb and task T Cell Receptor Modeling to learn the ETL (Extraction, Transform, and Load) skills.
The final work will be uploaded to my Github page.

Workflow And Collaboration Plan¶

In [3]:
# Setup workflow information as variables
info_workflow      = '''
For this project, it is finished independently by Jiarui Li.  
Therefore, this section will mainly describe the workflow.  
1. Code and version management
    The code will be managed in a git repository to manage 
    the version and sync all updates.  
2. Code organization
    For submit convenience, all code will be in one jupyter
    notebook. Also, no extra files required.  
    The package requirement will be attached to this notebook.
3. Document organization
    The document will be managed by docflow, which allows us
    to show everything in jupyter notebook also generate a
    markdown file for GitHub page. The code used to generate
    the page is at below.
4. Development procedure
    This poject development will follow the protocal of rapid
    application development cycle. First, we will decide the
    developement problem and target. Then, we will plan the TODO
    list for currect cycle. Finally, develop the codes, test
    them, and identify the new issues and requirement.
    
'''

# Function used to generate the final github page.
def generate_github_page(doc_dict:dict, title:str='', save_path:str='page.md') -> None:
    '''
    Function used to generate the final github page.
    
    Parameters
    ----------
    doc_dict  (dict[doc.Document]) : The sections of this page.
    title     (str)                : The title of this page.
    save_path (str)                : The save path to store the html file
    '''
    _doc_title    = doc.Title(title, level=1)
    # Extract sections as list
    _doc_sections = [_doc for _doc in doc_dict.values()]
    # Append title to section list for calalog generation
    _doc_sections = [_doc_title] + _doc_sections
    _final_doc = doc.Document(
        _doc_title,
        doc.DateTimeStamp('%d-%m-%Y %H:%M:%S'),    # Add page generation timestamp
        doc.Text('\n\n'),
        doc.Catalog(doc.Document(*_doc_sections)), # Add catalog for entire document
        doc.Text('\n\n'),
        *_doc_sections[1:]                         # Add all sections
    )
    _final_doc.save(save_path, format='markdown')
    
# Generate document for workflow and collaboration plan part and show it.
lab_doc['workflow'] = doc.Document(
    doc.Title('Workflow And Collaboration Plan', level=2),
    doc.Text(info_workflow),
    doc.Code(generate_github_page) # Let docflow fetch the code and attach to the document
)
lab_doc['workflow']
Out[3]:

Workflow And Collaboration Plan¶

For this project, it is finished independently by Jiarui Li.
Therefore, this section will mainly describe the workflow.

  1. Code and version management The code will be managed in a git repository to manage the version and sync all updates.
  2. Code organization For submit convenience, all code will be in one jupyter notebook. Also, no extra files required.
    The package requirement will be attached to this notebook.
  3. Document organization The document will be managed by docflow, which allows us to show everything in jupyter notebook also generate a markdown file for GitHub page. The code used to generate the page is at below.
  4. Development procedure This poject development will follow the protocal of rapid application development cycle. First, we will decide the developement problem and target. Then, we will plan the TODO list for currect cycle. Finally, develop the codes, test them, and identify the new issues and requirement.
def generate_github_page(doc_dict:dict, title:str='', save_path:str='page.md') -> None:
    '''
    Function used to generate the final github page.

    Parameters
    ----------
    doc_dict  (dict[doc.Document]) : The sections of this page.
    title     (str)                : The title of this page.
    save_path (str)                : The save path to store the html file
    '''
    _doc_title    = doc.Title(title, level=1)
    # Extract sections as list
    _doc_sections = [_doc for _doc in doc_dict.values()]
    # Append title to section list for calalog generation
    _doc_sections = [_doc_title] + _doc_sections
    _final_doc = doc.Document(
        _doc_title,
        doc.DateTimeStamp('%d-%m-%Y %H:%M:%S'),    # Add page generation timestamp
        doc.Text('\n\n'),
        doc.Catalog(doc.Document(*_doc_sections)), # Add catalog for entire document
        doc.Text('\n\n'),
        *_doc_sections[1:]                         # Add all sections
    )
    _final_doc.save(save_path, format='markdown')

Package Preparation¶

In [4]:
# Import all required packages
import pandas            as pd
import numpy             as np
import matplotlib.pyplot as plt
import json
import os
import requests
import matplotlib
In [5]:
# The packages we want them to be doc
package_doc = [pd, np, matplotlib, requests, json]

# functions used to document packages
def doc_pack(pack:object, max_description:int=200):
    '''
    Document single package with its embeded information
    
    Parameters
    ----------
    pack            (module) : The package want to be doc.
    max_description (int)    : The max length of the short description.
    '''
    return {
        'Name':        pack.__name__,        # Package name
        'Version':     pack.__version__,     # Package version
        'Install':     f'`pip install {pack.__name__}=={pack.__version__}`',  # Install command
        'Description': ''.join(pack.__doc__.replace('\n', ' '))[:max_description] + '...', # Short description
    }
    
def doc_packs(packs:list):
    '''
    Document a list of packages and generate a table.
    
    Parameters
    ----------
    packs   (list[module]) : The packages want to be doc.
    '''
    _table = pd.DataFrame([doc_pack(pack) for pack in packs])
    return doc.Table(_table)

info_package = '''
Environment and packages used for this project is documented below:

'''
# Generate document for package part and show it.
lab_doc['environment'] = doc.Document(
    doc.Title('Environment And Packages', level=2),
    doc.Text(info_package),
    doc_packs(package_doc),  # Document selected packages
)
lab_doc['environment']
Out[5]:

Environment And Packages¶

Environment and packages used for this project is documented below:

Name Version Install Description
0 pandas 2.0.3 pip install pandas==2.0.3 pandas - a powerful data analysis and manipulation library for Python ===================================================================== pandas is a Python package providing fast, flexible, a...
1 numpy 1.24.4 pip install numpy==1.24.4 NumPy ===== Provides 1. An array object of arbitrary homogeneous items 2. Fast mathematical operations over arrays 3. Linear Algebra, Fourier Transforms, Random Number Generation How to use t...
2 matplotlib 3.7.2 pip install matplotlib==3.7.2 An object-oriented plotting library. A procedural interface is provided by the companion pyplot module, which may be imported directly, e.g.:: import matplotlib.pyplot as plt or using ipython:...
3 requests 2.32.3 pip install requests==2.32.3 Requests HTTP Library ~~~~~ Requests is an HTTP library, written in Python, for human beings. Basic GET usage: >>> import requests >>> r = requests.get('https://www.python.org...
4 json 2.0.9 pip install json==2.0.9 JSON (JavaScript Object Notation) http://json.org is a subset of JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data interchange format. :mod:json exposes an API familiar to users...

Dataset¶

In [6]:
info_dataset_name   = 'VDJdb'
info_dataset_link   = 'https://vdjdb.cdr3.net/'
info_dataset_intro  = '''
VDJdb is a curated database of T-cell receptor (TCR) sequences
with known antigen specificities. The primary goal of VDJdb is
to facilitate access to existing information on T-cell receptor
antigen specificities.
'''
info_dataset_reason = '''
T Cell is a significant part of our immune system to protect our
body away from cancer, virus, and other antigens. Therefore,
Identify the rule that how T Cells are activated can help us
understand how to develop vaccine for diseases. To understand
how T cell work, T cell receptor (TCR) is the key factor.  
Therefore, we choose this dataset and try to model TCR.
'''
info_dataset_questions = [
    'Which TCR fragement is strongly related to which epitope (processed antigens) fragments?',
    'Is there relationship between different TCR? (For example, is there fragment leads to other fragment appearance.)'
]

# Generate document for dataset part and show it.
lab_doc['dataset'] = doc.Document(
    doc.Title('Dataset', level=2),
    doc.Title(info_dataset_name, level=3),
    doc.Text(f'Dataset Link: {info_dataset_link}'),
    doc.Title('Dataset Introduction', level=4),
    doc.Text(info_dataset_intro),
    doc.Title('Reason for Choose This Dataset', level=4),
    doc.Text(info_dataset_reason),
    doc.Title('Questions for This Dataset', level=4),
    doc.Sequence(info_dataset_questions)
)
lab_doc['dataset']
Out[6]:

Dataset¶

VDJdb¶

Dataset Link: https://vdjdb.cdr3.net/

Dataset Introduction¶

VDJdb is a curated database of T-cell receptor (TCR) sequences with known antigen specificities. The primary goal of VDJdb is to facilitate access to existing information on T-cell receptor antigen specificities.

Reason for Choose This Dataset¶

T Cell is a significant part of our immune system to protect our body away from cancer, virus, and other antigens. Therefore, Identify the rule that how T Cells are activated can help us understand how to develop vaccine for diseases. To understand how T cell work, T cell receptor (TCR) is the key factor.
Therefore, we choose this dataset and try to model TCR.

Questions for This Dataset¶

  • Which TCR fragement is strongly related to which epitope (processed antigens) fragments?
  • Is there relationship between different TCR? (For example, is there fragment leads to other fragment appearance.)

(ETL) Extraction, Transform, and Load¶

Extraction¶

In [7]:
# Define the functions to fetch data from VDJdb database
def get_vdjdb_header(url:str='https://vdjdb.cdr3.net/api/database/'):
    _meta_path  = os.path.join(url, 'meta')     # Format request url
    _response   = requests.get(_meta_path)      # GET: request the data
    if _response.status_code != 200: raise ConnectionError('Fetch header failed.') # Check response status
    _content    = json.loads(_response.content) # Decode json response
    _header_raw = _content.get('metadata')      # Fetch meta data
    if _header_raw is None: raise ValueError('Empty meta data')    # Check empty
    _header_raw = _header_raw.get('columns')    # Fetch columns data
    if _header_raw is None: raise ValueError('Empty columns data') # Check empty
    _header     = [item.get('name') for item in _header_raw]       # Get the headers
    return _header

def get_vdjdb_data(url:str='https://vdjdb.cdr3.net/api/database/'):
    _data_path = os.path.join(url, 'search')     # Format request url
    _response  = requests.post(_data_path,       # POST: request the data
                               headers={'Content-Type': 'application/json'},     # Request server to response in JSON format
                               data='{ "filters" : [] }')                        # Empty filter for all data
    if _response.status_code != 200: raise ConnectionError('Fetch data failed.') # Check response status
    _data = json.loads(_response.content) # Decode json response
    _data = _data.get('rows')             # Fetch main data
    if _data is None: raise ValueError('Empty data') # Check empty
    _data = [item['entries'] for item in _data]      # Flatten the data
    return _data
    
def get_vdjdb(url:str='https://vdjdb.cdr3.net/api/database/', local_cache:str='vdjdb.raw.csv'):
    '''
    Get the VDJdb data from the URL. The request method is from VDJdb document: https://vdjdb-web.readthedocs.io/en/latest/api.html
    
    Parameters
    ----------
    url (str)         : The URL for VDJdb request.
    local_cache (str) : The local cached data file
    '''
    if os.path.exists(local_cache): # Check is there a cached file
        print(f'Use local cache: {local_cache}')
        return pd.read_csv(local_cache, index_col=0)
    else:
        print(f'Fetch VDJdb from {url}')
        print('Get VDJdb header')
        _header = get_vdjdb_header(url) # Get header for the data
        print('Get VDJdb data')
        _data   = get_vdjdb_data(url)   # Get the data
        # Convert data to pandas dataframe
        _dataframe = pd.DataFrame(_data, columns=_header)
        print(f'{len(_dataframe)} records got')
        _dataframe.to_csv(local_cache)  # Save to local for next time load
        print(f'Cached to {local_cache}')
        return _dataframe

Fetch the database¶

In [8]:
url   = 'https://vdjdb.cdr3.net/api/database/'
vdjdb = get_vdjdb(url = url)  # Fetch the database
Fetch VDJdb from https://vdjdb.cdr3.net/api/database/
Get VDJdb header
Get VDJdb data
102990 records got
Cached to vdjdb.raw.csv
In [9]:
info_extract_intro = '''
Following the document of VDJdb (https://vdjdb-web.readthedocs.io/en/latest/api.html),
we used `requests` package to fetch the entire
VDJdb database. Then transfer it to dataframe.
'''
doc_extract = doc.Document(
    doc.Title('Extract', level=3),
    doc.Text(info_extract_intro),
    doc.Sequence({
        'Data URL': url,
        'Number of Records': len(vdjdb),
        'Number of Columns': len(vdjdb.columns)
    }),
)
doc_extract
Out[9]:

Extract¶

Following the document of VDJdb (https://vdjdb-web.readthedocs.io/en/latest/api.html), we used requests package to fetch the entire VDJdb database. Then transfer it to dataframe.

  • Data URL: https://vdjdb.cdr3.net/api/database/
  • Number of Records: 102990
  • Number of Columns: 16

Transform¶

In [10]:
info_transform_column = '''
Filter the columns we are interested in and drop other columns.
'''
columns = ['gene', 'cdr3', 'v.segm', 'j.segm', 'antigen.epitope', 'species', 'vdjdb.score']
# Comment for each column
columns_comment = {
    'gene':            'TCR is comprised of two sequence, this denotes which sequence it is.',
    'crd3':            'The TCR sequence.',
    'v.segm':          'The V region of the TCR sequence.',
    'j.segm':          'The J region of the TCR sequence.',
    'antigen.epitope': 'The epitope this TCR can bind to.',
    'species':         'Species for this record.',
    'vdjdb.score':     'Experiment confidence. (Can be considered as data quality)'
}
# Filter the target columns
df_column_cleaned = vdjdb[columns]

doc_transform_column = doc.Document(
    doc.Title('Filter Columns', level=4),
    doc.Text(info_transform_column),
    doc.Sequence(columns_comment),
    doc.Title('Example', level=5),
    doc.Text('\n'),
    doc.Table(df_column_cleaned[10:15]),
    doc.Text('\n'),
)
doc_transform_column
Out[10]:

Filter Columns¶

Filter the columns we are interested in and drop other columns.

  • gene: TCR is comprised of two sequence, this denotes which sequence it is.
  • crd3: The TCR sequence.
  • v.segm: The V region of the TCR sequence.
  • j.segm: The J region of the TCR sequence.
  • antigen.epitope: The epitope this TCR can bind to.
  • species: Species for this record.
  • vdjdb.score: Experiment confidence. (Can be considered as data quality)
    ##### Example
gene cdr3 v.segm j.segm antigen.epitope species vdjdb.score
10 TRA CAVRDGGTGFQKLVF TRAV3*01 TRAJ8*01 HPVGEADYFEY HomoSapiens 1
11 TRB CASRQDRDYQETQYF TRBV5-1*01 TRBJ2-5*01 HPVGEADYFEY HomoSapiens 1
12 TRA CAARGIGSGTYKYIF TRAV13-1*01 TRAJ40*01 HPVGEADYFEY HomoSapiens 1
13 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
14 TRA CAARGIGSGTYKYIF TRAV13-1*01 TRAJ40*01 HPVGEADYFEY HomoSapiens 1
In [11]:
info_transform_trb = '''
According to the typical researches, we commonly focus on TRB, which is
significantly decided the behaviour of the TCR.
Therefore, we only keep records for TRB.
'''
# Filter TRB
df_trb = df_column_cleaned[df_column_cleaned.gene == 'TRB']

doc_transform_trb = doc.Document(
    doc.Title('Filter TRB', level=4),
    doc.Text(info_transform_trb),
    doc.Title('Example', level=5),
    doc.Table(df_trb[:5]),
)
doc_transform_trb
Out[11]:

Filter TRB¶

According to the typical researches, we commonly focus on TRB, which is significantly decided the behaviour of the TCR. Therefore, we only keep records for TRB.

Example¶
gene cdr3 v.segm j.segm antigen.epitope species vdjdb.score
1 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
3 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
5 TRB CASSAPTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
7 TRB CASSARTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
8 TRB CASSPRRYNEQFF TRBV9*01 TRBJ2-1*01 HPVGEADYFEY HomoSapiens 1
In [12]:
info_transform_nona = '''
Clean all N/A data for columns v.segm and j.segm
'''
df_trb = df_trb[df_trb['v.segm'].notna() & df_trb['j.segm'].notna()]

doc_transform_nona = doc.Document(
    doc.Title('Clean N/A Data', level=4),
    doc.Text(info_transform_nona),
    doc.Title('Example', level=5),
    doc.Table(df_trb[:5]),
)
doc_transform_nona
Out[12]:

Clean N/A Data¶

Clean all N/A data for columns v.segm and j.segm

Example¶
gene cdr3 v.segm j.segm antigen.epitope species vdjdb.score
1 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
3 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
5 TRB CASSAPTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
7 TRB CASSARTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
8 TRB CASSPRRYNEQFF TRBV9*01 TRBJ2-1*01 HPVGEADYFEY HomoSapiens 1
In [13]:
info_transform_quality = '''
vdjdb.score shows confidence of this data.
Therefore, to make sure our analysis depends on high quality data.
We only keep records with vdjdb.score >= 1.
'''

df_trb['vdjdb.score'] = df_trb['vdjdb.score'].astype(np.int32) # Transfer to numeric type
df_trb_quality = df_trb[df_trb['vdjdb.score'] >= 1]            # Filter all records with 

doc_transform_quality = doc.Document(
    doc.Title('Filter High Quality Data', level=4),
    doc.Text(info_transform_quality),
    doc.Title('Example', level=5),
    doc.Table(df_trb_quality[:5]),
)
doc_transform_quality
Out[13]:

Filter High Quality Data¶

vdjdb.score shows confidence of this data. Therefore, to make sure our analysis depends on high quality data. We only keep records with vdjdb.score >= 1.

Example¶
gene cdr3 v.segm j.segm antigen.epitope species vdjdb.score
1 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
3 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
5 TRB CASSAPTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
7 TRB CASSARTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
8 TRB CASSPRRYNEQFF TRBV9*01 TRBJ2-1*01 HPVGEADYFEY HomoSapiens 1
In [14]:
# Document gathering for Transform
doc_transform = doc.Document(
    doc.Title('Transform', level=3),
    doc_transform_column,
    doc_transform_trb,
    doc_transform_nona,
    doc_transform_quality,
)

Load¶

In [15]:
info_load = '''
We now use CSV file to manage the data.
Therefore, we store back the preprocessed
data to a CSV file.  
When we need it, we could use `pandas` to
read it.
'''
# Store the data to a CSV file
df_trb_quality.to_csv('./vdjdb.trb.csv')
# Load it back
df = pd.read_csv('./vdjdb.trb.csv')

doc_load = doc.Document(
    doc.Title('Load', level=3),
    doc.Text(info_load),
    doc.Text('\n**Read file example**\n'),
    doc.Table(df[:5]),
)
doc_load
Out[15]:

Load¶

We now use CSV file to manage the data. Therefore, we store back the preprocessed data to a CSV file.
When we need it, we could use pandas to read it.

Read file example

Unnamed: 0 gene cdr3 v.segm j.segm antigen.epitope species vdjdb.score
0 1 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
1 3 TRB CASSARSGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
2 5 TRB CASSAPTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
3 7 TRB CASSARTGELFF TRBV9*01 TRBJ2-2*01 HPVGEADYFEY HomoSapiens 1
4 8 TRB CASSPRRYNEQFF TRBV9*01 TRBJ2-1*01 HPVGEADYFEY HomoSapiens 1
In [16]:
# Document gathering for ETL
lab_doc['ETL'] = doc.Document(
    doc.Title('(ETL) Extraction, Transform, and Load', level=2),
    doc_extract,
    doc_transform,
    doc_load
)

Analysis¶

Stat Species¶

In [17]:
info_stat_species = '''
Because for different species, the patterns of TCR might be different.
We are going to know how many species are there and the distribution
of the species.
'''
species_stat = df['species'].value_counts() # Counts all species
fig = plt.figure()
species_stat.plot.pie(subplots=True)        # Plot pie chart

doc_analysis_species = doc.Document(
    doc.Title('Stat Species', level=3),
    doc.Text(info_stat_species),
    doc.Table(pd.DataFrame(species_stat)),
    doc.Text('\n**Visualization**\n'),
    doc.EmbeddedImage(fig)                  # Directly embeded the image to document as base64
)
doc_analysis_species
Out[17]:

Stat Species¶

Because for different species, the patterns of TCR might be different. We are going to know how many species are there and the distribution of the species.

species count
HomoSapiens 8260
MusMusculus 1198
MacacaMulatta 659

Visualization

Analyze V and J Region Relationship¶

In [18]:
info_stat_v_j = '''
V and J regions significantly decided the function
of the TCR. Therefore, it is interesting to analyze
their distribution.
'''
# Count records by V and J region
v_j_counts = df[['v.segm', 'j.segm']].groupby(['v.segm', 'j.segm']).value_counts()

# Transfer to matplotlib axis
x = [item[0] for item in v_j_counts.index.values]
y = [item[1] for item in v_j_counts.index.values]
# Get the size of scatters
s = v_j_counts.values

# Normalize the values
s_norm = (s - min(s))/(max(s) - min(s))
# Get the color for the points by the values
c = list(map(lambda x : {10:'red',   9:'red',    8:'red',
                         7:'orange', 6:'orange', 5:'orange',
                         4:'orange', 3:'orange', 2:'green',
                         1:'green'}.get(x, 'grey'),
             np.asarray(s_norm*10, dtype=np.int32)))

fig = plt.figure(figsize=(12,5))
ax  = fig.subplots()
ax.scatter(x, y, s, c=c, alpha=0.75)
ax.grid(linestyle=':')
ax.tick_params(axis='x', labelrotation=90)
ax.set_xlabel('V Region')
ax.set_ylabel('J Region')

doc_analysis_v_j = doc.Document(
    doc.Title('Analyze V/J Region Relationship', level=3),
    doc.Text(info_stat_v_j),
    doc.Text('\n**Visualization**\n'),
    doc.EmbeddedImage(fig)                  # Directly embeded the image to document as base64
)
doc_analysis_v_j
Out[18]:

Analyze V/J Region Relationship¶

V and J regions significantly decided the function of the TCR. Therefore, it is interesting to analyze their distribution.

Visualization

In [19]:
# intigrate all analysis documents
lab_doc['analysis'] = doc.Document(
    doc.Title('Analysis', level=2),
    doc_analysis_species,
    doc_analysis_v_j
)
In [20]:
# Generate the Github page with expriment data and documents
generate_github_page(lab_doc, title=info_project_title, save_path='./page.md')