Notebook Tutorial

This page goes through key features and functions supported in the FLUTE tool in an elaborative and interactive manner using notebook.

[1]:

# Set up package and function imports
import sys,os
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), os.pardir, 'src')))
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import time
import run_FLUTE

# These three dependencies are for visualization only
import matplotlib.pyplot as plt
import networkx as nx
import altair as alt

[2]:

# Fill in username and password, as well as the configurations of flute database
db_name = "flute"
db_host = "localhost"
db_user = "root"
db_password = "12345678"

Using FLUTE to filter interactions

[3]:

input_file = 'input/example.xlsx'

[4]:

df = pd.read_excel(input_file)
df = df.fillna('').astype(str)
interaction_df = df[['Regulated Name', 'Regulated ID', 'Regulated Type', 'Regulator Name', 'Regulator ID', 'Regulator Type', 'Paper IDs']]

Let’s take a look at all the information we would use from input interactions

[5]:

interaction_df

[5]:

	Regulated Name	Regulated ID	Regulated Type	Regulator Name	Regulator ID	Regulator Type	Paper IDs
0	CD4	P01730	Protein	Anti-CD4	UAZ616E74692D434434	Other	PMC7749301
1	CD4	P01730	Protein	anti-CD8 mAbs	UAZ616E74692D434438206D416273	Other	PMC7749301
2	CD8	CD8	Other	rapamycin	5284616	Chemical	PMC7749301
3	acidification	GO:0045851	Biological Process	CD8	CD8	Other	PMC7749301
4	acidification	GO:0045851	Biological Process	transduction	GO:0009293	Biological Process	PMC7749301
...	...	...	...	...	...	...	...
14383	T-cell activation	GO:0042110	Biological Process	CD8	CD8	Other	PMC7214244
14384	calcium	5460341	Chemical	cyclophilin	Cyclophilin	Other	PMC7214244
14385	cytokine production	GO:0001816	Biological Process	CD8	CD8	Other	PMC7214244
14386	transduction	GO:0009293	Biological Process	CAR	P36575	Protein	PMC7214244
14387	FOXP3	Q9BZS1	Other	TGFB	D016212	bioprocess	PMC2275380

14388 rows × 7 columns

Here are regulated and regulator element type breakdown and their top 5 categories

[6]:

topk_series = interaction_df["Regulated Type"].value_counts().iloc[:5]
topk_df = pd.DataFrame({"Regulated Type":topk_series.index, 'count':topk_series.values})

upper = alt.Chart(topk_df).mark_bar(color="#bae6fd").encode(
    x = alt.X("count").scale(domain=[0, 7500]),
    y = alt.Y("Regulated Type", sort="-x")
).properties(
    width=500,
    height=200,
    title= "Regulated Element Type Top 5 Categories"
)

topk_series = interaction_df["Regulator Type"].value_counts().iloc[:5]
topk_df = pd.DataFrame({"Regulator Type":topk_series.index, 'count':topk_series.values})

lower = alt.Chart(topk_df).mark_bar(color="#fdd1ba").encode(
    x = alt.X("count").scale(domain=[0, 7500]),
    y = alt.Y("Regulator Type", sort="-x")
).properties(
    width=500,
    height=200,
    title= "Regulator Element Type Top 5 Categories"
)

alt.vconcat(upper, lower)

[6]:

[7]:

# Make a utility dataframe that include all species
id_name1 = df[['Regulated ID', 'Regulated Name']].rename(
            columns={'Regulated ID': 'ID', 'Regulated Name': 'Name'})
id_name2 = df[['Regulator ID', 'Regulator Name']].rename(
            columns={'Regulator ID': 'ID', 'Regulator Name': 'Name'})
id_name_df = pd.concat([id_name1, id_name2], ignore_index=True)
id_name_df['ID'] = id_name_df['ID'].astype(str).str.slice(0, 50)
id_name_df['Name'] = id_name_df['Name'].astype(str).str.slice(0, 50)
id_name_df = id_name_df.drop_duplicates(subset='ID')

[8]:

# Ground the utility dataframe to find out stringID information for each species
query = run_FLUTE.Query(db_user, db_password, db_host, db_name)
id_name_df = query.ground_string_id(id_name_df)

This DataFrame contains all the studies species in the input file, listed by their IDs, names, and stringIDs

[9]:

id_name_df

[9]:

	ID	Name	stringID
0	P01730	CD4	9606.ENSP00000011653
2	CD8	CD8	NaN
3	GO:0045851	acidification	NaN
5	GO:0016049	cell growth	NaN
6	GO:0009293	transduction	NaN
...	...	...	...
28755	UAZ42434D41784344332062734162	BCMAxCD3 bsAb	NaN
28760	UAZ4D4E442DCE94572064756F434152	MND-ΔW duoCAR	NaN
28766	UAZ64756F434152.t	duoCAR	NaN
28772	Cyclophilin	cyclophilin	NaN
28775	D016212	TGFB	NaN

4537 rows × 3 columns

The percentage of species with a valid stringID information is:

[10]:

len(id_name_df[id_name_df["stringID"].notna()]) / len(id_name_df)

[10]:

0.21996914260524575

We are interested in interactions that has protein participated either in regulated or regulator element. These interactions are distributed:

[11]:

both_proteins = (
    (interaction_df['Regulated Type'] == 'Protein') &
    (interaction_df['Regulator Type'] == 'Protein')
)
regulated_protein_only = (
    (interaction_df['Regulated Type'] == 'Protein') &
    (interaction_df['Regulator Type'] != 'Protein')
)
regulator_protein_only = (
    (interaction_df['Regulated Type'] != 'Protein') &
    (interaction_df['Regulator Type'] == 'Protein')
)
neither_protein = (
    (interaction_df['Regulated Type'] != 'Protein') &
    (interaction_df['Regulator Type'] != 'Protein')
)

results = {
    'Both proteins': both_proteins.sum(),
    'Only Regulated is protein': regulated_protein_only.sum(),
    'Only Regulator is protein': regulator_protein_only.sum(),
    'Neither is protein': neither_protein.sum()
}

# Create a dataframe from the results
results_df = pd.DataFrame(list(results.items()), columns=['Category', 'Count'])
alt.Chart(results_df).mark_arc().encode(
    theta="Count",
    color="Category"
)

[11]:

[12]:

# Return interactions that involves a protein:ppis/pcis/pbpis
pt_only_ints = run_FLUTE.filter_protein_ints(interaction_df)
pt_only_ints

[12]:

	Regulated Name	Regulated ID	Regulator Name	Regulator ID
0	cd4	P01730	anti-cd4	UAZ616E74692D434434
1	cd4	P01730	anti-cd8 mabs	UAZ616E74692D434438206D416273
2	cd4	P01730	cell growth	GO:0016049
3	ifn	Interferon	type	Q13326
4	cell viability	D002470	type	Q13326
...	...	...	...	...
9930	env	P03386	gfp	IPR011584
9931	aes	Q08117	amg	Q99217
9932	bcma	Q02223	april	O75888
9933	dna modification	GO:0006304	car	P36575
9934	transduction	GO:0009293	car	P36575

9935 rows × 4 columns

[13]:

# Fill out CIDm information
pt_only_ints = run_FLUTE.get_chem_id(pt_only_ints)
pt_only_ints

[13]:

	Regulated Name	Regulated ID	Regulator Name	Regulator ID	Regulated CIDm	Regulator CIDm
0	cd4	P01730	anti-cd4	UAZ616E74692D434434	NaN	NaN
1	cd4	P01730	anti-cd8 mabs	UAZ616E74692D434438206D416273	NaN	NaN
2	cd4	P01730	cell growth	GO:0016049	NaN	NaN
3	ifn	Interferon	type	Q13326	NaN	NaN
4	cell viability	D002470	type	Q13326	NaN	NaN
...	...	...	...	...	...	...
9930	env	P03386	gfp	IPR011584	NaN	NaN
9931	aes	Q08117	amg	Q99217	NaN	NaN
9932	bcma	Q02223	april	O75888	NaN	NaN
9933	dna modification	GO:0006304	car	P36575	NaN	NaN
9934	transduction	GO:0009293	car	P36575	NaN	NaN

9935 rows × 6 columns

[14]:

# Fill out GoID information
pt_only_ints = run_FLUTE.get_go_id(pt_only_ints)
pt_only_ints

[14]:

	Regulated Name	Regulated ID	Regulator Name	Regulator ID	Regulated CIDm	Regulator CIDm	Regulated GoID	Regulator GoID
0	cd4	P01730	anti-cd4	UAZ616E74692D434434	NaN	NaN	NaN	NaN
1	cd4	P01730	anti-cd8 mabs	UAZ616E74692D434438206D416273	NaN	NaN	NaN	NaN
2	cd4	P01730	cell growth	GO:0016049	NaN	NaN	NaN	GO:0016049
3	ifn	Interferon	type	Q13326	NaN	NaN	NaN	NaN
4	cell viability	D002470	type	Q13326	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
9930	env	P03386	gfp	IPR011584	NaN	NaN	NaN	NaN
9931	aes	Q08117	amg	Q99217	NaN	NaN	NaN	NaN
9932	bcma	Q02223	april	O75888	NaN	NaN	NaN	NaN
9933	dna modification	GO:0006304	car	P36575	NaN	NaN	GO:0006304	NaN
9934	transduction	GO:0009293	car	P36575	NaN	NaN	GO:0009293	NaN

9935 rows × 8 columns

[15]:

# Fill out stringID information
pt_only_ints = run_FLUTE.get_string_id(pt_only_ints, id_name_df)
pt_only_ints

[15]:

	Regulated Name	Regulated ID	Regulator Name	Regulator ID	Regulated CIDm	Regulator CIDm	Regulated GoID	Regulator GoID	Regulated stringID	Regulator stringID
0	cd4	P01730	anti-cd4	UAZ616E74692D434434	NaN	NaN	NaN	NaN	9606.ENSP00000011653	NaN
1	cd4	P01730	anti-cd8 mabs	UAZ616E74692D434438206D416273	NaN	NaN	NaN	NaN	9606.ENSP00000011653	NaN
2	cd4	P01730	cell growth	GO:0016049	NaN	NaN	NaN	GO:0016049	9606.ENSP00000011653	NaN
3	ifn	Interferon	type	Q13326	NaN	NaN	NaN	NaN	NaN	9606.ENSP00000218867
4	cell viability	D002470	type	Q13326	NaN	NaN	NaN	NaN	NaN	9606.ENSP00000218867
...	...	...	...	...	...	...	...	...	...	...
9930	env	P03386	gfp	IPR011584	NaN	NaN	NaN	NaN	NaN	NaN
9931	aes	Q08117	amg	Q99217	NaN	NaN	NaN	NaN	9606.ENSP00000221561	9606.ENSP00000370088
9932	bcma	Q02223	april	O75888	NaN	NaN	NaN	NaN	9606.ENSP00000053243	9606.ENSP00000343505
9933	dna modification	GO:0006304	car	P36575	NaN	NaN	GO:0006304	NaN	NaN	9606.ENSP00000311538
9934	transduction	GO:0009293	car	P36575	NaN	NaN	GO:0009293	NaN	NaN	9606.ENSP00000311538

9935 rows × 10 columns

[16]:

# Fill out UID information
pt_only_ints = run_FLUTE.get_uid(pt_only_ints)
pt_only_ints

[16]:

	Regulated Name	Regulated ID	Regulator Name	Regulator ID	Regulated CIDm	Regulator CIDm	Regulated GoID	Regulator GoID	Regulated stringID	Regulator stringID	Regulated UID	Regulator UID
0	cd4	P01730	anti-cd4	UAZ616E74692D434434	NaN	NaN	NaN	NaN	9606.ENSP00000011653	NaN	P01730	NaN
1	cd4	P01730	anti-cd8 mabs	UAZ616E74692D434438206D416273	NaN	NaN	NaN	NaN	9606.ENSP00000011653	NaN	P01730	NaN
2	cd4	P01730	cell growth	GO:0016049	NaN	NaN	NaN	GO:0016049	9606.ENSP00000011653	NaN	P01730	NaN
3	ifn	Interferon	type	Q13326	NaN	NaN	NaN	NaN	NaN	9606.ENSP00000218867	NaN	Q13326
4	cell viability	D002470	type	Q13326	NaN	NaN	NaN	NaN	NaN	9606.ENSP00000218867	NaN	Q13326
...	...	...	...	...	...	...	...	...	...	...	...	...
9930	env	P03386	gfp	IPR011584	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9931	aes	Q08117	amg	Q99217	NaN	NaN	NaN	NaN	9606.ENSP00000221561	9606.ENSP00000370088	Q08117	Q99217
9932	bcma	Q02223	april	O75888	NaN	NaN	NaN	NaN	9606.ENSP00000053243	9606.ENSP00000343505	Q02223	O75888
9933	dna modification	GO:0006304	car	P36575	NaN	NaN	GO:0006304	NaN	NaN	9606.ENSP00000311538	NaN	P36575
9934	transduction	GO:0009293	car	P36575	NaN	NaN	GO:0009293	NaN	NaN	9606.ENSP00000311538	NaN	P36575

9935 rows × 12 columns

Now input interactions have been narrowed down to contain protein-involved interactions, and each interaction has been populated with information including name/ID/UID/CIDm/GoID/stringID for both regulated and regulator element.

But not all fields are filled out:

[17]:

# Compute the percentage of null values for each column
null_percentages = pt_only_ints.isnull().mean() * 100
null_percentages_df = null_percentages.reset_index()
null_percentages_df.columns = ['column', 'percentage']

bars = alt.Chart(null_percentages_df).mark_bar(color='#d3effe').encode(
    x=alt.X('column:N', title=''),
    y=alt.Y('percentage:Q', title='Percentage of Null Values'),
    tooltip=['column', 'percentage']
)

# Text labels for the top of the bars
text = bars.mark_text(
    align='center',
    baseline='bottom',
    dy=-5  # Adjust the position of the text
).encode(
    text=alt.Text('percentage:Q', format='.1f')  # Format the text to one decimal place
)

layered_chart = (bars + text).properties(
    width=500,
    height=300,
    title='Percentage of Null Values in pt_only_ints Columns'
).configure_axis(
    labelAngle=315  # Rotate x-axis labels by 45 degrees
)

layered_chart.show()

[18]:

score_tuple = (0, 0, 0)
pt_scored_ints = query.filter_pt_ints_by_scoring(pt_only_ints, score_tuple)

Using FLUTE tool, you can further filter interactions using the customized score tuple. Here are the scores of these interactions

[19]:

pt_scored_ints

[19]:

	Element 1 ID	Element 2 ID	STRING escore	STRING tscore	STRING dscore
0	216239	P08069	180	885	0
1	271	P22001	342	0	0
2	271	P28907	800	163	0
3	784	P04040	690	999	900
4	947	P35548	0	180	0
...	...	...	...	...	...
976	Q9Y5U5	P43489	0	807	0
977	Q9Y5U5	Q07011	0	745	0
978	Q9Y6Q6	P10747	0	152	0
979	UAZ4B693637	Q9NZQ7	0	391	0
980	VAV	P10747	379	786	900

981 rows × 5 columns

[20]:

score_visual = pt_scored_ints[pt_scored_ints['STRING escore'].str.isnumeric() & pt_scored_ints['STRING tscore'].str.isnumeric() & pt_scored_ints['STRING dscore'].str.isnumeric()]
score_visual[['STRING escore','STRING tscore','STRING dscore']] = score_visual[['STRING escore','STRING tscore','STRING dscore']].apply(pd.to_numeric, errors='coerce', axis=1)

score_visual

[20]:

	Element 1 ID	Element 2 ID	STRING escore	STRING tscore	STRING dscore
0	216239	P08069	180	885	0
1	271	P22001	342	0	0
2	271	P28907	800	163	0
3	784	P04040	690	999	900
4	947	P35548	0	180	0
...	...	...	...	...	...
976	Q9Y5U5	P43489	0	807	0
977	Q9Y5U5	Q07011	0	745	0
978	Q9Y6Q6	P10747	0	152	0
979	UAZ4B693637	Q9NZQ7	0	391	0
980	VAV	P10747	379	786	900

763 rows × 5 columns

[21]:

binned_series = score_visual["STRING tscore"].value_counts(bins=20, sort=False)

chart_data_binned = pd.DataFrame({
    "leftbin": binned_series.index.left,
    "rightbin": binned_series.index.right,
    "count": binned_series.values
})

chart_data_binned.loc[0, "leftbin"] = score_visual["STRING tscore"].min()

quant_chart = alt.Chart(chart_data_binned).mark_bar(color="#fca5a5").encode(
    x = alt.X("leftbin", bin="binned", title="STRING tscore (binned)"),
    x2 = "rightbin",
    y = alt.Y("count")
).properties(
    width=500,
    height=200,
    title= 'STRING tscore distribution of filtered & scored interactions'
)

quant_chart

[21]:

Feel free to query the score specifying Element 1, 2 IDs and ground them back to common names

[22]:

pt_scored_ints[(pt_scored_ints['Element 1 ID']=='P01730')&(pt_scored_ints['Element 2 ID']=='P10747')]

[22]:

	Element 1 ID	Element 2 ID	STRING escore	STRING tscore	STRING dscore
212	P01730	P10747	336	590	900

[23]:

id_name_df[(id_name_df['ID']=='P01730') | (id_name_df['ID']=='P10747') ]

[23]:

	ID	Name	stringID
0	P01730	CD4	9606.ENSP00000011653
113	P10747	CD28	9606.ENSP00000324890

We finally obtain the filtration result as the output_df

[24]:

# Map the scored interactions with original input interactions
# And merge the DataFrames based on the sets and drop the helper columns
pt_scored_ints['set_12'] = pt_scored_ints.apply(lambda row: frozenset([row['Element 1 ID'].lower(), row['Element 2 ID'].lower()]), axis=1)
df['set_dr'] = df.apply(lambda row: frozenset([row['Regulated ID'].lower(), row['Regulator ID'].lower()]), axis=1)
output_df = df.merge(pt_scored_ints, left_on='set_dr', right_on='set_12')
output_df = output_df.drop(columns=pt_scored_ints.columns.tolist() + ['set_dr', ])
output_df = output_df.drop_duplicates()
output_df

[24]:

	Regulator Name	Regulator Type	Regulator Subtype	Regulator HGNC Symbol	Regulator Database	Regulator ID	Regulator Compartment	Regulator Compartment ID	Regulated Name	Regulated Type	...	Mechanism	Site	Cell Line	Cell Type	Tissue Type	Organism	Score	Source	Statements	Paper IDs
0	4-1BB	Protein				Q07011			GITR	Protein	...	NONE			cl:CL:0000084					The Cytotoxic cluster was more like Glycolytic...	PMC9939256
1	IL2RA	Protein				P01589			GITR	Protein	...	NONE			cl:CL:0000084					The Cytotoxic cluster was more like Glycolytic...	PMC9939256
2	OX40	Protein				P43489			GITR	Protein	...	NONE			cl:CL:0000084					The Cytotoxic cluster was more like Glycolytic...	PMC9939256
3	CD28	Protein				P10747			IL-2	Protein	...	NONE			cl:CL:0000084 ++++ mesh:D018414					In contrast , BAFF-R , CD28 , and TACI showed ...	PMC9939256
5	TACI	Protein				O14836			IL-2	Protein	...	NONE			cl:CL:0000084 ++++ mesh:D018414					In contrast , BAFF-R , CD28 , and TACI showed ...	PMC9939256
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1581	PD-L1	Protein				Q9NZQ7			TIM-3	Protein	...	NONE			mesh:D018414 ++++ cl:CL:0000084 ++++ cl:CL:000...					Furthermore , scFv PD-L1 antibody decreased th...	PMC9961031
1582	PD-L1	Protein				Q9NZQ7			immune response	Biological Process	...	NONE			cl:CL:0000084					The PD-1 / PD-L1 pathway downregulates anti-tu...	PMC9961031
1584	CAR	Protein				P36575			signaling pathway	Biological Process	...	NONE			cl:CL:0000235 ++++ cl:CL:0000084 ++++ cl:CL:00...					A large body of research has revealed that the...	PMC9961031
1585	PD-L1	Protein				Q9NZQ7			IL-2	Protein	...	Secretion			cl:CL:0000084 ++++ mesh:D018414 ++++ cl:CL:000...					After prolonged cancer cell antigen exposure ,...	PMC9961031
1586	APRIL	Protein				O75888			BCMA	Protein	...	NONE			cl:CL:0000092 ++++ cl:CL:0000784 ++++ cl:CL:00...		uberon:UBERON:0002371			Similar results are observed after APRIL induc...	PMC7214244

1120 rows × 28 columns

The percentage of filtered interactions compared to original input:

[25]:

len(output_df) / len(df)

[25]:

0.0778426466499861

You may also visualize some key differences between before and after filtration, e.g., percentage of positive and negative interactions

[26]:

topk_series = df["Sign"].value_counts().iloc[:2]
total_count = topk_series.sum()
topk_df = pd.DataFrame({"Sign": topk_series.index, 'percentage': (topk_series / total_count) * 100})

# Create the bar chart
upper = alt.Chart(topk_df).mark_bar(color="#bae6fd").encode(
    x=alt.X("percentage:Q", title="Positive/Negative Percentage Before Filtration"),
    y=alt.Y("Sign:N", sort="-x", title="Sign")
).properties(
    width=500,
    height=100
)

topk_series = output_df["Sign"].value_counts().iloc[:2]
total_count = topk_series.sum()
topk_df = pd.DataFrame({"Sign": topk_series.index, 'percentage': (topk_series / total_count) * 100})

# Create the bar chart
lower = alt.Chart(topk_df).mark_bar(color="#fdd1ba").encode(
    x=alt.X("percentage:Q", title="Positive/Negative Percentage After Filtration"),
    y=alt.Y("Sign:N", sort="-x", title="Sign")
).properties(
    width=500,
    height=100
)

alt.vconcat(upper, lower)

[26]:

These above functions in this section can be summarized into one query.filtered_input_ints(), with input_file, score_tuple as input parameters and output_path (just the path root) as output parameter. Two files are generated, with the time duration to finish this function - list of reading interactions that pass filtration - the filtration scores for those filtered interactions

[27]:

output_path = 'output/example'
score_tuple = (0,0,0)
query.filtered_input_ints(input_file, score_tuple, output_path)

File filtered: input/example.xlsx Time: 42.371906042099 seconds

Using FLUTE to analyze a paper set

Unzip the large OA file and put it to input/ directory. Load it to year_df for usage. It contains information of published year, PMCID, PMID of million research papers

[28]:

!unzip ../supplementary/oa_file_list.txt.zip -d "input/"
year_df = run_FLUTE.extract_year("input/oa_file_list.txt")
!rm "input/oa_file_list.txt"
year_df

Archive:  ../supplementary/oa_file_list.txt.zip
  inflating: input/oa_file_list.txt

[28]:

	Year	PMCID	PMID
0	2001	PMC13900	PMID:11250746
1	2001	PMC13901	PMID:11250747
2	2001	PMC13902	PMID:11250748
3	2000	PMC13911	PMID:11056684
4	2000	PMC13912	PMID:11400682
...	...	...	...
2811867	2012	PMC7108457	PMID:21978613
2811868	2007	PMC7108459	PMID:17977063
2811870	2020	PMC7108696	PMID:32160537
2811871	2019	PMC7108792	PMID:31640839
2811872	2015	PMC7108955	PMID:26319972

2627054 rows × 3 columns

[29]:

topk_series = year_df["Year"].value_counts().iloc[:10]
topk_df = pd.DataFrame({"Year":topk_series.index, 'count':topk_series.values})

cat_chart = alt.Chart(topk_df).mark_bar(color="#bae6fd").encode(
    x = alt.X("count"),
    y = alt.Y("Year", sort="-x")
).properties(
    width=500,
    height=200,
    title = 'Published Year Distribution of Research Papers in Corpus'
)

cat_chart

[29]:

FLUTE offers the function of showing interactions within the same paper set as the input interaction file. Its inclusion of column “Paper IDs” allows such

[30]:

query = run_FLUTE.Query(db_user, db_password, db_host, db_name)
ints_same_pp = query.get_same_papers_ints(input_file, year_df)
ints_same_pp

[30]:

array([['9541.XP_005555920.1', '9541.XP_005593605.1', 'activation',
        'PMID018283119'],
       ['9541.XP_005593605.1', '9541.XP_005555920.1', 'activation',
        'PMID018283119'],
       ['10090.ENSMUSP00000029275', '10090.ENSMUSP00000111405',
        'activation', 'PMID018283119'],
       ['10090.ENSMUSP00000111405', '10090.ENSMUSP00000029275',
        'activation', 'PMID018283119']], dtype='<U50')

Our original interaction input was extracted very recently (or from papers without year information), thus there are just a handful of interactions that overlap with them

But we can still view these interactions in the same paper set in a network graph, thanks to networkx package

[31]:

G = nx.from_edgelist(ints_same_pp[:,:2])
nx.draw(G,pos=nx.circular_layout(G),with_labels=True)
plt.show()

Using FLUTE to query an individual protein

FLUTE also supports the extraction of list of papers that research on certain protein

[32]:

query_pro = "P00533,P03386"

[33]:

# get related papers
query = run_FLUTE.Query(db_user, db_password, db_host, db_name)
fq_list = query.get_related_papers(year_df, query_pro)

np.array(fq_list)

[33]:

array(['PMC1240052', 'PMC1540706', 'PMC1681463', 'PMC1702556',
       'PMC2132490', 'PMC2156182', 'PMC2225448', 'PMC2360392',
       'PMC2575782', 'PMC2742444', 'PMC2756567', 'PMC2824488',
       'PMC3088706', 'PMC3203921', 'PMC3234252', 'PMC3315809',
       'PMC3326441', 'PMC3398014', 'PMC3527276', 'PMC3569983',
       'PMC3730230', 'PMC3767783', 'PMC3952845', 'PMC3965010',
       'PMC4039121', 'PMC4039310', 'PMC4152746', 'PMC4213030',
       'PMC4226707', 'PMC4302072', 'PMC4390223', 'PMC4440518',
       'PMC4615268', 'PMC4724821', 'PMC4741629', 'PMC4791069',
       'PMC4826614', 'PMC5036527', 'PMC5117851', 'PMC5303889',
       'PMC5342422', 'PMC5432325', 'PMC5449203', 'PMC5461031',
       'PMC5531611', 'PMC5641085', 'PMC5648601', 'PMC5715073',
       'PMC5817799', 'PMC5935102', 'PMC5977112', 'PMC5992104',
       'PMC6053247', 'PMC6365915'], dtype='<U10')

Use the function discussed above, interactions inside these papers can be extracted, these interactions co-occur with the inquired protein in the same literature and are of great research interest

[34]:

pd.DataFrame(fq_list, columns=['Paper IDs']).to_excel(output_path + '_query_' + query_pro + '.xlsx', index=False)
related_pp = query.get_same_papers_ints(output_path + '_query_' + query_pro + '.xlsx', year_df)

print("Number of interactions inside related_pp: ", len(related_pp))
related_pp[:15]

Number of interactions inside related_pp:  78

[34]:

array([['7227.FBpp0084623', '7227.FBpp0084626', 'binding',
        'PMID028515276'],
       ['7227.FBpp0084626', '7227.FBpp0084623', 'binding',
        'PMID028515276'],
       ['7227.FBpp0084626', '7227.FBpp0305095', 'binding',
        'PMID028515276'],
       ['7227.FBpp0305095', '7227.FBpp0084626', 'binding',
        'PMID028515276'],
       ['8364.ENSXETP00000021098', '8364.ENSXETP00000061006', 'binding',
        'PMID026344197'],
       ['8364.ENSXETP00000061006', '8364.ENSXETP00000021098', 'binding',
        'PMID026344197'],
       ['9606.ENSP00000175756', '9606.ENSP00000269571', 'ptmod',
        'PMID025081058'],
       ['9606.ENSP00000219070', '9606.ENSP00000301178', 'expression',
        'PMID027775700'],
       ['9606.ENSP00000261739', '9606.ENSP00000275493', 'binding',
        'PMID022298428'],
       ['9606.ENSP00000269571', '9606.ENSP00000175756', 'ptmod',
        'PMID025081058'],
       ['9606.ENSP00000275493', '9606.ENSP00000261739', 'binding',
        'PMID022298428'],
       ['9606.ENSP00000275493', '9606.ENSP00000301178', 'ptmod',
        'PMID027775700'],
       ['9606.ENSP00000275493', '9606.ENSP00000355396', 'binding',
        'PMID025353163'],
       ['9606.ENSP00000275493', '9606.ENSP00000368667', 'binding',
        'PMID028152297'],
       ['9606.ENSP00000275493', '9606.ENSP00000460823', 'binding',
        'PMID019798056']], dtype='<U50')

[35]:

# View these interactions in a networkx graph
G = nx.from_edgelist(related_pp[:,:2])
nx.draw(G,pos=nx.random_layout(G),with_labels=False)
plt.show()

Using FLUTE to find recent interactions

[36]:

# Map the input interaction file with its paper published year
df = pd.read_excel(input_file)
df = df.merge(year_df, left_on='Paper IDs', right_on='PMCID')
df['Year']=df['Year'].astype('int64')
df

[36]:

	Regulator Name	Regulator Type	Regulator Subtype	Regulator HGNC Symbol	Regulator Database	Regulator ID	Regulator Compartment	Regulator Compartment ID	Regulated Name	Regulated Type	...	Cell Type	Tissue Type	Organism	Score	Source	Statements	Paper IDs	Year	PMCID	PMID
0	CAR	Protein	NaN	NaN	NaN	P36575	NaN	NaN	tumor	Biological Process	...	cl:CL:0000084	NaN	uberon:UBERON:0002015	NaN	NaN	Anti-CAIX CAR T cells secreting anti-PD-L1 ant...	PMC5085160	2016	PMC5085160	PMID:27145284
1	CAR	Protein	NaN	NaN	NaN	P36575	NaN	NaN	ADCC	Biological Process	...	cl:CL:0000084 ++++ cl:CL:0001063	NaN	NaN	NaN	NaN	Moreover , anti-CAIX CAR T cells secreting the...	PMC5085160	2016	PMC5085160	PMID:27145284
2	IgG1 isoform	Other	NaN	NaN	NaN	UAZ496747312069736F666F726D	NaN	NaN	ADCC	Biological Process	...	cl:CL:0000084 ++++ cl:CL:0000623	NaN	NaN	NaN	NaN	For the anti-CAIX CAR T cells secreting anti-P...	PMC5085160	2016	PMC5085160	PMID:27145284
3	CAR	Protein	NaN	NaN	NaN	P36575	NaN	NaN	CAIX	Protein	...	cl:CL:0000084 ++++ cl:CL:0000623	NaN	NaN	NaN	NaN	The anti-CAIX CAR T cells only produced IL-2 a...	PMC5085160	2016	PMC5085160	PMID:27145284
4	CAIX	Protein	NaN	NaN	NaN	Q16790	NaN	NaN	CAR	Protein	...	cl:CL:0000084 ++++ cl:CL:0000623	NaN	NaN	NaN	NaN	The anti-CAIX CAR T cells only produced IL-2 a...	PMC5085160	2016	PMC5085160	PMID:27145284
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3112	CD4	Protein	NaN	NaN	NaN	P01730	NaN	NaN	IL7	Protein	...	cl:CL:0000084	NaN	NaN	NaN	NaN	The percentage of CD4 CARGD2.28.OX40 ζ T cells...	PMC5980417	2018	PMC5980417	PMID:29872565
3113	ζ	Other	NaN	NaN	NaN	UAZCEB6	NaN	NaN	IL7	Protein	...	cl:CL:0000084	NaN	NaN	NaN	NaN	The percentage of CD4 CARGD2.28.OX40 ζ T cells...	PMC5980417	2018	PMC5980417	PMID:29872565
3114	CAR.GD2	Other	NaN	NaN	NaN	UAZ4341522E474432	NaN	NaN	proliferation	Biological Process	...	cl:CL:0000084 ++++ cl:CL:0000034 ++++ cl:CL:00...	NaN	NaN	NaN	NaN	These cytokines are known to enhance survival ...	PMC5980417	2018	PMC5980417	PMID:29872565
3115	ζ	Other	NaN	NaN	NaN	UAZCEB6	NaN	NaN	IL2	Protein	...	cl:CL:0000084 ++++ cl:CL:0000034 ++++ cl:CL:00...	NaN	NaN	NaN	NaN	After exposure to GD2 tumor cells , a higher q...	PMC5980417	2018	PMC5980417	PMID:29872565
3116	TGFB	bioprocess	NaN	NaN	NaN	D016212	NaN	NaN	FOXP3	Other	...	NaN	NaN	NaN	0.86	INDRA	AKT * impairs de novo Foxp3 induction by TGF-b...	PMC2275380	2008	PMC2275380	PMID:18283119

3117 rows × 31 columns

[37]:

binned_series = df["Year"].value_counts(bins=12, sort=False)

chart_data_binned = pd.DataFrame({
    "leftbin": binned_series.index.left,
    "rightbin": binned_series.index.right,
    "count": binned_series.values
})

chart_data_binned.loc[0, "leftbin"] = df["Year"].min()

quant_chart = alt.Chart(chart_data_binned).mark_bar(color="#fca5a5").encode(
    x = alt.X("leftbin", bin="binned", title="Year (binned)"),
    x2 = "rightbin",
    y = alt.Y("count")
).properties(
    width=500,
    height=200,
    title = 'Published Year Distribution of All Input Interactions (Unknown Removed)'

)

quant_chart

[37]:

[38]:

# Define a number of years
x = 5  # recent 5 years
df = df[df['Year'] >= time.localtime().tm_year - x].drop(columns=year_df.columns)

This DataFrame only contains recent (less than 5 years) interactions published at a known year. The number shrinks a lot compared to original input interaction file

This is useful if users want to exempt these interactions from filtering

[39]:

df

[39]:

	Regulator Name	Regulator Type	Regulator Subtype	Regulator HGNC Symbol	Regulator Database	Regulator ID	Regulator Compartment	Regulator Compartment ID	Regulated Name	Regulated Type	...	Mechanism	Site	Cell Line	Cell Type	Tissue Type	Organism	Score	Source	Statements	Paper IDs
13	UniCAR	Other	NaN	NaN	NaN	UAZ556E69434152	NaN	NaN	4-1BB	Protein	...	NONE	NaN	NaN	cl:CL:0000815 ++++ cl:CL:0001063	NaN	NaN	NaN	NaN	As shown in Figure 4 ( a ) ( middle panel ) , ...	PMC6685520
14	TCR-	Other	NaN	NaN	NaN	TCR	NaN	NaN	proliferation	Biological Process	...	NONE	NaN	NaN	cl:CL:0000815	NaN	NaN	NaN	NaN	To further support the aforementioned findings...	PMC6685520
15	luciferase	Protein	NaN	NaN	NaN	Q01158	NaN	NaN	tumor	Biological Process	...	NONE	NaN	NaN	cl:CL:0000815 ++++ cl:CL:0001063	NaN	NaN	NaN	NaN	In mice transplanted with UniCAR endowed Tconv...	PMC6685520
16	UniCAR	Other	NaN	NaN	NaN	UAZ556E69434152	Other	sl-0487	CD4 CD25 CD127	Other	...	NONE	NaN	NaN	cl:CL:0000084 ++++ cl:CL:0000815	NaN	NaN	NaN	NaN	To investigate responsiveness of UniCAR armed ...	PMC6685520
17	CD3	Other	NaN	NaN	NaN	CD3	NaN	NaN	T cell activation	Biological Process	...	NONE	NaN	NaN	cl:CL:0000084 ++++ cl:CL:0000236	NaN	NaN	NaN	NaN	Originally , first generation CARs were design...	PMC6685520
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3085	mAb	Other	NaN	NaN	NaN	UAZ6D4162	Other	sl-0431	SHP-2	Protein	...	Phosphorylation	NaN	NaN	cl:CL:0000235 ++++ cl:CL:0000236 ++++ cl:CL:00...	NaN	NaN	NaN	NaN	Functionally , mAb targeting PD-L1 was able to...	PMC6558778
3086	PD-1	Protein	NaN	NaN	NaN	P18621.t	NaN	NaN	PD-L1	Protein	...	Transcription	NaN	cellosaurus:CVCL_1905	mesh:D015496 ++++ cl:CL:0000623 ++++ mesh:D018...	NaN	NaN	NaN	NaN	Furthermore , lenalidomide , an immunomodulato...	PMC6558778
3087	PD-L1	Protein	NaN	NaN	NaN	Q9NZQ7	Other	sl-0431	SHP-2	Protein	...	Phosphorylation	NaN	NaN	cl:CL:0000235 ++++ cl:CL:0000236 ++++ cl:CL:00...	NaN	NaN	NaN	NaN	Functionally , mAb targeting PD-L1 was able to...	PMC6558778
3088	mAb	Other	NaN	NaN	NaN	UAZ6D4162	Other	sl-0431	SHP-2	Protein	...	Phosphorylation	NaN	NaN	cl:CL:0000235 ++++ cl:CL:0000236 ++++ cl:CL:00...	NaN	NaN	NaN	NaN	Functionally , mAb targeting PD-L1 was able to...	PMC6558778
3089	CD4	Protein	NaN	NaN	NaN	P01730	NaN	NaN	transmembrane protein	Protein	...	Transcription	NaN	NaN	cl:CL:0000911 ++++ cl:CL:0000623 ++++ cl:CL:00...	NaN	NaN	NaN	NaN	Lymphocyte activation gene-3 ( LAG-3 ) is a tr...	PMC6558778

1367 rows × 28 columns

Using FLUTE to find duplicate interactions

[40]:

df = pd.read_excel(input_file)

columns_to_check = ['Regulated ID', 'Regulator ID', 'Paper IDs']
duplicates = df[df.duplicated(subset=columns_to_check, keep=False)]
duplicate_counts = duplicates.groupby(columns_to_check).size().reset_index(name='Occurrence')

This DataFrame contains the duplicated interactions (based on Regulated ID, Regulator ID, Paper IDs) of the original input file and their occurrences.

This is helpful when user wants to only keep unique interactions

[41]:

duplicate_counts.sort_values(by='Occurrence', ascending=False)

[41]:

	Regulated ID	Regulator ID	Paper IDs	Occurrence
462	Q9NZQ7	Q9NZQ7.t	PMC10098269	8
278	P18621	P31947	PMC8614004	5
319	P29597	P18031	PMC8904293	5
209	P01375	P05231.s	PMC8018404	4
458	Q9NZQ7	P31947	PMC8614004	4
...	...	...	...	...
175	Interferon	P36575	PMC6174845	2
174	Interferon	P01730	PMC3654581	2
173	Interferon	CD8	PMC3654581	2
172	IPR011584	P36575:[SubstitutionMutant]	PMC9853244	2
532	p38	CHEBI:90705	PMC10098269	2

533 rows × 4 columns

From this table, we might have insights into what species are elaboratively studied

[42]:

topk_series = duplicate_counts["Regulated ID"].value_counts().iloc[:5]
total_count = len(duplicate_counts)
topk_df = pd.DataFrame({"Regulated ID": topk_series.index, 'percentage': (topk_series / total_count) * 100})

# Create the bar chart
upper = alt.Chart(topk_df).mark_bar(color="#bae6fd").encode(
    x=alt.X("percentage", title="Percentage of Top 5 Repeated Regulated IDs"),
    y=alt.Y("Regulated ID:N", sort="-x", title="")
).properties(
    width=500,
    height=200
)

topk_series = duplicate_counts["Regulator ID"].value_counts().iloc[:5]
total_count = len(duplicate_counts)
topk_df = pd.DataFrame({"Regulator ID": topk_series.index, 'percentage': (topk_series / total_count) * 100})

# Create the bar chart
lower = alt.Chart(topk_df).mark_bar(color="#fdd1ba").encode(
    x=alt.X("percentage", title="Percentage of Top 5 Repeated Regulator IDs").scale(domain=[0, 12]),
    y=alt.Y("Regulator ID:N", sort="-x", title="")
).properties(
    width=500,
    height=200
)

alt.vconcat(upper, lower)

[42]: