-
Notifications
You must be signed in to change notification settings - Fork 0
First analysis- 30mer pLDDT #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, can you add a plot toresults/figures
using the graph method you created for the analysis in this notebook? Would be cool to show John and look back on later to compare with other methods
|
||
|
||
# %% | ||
def plot_epitope_non_epitope_stats_9mer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's not intuitive from looking at the plot what the minimum represents here. I think the legend should be more descriptive: this is the mean per-amino acid minimum pLDDT for the 9mer in each 30mer with the lowest minimum single pLDDT. Probably a better way to word this, but the plot's a bit misleading. Could make the mean legend more descriptive, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is this identical to the 30mer mean min? I think the calculation works out the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, from looking at the values, I think these are equivalent- I'd recommend removing min min from this plot.
print("max:" + str(max_pLDDT)) | ||
min_pLDDT = dataset.select(pl.col(colname)).min().item() | ||
print("min:" + str(min_pLDDT)) | ||
mean_pLDDT = dataset.select(pl.col(colname)).to_series() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove mean_pLDDT
, looks like it isn't doing anything
all_statistics, | ||
"data/hv/peptide/inference", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add your mass + helix / beta sheet feature extraction methods here
@ljwoods2 just pushed the code wasn't able to do everything I wanted but I got a good start |
…nd min of the atomic weights of the 9-mers
@spencer2234 can you try using max bepipred score per 30mer as a feature instead of mean? I think that's potentially a more fair way to compare |
@ljwoods2 checkout the hv_class and in_class folders in notebooks |
@@ -0,0 +1,99 @@ | |||
import polars as pl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write a brief docstring at the top of each of these feature extraction scripts describing which features they're going to extract- it's not clear from name alone what this is meant to do
y_hat_RSA_fp = st.normalized_pLDDT_30mer(all_statistics_in_class_fp, "mean_rsa_slice") | ||
y_true_RSA = all_statistics_in_class_fp.select(pl.col("epitope")) | ||
in_class_norm_rsa_mean_30mer_ROC = st.plot_auc_roc_curve( | ||
y_true_RSA, y_hat_RSA_fp, "in_class Normalized mean RSA values for 30mer fp ROC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your poster, change the titles of these figures so that they don't say "in_class" as the dataset name. I would say "IN1 30mer classification set" as the name or something similar, and then you can define that in the text of the poster.
Same goes for other figures
in_class_norm_rsa_mean_30mer_ROC = st.plot_auc_roc_curve( | ||
y_true_RSA, y_hat_RSA_fp, "in_class Normalized mean RSA values for 30mer fp ROC" | ||
) | ||
in_class_norm_rsa_mean_30mer_ROC.savefig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AUC curve is flipped, fix this so AUC > 0.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for other flipped AUC curves
|
||
|
||
# %% | ||
y_hat_RSA_fp = st.normalized_pLDDT_30mer(all_statistics_in_class_fp, "mean_rsa_slice") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite a few steps leading up to this, so add a markdown cell above this cell describing what this plot shows. The scoring method isn't immediately obvious: all instances of all 30mers across the focal proteins they appeared in, which allows duplicate 30mers, each 30mer annotated with a true/false value extracted from PepSeq (assay)
) | ||
|
||
# %% | ||
fp_aggrigate_30mer = all_statistics_hv_class_fp.group_by("peptide").agg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this plot and the one above it are both using the column "mean_pLDDT_slice" but this one refers to the metric as "geometric mean pLDDT"- is it using geometric mean or not? Should rename whichever is incorrect
) | ||
|
||
# %% | ||
mean_auc = fp_aggrigate_9mer.select("AUC").mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AUC is always None here, something is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with the equivalent cell in in_data.py
Fixes #7
Fixes #8
Fixes #9
Fixes #15
Fixes #16