Although a ROC = 0.73, is not a bad value, in most of cases it is not enough to be able to predict appropiately on binary classifications that they are so imbalanced. By predicting most of the samples as being the majority class, the model will have “good” theoretical results, but in practice will fail to predict well the ‘1’ classes.
I am dealing with a similiar imabalanced project and facing this issue, do you know how to perform the other suggested methods to deal with imabalanced datasets in PySpark? (SMOTE, Oversampling, undersampling…)
Thank you and best regards.