Author : Uday Krishna, Digital Data Analyst
1) Pre-Processing: Preparation of data for modeling
You are the best judge of what needs to be done, but here are some considerations:
What is my expected output and what's the nature and size of data I have at my disposal? Is it a binary - 0 or 1, cluster, probabilities? This would dictate the choice of algorithm, method (supervised, unsupervised) and thereby data preparation for that particular algorithm.
Which are the dependent variables and which is the independent variable?
Are the variables categorical or continuous? Can I transpose the variable or represent it in a different way (encoding it for example) to make it more palatable to my needs?
Training data set and testing data set split.
Consider - Class Imbalance.
Data Cleaning - Removal of NaNs, duplicates, erroneous entries and the likes.
Take into consideration - Stemming, Lemmatization, Stop-words removal, similarity measure (cosine similarity, edit distance etc), L1 & L2 regularization, normalization among other similar things.
Feature selection - PCA - Principal Component Analysis, SVD or other statistical/mathematical measures.
Academic papers (Google Scholar is a good resource), research reports, past experiments, solid human domain knowledge to exclude false indicators.
The choice of algorithm is important. What's my desired outcome - how do I want to interpret it? What's the nature/size/class of data at my disposal? The analysis done in the pre-processing stage is also relevant in modeling. These 2 should help you narrow down the search.
Do look at academic papers (Google Scholar) or business reports or other examples from colleagues/online/kaggle etc. and find similar work done before. Pay careful attention to the nature of their data, their output, the kind of data cleaning they did, the size of data, choice of algorithm, evaluation criteria, results and interpretation of those results. If there is a close match or if there is an analogous match (same type of data but different source for example), then it might make sense to replicate the algorithm and the entire methodology or to adopt the algorithm to your methodology.
Take into consideration - overfitting and the different types of cross-validation.
Training and Testing.
Choice of evaluation metric is as important as scoring in that particular evaluation metric. Each algorithm/method typically has some evaluation metrics closely associated with it, mostly due to practice. There are different schools of thought. You have your own. There are other evaluation metrics too other than these both, since there are popular in popular culture.
The question is - What are we trying to evaluate and what are we trying to interpret? What metric will help me evaluate it with the least bias?
Recording/measuring results. Iteration with different algorithm choices and/or different pre-processing tweaks/choices.
From a ML standpoint, it's not just important to score high in evaluation metrics but to also ensure there are no biases in evaluation metric choices, data set formation, class examples, overfitting etc. It's only then that the system can scale and provide consistent results. Eventually, with more data and more experimentation and learning, the system will score high as long as the experimentation/modeling framework is setup the right way.
The training dataset needs to be a good representation of the testing dataset that will eventually be used.
Recording the choices and assumptions will help in the longer run.
Like all experiments - one only learns from implementation, recording and measuring results, feedback based iteration.