## 1) Pre-Processing: Preparation of data for modelingYou are the best judge of what needs to be done, but here are some considerations: - What is my expected output and what's the nature and size of data I have at my disposal? Is it a binary - 0 or 1, cluster, probabilities? This would dictate the choice of algorithm, method (supervised, unsupervised) and thereby data preparation for that particular algorithm.
- Which are the dependent variables and which is the independent variable?
- Are the variables categorical or continuous? Can I transpose the variable or represent it in a different way (encoding it for example) to make it more palatable to my needs?
- Training data set and testing data set split.
- Consider - Class Imbalance.
**Data Cleaning**- Removal of NaNs, duplicates, erroneous entries and the likes.- Take into consideration -
**Stemming**,**Lemmatization**, Stop-words removal, similarity measure (cosine similarity, edit distance etc), L1 & L2 regularization, normalization among other similar things. **Feature selection**- PCA - Principal Component Analysis, SVD or other statistical/mathematical measures.- Academic papers (Google Scholar is a good resource), research reports, past experiments, solid human domain knowledge to exclude false indicators.
## 2) |