Weekly schedule:
- Week 1 (Aug 22nd-)
Topics:
-
* Class overview & Intro to data mining (ppt)
* Introduction to Kaggle and Python (notes)
* Data (types of data, and data preprocessing) (ppt; pandas)
* EDA & visualization (metaplotlib tutorials)
More readings (Data Mining and health): Covid or flu?; genome/gene sequencing; ‘Major errors’ alleged in landmark study that used microbes to identify cancers
Notebooks: Numpy & Pandas demo; ; Data exploration (EDA); Data preprocessing; EDA case study.===================================
- Week 2 (Aug 29th-)
Topics:
-
* Where/how to get the data? (notes)
* Vectors, matrices, and eigenvalue and eigenvector (notes)
* Curse of dimensionality and dimensionality reduction (ppt)
Notebooks: SVD explained; PCA explained; SVD & PCA relationship; Standardization & PCA
For bored: NMF; MDS; t-SNE (non-linear dimensionality reduction); the art of using t-SNE
===================================
- Week 3 (Sep 5th-)
No class on Sep 5 (labor day)
Topics:
Reading: Chapter 2, (Massive) Chapter 3
Notebooks: Data preprocessing (review); Standardization & PCA (review); Comparison of different scaling approaches
===================================
Module 2 -- search & link analysis
- Week 4 (Sep 12th-)
Topics:
-
* KNN: Learning from neighbors (notes; KNN in sklearn; algorithms for near neighbor searches)
* Finding similar items: Minhash & LSH (ppt)
Notebooks: Fraud Detection using KNN; Work with texts===================================
- Week 5 (Sep 19th-)
Topics:
-
* Finding similar items: Minhash & LSH (Cont.) (ppt; a practice)
* Link analysis (PageRank): transition matrix and power iteration (ppt); Markov models (ppt)
Notebook: Facebook Network Analysis using NetworkX
For bored: Twitter's "circle of trust" algorithm for WTF
===================================
Module 3 -- association rule mining
- Week 6 (Sep 26th-)
Topics:
-
* Link analysis (PageRank): transition matrix and power iteration (ppt); Markov models (ppt)
(Cont.)
* Association analysis: Frequent itemset generation & rule generation (Apriori algorithm) (ppt) (an example)
More readings (HPV vacination): paper abstract; statement of retraction; online post 1; post 2
===================================
- Week 7 (Oct 3rd-)
Topics:
-
* Association analysis: support counting (hash tree, and DHP); evaluation of association patterns (Simpson's paradox)
* Association analysis: advanced concepts (non-binary data; concept hierarchy; sequential patterns; subgraph patterns) (ppt) * Genome-wise association studies (a paper)
Readings: DHP/SIT algorithms for association rules mining; Benefits and limitations of genome-wide association studies
===================================
Module 4 -- clustering analysis
- Week 8 (Oct 10th-)
Topics:
-
* Cluster analysis (ppt)
-
-- Clustering algorithms: k-means and variants, hierarchical clustering, density-based clustering
-- Clustering evaluation: supervised vs unsupervised; heatmap, correlation, SSE, Silhouette coefficient, entropy, purity
Notebooks: some toy datasets; notebook (comparison of basic clustering algorithm) (scikit-learn); comparison of hierarchical clustering; silhouette analysis on sklearn
Others: tree of coronavirus; clustering & data processing (recap)===================================
Module 5 -- classification (basic)
- Week 9 (Oct 17th-)
Topics:
-
* Classification: basic concept & techniques (decision tree; ppt); KNN (ppt)
* Naive Bayes classifier (ppt)
* Model selection/evaluation (ppt); overfitting/regularization (ppt)
Reading: Chapters 3 & 4
For bored: XGBoost===================================
Midterm exam
- Week 10 (Oct 24th-)
Module 6 -- classification (cont.)
- Week 11 (Oct 31st-)
Topics:
Notebook: Decision tree & pruning (breast cancer dataset 1; 2); SVM (GridSearchCV)
Reading: Chapters 3 & 4
For bored: XGBoost===================================
- Week 12 (Nov 6th-)
Topics:
-
* Linear regression & logistic regression (ppt)
* ANN (ppt-1; ppt-2); Deep Learning (e.g., CNN, ppt)
* Ensemble Methods (Bagging & Boosting) (ppt)
Reading: pytorch vs tensorflow (google trends)
Notebook: Linear regression; Fraud Detection (evaluation & logistic regression); Neural network models in scikit-learn
Notebook (more): Voting classifier; A tutorial on Tensorflow: clothing classification using the Fashion MNIST dataset or text classification
For bored: Opportunities and obstacles for deep learning in biology and medicine; Attention is all you need; Transformers at Huggingface; Facebook Research's FAISS
===================================
Module 7 -- advertising on the web, recommendation system
- Week 13 (Nov 13th-)
Topics:
-
* Advertising on the Web (ppt; Alphabet Q3 FY23 income)
* Recommendation systems (ppt)
Reading 2:How google adwords auction work; How Data Mining Can Help Advertisers Hit Their Targets
For fun: Touching the Void; Spotify active users; "Surprise me" recommendation===================================
- Week 14 (Nov 20th-)
Thanksgiving break; no class===================================
Module 8 -- everything else
- Week 15 (Nov 27th-)
Topics:
-
* Text Mining & NLP (notes)
* Anormaly Detection (ppt)
* Avoid False Discoveries (ppt)
* Human factors, security and ethical issues in DM
===================================
- Week 16 (Dec 4th-)
5-mins mini-presentations on projects (in class on both Tuesday and Thursday)