Midterm exam

The midterm exam is going to be in-person, at regular class time on Wed. It is closed book exam, but we allow cheet sheet (one sheet of A4 paper, one side, handwritten).
For a complete list of study material, please refer to all posted notes/slides, homework assignments and quizzes.

Check list

  • Data types, data preprocessing (missing data, noise, outliers, binarization/discretization, attribute transformation, discretization, standardization/normalization).
  • Probability & statistics basics, contigency-table and chi-squared test.
  • Vector & matrix operations; dimensionality reduction; PCA and SVD.
  • Similarity/distance metrics (Jaccard similarity, Simple Matching Score, Minkowski distance, Euclidean distance, City block distance/Hamming distance, cosine similarity, correlation coefficient, entropy).
  • Finding similar items (shingling/tokenizer, SimHash, MinHash, LSH).
  • Link analysis (PageRank algorithm, transition matrix, Markov chain, power iteration, teleport)
  • Mining association rules (Apriori algorithm, finding frequent itemsets, from itemsets to association rules, Simpson's paradox)
  • Cluster analysis (clustering algorithms & evaluations): k-means and variants (K-means++, bisecting K-means), hierarchical clustering, density-based clustering (DBSCAN); measures of cluster validity.
  • Classification: basic algorithms (KNN, Decision tree, Naive Bayes), model selection/evaluation, overfitting/regularization.