Obviously Awesome

Data Mining - a Programmers Guide

This post is in progress. It will be fleshed out as time permits.

  • Data Mining: a Programmer’s Guide (Zacharski)

    • Intro

    • Recommendation Systems

      Intro; finding similar items; Manhattan distance; Euclidean distance; Minkowski distance; Pearson correlation coefficient; cosine similarity; k-nearest-neighbors in Python; book crossing dataset

    • Item-based Filtering

      Explicit & implicit ratings; user-based filters; item-based filters; adjusted cosine similarity; slope one algorithm; Python code; MovieLens dataset

    • Classification

      Pandora-like systems; selecting appropriate attributes; example; data normalization; modified standard score; Python code; sports example; acquiring attribute data

    • Classification, Pt2

      Training sets & test data; 10-fold cross validation; adding data vs algorithm tweaks; kNN; Python code

    • Naive Bayes & Probability Density Functions

      Lazy & eager learning; probability refresher; conditional probability; Bayes theorem; Python code; Congress Voting dataset; Gaussian distribution; Python code

    • Naive Bayes & unstructured text

      Positive & negative texts; classifier training; stop words; newsgroup classifier; Python code; sentiment analysis

    • Clustering

      Intro; hierarchical; single/complete/average linkages; dog breed clusters; breakfast cereal clusters; Kmeans; Kmeans++, Enron email dataset