Data Science Notes

  • Human’s desire for efficiency led to the development of computers and programming languages.
  • Enabling computers to do tasks that can be done by humans and also the tasks that can’t be done by humans and thus humans can work less and spend more time on other things.
  • Machines can work 24 X 7 unlike humans who can only work for a few hours a day.
  • From data mining to knowledge discovery in databases (1996 paper)
  • Data mining is the application of specific algorithms for extracting patterns from databases
  • Data mining + Computer Science = Data Science
  • Business Intelligence: find if somebody is pregnant from their purchases data and market to them the things that you sell to new parents
  • Paper title: What’s even creepier than target guessing that you are pregnant
  • In the move “Money ball”, a poor baseball team picked under valued players and win 20 consecutive series using data Science

Data Science Project Steps

  1. Formulate Question: Define the problem that you want to solve and make sure you ask the right questions; clear well formulated question will determine the research and it will also affect the kind of data that you will gather
  2. Gather data:
    • Source of the data: Where does the data come From
    • Description of the data set: understanding all the context under which the data was collected
    • Number of data points: How big is the dataset actually
    • Number of features: for each data point, how many aspects (characteristics) were measured
    • Description of the features:
  3. Clean data: real world data is also messy, so need to clean the data
  4. Explore and Visualize: a graph or chart is much more helpful than a table of databases; a picture is worth a thousand words; Oftentimes, exploring, visualizing and cleaning the data more or less at the same time
    • Distribution
    • Outliers
  5. Model: Train Algorithm
    • Split training data and test data: shuffle data before splitting
  6. Deploy and Evaluate
    • Check-up list to evaluate regression model:
      1. R-squared
      2. p-values : of the coefficients
      3. V.I.F: variance inflation factor
      4. BIC: the Bayesian information
  • Dummy variable to capture binary