Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Feature Engineering for Machine Learning at QConSP

Machine learning fits mathematical models to date to derive insights or make predictions. Engineering the features that sit between data and models is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling and enable results of higher quality. In this talk, we will dive deeper into the mechanisms behind popular feature engineering techniques and walk through use of where these techniques are most useful. You will be able to better identify which methods to use based on your data and the problem you are working to solve.

  • Login to see the comments

Feature Engineering for Machine Learning at QConSP

  1. 1. Feature Engineering for Machine Learning Amanda Casari Principal Product Manager + Data Scientist Concur Labs @ SAP Concur @amcasari
  2. 2. here to there via random walk product + data @ SAP Concur control systems engineering + robotics + legos officer in US Navy operations research analyst wandering dirtbag + conservation volunteer EE + applied math + complex systems underwater robotics consultant extraordinaire stay at home mom co-author NASA Datanaut @amcasarihere to there via random walk
  3. 3. data science is not magic… @amcasari
  4. 4. …but it is a process (sometimes painful) @amcasari@MROGATI
  5. 5. it is easy to get turned around…. @amcasari idea research exploration hypotheses model outcomes feedback
  6. 6. …and it is easy to get mixed up xkcd #1838 @amcasari
  7. 7. …so let’s focus on getting from data to models feature engineering goes here! @amcasari
  8. 8. when we say… DATA SCIENCE • …. the interdisciplinary intersection of methods, processes, algorithms and problem solving techniques to extract knowledge from data1 MACHINE LEARNING [ML] § …. fitting mathematical models to data in order to derive insights or make predictions.2 FEATURE § …. a numeric representation of an aspect of raw data2 FEATURE ENGINEERING § …. the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model2 hint: our community is well represented in Wikipedia @amcasari
  9. 9. [n.b. ethics] DATA SOCIAL CONSTRUCT § …. “jointly constructed understandings of the world that form the basis for shared assumptions about reality”1 BIAS § … results from unfair sampling of a population, or from an estimation process that does not give accurate results on average2 ACCOUNTABILITY § … you are answerable for your decisions and obligated to be able to explain the resulting consequences3 hint: much more about this w/ @kjam at 14:30 § …. is an abstract representation of reality, not reality itself. Data is a part of the system of record, but not the actual system itself. @amcasari
  10. 10. how to choose? 1/ FRAME YOUR PROBLEM 2/ UNDERSTAND YOUR DATA § What data will be most helpful to understand and generate a better understanding of this problem? 3/ FRAME YOUR FEATURE GOALS § What are you optimizing for? § Iteration speed § Model performance 4 / TEST, ITERATE, TEST AGAIN § Check your choices for robustness § Validate but realize this will still change § Can you frame your problem in a way that machine learning could be useful? e.g. prediction @amcasari
  11. 11. vector space scalar: single numeric feature vector: ordered list of scalars Example: 1/ two-dimensional vector, v = [1, -1] @amcasari
  12. 12. feature space In data, abstract vectors take on actual meaning Examples: • 1/ a vector can represent a person’s preference for songs • Song = feature • +1: Thumbs-up • -1: Thumbs-down • 2/ song represents ind. preferences in a group @amcasari
  13. 13. Counts: Fancy Tricks with Simple Numbers
  14. 14. counts: binarization @amcasari
  15. 15. counts: binning @amcasari
  16. 16. counts: fixed width binning @amcasari
  17. 17. @amcasari counts: adaptive binning
  18. 18. @amcasari loga(ax) = x, where a is a positive constant and x can be any positive number a0=1, loga(1)=0 tl;dr the log function compresses the range of large numbers and expands the range of small numbers counts: log transform binning
  19. 19. @amcasari What does scaling do for features?
  20. 20. normalization: feature scaling @amcasari
  21. 21. @amcasari normalization: feature scaling
  22. 22. @amcasari normalization: feature scaling
  23. 23. @amcasari proper scaling preserves underlying shape
  24. 24. Text: Flatten, Filter, Chunk
  25. 25. why text? @amcasari hedonometer.org
  26. 26. flatten: bag-of-words (BoW) @amcasari
  27. 27. filter: frequency based filtering (stopwords) @amcasari These NLP libraries have both English + Portuguese corpora, models, etc 1/ spacy 2/ NLTK 3/ OpenNLP
  28. 28. chunk: parts of speech matter @amcasari Pop Chart Lab, npr.org
  29. 29. @amcasari thank you @RainyData code repobuy the book here!

×