Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi

395 views

Published on

In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi

  1. 1. Using Embeddings to Understand 
 the Variance and Evolution of Data Science Skill Sets Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb
  2. 2. Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb Using Embeddings to Understand 
 the Variance and Evolution of Data Science Skill Sets
  3. 3. TapRecruit uses NLP to understand and 
 organize natural language career content Smart Editor 
 for Job Descriptions Active Pipeline 
 Health Monitoring Multifaceted Salary 
 Estimation
  4. 4. Language matters in job descriptions Same title, Different job Finance Manager Kraft Foods Finance Manager Kraft Foods Same Title Junior (3 Years) No Managerial Experience Senior (6-8 Years) Division Level Controller Strategic Finance Role MBA / CPA Required Experience Required Responsibility Preferred Skill Required Education Different title, Same job Performance 
 Marketing Manager PocketGems Senior Analyst, Customer Strategy The Gap Mid-Level Quantitative Focus Expertise iBanking Data Analysis Tools (SQL) Consulting Experience Preferred MBA Preferred Mid-Level Quantitative Focus Expertise iBanking Relational Database Experience External Consulting Experience Preferred BA degree in business, finance, MBA Preferred Required Experience Required Skills Required Experience Required Skills Preferred Experience Required and 
 Preferred Education
  5. 5. How have data science skills changed over time?
  6. 6. How have data science skills changed over time? Manual Feature Extraction Dynamic Topic Modeling Adapted from Blei and Lafferty, ICML 2006. 1880 force energy motion differ light 1960 radiat energy electron measure ray 2000 state energy electron magnet field 1920 atom theory electron energy measure Matter Electron Quantum MBA PhD SQL Tableau PowerBIPython
  7. 7. Word embeddings capture semantic similarities Proficiency programming in Python, Java or C++. Word ContextContext Experience in Python, Java or other object-oriented programming languages WordContext Context Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python) WordContext French German Japanese Esperanto Language Python Programming C++ Java Object- orientated
  8. 8. Embeddings capture entity relationships Adapted from Stanford NLP GLoVE Project Woman :: Queen as Man :: ? Man Woman Queen King Hierarchies McAdamCola o VodafoneVerizon Viacom Dauman Exxon Tillerson Wal-Mart McMillon Comparatives 
 and Superlatives SlowestSlower Slow Shortest Shorter Strongest Stronge r Strong Short
  9. 9. Pretrained embeddings facilitate fast prototyping Final Application Corpus Twitter Common Crawl GoogleNews Wikipedia Tokens 27 B 42-840 B 100 B 6 B Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k Algorithm GLoVE GLoVE word2vec GLoVE Vector Length 25 - 200 d 300 d 300 d 50 - 300 d Corpus Generation Corpus Processing Language Model Generation Language Model Tuning
  10. 10. Problems with pretrained embedding models Casing Abbreviations vs Words e.g. IT vs it Out of Vocabulary Words Domain Specific Words & Acronyms Polysemy Words with multiple meanings e.g. drive (a car) vs drive (results) e.g. Chef (the job) vs Chef (the language) Multi-word Expressions Phrases that have new meanings e.g. Front-end vs front + end
  11. 11. Tools for developing custom language models Tokenization, POS tagging, Sentence Segmentation, Dependency Parsing Corpus Processing CoreNLP SyntaxNet Different word embedding models (GLoVE, word2vec, fastText) Language Modeling
  12. 12. Windows capture semantic similarity vs relatedness Captures Semantic similarity, Substitutes and Word-level differences Small Window Size Captures Semantic relatedness, Alternatives and Domain-level differences Large Window Size Python Java Programming C++ Language French German Japanese Esperanto SPSS Statistical modeling Object-orientated SoftwarePython Programming C++ JavaLanguage French German Japanese Esperanto Object- orientated
  13. 13. Custom embeddings identified equal opportunity and perks language
  14. 14. Custom embeddings identified ‘soft’ skills and language around experience
  15. 15. I’ve got 300 dimensions… but time ain’t one
  16. 16. Two flavors of dynamic embeddings Trained Together 2015 2016 2017 2018 Balmer and Mandt, arXiv: 1702:08359. Rudolph and Blei, arXiv: 1703:08052. Independently Trained 2016 2017 2018 2015 Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.
  17. 17. The benefits of dynamic Bernoulli embeddings Data efficient: Treats each time slice as a sequential latent variable, enabling time slices with sparse data. Does not require stitching/ alignment: Treating time slice as a variable ensures embeddings are connected across slices. Dynamic embeddings Balmer and Mandt, arXiv: 1702:08359. Rudolph and Blei, arXiv: 1703:08052. Data hungry: Sufficient data for each time slice for a quality embedding. Requires stitching: Each time slice is trained independently, therefore dimensions are not comparable across slices. Static embeddings Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.

  18. 18. Demand for MBAs and PhDs falling Data Science JobsAll Jobs 0 0.5 1 2016 2017 2018 0 0.5 1 2016 2017 2018 MBA PhD PhD
  19. 19. Data Science skills showing significant shifts Tableau and PowerBI 0 1 2 3 2016 2017 2018 Tableau PowerBI Python vs Perl 2016 2017 2018 Python Perl 1 0 Hadoop vs Spark 2016 2017 2018 Hadoop Spark 1 0
  20. 20. regression :: Generalized Linear Models as word2vec :: Exponential Family Embeddings
  21. 21. Members of the Exponential Family of Embeddings Binary Data Bernoulli Embedding Count or Ordinal Data Poisson Embedding Context Datapoint Context Mini Bagels Cream cheese Milk Coffee Orange Juice Continuous Data Gaussian Embedding Context Datapoint Context JFK-CDG LGA-DCA JFK-DFW LAX-JFK LAX-LGA Context Datapoin t Context Proficiency programming Python Java C++
  22. 22. Members of the Exponential Family of Embeddings Binary Data Bernoulli Embedding Context Datapoin t Context Proficiency programming Python Java C++ Count or Ordinal Data Context Datapoint Context 10 Guidelines for A/B Testing / Emily Robinson The Value of Null Results / Angel D’az Words in Space / Rebecca Bilbro … / James Powell Why I Use Julia / Katharine Hyatt Poisson Embedding
  23. 23. Poisson embeddings capture item similarities from shopper behavior Maruchan chicken ramen High Inner Product Combinations Maruchan creamy chicken ramen Maruchan oriental flavor ramen Maruchan roast chicken ramen Old Dutch potato chips & Budweiser Lager beer Lays potato chips & DiGiorno frozen pizza Yoplait strawberry yogurt Low Inner Product Combinations Yoplait apricot mango yogurt Yoplait strawberry orange smoothie Yoplait strawberry banana yogurt General Mills cinnamon toast & Tide Plus detergent Beef Swanson Broth soup & Campbell Soup cans Adapted from Rudolph, Ruiz, Mandt and Blei, arXiv: 1608.00778.
  24. 24. How have data science skills changed over time? - Flavors of static word embeddings: The Corpus Issue - Considerations for developing custom embedding models - Flavors of dynamic models: Dynamic Bernoulli embeddings - Other members of the Exponential Family of Embeddings
  25. 25. Thank you PyData NYC! Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb
  26. 26. Thank you PyData NYC! Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb

×