SlideShare a Scribd company logo
1 of 23
Download to read offline
Featurizing log data
before XGBoost
Xavier Conort
Thursday, August 20, 2015 @
● XuetangX, a Chinese MOOC learning platform initiated
by Tsinghua University,
● launched online on Oct 10th, 2013.
● more than 100 Chinese courses and over 260
international courses
● high dropout rate
The competition host
● challenge: predict whether a user will drop a course
within next 10 days based on his or her prior activities.
● data:
○ enrollment_train (120K rows) / enrollment_test (80K rows):
■ Columns: enrollment_id, username, course_id
○ log_train / log_test
■ Columns: enrollment_id, time, source, event, object
○ object
■ Columns: course_id, module_id, category, children, start
○ truth_train
■ Columns: enrollment_id, dropped_out
Problem to solve
Log data 5890
objects
Team
Chief Product Officer Chief Data Scientist
Data Scientist Data Scientist
(O. Zhang)
How we worked as a Team
● worked separately on feature engineering. 90% of
our time was spent here.
● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!)
○ model text features
○ tune hyperparameters
○ experiment different feature sets and blend 8 XGBoost
using different sets
○ communicate results
Feature engineering techniques used
● counts
● time statistics (min, mean, max, diff)
● entropy
● sequences treated as text on which we ran
○ SVD on 3grams
○ DataRobot text mining solution
● 20 first components of SVD on user x object
NB: removed duplicated log info and used training + test
sets to build most features
How to build efficient features in R
Key course features
● course_id
● first log time
● enrollment counts
● unique log counts
● mean time interval
Key enrollment count features
● log counts
● unique log counts
● ratio between unique log counts over log counts
● unique log counts by event (nagivate, access,
problem, video, page_close, discussion, wiki)
● unique log counts before end of course (5 days, 10
days and 30 days before)
● sequence number of enrollment in that course
Key enrollment time stats
● log time stats (min, mean, max)
● gap between first and last log of enrollment
● gap between enrollment first log and course first log
● gap between enrollment last log and course last logs
● difference between mean log time and mid point
between first and last log
● log interval stats (mean, 90, 99 and 100 quantiles)
Enrollment entropy features
enrollment entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
● hours of the day for the last 1/3/7 days before last
logs
● object (when event == problem)
● chapter ids
Example of entropy feature
- log(weekday_log_count / enrollment_log_count) *
weekday_log_count / enrollment_log_count
Sum => weekday_entropy[enrollment_id==1]
1.589988
Enrollment sequence features
● for each enrollment_id, built sequences of
○ weekdays
○ objects
■ all objects / 'problem' and 'video' objects only
○ events
● treated sequences as 4 text variables. Ran for each
○ svd on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
& Nystroem SVM on (tuned) n-grams
Extract of enrollment object sequences
1/2-grams from Object sequences
DataRobot on Object 1-2 grams
Key user count features and time
stats
● enrollment count
● binary indicator whether user signed up for each of
the 38 courses
● unique log count
● mean log time interval
● sequence number of enrollment for that user
User entropy features
user entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
User sequence features
● for each user, built sequences of
○ weekdays
○ chapter_ids
○ events
● treated them as 3 text variables. Ran
○ SVD on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
+ Nystroem SVM on (tuned) n-grams
How we got to the TOP3
● entropy features mentioned before
● exploited info in
○ log count in the 5 / 10 / 20 days after end of course
○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course
○ time to sign up for new course
○ time until the next log for same user
added ~0.001 to AUC (vs
less powerful features)
added ~0.002 to AUC
XGBoost
Thank you!

More Related Content

What's hot

07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Kush Kulshrestha
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Introduction to XPath
Introduction to XPathIntroduction to XPath
Introduction to XPathtorp42
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 

What's hot (20)

07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Introduction to XPath
Introduction to XPathIntroduction to XPath
Introduction to XPath
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Bagging.pptx
Bagging.pptxBagging.pptx
Bagging.pptx
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
ID3 ALGORITHM
ID3 ALGORITHMID3 ALGORITHM
ID3 ALGORITHM
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 

Viewers also liked

How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionJeong-Yoon Lee
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarthHackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT MinistryJeong-Yoon Lee
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbitodsc
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R PackageDataRobot
 

Viewers also liked (20)

How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 

Similar to Featurizing log data before XGBoost

Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overviewjoeyrobert
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Agile Testing Analytics
Agile Testing AnalyticsAgile Testing Analytics
Agile Testing AnalyticsQASymphony
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with DialoguesJinho Choi
 
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentTSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentHarry McLaren
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Object oriented programming with java pdf
Object oriented programming with java pdfObject oriented programming with java pdf
Object oriented programming with java pdfsonivipin2342
 
Technical Debt Management
Technical Debt ManagementTechnical Debt Management
Technical Debt ManagementMark Niebergall
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationTechWell
 
Pytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsPytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsAndreu Vallbona Plazas
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT  2nd semester C++ unit-1B.sc CSIT  2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1Tekendra Nath Yogi
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeInterop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeJeremy Schulman
 

Similar to Featurizing log data before XGBoost (20)

Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Agile Testing Analytics
Agile Testing AnalyticsAgile Testing Analytics
Agile Testing Analytics
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
Application Metrics
Application MetricsApplication Metrics
Application Metrics
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with Dialogues
 
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentTSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Object oriented programming with java pdf
Object oriented programming with java pdfObject oriented programming with java pdf
Object oriented programming with java pdf
 
Technical Debt Management
Technical Debt ManagementTechnical Debt Management
Technical Debt Management
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful Information
 
Pytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsPytest - testing tips and useful plugins
Pytest - testing tips and useful plugins
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT  2nd semester C++ unit-1B.sc CSIT  2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeInterop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough Code
 

Recently uploaded

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Featurizing log data before XGBoost

  • 1. Featurizing log data before XGBoost Xavier Conort Thursday, August 20, 2015 @
  • 2. ● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University, ● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260 international courses ● high dropout rate The competition host
  • 3. ● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities. ● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows): ■ Columns: enrollment_id, username, course_id ○ log_train / log_test ■ Columns: enrollment_id, time, source, event, object ○ object ■ Columns: course_id, module_id, category, children, start ○ truth_train ■ Columns: enrollment_id, dropped_out Problem to solve
  • 5. Team Chief Product Officer Chief Data Scientist Data Scientist Data Scientist (O. Zhang)
  • 6. How we worked as a Team ● worked separately on feature engineering. 90% of our time was spent here. ● delegated Modeling part to DataRobot to: ○ find best algorithm (with XGboost as a winner!) ○ model text features ○ tune hyperparameters ○ experiment different feature sets and blend 8 XGBoost using different sets ○ communicate results
  • 7. Feature engineering techniques used ● counts ● time statistics (min, mean, max, diff) ● entropy ● sequences treated as text on which we ran ○ SVD on 3grams ○ DataRobot text mining solution ● 20 first components of SVD on user x object NB: removed duplicated log info and used training + test sets to build most features
  • 8. How to build efficient features in R
  • 9. Key course features ● course_id ● first log time ● enrollment counts ● unique log counts ● mean time interval
  • 10. Key enrollment count features ● log counts ● unique log counts ● ratio between unique log counts over log counts ● unique log counts by event (nagivate, access, problem, video, page_close, discussion, wiki) ● unique log counts before end of course (5 days, 10 days and 30 days before) ● sequence number of enrollment in that course
  • 11. Key enrollment time stats ● log time stats (min, mean, max) ● gap between first and last log of enrollment ● gap between enrollment first log and course first log ● gap between enrollment last log and course last logs ● difference between mean log time and mid point between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)
  • 12. Enrollment entropy features enrollment entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day ● hours of the day for the last 1/3/7 days before last logs ● object (when event == problem) ● chapter ids
  • 13. Example of entropy feature - log(weekday_log_count / enrollment_log_count) * weekday_log_count / enrollment_log_count Sum => weekday_entropy[enrollment_id==1] 1.589988
  • 14. Enrollment sequence features ● for each enrollment_id, built sequences of ○ weekdays ○ objects ■ all objects / 'problem' and 'video' objects only ○ events ● treated sequences as 4 text variables. Ran for each ○ svd on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. & Nystroem SVM on (tuned) n-grams
  • 15. Extract of enrollment object sequences
  • 17. DataRobot on Object 1-2 grams
  • 18. Key user count features and time stats ● enrollment count ● binary indicator whether user signed up for each of the 38 courses ● unique log count ● mean log time interval ● sequence number of enrollment for that user
  • 19. User entropy features user entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day
  • 20. User sequence features ● for each user, built sequences of ○ weekdays ○ chapter_ids ○ events ● treated them as 3 text variables. Ran ○ SVD on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. + Nystroem SVM on (tuned) n-grams
  • 21. How we got to the TOP3 ● entropy features mentioned before ● exploited info in ○ log count in the 5 / 10 / 20 days after end of course ○ log counts by event, sign_up counts and day entropy in the next 10 days after end of course ○ time to sign up for new course ○ time until the next log for same user added ~0.001 to AUC (vs less powerful features) added ~0.002 to AUC