SlideShare a Scribd company logo
1 of 29
Big Data:
the weakest link
Vivek Nair, Tim Menzies
{vivekaxl,tim.menzies}@gmail.com
HPCC Eng. Summit - Sept 29, 2015
Where is the weakest link?
2
Where is the weakest link?
3
Where is the weakest link?
4
Where is the weakest link?
5
Where is the weakest link?
6
Premise of Big Data
Analysis is a “systems” task?
• Better conclusions =
same algorithms + more
data + more cpu
• If so, then …
– No role for human error
– All insight is auto-generated
from CPUs.
Analysis is a “human” task?
• Current results on “software
analytics”
– A human-intensive process
7
Q: Is Big Data a “Systems” or “Human”-task?
A: Yes
8
Code used in my
last paper
(1100 LOC of Python
calling scikitlearn)
9
Use a Higher-Level languages?
• ECL solves this problem?
• But if you can write it quick,
– you can write it wrong, quick.
10
Is this really a problem?
• Q: What would we expect
to see if…
– Top experts, publishing in top
journals
– Many of the same data sets
– 8 years of trying
• A:
– Perhaps some upward
progress
– Perhaps a little less variance
11
So, what do
we see?
• Software analytics
– Defect prediction
– Many of the same learners,
– Many of the same data sets
• 42 papers,
top journals,
• 23 author groups
• 2002 to 2010
• Y-axis measures
mean performance
12
Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd,
David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
13
http://fivethirt
yeight.com/fe
atures/science
-isnt-broken/
A little theory
• James D. Herbsleb, CMU
• Socio-Technical Coordination
• A predictor for higher defects:
– Groups of programmers
working on similar functions
then,
– but do not sharing that
expertise
14
Q: How to find expertise groups
within the HPCC community?
A: using data mining
15
Static features and commit history
can act as a cue for expertise
● Our motivation
o “relation between embodiment and language
acquisition by locating the ‘minimal set of
necessary features’ that enable language of any
kind to be learned” - The Philosophy of Expertise
16
Software analytics results:
learn predictors for expertise
● “...counts of the cumulative number of different
developers changing a file over its lifetime can help
to improve defect predictions…”[1]
● “Quantify person's experience with a part of code
using change history of the code”[2]
● “RevFinder, a file location-based code-reviewer
recommendation approach” [3]
● “30% of its code entities has more than 0.3 of
similarity with at least one developer vocabulary”
[4]
17
[1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell.
"Programmer-based fault prediction." Proceedings of the 6th
International Conference on Predictive Models in Software Engineering.
ACM, 2010.
[2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a
quantitative approach to identifying expertise." Proceedings of the
24th international conference on software engineering. ACM, 2002.
[3] Thongtanunam, Patanamon, et al. "Who should review my code? A
file location-based code-reviewer recommendation approach for
Modern Code Review."Software Analysis, Evolution and Reengineering
(SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015.
[4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de
Figueiredo. "Using Developers Contributions on Software Vocabularies
to Identify Experts."Information Technology-New Generations (ITNG),
2015 12th International Conference on. IEEE, 2015.
Q: And what data mining suite will we
use to mine data about programmers?
• A: need you ask?
18
Source Code
19
But what are we clustering?
Developer products
• Lightweight parsing of source code
• Developers profiles, accessed
via social media sites
Languages Used
Skill Set (self promotion)
Data processing
1. Github repos (for code) ➔ Social media(for years of work)
2. Static code analysis: frequency counts of AST features
(e.g. count loops, returns, var comparisons, map, etc )
3. Bayes classifier
Early
career
Later career
Classification
- Features: Nodes of AST
- Algorithms Used: Simple Cart, Random
Forest, Naive Bayes etc.
- Can distinguish expert from novice
programmers
•precision= 78% early career
•precision = 74% later career
* Using Weka
Current status
The good news
• Can auto-find groups of
better programmers
• Can do that for very large
data sets
– The ECL advantages
The other news
• Seeking larger data sets
• Talking to HackerRank
• Looking at ways to
instrument the HPCC
forums
– Matchmaker tools
– Affinity groups
25
Where is the weakest link?
26
Where is the weakest link?
27
We can make that link stronger
28
Acknowledgements:
Thanks to funding from LexisNexis
29

More Related Content

What's hot

TienResumeFinalV22016
TienResumeFinalV22016TienResumeFinalV22016
TienResumeFinalV22016
Nora Tien
 
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
Alex Pinto
 
Opinion Mining for Software Engineering
Opinion Mining for Software EngineeringOpinion Mining for Software Engineering
Opinion Mining for Software Engineering
Alexander Serebrenik
 
Timothy Chu Resume
Timothy Chu ResumeTimothy Chu Resume
Timothy Chu Resume
Timothy Chu
 

What's hot (14)

Brian_Thomas_Resume_20160215
Brian_Thomas_Resume_20160215Brian_Thomas_Resume_20160215
Brian_Thomas_Resume_20160215
 
TienResumeFinalV22016
TienResumeFinalV22016TienResumeFinalV22016
TienResumeFinalV22016
 
Resume qinshu xiao_10_10
Resume qinshu xiao_10_10Resume qinshu xiao_10_10
Resume qinshu xiao_10_10
 
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
 
cv
cvcv
cv
 
ownR extended technical introduction
ownR extended technical introductionownR extended technical introduction
ownR extended technical introduction
 
Opinion Mining for Software Engineering
Opinion Mining for Software EngineeringOpinion Mining for Software Engineering
Opinion Mining for Software Engineering
 
Reflex and model based agents
Reflex and model based agentsReflex and model based agents
Reflex and model based agents
 
Put Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowPut Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and How
 
Overview of Data Science
Overview of Data ScienceOverview of Data Science
Overview of Data Science
 
Resume of Zikai Cai
Resume of Zikai CaiResume of Zikai Cai
Resume of Zikai Cai
 
Venkata brundavanam 2020
Venkata brundavanam 2020Venkata brundavanam 2020
Venkata brundavanam 2020
 
Timothy Chu Resume
Timothy Chu ResumeTimothy Chu Resume
Timothy Chu Resume
 
Jinank
JinankJinank
Jinank
 

Viewers also liked

Porody sobak
Porody sobakPorody sobak
Porody sobak
Ivakina
 
Партнерский договор LR с физическим лицом_12.15
Партнерский договор      LR с физическим лицом_12.15Партнерский договор      LR с физическим лицом_12.15
Партнерский договор LR с физическим лицом_12.15
t575ae
 
Remembrance of data past
Remembrance of data pastRemembrance of data past
Remembrance of data past
Amélie Marian
 
Traumatic brain injury
Traumatic brain injuryTraumatic brain injury
Traumatic brain injury
caitjoh
 
Metodo di scrittura (P:O:R:C:C:O)
Metodo di scrittura (P:O:R:C:C:O)Metodo di scrittura (P:O:R:C:C:O)
Metodo di scrittura (P:O:R:C:C:O)
Danilo Buccarello
 

Viewers also liked (20)

LEVICK Weekly - Sept 7 2012
LEVICK Weekly - Sept 7 2012LEVICK Weekly - Sept 7 2012
LEVICK Weekly - Sept 7 2012
 
Porody sobak
Porody sobakPorody sobak
Porody sobak
 
Cotizacion+(1)
Cotizacion+(1)Cotizacion+(1)
Cotizacion+(1)
 
Uu ite
Uu iteUu ite
Uu ite
 
Author guidelines
Author guidelinesAuthor guidelines
Author guidelines
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SME
 
Партнерский договор LR с физическим лицом_12.15
Партнерский договор      LR с физическим лицом_12.15Партнерский договор      LR с физическим лицом_12.15
Партнерский договор LR с физическим лицом_12.15
 
TUBULAR EXCHANGER
TUBULAR EXCHANGERTUBULAR EXCHANGER
TUBULAR EXCHANGER
 
Remembrance of data past
Remembrance of data pastRemembrance of data past
Remembrance of data past
 
Bet-the-Farm User Experience
Bet-the-Farm User ExperienceBet-the-Farm User Experience
Bet-the-Farm User Experience
 
A Night Owl Seeking Balance.
A Night Owl Seeking Balance.A Night Owl Seeking Balance.
A Night Owl Seeking Balance.
 
Isus
IsusIsus
Isus
 
User guide
User guideUser guide
User guide
 
Traumatic brain injury
Traumatic brain injuryTraumatic brain injury
Traumatic brain injury
 
選択する肢/branch_city
選択する肢/branch_city選択する肢/branch_city
選択する肢/branch_city
 
Il condizionale
Il condizionaleIl condizionale
Il condizionale
 
Metodo di scrittura (P:O:R:C:C:O)
Metodo di scrittura (P:O:R:C:C:O)Metodo di scrittura (P:O:R:C:C:O)
Metodo di scrittura (P:O:R:C:C:O)
 
 
 
Demo
DemoDemo
Demo
 

Similar to Analyzing Big Data's Weakest Link (hint: it might be you)

Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics:  Why Big Data Needs Rich DataLies, Damned Lies and Software Analytics:  Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
Margaret-Anne Storey
 

Similar to Analyzing Big Data's Weakest Link (hint: it might be you) (20)

Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software Analytics
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics:  Why Big Data Needs Rich DataLies, Damned Lies and Software Analytics:  Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
 
Sudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfSudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdf
 
01.intro
01.intro01.intro
01.intro
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Mastering Software Variability for Innovation and Science
Mastering Software Variability for Innovation and ScienceMastering Software Variability for Innovation and Science
Mastering Software Variability for Innovation and Science
 

More from HPCC Systems

Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
HPCC Systems
 

More from HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Recently uploaded

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Recently uploaded (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Analyzing Big Data's Weakest Link (hint: it might be you)

  • 1. Big Data: the weakest link Vivek Nair, Tim Menzies {vivekaxl,tim.menzies}@gmail.com HPCC Eng. Summit - Sept 29, 2015
  • 2. Where is the weakest link? 2
  • 3. Where is the weakest link? 3
  • 4. Where is the weakest link? 4
  • 5. Where is the weakest link? 5
  • 6. Where is the weakest link? 6
  • 7. Premise of Big Data Analysis is a “systems” task? • Better conclusions = same algorithms + more data + more cpu • If so, then … – No role for human error – All insight is auto-generated from CPUs. Analysis is a “human” task? • Current results on “software analytics” – A human-intensive process 7
  • 8. Q: Is Big Data a “Systems” or “Human”-task? A: Yes 8
  • 9. Code used in my last paper (1100 LOC of Python calling scikitlearn) 9
  • 10. Use a Higher-Level languages? • ECL solves this problem? • But if you can write it quick, – you can write it wrong, quick. 10
  • 11. Is this really a problem? • Q: What would we expect to see if… – Top experts, publishing in top journals – Many of the same data sets – 8 years of trying • A: – Perhaps some upward progress – Perhaps a little less variance 11 So, what do we see?
  • 12. • Software analytics – Defect prediction – Many of the same learners, – Many of the same data sets • 42 papers, top journals, • 23 author groups • 2002 to 2010 • Y-axis measures mean performance 12 Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd, David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
  • 14. A little theory • James D. Herbsleb, CMU • Socio-Technical Coordination • A predictor for higher defects: – Groups of programmers working on similar functions then, – but do not sharing that expertise 14
  • 15. Q: How to find expertise groups within the HPCC community? A: using data mining 15
  • 16. Static features and commit history can act as a cue for expertise ● Our motivation o “relation between embodiment and language acquisition by locating the ‘minimal set of necessary features’ that enable language of any kind to be learned” - The Philosophy of Expertise 16
  • 17. Software analytics results: learn predictors for expertise ● “...counts of the cumulative number of different developers changing a file over its lifetime can help to improve defect predictions…”[1] ● “Quantify person's experience with a part of code using change history of the code”[2] ● “RevFinder, a file location-based code-reviewer recommendation approach” [3] ● “30% of its code entities has more than 0.3 of similarity with at least one developer vocabulary” [4] 17 [1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell. "Programmer-based fault prediction." Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, 2010. [2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a quantitative approach to identifying expertise." Proceedings of the 24th international conference on software engineering. ACM, 2002. [3] Thongtanunam, Patanamon, et al. "Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review."Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015. [4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de Figueiredo. "Using Developers Contributions on Software Vocabularies to Identify Experts."Information Technology-New Generations (ITNG), 2015 12th International Conference on. IEEE, 2015.
  • 18. Q: And what data mining suite will we use to mine data about programmers? • A: need you ask? 18
  • 20. But what are we clustering? Developer products • Lightweight parsing of source code • Developers profiles, accessed via social media sites
  • 22. Skill Set (self promotion)
  • 23. Data processing 1. Github repos (for code) ➔ Social media(for years of work) 2. Static code analysis: frequency counts of AST features (e.g. count loops, returns, var comparisons, map, etc ) 3. Bayes classifier Early career Later career
  • 24. Classification - Features: Nodes of AST - Algorithms Used: Simple Cart, Random Forest, Naive Bayes etc. - Can distinguish expert from novice programmers •precision= 78% early career •precision = 74% later career * Using Weka
  • 25. Current status The good news • Can auto-find groups of better programmers • Can do that for very large data sets – The ECL advantages The other news • Seeking larger data sets • Talking to HackerRank • Looking at ways to instrument the HPCC forums – Matchmaker tools – Affinity groups 25
  • 26. Where is the weakest link? 26
  • 27. Where is the weakest link? 27
  • 28. We can make that link stronger 28