4. (Big)Data in a nutshell
• Business Intelligence / Research Evolved
• Significant change in Decision Making
• Enables new Products & Features
• Enables new Business Models
5. Data Scientist
• Has a Business / Research oriented
perspective
• Knowledge of statistics & software
engineering (AI, infrastructure)
• Ability to explore questions and formulate
hypotheses to be tested
6. Data Science Project
• Focused on particular business goals
• Based on a set of important questions
• Result > Answers that support business
decisions
8. Start w/ “Big” Questions
... answer them with (Big)Data
How can we understand & improve the conversion rate?
How can we increase customer satisfaction?
How can we find important mentions in social media?
9. Identify Data Sources
OR add more probes / sensors as needed
Google Analytics,Web server logs, Mixpanel, Custom
application metrics, Mouse tracking, Facebook metrics etc.
10. Extract Data
... to a medium that allows you to run arbitrary queries
Local filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
11. Extract
• Database dump tool, replicas or backups
• External web services
• Apache Sqoop (SQL-to-Hadoop)
• Implement pipelines / real-time streams
• Write custom tools as needed
13. Curate - Your Way
• Use or develop tools / scripts
• On large volumes there no obvious choices
• Custom ways of filtering & aggregating large
streams (e.g. twitter, sensors)
• Reuse existing software components for
data curation / validation
14. DataWrangler
Interactive System for Data
cleaning a transformation
http://vis.stanford.edu/wrangler/
15. Open Refine
Former Google Refine
https://github.com/OpenRefine/
OpenRefine
17. Why Sample?
• Interactive exploration to create and check
assumptions, to create algorithms
• Be careful with “Statistical Significance”
• Sample Smart: By time, By location etc.