This document proposes using machine learning techniques to analyze logs and surface the most relevant ones. It discusses using both unsupervised and supervised learning. Unsupervised techniques like clustering could analyze large amounts of unlabeled data to group similar logs. Supervised learning would involve acquiring labels to train classifiers on what is relevant versus irrelevant. The proposed solution involves normalizing logs, acquiring labels, training models, and then classifying and enhancing new logs. It suggests this could be done at scale using tools like Spark.
2. The Problem - Overlogging
• Millions of logs per week
• Important logs get lost in the clutter
• Need to surface the relevant logs, deemphasize irrelevant logs
3. Proposed Solution
• A Machine Learning approach
• Can sift through large amounts of data
• Can evolve and react to changes in data
• Requires large amounts of data to be effective
5. Unsupervised Machine Learning
• No labels are needed, just lots of data
• Useful when reducing a large amount of data points to a smaller
cluster subset
6. Unsupervised Machine Learning
"GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.Confi
"GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.
"GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
"GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352
"GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253
"GET /twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.
"GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
"GET /twiki/bin/edit/Main/Header_checks?topicparent=Main.Configuratio
"GET /twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851
"GET /twiki/bin/view/TWiki/WebTopicEditTemplate HTTP/1.1" 200 3732
"GET /app_dev.php/ HTTP/1.1" 200 6715 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
"GET /bundles/framework/css/body.css HTTP/1.1" 200 6657 "http://my.log-
sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.231
"GET /bundles/framework/css/structure.css HTTP/1.1" 200 1191 "http://my.log-
sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.
"GET /bundles/acmedemo/css/demo.css HTTP/1.1" 200 2204 "http://my.log-
sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311
"GET /bundles/acmedemo/images/welcome-quick-tour.gif HTTP/1.1" 200 4770
"http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3)
AppleWebKit/537.36 (KHTML, like Gecko)
"GET /bundles/acmedemo/images/welcome-demo.gif HTTP/1.1" 200 4053 "http://my.log-
sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrom
Nov 20 17:27:55 HANNIBAL MyProgram[13163]: Program started by User 1000
Nov 21 17:27:53 HANNIBAL MyProgram[13163]: Program terminated by User 1000
Nov 21 17:27:58 JANE MyProgram[13163]: Program started by User 555
Nov 23 18:27:53 ARILOU MyProgram[13163]: Program stopped by User 777
7. Supervised Machine Learning
• Learning from labeled examples
• Requires a well defined question:
• Is this email spam?
• Is this object a car?
• Is this log interesting?
• Deployed successfully in many domains, most notable classifiers are
NN, SVM, Bayesian Classifiers
8. Supervised Machine Learning - SVM
• Data elements are arranged in vectors
• Each vector index is assigned a weight in the training phase
• A score is computed by summing up the relevant weights
0.1
0.5
-0.9
0.3
Xconnection error success failure
“Connection failure”: 0.1 + 0.3 = 0.4
“Connection success”: 0.1 - 0.9 = -0.8
9. Log Relevancy
• An ill posed problem
• Relevancy is user specific
• People tend to search for
known issues
• There are also unknown
unknowns
• Labels are potentially
very tedious to acquire
10. Proposed Solution - Labels
• Acquiring labels:
• Implicit/explicit user behavior
• Inter-user similarities
• Public knowledge bases
11. Machine Learning in Practice
• Data is textual, numerical and alphanumerical
• Classifiers that have shown good results:
• Random Forests, resemble flow chart decision making
• Linear SVM
• Both classifiers are easy to interpret in the feature space
12. Machine Learning in Practice
connected: -0.157199772246
to provider: -0.15319903564
connected successfully: -0.15319903564
unable: 0.671539714688
topic: 0.678756599452
error: 0.788508324168
13. Machine Learning in Practice - Modules
• Log normalization
• Label acquisition
• Model training
• Log classification and enhancement
14. Log Normalization
• Lower case, stem, stop words
• Identify common fields (timestamp, severity, etc’)
• Identify variable, functions, class names
• Identify known reserved words
• Cluster logs that share the same prototype
15. Labeler
• Different sources for labels
• CQA sites
• Explicit user interaction
• Implicit user interaction
• Heuristics
16. Log Enhancer
• Use knowledge about log events to add prior data
• Suggest solutions to known problems
• Tag relevant logs for display to the user