SlideShare a Scribd company logo
1 of 12
Download to read offline
First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas
Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)


3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", 

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another efficient format (parquet), combine
Log File Analysis
• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file
Log File Analysis
Log File Analysis
• Convert to more efficient data types

• Faster writing and reading time
Log File Analysis
• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan
Thank you

More Related Content

What's hot

AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAnton Shulke
 
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?Koray Tugberk GUBUR
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsDan Taylor
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Koray Tugberk GUBUR
 
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureKoray Tugberk GUBUR
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queriesBill Slawski
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOKoray Tugberk GUBUR
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
Log file analysis with advertools
Log file analysis with advertoolsLog file analysis with advertools
Log file analysis with advertoolsElias Dabbas
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOKoray Tugberk GUBUR
 
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateSEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateKoray Tugberk GUBUR
 
William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchBill Slawski
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Pythonsearchsolved
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConBill Slawski
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering Bill Slawski
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0Bill Slawski
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO Sara Taher
 
Using Tags & Taxonomies to super charge your eCommerce SEO
Using Tags & Taxonomies to super charge your eCommerce SEOUsing Tags & Taxonomies to super charge your eCommerce SEO
Using Tags & Taxonomies to super charge your eCommerce SEOMichael King
 

What's hot (20)

AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBUR
 
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
 
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queries
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Log file analysis with advertools
Log file analysis with advertoolsLog file analysis with advertools
Log file analysis with advertools
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
 
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateSEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
 
William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-search
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Python
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA Con
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Using Tags & Taxonomies to super charge your eCommerce SEO
Using Tags & Taxonomies to super charge your eCommerce SEOUsing Tags & Taxonomies to super charge your eCommerce SEO
Using Tags & Taxonomies to super charge your eCommerce SEO
 

Similar to Log File Analysis

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworksViet-Trung TRAN
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data StorageAllan Huang
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web ApplicationsMarkku Laine
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Richard Esplin
 
Web Performance & Scalability Tools
Web Performance & Scalability ToolsWeb Performance & Scalability Tools
Web Performance & Scalability ToolsFolio3 Software
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksAmazon Web Services
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesAndrew Kandels
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conferencenkabra
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez
 

Similar to Log File Analysis (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data Storage
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
 
Web Performance & Scalability Tools
Web Performance & Scalability ToolsWeb Performance & Scalability Tools
Web Performance & Scalability Tools
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
 

More from Elias Dabbas

Don't research keywords, generate them...
Don't research keywords, generate them...Don't research keywords, generate them...
Don't research keywords, generate them...Elias Dabbas
 
BoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardBoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardElias Dabbas
 
Remarketing Basics
Remarketing BasicsRemarketing Basics
Remarketing BasicsElias Dabbas
 
Analytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesAnalytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesElias Dabbas
 
Online Marketing - Forward to Basics
Online Marketing - Forward to BasicsOnline Marketing - Forward to Basics
Online Marketing - Forward to BasicsElias Dabbas
 
Structured Data - The Future of Search
Structured Data - The Future of SearchStructured Data - The Future of Search
Structured Data - The Future of SearchElias Dabbas
 
Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Elias Dabbas
 
Google Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerGoogle Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerElias Dabbas
 
Adwords training social media forum 2010
Adwords training social media forum 2010Adwords training social media forum 2010
Adwords training social media forum 2010Elias Dabbas
 
Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Elias Dabbas
 
SEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumSEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumElias Dabbas
 
CMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalCMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalElias Dabbas
 
Web Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiWeb Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiElias Dabbas
 
AdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesAdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesElias Dabbas
 

More from Elias Dabbas (16)

Twitter Dashboard
Twitter DashboardTwitter Dashboard
Twitter Dashboard
 
Don't research keywords, generate them...
Don't research keywords, generate them...Don't research keywords, generate them...
Don't research keywords, generate them...
 
BoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardBoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive Dashboard
 
Remarketing Basics
Remarketing BasicsRemarketing Basics
Remarketing Basics
 
Analytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesAnalytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence Series
 
Online Marketing - Forward to Basics
Online Marketing - Forward to BasicsOnline Marketing - Forward to Basics
Online Marketing - Forward to Basics
 
Structured Data - The Future of Search
Structured Data - The Future of SearchStructured Data - The Future of Search
Structured Data - The Future of Search
 
Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011
 
Google Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerGoogle Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online Marketer
 
Adwords training social media forum 2010
Adwords training social media forum 2010Adwords training social media forum 2010
Adwords training social media forum 2010
 
Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010
 
SEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumSEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME Forum
 
CMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalCMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - Drupal
 
Web Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiWeb Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubai
 
AdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesAdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, Strategies
 
Web2.0 Primer
Web2.0 PrimerWeb2.0 Primer
Web2.0 Primer
 

Recently uploaded

ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 

Recently uploaded (17)

ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 

Log File Analysis

  • 1. First steps at parsing and analyzing web server log files at scale Elias Dabbas @eliasdabbas
  • 2. Raw log file Harvard Dataverse, ecommerce site (zanbil.ir) 
 3.3GB ~1.3M lines Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",  https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
  • 3. Parse and convert to DataFrame/Table • Loading and parsing the whole file into memory probably won’t work (or scale) • Log files are usually not big, they’re huge • Sequentially parse chunks of lines, save to another efficient format (parquet), combine
  • 5. • File ingestion gets even faster after saving the DataFrame to a single optimized file, also more convenient to store as a single file
  • 8. • Convert to more efficient data types • Faster writing and reading time
  • 10. • Magic provided by: • Pandas • Apache Arrow Project • Apache Parquet Project Model Name: MacBook Pro Model Identifier: MacBookPro16,4 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB
  • 11. logs_to_df function Assumes common (or combined) log format Can be extended to other formats def logs_to_df(logfile, output_dir, errors_file): with open(logfile) as source_file: linenumber = 0 parsed_lines = [] for line in source_file: try: log_line = re.findall(combined_regex, line)[0] parsed_lines.append(log_line) except Exception as e: with open(errors_file, 'at') as errfile: print((line, str(e)), file=errfile) continue linenumber += 1 if linenumber % 250_000 == 0: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(f'{output_dir}/file_{linenumber}.parquet') parsed_lines.clear() else: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’) parsed_lines.clear() combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(? P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (? P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)' Regular Expressions Cookbook by Jan Goyvaerts, Steven Levithan