SlideShare a Scribd company logo
1 of 7
Download to read offline
Scraping Engine
Packaging DeliveryEngine Monitor
Notification
Engine Properties
Configuration
Page Element
Configuration
Proxy Collector
Notification
Template
Event
Trigger
Property
Reader
Proxy
Claimer
Page Element
Reader
Page
Loader
Data
Processor
ScraperScraping
Case
DB
Web Scraping Solution
- ScrapeXpress Classic Solution Series -
Author: Andy Yang
Auditor:
Created Date: 13/10/2015
Last Updated: 13/10/2015
Version: 1.3
Dependency Document: BusinessRequirementOfWebScrapingForxxxx_v1.1.docx
1. overview
2. Business Modules
2.1. Core Module – Scraping Engine
Scraping Engine( also called Engine) - One Engine for one website
Execute a scraping task, scrape pre-defined data fields from web page and convert result
data into universal data object which is easy to be used by other modules, the module will invoke
other relevant modules to finish whole scraping task.
Components:
 Engine Event Trigger
Create event and fire it to Engine Monitor
 Engine Properties Reader
Read engine properties defined by Engine Properties Configuration
 Proxy Claimer
Claim a proxy ip from Proxy Pool maintained by Proxy Collector
 Page Elements Configuration Reader
Read and analyse the configuration of page elements defined by Page Element
Configuration
 Page Content Loader
Accept the url request, retrieve and transform web page content into the stream of html
source code
 Data Scraper
Parse html content and extract each data field from the stream of html source code by
invoking API of the third-party API, finally save these datas into universal data object.
 Data Processor
Save data into database or forward to Packaging Module
Send notification to specified user by Notification Module
Output
 Invoke Packaging Module to package result data into specified formatted file
 Invoke Delivery Module to put packaged file into target folder
 Fire exception event to Engine Monitor to handle these events.
 Invoke Notification Module to notice the specific user (Finance DEPT) the status of
scraping task by email
Input
 Read Engine Properties to control the engine running
 Read Web Page Structure Configuration to scrape specified data from web page
content correctly
 Dynamically claim a proxy IP and use it to access the target url.
2.2. Packaging Module
Accept the result data from Scraping Engine and convert data into formatted file, such as
EXCEL or CSV.
2.3. Delivery Module
Accept the delivery command from Scraping Engine and put packaged file into specified
folder.
Tips: This module can be extend to deliver data file by different way, such as by email.
2.4. Engine Monitor
Accept the exception event fired by Scraping Engine, generate the message content due to
the message template and invoke Notification Module to send message to ITD
2.5. Notification Module
According to the pre-defined method, accept message object and send message to specified
user by email.
3. Supporting Modules:
3.1. Engine Properties Configurator
Define the properties of Scraping Engine which will be used when Scraping Engine execute
a scraping task.
3.2. Page Element Configurator
Define the each data element that you want to scrape from web page
3.3. Proxy Collector
Collect free proxy server ip from online website, validate and submit available proxy ip into
Proxy Pool.
3.4. Notification Template Management
Create and maintain the template of notification message, so that we can adjust the content
and format depending on the business scenario.
4. Scraping Case System
4.1. Scraping Case Builder
 Define a Case
 put the Page Element Configuration into java code
 Initialise the data, such as search condition, running schedule...
4.2. Scraping Case Controller
 Run, stop a running case or start
 Log the status of scraping process.
5. Implement Strategy
We separate the whole progress of project into 3 stages:
1. Stage 1 : Basic Functions
2. Stage 2 : Support & Advance Functions
3. Stage 3: High Level Functions
Stage 1: Basic Functions
Scope
Develop essential modules and functions so that we can scrape data from 3 website
mentioned in requirement document, put some supporting modules and high level functions to stage
2 or stage 3.
Module Functions & Comment
Scraping Engine Engine Properties Reader :
only write the properties into java code instead of reading from config file
Reading properties from config file will be developed in stage 2;
Page Elements Configuration Reader:
only write the Page Element Configuration into java code instead of
reading from configuration file
Reading Page Element Configuration from config file will be
developed in stage 3;
Page Content Loader: Only directly access the target website instead of
via proxy server
via proxy server will be in stage2 or stage3
Proxy Claimer:
only define interface instead of claiming proxy ip from Proxy Pool
Claiming proxy ip from Proxy Pool will be in stage2
Engine Event Trigger:
Define essential events to be fired
According to the requirement, we will add new event in stage 2 and
stage 3.
Data Scraper:
Need invoke the third-side api to scrape data from web page according to
the Page Element Configuration instead of developing whole data
scraping algorithm.
We will rewrite the whole algorithm in stage 3
Data Processor:
Directly save data into database and package data into formatted file
Engine Monitor Able to these events fired by stage 2
Packaging Module Package data into Excel file and put it to specified folder
Notification Module Able to send essential notifications to ITD and Finance DEPT
Scraping Case
Builder
Create scraping case for 3 websites and initialise the search conditons;
put the Page Element Configuration into java code
Scraping Case
Controller
Provide start and stop functions to run scraping case to scraped data
no log functions
Workload Assessment of Stage 1
Jobs Work load (work day)
Preparation
1 Validate feasibility of technology
2 Prepare development environment and tools
3 Design and confirm data structure / definitions
Coding & Unit Testing
4 Program Scraping Engine
5 Program Engine Monitor
6 Program Packaging Module
6 Program Notification Module
Data Preparation
7 Collect search conditions from 2 websites:Divvy and
Parkhound
7 Create and initialise Scraping Case
8 Put the Page Element Configuration into java code
For 3 websites
Testing and Deployment
9 Build testing environment and testing
10 Build production environment and deploy system
Only run as standalone application
Maintenance and Document
11 On-site maintenance and fix bug
12 Write usage instructions
Not technology document
Stage 2: Support and Advance Functions
Module Functions & Comment
Scraping Engine Reading properties from config file
Access target web site via proxy server
Claiming proxy ip from Proxy Pool
Engine Monitor Update depending on real requirements
Notification Module Update depending on real requirements
Scraping Case
Controller
Update depending on real requirements
Engine Properties
Configurator
Maintain the properties of engine into config file or table
Proxy Collector Manually write proxy list into Proxy Pool
Scraping Case
Controller
Log the status of scraping process.
Stage 3: High Level Functions
Module Functions & Comment
Scraping Engine Read Page Element Configuration from config file or table
Rewrite the whole algorithm of scraping data depending on the Page
Element Configuration.
Page Element
configurator
Proxy Collector Automatically collect proxy ip from internet website, validate the
connectivity of proxy ips
Notification Template
Management
Define the template of message out of the system instead of writing
message content in java code.

More Related Content

What's hot

Apache course contents
Apache course contentsApache course contents
Apache course contentsdarshangosh
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08Niit Care
 
Architecture In Share Point2010
Architecture In Share Point2010Architecture In Share Point2010
Architecture In Share Point2010Alexander Meijers
 
Share point review qustions
Share point review qustionsShare point review qustions
Share point review qustionsthan sare
 
Oracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration IOracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration ISachin Kumar
 
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck schedulerLinux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck schedulerpcherukumalla
 
Websphere interview Questions
Websphere interview QuestionsWebsphere interview Questions
Websphere interview Questionsgummadi1
 
Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4than sare
 
State management
State managementState management
State managementIblesoft
 
Sharepoint Performance - part 2
Sharepoint Performance - part 2Sharepoint Performance - part 2
Sharepoint Performance - part 2Regroove
 
Weblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencastWeblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencastRajiv Gupta
 
Personalization in webcenter portal
Personalization in webcenter portalPersonalization in webcenter portal
Personalization in webcenter portalVinay Kumar
 
introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)Assay Khan
 

What's hot (20)

WebLogic FAQs
WebLogic FAQsWebLogic FAQs
WebLogic FAQs
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Apache course contents
Apache course contentsApache course contents
Apache course contents
 
Unit5 servlets
Unit5 servletsUnit5 servlets
Unit5 servlets
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
 
Architecture In Share Point2010
Architecture In Share Point2010Architecture In Share Point2010
Architecture In Share Point2010
 
Share point review qustions
Share point review qustionsShare point review qustions
Share point review qustions
 
Oracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration IOracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration I
 
Web servers
Web serversWeb servers
Web servers
 
SUG Bangalore - Kick Off Session
SUG Bangalore - Kick Off SessionSUG Bangalore - Kick Off Session
SUG Bangalore - Kick Off Session
 
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck schedulerLinux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
 
Websphere interview Questions
Websphere interview QuestionsWebsphere interview Questions
Websphere interview Questions
 
Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4
 
State management
State managementState management
State management
 
WebLogic for DBAs
WebLogic for DBAsWebLogic for DBAs
WebLogic for DBAs
 
Sharepoint Performance - part 2
Sharepoint Performance - part 2Sharepoint Performance - part 2
Sharepoint Performance - part 2
 
Weblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencastWeblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencast
 
Personalization in webcenter portal
Personalization in webcenter portalPersonalization in webcenter portal
Personalization in webcenter portal
 
Understanding iis part2
Understanding iis part2Understanding iis part2
Understanding iis part2
 
introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)
 

Viewers also liked

Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개Shougo Kim
 
공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서Seongwon Eun
 
민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104Borah Kang
 
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스Dongjae Lee
 
2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_fJay Park
 
모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션Sunnyrider
 
2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드MezzoMedia
 
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"Buzzvil
 
2016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 2015122016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 201512Nasmedia
 
MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]MezzoMedia
 
2017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 12152017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 1215Nasmedia
 
[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트MezzoMedia
 

Viewers also liked (12)

Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
 
공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서
 
민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104
 
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
 
2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f
 
모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션
 
2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드
 
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
 
2016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 2015122016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 201512
 
MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]
 
2017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 12152017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 1215
 
[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트
 

Similar to ScrapeXpress-Standalone-solution

Parallelminds.asp.net with sp
Parallelminds.asp.net with spParallelminds.asp.net with sp
Parallelminds.asp.net with spparallelminder
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08Vivek chan
 
Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014Lou Sacco
 
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman pluginoVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman pluginOved Ourfali
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08Mani Chaubey
 
UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...Peter Muessig
 
05 asp.net session07
05 asp.net session0705 asp.net session07
05 asp.net session07Vivek chan
 
Power of ONE Automation through Web Services
Power of ONE Automation through Web ServicesPower of ONE Automation through Web Services
Power of ONE Automation through Web ServicesCA | Automic Software
 
Parallelminds.web partdemo1
Parallelminds.web partdemo1Parallelminds.web partdemo1
Parallelminds.web partdemo1parallelminder
 
Web components - An Introduction
Web components - An IntroductionWeb components - An Introduction
Web components - An Introductioncherukumilli2
 

Similar to ScrapeXpress-Standalone-solution (20)

Parallelminds.asp.net with sp
Parallelminds.asp.net with spParallelminds.asp.net with sp
Parallelminds.asp.net with sp
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
 
Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014
 
DEVICE CHANNELS
DEVICE CHANNELSDEVICE CHANNELS
DEVICE CHANNELS
 
Java EE Services
Java EE ServicesJava EE Services
Java EE Services
 
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman pluginoVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
 
Synopsis
SynopsisSynopsis
Synopsis
 
Asp.net control
Asp.net controlAsp.net control
Asp.net control
 
KMS (1)
KMS (1)KMS (1)
KMS (1)
 
ASP.NET Lecture 5
ASP.NET Lecture 5ASP.NET Lecture 5
ASP.NET Lecture 5
 
UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...
 
ASP.NET Lecture 2
ASP.NET Lecture 2ASP.NET Lecture 2
ASP.NET Lecture 2
 
Server side rendering review
Server side rendering reviewServer side rendering review
Server side rendering review
 
2310 b 15
2310 b 152310 b 15
2310 b 15
 
2310 b 15
2310 b 152310 b 15
2310 b 15
 
05 asp.net session07
05 asp.net session0705 asp.net session07
05 asp.net session07
 
Power of ONE Automation through Web Services
Power of ONE Automation through Web ServicesPower of ONE Automation through Web Services
Power of ONE Automation through Web Services
 
Parallelminds.web partdemo1
Parallelminds.web partdemo1Parallelminds.web partdemo1
Parallelminds.web partdemo1
 
Web components - An Introduction
Web components - An IntroductionWeb components - An Introduction
Web components - An Introduction
 

More from Andy Yang

Jxt job posting
Jxt job postingJxt job posting
Jxt job postingAndy Yang
 
Sae job application export
Sae job application exportSae job application export
Sae job application exportAndy Yang
 
Integration solution with daxtra resume indexing
Integration solution with daxtra resume indexingIntegration solution with daxtra resume indexing
Integration solution with daxtra resume indexingAndy Yang
 
Ctc people product development and release process
Ctc people product development and release processCtc people product development and release process
Ctc people product development and release processAndy Yang
 
One push architecture plugin work with hub & board core
One push architecture   plugin work with hub & board coreOne push architecture   plugin work with hub & board core
One push architecture plugin work with hub & board coreAndy Yang
 
One push architecture plugin and container
One push architecture   plugin and containerOne push architecture   plugin and container
One push architecture plugin and containerAndy Yang
 
One push architecture total architecture
One push architecture   total architectureOne push architecture   total architecture
One push architecture total architectureAndy Yang
 
Onepush platformtotalsolution
Onepush platformtotalsolutionOnepush platformtotalsolution
Onepush platformtotalsolutionAndy Yang
 
eDM system model
eDM system modeleDM system model
eDM system modelAndy Yang
 
eDM infrastructure
eDM infrastructureeDM infrastructure
eDM infrastructureAndy Yang
 

More from Andy Yang (10)

Jxt job posting
Jxt job postingJxt job posting
Jxt job posting
 
Sae job application export
Sae job application exportSae job application export
Sae job application export
 
Integration solution with daxtra resume indexing
Integration solution with daxtra resume indexingIntegration solution with daxtra resume indexing
Integration solution with daxtra resume indexing
 
Ctc people product development and release process
Ctc people product development and release processCtc people product development and release process
Ctc people product development and release process
 
One push architecture plugin work with hub & board core
One push architecture   plugin work with hub & board coreOne push architecture   plugin work with hub & board core
One push architecture plugin work with hub & board core
 
One push architecture plugin and container
One push architecture   plugin and containerOne push architecture   plugin and container
One push architecture plugin and container
 
One push architecture total architecture
One push architecture   total architectureOne push architecture   total architecture
One push architecture total architecture
 
Onepush platformtotalsolution
Onepush platformtotalsolutionOnepush platformtotalsolution
Onepush platformtotalsolution
 
eDM system model
eDM system modeleDM system model
eDM system model
 
eDM infrastructure
eDM infrastructureeDM infrastructure
eDM infrastructure
 

Recently uploaded

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 

Recently uploaded (20)

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 

ScrapeXpress-Standalone-solution

  • 1. Scraping Engine Packaging DeliveryEngine Monitor Notification Engine Properties Configuration Page Element Configuration Proxy Collector Notification Template Event Trigger Property Reader Proxy Claimer Page Element Reader Page Loader Data Processor ScraperScraping Case DB Web Scraping Solution - ScrapeXpress Classic Solution Series - Author: Andy Yang Auditor: Created Date: 13/10/2015 Last Updated: 13/10/2015 Version: 1.3 Dependency Document: BusinessRequirementOfWebScrapingForxxxx_v1.1.docx 1. overview
  • 2. 2. Business Modules 2.1. Core Module – Scraping Engine Scraping Engine( also called Engine) - One Engine for one website Execute a scraping task, scrape pre-defined data fields from web page and convert result data into universal data object which is easy to be used by other modules, the module will invoke other relevant modules to finish whole scraping task. Components:  Engine Event Trigger Create event and fire it to Engine Monitor  Engine Properties Reader Read engine properties defined by Engine Properties Configuration  Proxy Claimer Claim a proxy ip from Proxy Pool maintained by Proxy Collector  Page Elements Configuration Reader Read and analyse the configuration of page elements defined by Page Element Configuration  Page Content Loader Accept the url request, retrieve and transform web page content into the stream of html source code  Data Scraper Parse html content and extract each data field from the stream of html source code by invoking API of the third-party API, finally save these datas into universal data object.  Data Processor Save data into database or forward to Packaging Module Send notification to specified user by Notification Module Output  Invoke Packaging Module to package result data into specified formatted file  Invoke Delivery Module to put packaged file into target folder  Fire exception event to Engine Monitor to handle these events.  Invoke Notification Module to notice the specific user (Finance DEPT) the status of
  • 3. scraping task by email Input  Read Engine Properties to control the engine running  Read Web Page Structure Configuration to scrape specified data from web page content correctly  Dynamically claim a proxy IP and use it to access the target url. 2.2. Packaging Module Accept the result data from Scraping Engine and convert data into formatted file, such as EXCEL or CSV. 2.3. Delivery Module Accept the delivery command from Scraping Engine and put packaged file into specified folder. Tips: This module can be extend to deliver data file by different way, such as by email. 2.4. Engine Monitor Accept the exception event fired by Scraping Engine, generate the message content due to the message template and invoke Notification Module to send message to ITD 2.5. Notification Module According to the pre-defined method, accept message object and send message to specified user by email. 3. Supporting Modules: 3.1. Engine Properties Configurator Define the properties of Scraping Engine which will be used when Scraping Engine execute a scraping task. 3.2. Page Element Configurator Define the each data element that you want to scrape from web page 3.3. Proxy Collector Collect free proxy server ip from online website, validate and submit available proxy ip into Proxy Pool.
  • 4. 3.4. Notification Template Management Create and maintain the template of notification message, so that we can adjust the content and format depending on the business scenario. 4. Scraping Case System 4.1. Scraping Case Builder  Define a Case  put the Page Element Configuration into java code  Initialise the data, such as search condition, running schedule... 4.2. Scraping Case Controller  Run, stop a running case or start  Log the status of scraping process. 5. Implement Strategy We separate the whole progress of project into 3 stages: 1. Stage 1 : Basic Functions 2. Stage 2 : Support & Advance Functions 3. Stage 3: High Level Functions Stage 1: Basic Functions Scope Develop essential modules and functions so that we can scrape data from 3 website mentioned in requirement document, put some supporting modules and high level functions to stage 2 or stage 3. Module Functions & Comment Scraping Engine Engine Properties Reader : only write the properties into java code instead of reading from config file Reading properties from config file will be developed in stage 2; Page Elements Configuration Reader: only write the Page Element Configuration into java code instead of reading from configuration file
  • 5. Reading Page Element Configuration from config file will be developed in stage 3; Page Content Loader: Only directly access the target website instead of via proxy server via proxy server will be in stage2 or stage3 Proxy Claimer: only define interface instead of claiming proxy ip from Proxy Pool Claiming proxy ip from Proxy Pool will be in stage2 Engine Event Trigger: Define essential events to be fired According to the requirement, we will add new event in stage 2 and stage 3. Data Scraper: Need invoke the third-side api to scrape data from web page according to the Page Element Configuration instead of developing whole data scraping algorithm. We will rewrite the whole algorithm in stage 3 Data Processor: Directly save data into database and package data into formatted file Engine Monitor Able to these events fired by stage 2 Packaging Module Package data into Excel file and put it to specified folder Notification Module Able to send essential notifications to ITD and Finance DEPT Scraping Case Builder Create scraping case for 3 websites and initialise the search conditons; put the Page Element Configuration into java code Scraping Case Controller Provide start and stop functions to run scraping case to scraped data no log functions Workload Assessment of Stage 1 Jobs Work load (work day) Preparation 1 Validate feasibility of technology 2 Prepare development environment and tools 3 Design and confirm data structure / definitions
  • 6. Coding & Unit Testing 4 Program Scraping Engine 5 Program Engine Monitor 6 Program Packaging Module 6 Program Notification Module Data Preparation 7 Collect search conditions from 2 websites:Divvy and Parkhound 7 Create and initialise Scraping Case 8 Put the Page Element Configuration into java code For 3 websites Testing and Deployment 9 Build testing environment and testing 10 Build production environment and deploy system Only run as standalone application Maintenance and Document 11 On-site maintenance and fix bug 12 Write usage instructions Not technology document Stage 2: Support and Advance Functions Module Functions & Comment Scraping Engine Reading properties from config file Access target web site via proxy server Claiming proxy ip from Proxy Pool Engine Monitor Update depending on real requirements Notification Module Update depending on real requirements Scraping Case Controller Update depending on real requirements Engine Properties Configurator Maintain the properties of engine into config file or table Proxy Collector Manually write proxy list into Proxy Pool Scraping Case Controller Log the status of scraping process.
  • 7. Stage 3: High Level Functions Module Functions & Comment Scraping Engine Read Page Element Configuration from config file or table Rewrite the whole algorithm of scraping data depending on the Page Element Configuration. Page Element configurator Proxy Collector Automatically collect proxy ip from internet website, validate the connectivity of proxy ips Notification Template Management Define the template of message out of the system instead of writing message content in java code.