SlideShare a Scribd company logo
1 of 41
Download to read offline
https://docs.microsoft.com/ja-jp/azure/architecture/checklist/availability
https://docs.microsoft.com/ja-jp/azure/architecture/
• Availability 今回
単一障害点をなくそう
All components, services, resources, and compute instances should be deployed as multiple
instances to prevent a single point of failure from affecting availability. This includes
authentication mechanisms. Design the application to be configurable to use multiple instances,
and to automatically detect failures and redirect requests to non-failed instances where the
platform does not do this automatically.
サービスレベルの異なるワークロードは分離しよう
If a service is composed of critical and less-critical workloads, manage them differently and specify
the service features and number of instances to meet their availability requirements.
依存関係を理解し、最小化しよう
Minimize the number of different services used where possible, and ensure you understand all of
the feature and service dependencies that exist in the system. This includes the nature of these
dependencies, and the impact of failure or reduced performance in each one on the overall
application. Microsoft guarantees at least 99.9 percent availability for most services, but this
means that every additional service an application relies on potentially reduces the overall
availability SLA of your system by 0.1 percent.
タスクとメッセージはべき等(安全に繰り返せるよう)にしよう
so that duplicated requests will not cause problems. For example, a service can act as a consumer
that handles messages sent as requests by other parts of the system that act as producers. If the
consumer fails after processing the message, but before acknowledging that it has been
processed, a producer might submit a repeat request which could be handled by another instance
of the consumer. For this reason, consumers and the operations they carry out should be
idempotent so that repeating a previously executed operation does not render the results invalid.
This may mean detecting duplicated messages, or ensuring consistency by using an optimistic
approach to handling conflicts.
メッセージブローカーでクリティカルなトランザクションの可用性を上げよう
Many scenarios for initiating tasks or accessing remote services use messaging to pass
instructions between the application and the target service. For best performance, the application
should be able to send the message and then return to process more requests, without needing
to wait for a reply. To guarantee delivery of messages, the messaging system should provide high
availability. Azure Service Bus message queues implement at least once semantics. This means that
each message posted to a queue will not be lost, although duplicate copies may be delivered
under certain circumstances. If message processing is idempotent (see the previous item),
repeated delivery should not be a problem.
機能的縮退を考慮しよう
when reaching resource limits, and take appropriate action to minimize the impact for the user. In
some cases, the load on the application may exceed the capacity of one or more parts, causing
reduced availability and failed connections. Scaling can help to alleviate this, but it may reach a
limit imposed by other factors, such as resource availability or cost. Design the application so that,
in this situation, it can automatically degrade gracefully. For example, in an ecommerce system, if
the order-processing subsystem is under strain (or has even failed completely), it can be
temporarily disabled while allowing other functionality (such as browsing the product catalog) to
continue. It might be appropriate to postpone requests to a failing subsystem, for example still
enabling customers to submit orders but saving them for later processing, when the orders
subsystem is available again.
突発的なイベント増に対処しよう
Most applications need to handle varying workloads over time, such as peaks first thing in the
morning in a business application or when a new product is released in an ecommerce site. Auto-
scaling can help to handle the load, but it may take some time for additional instances to come
online and handle requests. Prevent sudden and unexpected bursts of activity from overwhelming
the application: design it to queue requests to the services it uses and degrade gracefully when
queues are near to full capacity. Ensure that there is sufficient performance and capacity available
under non-burst conditions to drain the queues and handle outstanding requests. For more
information, see the Queue-Based Load Leveling Pattern.
各サービスは複数のインスタンスにデプロイしよう
Microsoft makes availability guarantees for services that you create and deploy, but these
guarantees are only valid if you deploy at least two instances of each role in the service. This
enables one role to be unavailable while the other remains active. This is especially important if
you need to deploy updates to a live system without interrupting clients' activities; instances can
be taken down and upgraded individually while the others continue online.
アプリを複数のデータセンターに配置しよう
Although extremely unlikely, it is possible for an entire datacenter to go offline through an event
such as a natural disaster or Internet failure. Vital business applications should be hosted in more
than one datacenter to provide maximum availability. This can also reduce latency for local users,
and provide additional opportunities for flexibility when updating applications.
デプロイとメンテナンス作業は、自動化、テストできるようにしよう
Distributed applications consist of multiple parts that must work together. Deployment should
therefore be automated, using tested and proven mechanisms such as scripts and deployment
applications. These can update and validate configuration, and automate the deployment process.
Automated techniques should also be used to perform updates of all or parts of applications. It is
vital to test all of these processes fully to ensure that errors do not cause additional downtime. All
deployment tools must have suitable security restrictions to protect the deployed application;
define and enforce deployment policies carefully and minimize the need for human intervention.
ステージング環境を用意し、本番環境と切り換える仕組みにしよう
where these are available. For example, using Azure Cloud Services staging and production
environments allows applications to be switched from one to another instantly through a virtual IP
address swap (VIP Swap). However, if you prefer to stage on-premises, or deploy different versions
of the application concurrently and gradually migrate users, you may not be able to use a VIP
Swap operation.
設定変更で再起動が必要な要素を理解し、対処しよう
the instance when possible. In many cases, the configuration settings for an Azure application or
service can be changed without requiring the role to be restarted. Role expose events that can be
handled to detect configuration changes and apply them to components within the application.
However, some changes to the core platform settings do require a role to be restarted. When
building components and services, maximize availability and minimize downtime by designing
them to accept changes to configuration settings without requiring the application as a whole to
be restarted.
更新ドメインを意識してダウンタイムなしでアップデートしよう
Azure compute units such as web and worker roles are allocated to upgrade domains. Upgrade
domains group role instances together so that, when a rolling update takes place, each role in the
upgrade domain is stopped, updated, and restarted in turn. This minimizes the impact on
application availability. You can specify how many upgrade domains should be created for a
service when the service is deployed.
(大事なことなので何回も言います) 可用性セットを使おう
Placing two or more virtual machines in the same availability set guarantees that these virtual
machines will not be deployed to the same fault domain. To maximize availability, you should
create multiple instances of each critical virtual machine used by your system and place these
instances in the same availability set. If you are running multiple virtual machines that serve
different purposes, create an availability set for each virtual machine. Add instances of each virtual
machine to each availability set. For example, if you have created separate virtual machines to act
as a web server and a reporting server, create an availability set for the web server and another
availability set for the reporting server. Add instances of the web server virtual machine to the
web server availability set, and add instances of the reporting server virtual machine to the
reporting server availability set.
データを遠隔地に複製しよう
Data in Azure Storage is automatically replicated within in a datacenter. For even higher availability,
use Read-access geo-redundant storage (-RAGRS), which replicates your data to a secondary
region and provides read-only access to the data in the secondary location. The data is durable
even in the case of a complete regional outage or a disaster.
データベースを遠隔地に複製しよう
Azure SQL Database and Cosmos DB both support geo-replication, which enables you to
configure secondary database replicas in other regions. Secondary databases are available for
querying and for failover in the case of a data center outage or the inability to connect to the
primary database. For more information, see Failover groups and active geo-replication (SQL
Database) and How to distribute data globally with Azure Cosmos DB?.
(使えるところでは) 楽観的平行性制御と結果整合性でいこう
where possible. Transactions that block access to resources through locking (pessimistic
concurrency) can cause poor performance and considerably reduce availability. These problems
can become especially acute in distributed systems. In many cases, careful design and techniques
such as partitioning can minimize the chances of conflicting updates occurring. Where data is
replicated, or is read from a separately updated store, the data will only be eventually consistent.
But the advantages usually far outweigh the impact on availability of using transactions to ensure
immediate consistency.
戻すことを意識してバックアップしてますか
and ensure it meets the Recovery Point Objective (RPO). Regularly and automatically back up data
that is not preserved elsewhere, and verify you can reliably restore both the data and the
application itself should a failure occur. Data replication is not a backup feature because errors
and inconsistencies introduced through failure, error, or malicious operations will be replicated
across all stores. The backup process must be secure to protect the data in transit and in storage.
Databases or parts of a data store can usually be recovered to a previous point in time by using
transaction logs. Microsoft Azure provides a backup facility for data stored in Azure SQL Database.
The data is exported to a backup package on Azure blob storage, and can be downloaded to a
secure on-premises location for storage.
RedisはStandard以上がおすすめ
When using Azure Redis Cache, choose the standard option to maintain a secondary copy of the
contents.
タイムアウト設定は戦略的に
Services and resources may become unavailable, causing requests to fail. Ensure that the timeouts
you apply are appropriate for each service or resource as well as the client that is accessing them.
(In some cases, it may be appropriate to allow a longer timeout for a particular instance of a client,
depending on the context and other actions that the client is performing.) Very short timeouts
may cause excessive retry operations for services and resources that have considerable latency.
Very long timeouts can cause blocking if a large number of requests are queued, waiting for a
service or resource to respond.
リトライも戦略的に
Design a retry strategy for access to all services and resources where they do not inherently
support automatic connection retry. Use a strategy that includes an increasing delay between
retries as the number of failures increases, to prevent overloading of the resource and to allow it
to gracefully recover and handle queued requests. Continual retries with very short delays are
likely to exacerbate the problem.
あきらめも重要
when remote services are unavailable. There may be situations in which transient or other faults,
ranging in severity from a partial loss of connectivity to the complete failure of a service, take
much longer than expected to return to normal. Additionally, if a service is very busy, failure in
one part of the system may lead to cascading failures, and result in many operations becoming
blocked while holding onto critical system resources such as memory, threads, and database
connections. Instead of continually retrying an operation that is unlikely to succeed, the
application should quickly accept that the operation has failed, and gracefully handle this failure.
You can use the circuit breaker pattern to reject requests for specific operations for defined
periods. For more information, see Circuit Breaker Pattern.
ダメなら他へつなぐ
to mitigate the impact of a specific service being offline or unavailable. Design applications to take
advantage of multiple instances without affecting operation and existing connections where
possible. Use multiple instances and distribute requests between them, and detect and avoid
sending requests to failed instances, in order to maximize availability.
ダメなら他へ(応用編)
where possible. For example, if writing to SQL Database fails, temporarily store data in blob
storage. Provide a facility to replay the writes in blob storage to SQL Database when the service
becomes available. In some cases, a failed operation may have an alternative action that allows
the application to continue to work even when a component or service fails. If possible, detect
failures and redirect requests to other services that can offer a suitable alternative functionality, or
to back up or reduced functionality instances that can maintain core operations while the primary
service is offline.
起こりやすい障害の対処法はまとめておく
to report the situation to operations staff. For failures that are likely but have not yet occurred,
provide sufficient data to enable operations staff to determine the cause, mitigate the situation,
and ensure that the system remains available. For failures that have already occurred, the
application should return an appropriate error message to the user but attempt to continue
running, albeit with reduced functionality. In all cases, the monitoring system should capture
comprehensive details to enable operations staff to effect a quick recovery, and if necessary, for
designers and developers to modify the system to prevent the situation from arising again.
落ちる前に気づく
The health and performance of an application can degrade over time, without being noticeable
until it fails. Implement probes or check functions that are executed regularly from outside the
application. These checks can be as simple as measuring response time for the application as a
whole, for individual parts of the application, for individual services that the application uses, or
for individual components. Check functions can execute processes to ensure they produce valid
results, measure latency and check availability, and extract information from the system.
いざというとき本当に切り替わりますか
to ensure they are available and operate as expected. Changes to systems and operations may
affect failover and fallback functions, but the impact may not be detected until the main system
fails or becomes overloaded. Test it before it is required to compensate for a live problem at
runtime.
すべては監視システムの信頼の上に
Automated failover and fallback systems, and manual visualization of system health and
performance by using dashboards, all depend on monitoring and instrumentation functioning
correctly. If these elements fail, miss critical information, or report inaccurate data, an operator
might not realize that the system is unhealthy or failing.
実行時間が長いワークフロー全体が落ちるとショックでかい
and retry on failure. Long-running workflows are often composed of multiple steps. Ensure that
each step is independent and can be retried to minimize the chance that the entire workflow will
need to be rolled back, or that multiple compensating transactions need to be executed. Monitor
and manage the progress of long-running workflows by implementing a pattern such
as Scheduler Agent Supervisor Pattern.
広域災害に対する仕組みと訓練
Create an accepted, fully-tested plan for recovery from any type of failure that may affect system
availability. Choose a multi-site disaster recovery architecture for any mission-critical applications.
Identify a specific owner of the disaster recovery plan, including automation and testing. Ensure
the plan is well-documented, and automate the process as much as possible. Establish a backup
strategy for all reference and transactional data, and test the restoration of these backups
regularly. Train operations staff to execute the plan, and perform regular disaster simulations to
validate and improve the plan.
© 2017 Microsoft Corporation. All rights reserved.
本情報の内容(添付文書、リンク先などを含む)は、作成日時点でのものであり、予告なく変更される場合があります。

More Related Content

What's hot

TechnicalTerraformLandingZones121120229238.pdf
TechnicalTerraformLandingZones121120229238.pdfTechnicalTerraformLandingZones121120229238.pdf
TechnicalTerraformLandingZones121120229238.pdf
MIlton788007
 
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
carlitocabana
 

What's hot (20)

Microsoft Azure Cloud Services
Microsoft Azure Cloud ServicesMicrosoft Azure Cloud Services
Microsoft Azure Cloud Services
 
introduction to Azure Sentinel
introduction to Azure Sentinelintroduction to Azure Sentinel
introduction to Azure Sentinel
 
Azure Security Overview
Azure Security OverviewAzure Security Overview
Azure Security Overview
 
Modernize your Security Operations with Azure Sentinel
Modernize your Security Operations with Azure SentinelModernize your Security Operations with Azure Sentinel
Modernize your Security Operations with Azure Sentinel
 
Microsoft Azure Security Overview
Microsoft Azure Security OverviewMicrosoft Azure Security Overview
Microsoft Azure Security Overview
 
Introduction to Azure
Introduction to AzureIntroduction to Azure
Introduction to Azure
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
 
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationMicrosoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
 
Building Your Cloud Strategy
Building Your Cloud StrategyBuilding Your Cloud Strategy
Building Your Cloud Strategy
 
Azure Backup Simplifies
Azure Backup SimplifiesAzure Backup Simplifies
Azure Backup Simplifies
 
Understanding Azure Disaster Recovery
Understanding Azure Disaster RecoveryUnderstanding Azure Disaster Recovery
Understanding Azure Disaster Recovery
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de Kreuk
 
On-premise to Microsoft Azure Cloud Migration.
 On-premise to Microsoft Azure Cloud Migration. On-premise to Microsoft Azure Cloud Migration.
On-premise to Microsoft Azure Cloud Migration.
 
Microsoft Azure alerts
Microsoft Azure alertsMicrosoft Azure alerts
Microsoft Azure alerts
 
Microsoft Azure - Introduction to microsoft's public cloud
Microsoft Azure - Introduction to microsoft's public cloudMicrosoft Azure - Introduction to microsoft's public cloud
Microsoft Azure - Introduction to microsoft's public cloud
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
TechnicalTerraformLandingZones121120229238.pdf
TechnicalTerraformLandingZones121120229238.pdfTechnicalTerraformLandingZones121120229238.pdf
TechnicalTerraformLandingZones121120229238.pdf
 
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
 
Azure cloud migration simplified
Azure cloud migration simplifiedAzure cloud migration simplified
Azure cloud migration simplified
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 

Similar to Azure Design Review Checklist Availabilityの巻

MS Cloud Design Patterns Infographic 2015
MS Cloud Design Patterns Infographic 2015MS Cloud Design Patterns Infographic 2015
MS Cloud Design Patterns Infographic 2015
James Tramel
 
Jisto_Whitepaper_Recapturing_Stranded_Resources
Jisto_Whitepaper_Recapturing_Stranded_ResourcesJisto_Whitepaper_Recapturing_Stranded_Resources
Jisto_Whitepaper_Recapturing_Stranded_Resources
Kevin Donovan
 

Similar to Azure Design Review Checklist Availabilityの巻 (20)

Design patterns and plan for developing high available azure applications
Design patterns and plan for developing high available azure applicationsDesign patterns and plan for developing high available azure applications
Design patterns and plan for developing high available azure applications
 
MS Cloud Design Patterns Infographic 2015
MS Cloud Design Patterns Infographic 2015MS Cloud Design Patterns Infographic 2015
MS Cloud Design Patterns Infographic 2015
 
Ms cloud design patterns infographic 2015
Ms cloud design patterns infographic 2015Ms cloud design patterns infographic 2015
Ms cloud design patterns infographic 2015
 
saas
saassaas
saas
 
Scaling apps using azure cloud services
Scaling apps using azure cloud servicesScaling apps using azure cloud services
Scaling apps using azure cloud services
 
Unit 5.pptx
Unit 5.pptxUnit 5.pptx
Unit 5.pptx
 
WebApplicationArchitectureAzure.pptx
WebApplicationArchitectureAzure.pptxWebApplicationArchitectureAzure.pptx
WebApplicationArchitectureAzure.pptx
 
WebApplicationArchitectureAzure.pdf
WebApplicationArchitectureAzure.pdfWebApplicationArchitectureAzure.pdf
WebApplicationArchitectureAzure.pdf
 
Cloud testing with synthetic workload generators
Cloud testing with synthetic workload generatorsCloud testing with synthetic workload generators
Cloud testing with synthetic workload generators
 
Jisto_Whitepaper_Recapturing_Stranded_Resources
Jisto_Whitepaper_Recapturing_Stranded_ResourcesJisto_Whitepaper_Recapturing_Stranded_Resources
Jisto_Whitepaper_Recapturing_Stranded_Resources
 
Mule esb intoduction
Mule esb intoductionMule esb intoduction
Mule esb intoduction
 
A Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System UptimeA Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System Uptime
 
Automatic scaling of web applications for cloud computing services
Automatic scaling of web applications for cloud computing servicesAutomatic scaling of web applications for cloud computing services
Automatic scaling of web applications for cloud computing services
 
Cloud architecture
Cloud architectureCloud architecture
Cloud architecture
 
10 Best Practices for Reducing Spend in Azure
10 Best Practices for Reducing Spend in Azure10 Best Practices for Reducing Spend in Azure
10 Best Practices for Reducing Spend in Azure
 
Microservices with Spring
Microservices with SpringMicroservices with Spring
Microservices with Spring
 
Scalable Fault-tolerant microservices
Scalable Fault-tolerant microservicesScalable Fault-tolerant microservices
Scalable Fault-tolerant microservices
 
CVx_Pilot_DR_DS
CVx_Pilot_DR_DSCVx_Pilot_DR_DS
CVx_Pilot_DR_DS
 
Azure SQL Database
Azure SQL DatabaseAzure SQL Database
Azure SQL Database
 
Microservices approach for Websphere commerce
Microservices approach for Websphere commerceMicroservices approach for Websphere commerce
Microservices approach for Websphere commerce
 

More from Toru Makabe

Azure Blueprints - 企業で期待される背景と特徴、活用方法
Azure Blueprints - 企業で期待される背景と特徴、活用方法Azure Blueprints - 企業で期待される背景と特徴、活用方法
Azure Blueprints - 企業で期待される背景と特徴、活用方法
Toru Makabe
 
Azure Kubernetes Service 2019 ふりかえり
Azure Kubernetes Service 2019 ふりかえりAzure Kubernetes Service 2019 ふりかえり
Azure Kubernetes Service 2019 ふりかえり
Toru Makabe
 
インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018
インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018
インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018
Toru Makabe
 

More from Toru Makabe (20)

インフラ廻戦 品川事変 前夜編
インフラ廻戦 品川事変 前夜編インフラ廻戦 品川事変 前夜編
インフラ廻戦 品川事変 前夜編
 
Ingress on Azure Kubernetes Service
Ingress on Azure Kubernetes ServiceIngress on Azure Kubernetes Service
Ingress on Azure Kubernetes Service
 
細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive
細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive
細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive
 
Demystifying Identities for Azure Kubernetes Service
Demystifying Identities for Azure Kubernetes ServiceDemystifying Identities for Azure Kubernetes Service
Demystifying Identities for Azure Kubernetes Service
 
Azure Blueprints - 企業で期待される背景と特徴、活用方法
Azure Blueprints - 企業で期待される背景と特徴、活用方法Azure Blueprints - 企業で期待される背景と特徴、活用方法
Azure Blueprints - 企業で期待される背景と特徴、活用方法
 
ミッション : メガクラウドを安全にアップデートせよ!
ミッション : メガクラウドを安全にアップデートせよ!ミッション : メガクラウドを安全にアップデートせよ!
ミッション : メガクラウドを安全にアップデートせよ!
 
俺の Kubernetes Workflow with HashiStack
俺の Kubernetes Workflow with HashiStack俺の Kubernetes Workflow with HashiStack
俺の Kubernetes Workflow with HashiStack
 
Resilience Engineering on Kubernetes
Resilience Engineering on KubernetesResilience Engineering on Kubernetes
Resilience Engineering on Kubernetes
 
俺とHashiCorp
俺とHashiCorp俺とHashiCorp
俺とHashiCorp
 
Real World Azure RBAC
Real World Azure RBACReal World Azure RBAC
Real World Azure RBAC
 
Azure Kubernetes Service 2019 ふりかえり
Azure Kubernetes Service 2019 ふりかえりAzure Kubernetes Service 2019 ふりかえり
Azure Kubernetes Service 2019 ふりかえり
 
インフラ野郎AzureチームProX
インフラ野郎AzureチームProXインフラ野郎AzureチームProX
インフラ野郎AzureチームProX
 
NoOps Japan Community 1st Anniversary 祝辞
NoOps Japan Community 1st Anniversary 祝辞 NoOps Japan Community 1st Anniversary 祝辞
NoOps Japan Community 1st Anniversary 祝辞
 
ZOZOTOWNのCloud Native Journey
ZOZOTOWNのCloud Native JourneyZOZOTOWNのCloud Native Journey
ZOZOTOWNのCloud Native Journey
 
Ops meets NoOps
Ops meets NoOpsOps meets NoOps
Ops meets NoOps
 
Essentials of container
Essentials of containerEssentials of container
Essentials of container
 
インフラ野郎 Azureチーム at クラウド boost
インフラ野郎 Azureチーム at クラウド boostインフラ野郎 Azureチーム at クラウド boost
インフラ野郎 Azureチーム at クラウド boost
 
ダイ・ハード in the Kubernetes world
ダイ・ハード in the Kubernetes worldダイ・ハード in the Kubernetes world
ダイ・ハード in the Kubernetes world
 
半日でわかる コンテナー技術 (応用編)
半日でわかる コンテナー技術 (応用編)半日でわかる コンテナー技術 (応用編)
半日でわかる コンテナー技術 (応用編)
 
インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018
インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018
インフラエンジニア エボリューション ~激変する IT インフラ技術者像、キャリアとスキルを考える~ at Tech Summit 2018
 

Recently uploaded

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Azure Design Review Checklist Availabilityの巻

  • 1.
  • 4.
  • 5.
  • 6.
  • 7. 単一障害点をなくそう All components, services, resources, and compute instances should be deployed as multiple instances to prevent a single point of failure from affecting availability. This includes authentication mechanisms. Design the application to be configurable to use multiple instances, and to automatically detect failures and redirect requests to non-failed instances where the platform does not do this automatically.
  • 8. サービスレベルの異なるワークロードは分離しよう If a service is composed of critical and less-critical workloads, manage them differently and specify the service features and number of instances to meet their availability requirements.
  • 9. 依存関係を理解し、最小化しよう Minimize the number of different services used where possible, and ensure you understand all of the feature and service dependencies that exist in the system. This includes the nature of these dependencies, and the impact of failure or reduced performance in each one on the overall application. Microsoft guarantees at least 99.9 percent availability for most services, but this means that every additional service an application relies on potentially reduces the overall availability SLA of your system by 0.1 percent.
  • 10. タスクとメッセージはべき等(安全に繰り返せるよう)にしよう so that duplicated requests will not cause problems. For example, a service can act as a consumer that handles messages sent as requests by other parts of the system that act as producers. If the consumer fails after processing the message, but before acknowledging that it has been processed, a producer might submit a repeat request which could be handled by another instance of the consumer. For this reason, consumers and the operations they carry out should be idempotent so that repeating a previously executed operation does not render the results invalid. This may mean detecting duplicated messages, or ensuring consistency by using an optimistic approach to handling conflicts.
  • 11. メッセージブローカーでクリティカルなトランザクションの可用性を上げよう Many scenarios for initiating tasks or accessing remote services use messaging to pass instructions between the application and the target service. For best performance, the application should be able to send the message and then return to process more requests, without needing to wait for a reply. To guarantee delivery of messages, the messaging system should provide high availability. Azure Service Bus message queues implement at least once semantics. This means that each message posted to a queue will not be lost, although duplicate copies may be delivered under certain circumstances. If message processing is idempotent (see the previous item), repeated delivery should not be a problem.
  • 12. 機能的縮退を考慮しよう when reaching resource limits, and take appropriate action to minimize the impact for the user. In some cases, the load on the application may exceed the capacity of one or more parts, causing reduced availability and failed connections. Scaling can help to alleviate this, but it may reach a limit imposed by other factors, such as resource availability or cost. Design the application so that, in this situation, it can automatically degrade gracefully. For example, in an ecommerce system, if the order-processing subsystem is under strain (or has even failed completely), it can be temporarily disabled while allowing other functionality (such as browsing the product catalog) to continue. It might be appropriate to postpone requests to a failing subsystem, for example still enabling customers to submit orders but saving them for later processing, when the orders subsystem is available again.
  • 13. 突発的なイベント増に対処しよう Most applications need to handle varying workloads over time, such as peaks first thing in the morning in a business application or when a new product is released in an ecommerce site. Auto- scaling can help to handle the load, but it may take some time for additional instances to come online and handle requests. Prevent sudden and unexpected bursts of activity from overwhelming the application: design it to queue requests to the services it uses and degrade gracefully when queues are near to full capacity. Ensure that there is sufficient performance and capacity available under non-burst conditions to drain the queues and handle outstanding requests. For more information, see the Queue-Based Load Leveling Pattern.
  • 14.
  • 15. 各サービスは複数のインスタンスにデプロイしよう Microsoft makes availability guarantees for services that you create and deploy, but these guarantees are only valid if you deploy at least two instances of each role in the service. This enables one role to be unavailable while the other remains active. This is especially important if you need to deploy updates to a live system without interrupting clients' activities; instances can be taken down and upgraded individually while the others continue online.
  • 16. アプリを複数のデータセンターに配置しよう Although extremely unlikely, it is possible for an entire datacenter to go offline through an event such as a natural disaster or Internet failure. Vital business applications should be hosted in more than one datacenter to provide maximum availability. This can also reduce latency for local users, and provide additional opportunities for flexibility when updating applications.
  • 17. デプロイとメンテナンス作業は、自動化、テストできるようにしよう Distributed applications consist of multiple parts that must work together. Deployment should therefore be automated, using tested and proven mechanisms such as scripts and deployment applications. These can update and validate configuration, and automate the deployment process. Automated techniques should also be used to perform updates of all or parts of applications. It is vital to test all of these processes fully to ensure that errors do not cause additional downtime. All deployment tools must have suitable security restrictions to protect the deployed application; define and enforce deployment policies carefully and minimize the need for human intervention.
  • 18. ステージング環境を用意し、本番環境と切り換える仕組みにしよう where these are available. For example, using Azure Cloud Services staging and production environments allows applications to be switched from one to another instantly through a virtual IP address swap (VIP Swap). However, if you prefer to stage on-premises, or deploy different versions of the application concurrently and gradually migrate users, you may not be able to use a VIP Swap operation.
  • 19. 設定変更で再起動が必要な要素を理解し、対処しよう the instance when possible. In many cases, the configuration settings for an Azure application or service can be changed without requiring the role to be restarted. Role expose events that can be handled to detect configuration changes and apply them to components within the application. However, some changes to the core platform settings do require a role to be restarted. When building components and services, maximize availability and minimize downtime by designing them to accept changes to configuration settings without requiring the application as a whole to be restarted.
  • 20. 更新ドメインを意識してダウンタイムなしでアップデートしよう Azure compute units such as web and worker roles are allocated to upgrade domains. Upgrade domains group role instances together so that, when a rolling update takes place, each role in the upgrade domain is stopped, updated, and restarted in turn. This minimizes the impact on application availability. You can specify how many upgrade domains should be created for a service when the service is deployed.
  • 21. (大事なことなので何回も言います) 可用性セットを使おう Placing two or more virtual machines in the same availability set guarantees that these virtual machines will not be deployed to the same fault domain. To maximize availability, you should create multiple instances of each critical virtual machine used by your system and place these instances in the same availability set. If you are running multiple virtual machines that serve different purposes, create an availability set for each virtual machine. Add instances of each virtual machine to each availability set. For example, if you have created separate virtual machines to act as a web server and a reporting server, create an availability set for the web server and another availability set for the reporting server. Add instances of the web server virtual machine to the web server availability set, and add instances of the reporting server virtual machine to the reporting server availability set.
  • 22.
  • 23. データを遠隔地に複製しよう Data in Azure Storage is automatically replicated within in a datacenter. For even higher availability, use Read-access geo-redundant storage (-RAGRS), which replicates your data to a secondary region and provides read-only access to the data in the secondary location. The data is durable even in the case of a complete regional outage or a disaster.
  • 24. データベースを遠隔地に複製しよう Azure SQL Database and Cosmos DB both support geo-replication, which enables you to configure secondary database replicas in other regions. Secondary databases are available for querying and for failover in the case of a data center outage or the inability to connect to the primary database. For more information, see Failover groups and active geo-replication (SQL Database) and How to distribute data globally with Azure Cosmos DB?.
  • 25. (使えるところでは) 楽観的平行性制御と結果整合性でいこう where possible. Transactions that block access to resources through locking (pessimistic concurrency) can cause poor performance and considerably reduce availability. These problems can become especially acute in distributed systems. In many cases, careful design and techniques such as partitioning can minimize the chances of conflicting updates occurring. Where data is replicated, or is read from a separately updated store, the data will only be eventually consistent. But the advantages usually far outweigh the impact on availability of using transactions to ensure immediate consistency.
  • 26. 戻すことを意識してバックアップしてますか and ensure it meets the Recovery Point Objective (RPO). Regularly and automatically back up data that is not preserved elsewhere, and verify you can reliably restore both the data and the application itself should a failure occur. Data replication is not a backup feature because errors and inconsistencies introduced through failure, error, or malicious operations will be replicated across all stores. The backup process must be secure to protect the data in transit and in storage. Databases or parts of a data store can usually be recovered to a previous point in time by using transaction logs. Microsoft Azure provides a backup facility for data stored in Azure SQL Database. The data is exported to a backup package on Azure blob storage, and can be downloaded to a secure on-premises location for storage.
  • 27. RedisはStandard以上がおすすめ When using Azure Redis Cache, choose the standard option to maintain a secondary copy of the contents.
  • 28.
  • 29. タイムアウト設定は戦略的に Services and resources may become unavailable, causing requests to fail. Ensure that the timeouts you apply are appropriate for each service or resource as well as the client that is accessing them. (In some cases, it may be appropriate to allow a longer timeout for a particular instance of a client, depending on the context and other actions that the client is performing.) Very short timeouts may cause excessive retry operations for services and resources that have considerable latency. Very long timeouts can cause blocking if a large number of requests are queued, waiting for a service or resource to respond.
  • 30. リトライも戦略的に Design a retry strategy for access to all services and resources where they do not inherently support automatic connection retry. Use a strategy that includes an increasing delay between retries as the number of failures increases, to prevent overloading of the resource and to allow it to gracefully recover and handle queued requests. Continual retries with very short delays are likely to exacerbate the problem.
  • 31. あきらめも重要 when remote services are unavailable. There may be situations in which transient or other faults, ranging in severity from a partial loss of connectivity to the complete failure of a service, take much longer than expected to return to normal. Additionally, if a service is very busy, failure in one part of the system may lead to cascading failures, and result in many operations becoming blocked while holding onto critical system resources such as memory, threads, and database connections. Instead of continually retrying an operation that is unlikely to succeed, the application should quickly accept that the operation has failed, and gracefully handle this failure. You can use the circuit breaker pattern to reject requests for specific operations for defined periods. For more information, see Circuit Breaker Pattern.
  • 32. ダメなら他へつなぐ to mitigate the impact of a specific service being offline or unavailable. Design applications to take advantage of multiple instances without affecting operation and existing connections where possible. Use multiple instances and distribute requests between them, and detect and avoid sending requests to failed instances, in order to maximize availability.
  • 33. ダメなら他へ(応用編) where possible. For example, if writing to SQL Database fails, temporarily store data in blob storage. Provide a facility to replay the writes in blob storage to SQL Database when the service becomes available. In some cases, a failed operation may have an alternative action that allows the application to continue to work even when a component or service fails. If possible, detect failures and redirect requests to other services that can offer a suitable alternative functionality, or to back up or reduced functionality instances that can maintain core operations while the primary service is offline.
  • 34.
  • 35. 起こりやすい障害の対処法はまとめておく to report the situation to operations staff. For failures that are likely but have not yet occurred, provide sufficient data to enable operations staff to determine the cause, mitigate the situation, and ensure that the system remains available. For failures that have already occurred, the application should return an appropriate error message to the user but attempt to continue running, albeit with reduced functionality. In all cases, the monitoring system should capture comprehensive details to enable operations staff to effect a quick recovery, and if necessary, for designers and developers to modify the system to prevent the situation from arising again.
  • 36. 落ちる前に気づく The health and performance of an application can degrade over time, without being noticeable until it fails. Implement probes or check functions that are executed regularly from outside the application. These checks can be as simple as measuring response time for the application as a whole, for individual parts of the application, for individual services that the application uses, or for individual components. Check functions can execute processes to ensure they produce valid results, measure latency and check availability, and extract information from the system.
  • 37. いざというとき本当に切り替わりますか to ensure they are available and operate as expected. Changes to systems and operations may affect failover and fallback functions, but the impact may not be detected until the main system fails or becomes overloaded. Test it before it is required to compensate for a live problem at runtime.
  • 38. すべては監視システムの信頼の上に Automated failover and fallback systems, and manual visualization of system health and performance by using dashboards, all depend on monitoring and instrumentation functioning correctly. If these elements fail, miss critical information, or report inaccurate data, an operator might not realize that the system is unhealthy or failing.
  • 39. 実行時間が長いワークフロー全体が落ちるとショックでかい and retry on failure. Long-running workflows are often composed of multiple steps. Ensure that each step is independent and can be retried to minimize the chance that the entire workflow will need to be rolled back, or that multiple compensating transactions need to be executed. Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent Supervisor Pattern.
  • 40. 広域災害に対する仕組みと訓練 Create an accepted, fully-tested plan for recovery from any type of failure that may affect system availability. Choose a multi-site disaster recovery architecture for any mission-critical applications. Identify a specific owner of the disaster recovery plan, including automation and testing. Ensure the plan is well-documented, and automate the process as much as possible. Establish a backup strategy for all reference and transactional data, and test the restoration of these backups regularly. Train operations staff to execute the plan, and perform regular disaster simulations to validate and improve the plan.
  • 41. © 2017 Microsoft Corporation. All rights reserved. 本情報の内容(添付文書、リンク先などを含む)は、作成日時点でのものであり、予告なく変更される場合があります。