Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Data recovery consistency with check db
1.
2. Data Recovery & Consistency with CHECKDBwith SQL Server Vinod Kumar Technology Evangelist - Microsoft @vinodk_sql www.ExtremeExperts.com http://blogs.sqlxml.org/vinodkumar
3. Why Is This Session Important? Corruption does happen, mostly caused by IO subsystem People don’t realize they have corruption until too late People don’t know what to do when they do have corruption, leading to: More data loss and downtime than necessary Monetary and even job losses
4. What Can Happen to an Unprepared DBA Confronted by Corruption?
5. Session Takeaways From this session you will CHECKDB Significance Guidance and options after corruption Getting database online Distinguish Repair VS Restore DON’T TRY this on your Production Environments
6. Agenda Discovering corruption Interpreting CHECKDB output Choosing between restore and repair Recovering from a ‘last resort’ With demos of common scenarios
7. I/O Errors Three types 823 (a hard I/O error) 824 (a soft I/O error) 825 (a read-retry error) Nice error messages in 2005+ Msg 824, Level 24, State 2, Line 1 SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:roken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. Logged in msdb..suspect_pages Input into single-page restore operations
8. Page Protection Options SQL Server allows pages to be ‘protected’ on disk from corruptions Allows fast detection of corruptions Set using ALTER DATABASE SET PAGE_VERIFY <option> Three options: NONE TORN_PAGE_DETECTION CHECKSUM
9. DBCC CHECKDB The only way to read all allocated pages in the database Use to force page checksums to be checked Choose between full checks and WITH PHYSICAL_ONLY Many algorithms to minimize runtime and run ONLINE since SQL Server 2000 Blog post series: http://www.sqlskills.com/blogs/paul/category/CHECKDB-From-Every-Angle.aspx
10. First Hints That Something Is Wrong… Application/user connections get broken Users report 823 or 824 errors ‘Hard’ and ‘Soft’ IO errors Backup jobs start failing Error 3043 – backup detected checksum errors Agent alerts start firing Should have alerts on all errors with severity >= 19 Should have an alert on error 825 Informational (!) message that there are transient IO problems Maintenance jobs start failing
11. As Soon As Corruption Is Suspected… No need to panic! Determine the extent of the corruption Run DBCC CHECKDB Look in the SQL Server error log Check maintenance job history Check what backups are available Wait for CHECKDB to finish before doing anything else You many not NEED to do anything intrusive/destructive
12. How To Run DBCC CHECKDB By default, CHECKDB will: Only return the first 200 errors Return lots of info that’s distracting in a corruption situation Use the following command with only these options: DBCC CHECKDB (<<yourdb>>) WITH ALL_ERRORMSGS, NO_INFOMSGS If it’s taking longer than usual, that should mean that it found some corruption Check the error log for message 5268 from SQL Server 2005 SP2 onwards to see if it’s rescanning some data Most importantly, wait for it to complete!
13. Interpreting CHECKDB Output (1) So, CHECKDB completes and you have a bunch of cryptic error messages. Now what? There are over 150 errorsthat CHECKDB can output, some with over 200 states Figuring out what one error means isn’t too bad MSDN has most of them published for reference There are some tips and tricks you can use…
14. Interpreting CHECKDB Output (2) Did CHECKDB fail? If it stops before completing successfully, something bad has happened that is preventing CHECKDB from running This means there is no choice but to restore from a backup as CHECKDB cannot be forced to run (and hence repair) Examples of fatal (to CHECKDB) errors 7984 – 7988: corruption in critical system tables 8967: invalid states within CHECKDB itself 8930: corrupt metadata in the database such that CHECKDB could not run See ‘Understanding DBCC Error Messages’ in the BOL for DBCC CHECKDB for more details
16. Interpreting CHECKDB Output (3) Are the corruptions only in non-clustered indexes? If recommended repair level is REPAIR_REBUILD, then YES! Otherwise, check all the index IDs in the errors – if they’re all greater than 1, then YES! If YES, you *could* just rebuild the corrupt indexes Depends on the error, and the size of the index But, what caused the corruption? If you just rebuild the indexes, the corruption will probably happen again (especially if caused by the IO subsystem) Make sure you do root-cause analysis and take preventative measures
18. Interpreting CHECKDB Output (4) Was there an un-repairable error found? 8909, 8938, 8939 (page header corruption) errors where the type is ‘PFS’ 8970 error: invalid data for the column type 8992 error: CHECKCATALOG (metadata mismatch) error Plus a few more obscure ones E.g. an 8904 error (extent is allocated to two objects). This is usually repairable except in the case where the extent is marked as mixed and dedicated, and has pages allocated to multiple objects. The repair is too complicated and/or destructive so is not attempted. None of these can be automatically repaired But if you don’t have a backup without these corruptions, you may be able to fix the 8970 and 8992 errors…
21. Recovering Using Backups Best way to avoid data loss Not necessarily the best way to avoid downtime Depends what kind of backups are available Although backup compression in SQL Server 2008 helps… Plethora of options available Full database backup is a good starting point Series of transaction log backups as well is much better Beyond the scope of this session… Remember: Backups have to exist to be useful Backups have to be valid to avoid data loss
22. Choosing Between Restore and Repair (1) Multiple decision points that could short-circuit the decision process Do you still have a database? No – you must restore from a backup Do you have working backups? No – you must use repair, or restore a damaged backup with CONTINUE_AFTER_ERROR, or extract data to a new database Is the log damaged? Yes – you must restore, or run emergency mode repair, or extract to a new database
23. Choosing Between Restore and Repair (2) Did CHECKDB fail? Yes – you must restore or extract Is it just non-clustered indexes that are damaged? Yes – maybe rebuild them manually Are there any un-repairable errors? Yes – you must restore or extract If you’re still able to make a repair/restore choice: Consider your down-time and data-loss Service Level Agreements Use whichever option you can which allows you to limit down-time and data-loss while still staying within the SLAs
24. Repair vs. Restore Manually repairing a single page corruption with and without backups demo
25. Beware of REPAIR_ALLOW_DATA_LOSS Repair fixes structural inconsistencies by de-allocating (Not REPAIR_REBUILD, but indexes should be fixed manually) This is the fastest and most provably correct way Repair doesn’t take into account: Foreign-key constraints Inherent business logic and data relationships Replication (see BOL for DBCC CHECKDB) Before running repair, protect yourself Take a backup and quiesce replication topologies involved After running repair, check the data Consider running DBCC CHECKCONSTRAINTS Fix up any replication topologies involved
26. What If the Log Is Damaged? Without a backup, two realistic choices: Use EMERGENCY mode to access the data in the corrupt state E.g. to extract to another database ALTER DATABASE mydb SET EMERGENCY; Use EMERGENCY mode repair New feature of SQL Server 2005 Rebuilds the log and runs REPAIR_ALLOW_DATA_LOSS as an atomic operation Database must be in EMERGENCY *and* SINGLE_USER This is the 3rd worst state to be in
27. Things That People Often Try *First* Restart SQL Server Just wastes time and delays getting back online Immediately jump to a last resort and cause data loss without working through options Running repair Rebuilding the transaction log Detach a suspect database It will fail to attach again – now the situation is even worse! This is the 2nd worst state to be in However, there’s a trick you can use…
28. Repairing a Suspect Database How to hack a detached suspect database back into the system and repair it demo
29. What If You Don't Have a Database At All *OR* Any Kind of Backup to Restore From? Total data loss - *this* is the worst state to be in You might have no choice apart from manual re-entry, or URLC Update Resume, Leave City
30. Summary: Pulling It All Together Know the signs of corruption When corruption occurs, be methodical: Figure out the extent of the corruption Figure out your options to limit downtime, data loss, or both If you’re going to run repair, take a backup first Fix the corruption Finish with root-cause analysis Test all of this before you have to do it for real Good luck!
31. Resources (Paul's Blog) Example corrupt databases to play with http://www.sqlskills.com/blogs/paul/post/Example-20002005-corrupt-databases-and-some-more-info-on-backup-restore-page-checksums-and-IO-errors.aspx Everything you ever wanted to know about CHECKDB http://www.sqlskills.com/blogs/paul/category/CHECKDB-From-Every-Angle.aspx Tips and tricks for interpreting CHECKDB output http://www.sqlskills.com/blogs/paul/post/CHECKDB-From-Every-Angle-Tips-and-tricks-for-interpreting-CHECKDB-output.aspx Log rebuilding and repair http://www.sqlskills.com/blogs/paul/post/Corruption-Last-resorts-that-people-try-first.aspx Page checksums and SQLIOSim http://www.sqlskills.com/blogs/paul/post/How-to-tell-if-the-IO-subsystem-is-causing-corruptions.aspx EMERGENCY mode repair http://www.sqlskills.com/blogs/paul/post/CHECKDB-From-Every-Angle-EMERGENCY-mode-repair-the-very-very-last-resort.aspx
32. આભાર ধন্যবাদ நன்றி धन्यवाद ಧನ್ಯವಾದಗಳು ధన్యవాదాలు ଧନ୍ୟବାଦ ਧੰਨਵਾਦ നിങ്ങള്ക്ക് നന്ദി