SQL Server allow to create additional filegroups to spread storage of data and indexes to different disks to improve I/O performance. The issue is that when creating new data storage objects we are required to explicitly point to the new filegroup. Is there any option to use the new filegroup by default without explicitly specifying the filegroup when creating new tables and indexes?
Let me explain some terminology:
Database – is made up of multiple files for database objects and transaction logs
File – a physical file that is used to store database objects or transaction logs
Filegroup – this is a container that is made up of one or more files to store database objects
PRIMARY filegroup – this is the default filegroup that is created when you create a new database.
Default filegroup – when creating new tables and indexes this specifies which filegroup is the default if a filegroup is not specified
Based on explanation as above when creating a new database, the primary file group of a database is the default filegroupand the primary data file is created in it. When a table or index is created without specifying a filegroup, all pages will be allocated from the default filegroup. Note: only one filegroup at a time can be the default filegroup.
By default, the primary filegroup is the default filegroup, so all new objects are placed in this filegroup. However we can change the default filegroup to any other filegroup at any time. When you have multiple filegroups, changing the default filegroup from PRIMARY to another one is a good idea, because it eliminates the need to specify the filegroup when objects are created and makes sure that new objects are created in the correct filegroup.
Example
Let’s analyze the problem for which we are required to change the default filegroup from primary to something else.
The following script creates a database and then adds additional filegroups. Although additional filegroups can be created when creating a database, I added the filegroups later for simplicity and illustration. So in this case, the primary filegroup will be the default filegroup.
-- Script 1: Create DB and add additional file group
-- If DB pre exists then drop it
IF exists (SELECT * FROM sys.databases WHERE NAME = 'MSSQLTip')
USE MASTER
DROP DATABASE MSSQLTip
GO
-- Create new DB
CREATE DATABASE MSSQLTip
GO
-- Add file groups to DB
ALTER DATABASE MSSQLTip ADD FILEGROUP FG1
ALTER DATABASE MSSQLTip ADD FILEGROUP FG2
GO
-- Verify file groups in DB
USE MSSQLTip
GO
SELECT groupName AS FileGroupName FROM sysfilegroups
GO
At this point we have three file groups primary, FG1 and FG2 in our new database. Any new table or index will be created in the primary filegroup which is the default unless we specify the filegroup to use. If we want to create a data storage object in FG1 or FG2 then the only way is to specify the filegroup when creating the storage object.
Consider the following scenario where we want to create a table without specifying the filegroup. It is essential to provide explicit instructions for creating the data storage objects in other than the default filegroup. The only way to ensure the creation of new data storage objects on FG1 or FG2 without any additional consideration is to set the required filegroup as shown below.
-- Script 2: Create tables in file groups
-- tbl1 would be created on primary file group
CREATE Table tbl1 (ID int identity(1,1))
GO
-- tbl2 would be created on FG1
CREATE Table tbl2 (ID int identity(1,1), fname varchar(20))
ON FG1
GO
-- Verify file group of tbl1
sp_help tbl1
GO
-- Verify file group of tbl2
sp_help tbl2
GO
Now that we created the table on the new filegroup we can try to insert data as follows:
INSERT INTO tbl2 (fname) values ('Atif')
GO
Since we did not specify a file for this filegroup we get this error message:
Msg 622, Level 16, State 3, Line 1 The filegroup “FG1″ has no files assigned to it. Tables, indexes, text columns, ntext columns, and image columns cannot be populated on this filegroup until a file is added.
Another issue we have since we do not have a file associated with this filegroup is if we try to change FG1 to be the default filegroup as shown below:
ALTER DATABASE MSSQLTip MODIFY FILEGROUP FG1 DEFAULT
GO
We will get this error message, because a file group cannot be set as the default unless it has at least one data file.
Msg 5050, Level 16, State 1, Line 1 Cannot change the properties of empty filegroup ‘FG1′. The filegroup must contain at least one file.
Creating the Data File
A data file may be placed in a single filegroup and cannot be shared among filegroups. Files may be placed in the filegroup at DB creation time or at a later time. In the following script we create secondary data files for each filegroup (FG1 and FG2). (Note: if you do this on your server you will need specify the correct path for the file creation.)
-- Script 3: Add data files to file groups
-- Add data file to FG1
ALTER DATABASE MSSQLTip
ADD FILE (NAME = MSSQLTip_1,FILENAME = 'F:\DBs\2K5\MSSQLTip_1.ndf')
TO FILEGROUP FG1
GO
-- Add data file to FG2
ALTER DATABASE MSSQLTip
ADD FILE (NAME = MSSQLTip_2,FILENAME = 'F:\DBs\2K5\MSSQLTip_2.ndf')
TO FILEGROUP FG2
GO
-- Verify files in file groups
USE MSSQLTip
GO
sp_helpfile
GO
Setting New Default Filegroup
After creating the new files for filegroups FG1 and FG2, we can change the default filegroup and verify the new default by running the code below.
--Script 4: Set FG1 as default file group
-- Set FG1 as default file group
ALTER DATABASE MSSQLTip MODIFY FILEGROUP FG1 DEFAULT
GO
-- Create a table without specifying file group
Create table table3 (ID TINYINT)
GO
--Verify the file group of table3 is FG1
sp_help table3
GO
-- insert some data to make sure no errors
insert into table3 values (10)
GO
Using sp_help shows that table3 is created in FG1 even without specifying FG1 when creating the table.
Now let’s set FG2 as the default filegroup using SSMS. Right click on the MSSQLTip database and click onProperties. Go to the filegroups page and set FG2 as the default as shown below.
Checking the Default Filegroup
In addition to using SSMS to find what filegroup is the default, we can also check the default filegroup by querying sysfilegroups as shown below.
--Script 5: verify default filegroup
USE MSSQLTip
GO
SELECT groupname AS DefaultFileGroup FROM sysfilegroups
WHERE convert(bit, (status & 0x10)) = 1
GO
Summary
Setting the default filegroup does not stop you from using other filegroups it just means that you will need to specify the correct filegroup when creating a table or index otherwise it will be created on whatever filegroup is set as the default.
Database administrators experience regular slowdowns on their SQL Server databases.
The I/O subsystem is a key factor when it comes to SQL Server performance since database pages are constantly moved in and out of the buffer pool. Also the transaction logs and tempDB generate significant I/O traffic. Therefore you have to ensure that your I/O subsystem performs as expected, otherwise you will be a victim of increased response times and frequent time-outs. In this article the author will describe some ways to identify I/O related bottlenecks using the built-in.
Performance Monitor
Database Administrators can use Performance Monitoring tools to see the load on their I/O subsystem. Below are the performance counters to setup to check disk performance.
The PhysicalDisk Object: Avg. Disk Queue Length counter shows you the average number of read and write requests that were queued on the selected physical disk. The higher the number the more disk operations are waiting. It requires attention if this value frequently exceeds a value of 2 during peak usage of SQL Server. If you have multiple drives you should take this number and divide by the number of drives in the array to see if the number is above 2. For example, you have 4 drives and a disk queue length of 10, this would be 10/4 = 2.5, so this would be the value you want to use not 10.
Avg. Disk Sec/Read and Avg. Disk Sec/Write shows the average time of a data reads and writes from/to the disk. It is good up to 10 ms, but it is still acceptable if less than 20 ms. Any higher value needs further investigation.
Physical Disk: %Disk Time is the ratio of elapsed time when the disk drive was busy with read or write requests. The rule of thumb for this value is that it should be below 50 percent.
The counters Disk Reads/Sec and Disk Writes/Sec show you the rate of read/write operations on the disk. It should be less than 85 percent of the disk capacity since the disk access time increases exponentially beyond this value.
You can determine the disk capacity by gradually increasing the load on the system. One way to do this is to use SQLIO. You should look for the point where the throughput is constant, but the latency increases.
You can use the counters for RAID configurations with the following calculations: Raid 0: I/O per disk = (reads + writes) / number of disks Raid 1: I/O per disk = [reads + (writes*2)] / 2 Raid 5: I/O per disk = [reads + (writes*4)] / number of disks Raid 10: I/O per disk = [reads + (writes*2)] / number of disks
Here is an example of your I/O per disk for RAID 1, if we get these values from the counters:
Disk Reads/sec = 90 Disk Writes/sec = 75 The formula for I/O on a RAID-1 array is [reads + (writes*2)] / 2 or [90 + (75*2)] / 2 = 120 I/Os per disk
Dynamic Management Views
There are some useful Dynamic Management Views (DMVs) to check I/O bottlenecks.
An I/O latch wait occurs when a page is accessed for reading or writing but the page is not available in the buffer pool. It causes waits on PAGEIOLATCH_EX or PAGEIOLATCH_SH, depending upon the type of request. These wait types can indicate an I/O bottleneck. You can query the sys.dm_os_wait_stats DMV to find latch wait statistics. You can identify I/O problems if you save query outputs of waiting_task_counts and wait_time_ms values from a normal working state of your SQL Server and compare these values when performance is degraded.
select *
from sys.dm_os_wait_stats
where wait_type like 'PAGEIOLATCH%'
order by wait_type asc
Pending I/O requests can be found by querying the following DMVs and can be used to identify which disk is responsible for the bottleneck.
select database_id,
file_id,
io_stall,
io_pending_ms_ticks,
scheduler_address
from sys.dm_io_virtual_file_stats(NULL, NULL) iovfs,
sys.dm_io_pending_io_requests as iopior
where iovfs.file_handle = iopior.io_handle
Disk Fragmentation
It is recommended to check the disk fragmentation and the configuration of your disks used by the SQL Server instance.
Fragmentation of files on NTFS can cause significant reductions in performance. Disks should scheduled to defragment regularly. In some cases a SAN can actually perform worse with defragmentation enabled thus SANs need to be treated on a case-by-case basis.
Fragmentation of indexes can also cause high I/O utilization on NTFS, but this does not have the same affect on SANs which perform better on random I/Os.
Disk Configuration / Best Practices
DBA/ SysAdmin should have log files on a physical disk that is separate from the data files for better performance. The I/O profile for a heavily used database data files (including tempDB) are random. The I/O profile for all database log files are sequential except when a transaction needs to be rolled back.
The internal disks should only be used for database log files, because they excel at sequential I/O, but perform poorly on random I/Os.
The database data and log files should each be placed on their own dedicated disk packs. To ensure optimal performance, I recommend that the database log file be placed on two internal disks configured as RAID 1. The database data file should reside on a SAN system which is only accessed by SQL Server and has controlled querying and reporting. Ad hoc access should be disallowed.
Write caching should be enabled where possible and you should make sure the cache is protected from power failures and other possible failures.
To limit possible I/O bottlenecks for your OLTP system you should not mix OLAP and OLTP environments. Also, make sure your code is tuned and create indexes where necessary to avoid unnecessary I/Os.
Next Steps
Collect and compare performance counters
Analyze DMV information
Run SQL Server Profiler to find high Read and Write queries that can be tuned
EXEC sp_dboption N’mydb’, N’offline’, N’true’
OR
ALTER DATABASE [mydb] SET OFFLINE WITH
ROLLBACK AFTER 30 SECONDS
OR
ALTER DATABASE [mydb] SET OFFLINE WITH
ROLLBACK IMMEDIATE
Using the alter database statement (SQL Server 2k and beyond) is the preferred method.
The rollback after statement will force currently executing statements to rollback after N seconds.
The default is to wait for all currently running transactions to complete and for the sessions to be terminated. Use the rollback immediate clause to rollback transactions immediately.
Unable to render ReportViewer on ASP.NET Web pages while running on IIS7.
You have no problem viewing your reports when running on debug mode with your Visual Studio 2005.
You are able to view your reports on Report Manager but not able to view them on IIS7.
You encounter JavaScript error when loading your report page with ReportViewer. Image buttons such as calendar appear as red ‘X’.
Cause:
When the ReportViewer control is added to Web Form (.aspx), the
Reserved.ReportViewerWebControl.axd httpHandler is added to System.Web section of the Web.Config file. In IIS7, it should be added under System.Webserver section.
IIS7 Handler Mappings does not contain Reserved.ReportViewerWebControl.axd httpHandler, and therefore unable to render the ReportViewer elements needed by the JavaSript.
Resolution:
Open Internet Information Services (IIS) Manager and select your Web application.
Under IIS area, double-click on Handler Mappings icon.
At the Action pane on your right, click on Add Managed Handler.
At the Add Managed Handler dialog, enter the following: Request path: Reserved.ReportViewerWebControl.axd Type: Microsoft.Reporting.WebForms.HttpHandler Name: Reserved-ReportViewerWebControl-axd
Click OK.
Reserved-ReportViewerWebControl-axd handler is now added to your Handler Mappings list. Notice that the following line has also been added to your Web.config file under the system.webserver’s handler section:
Symptoms:
I have my development database on a SQL Server 2005 server. I enabled it for full-text catalogs and created a catalog with no issues whatsoever. When I go to create the index, however, I run through the wizard (using SQL Server Management Studio) until the last step. Upon finishing the wizard, I get the following error:
Error:
Failed to retrieve data for this request (Microsoft.SqlServer.SmoEnum)
Additional Information:
A exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
EXECUTE permission denied on object 'sp_help_category', database 'msdb', schema 'dbo'.
SELECT permission denied on object 'sysjobs_view', database 'msdb', schema 'dbo'. (Microsoft SQL Server, Error: 229)
I am the DBOwner. The DBA created the index for me with no problems. I can see the index. I can pull up the property pages and see “General” and “Columns” views fine, but when I click on “Schedules”, I receive the same error as above.
Solution:
You do not have permissions to access sysjobs_view in database msdb and also executesp_help_category in msdb.
User must have CREATE FULLTEXT CATALOG permission on this database, or be a member of thedb_owner, or db_ddladmin fixed database roles.
If you tried changing collation to from English to Turkish or other using the script as below but it will fail due to dependencies issues, You have to script them, drop them, change db collation and recreate them from script.
The Collation change script:
Code:
ALTER DATABASE {your database} SET SINGLE_USER WITH ROLLBACK IMMEDIATE
ALTER DATABASE {your database} COLLATE Turkish_CI_AS
ALTER DATABASE {your database} SET MULTI_USER
where you need changing {your database} with your database name i.e. mydb
Keep in mind that changing the colaltion of the database will NOT change the collation of the columns that are already in the database, only the default collation of any new columns you will create.
The error you may face while changing collation:
Error Code:
Msg 5075, Level 16, State 1, Line 2
The object '{ObjectName}' is dependent on database collation. The database collation cannot be changed if a schema-bound object depends on it. Remove the dependencies on the database collation and then retry the operation.
Msg 5075, Level 16, State 1, Line 2
The object '{objectName{' is dependent on database collation. The database collation cannot be changed if a schema-bound object depends on it. Remove the dependencies on the database collation and then retry the operation.
Msg 5075, Level 16, State 1, Line 2
The object '{ObjectName}' is dependent on database collation. The database collation cannot be changed if a schema-bound object depends on it. Remove the dependencies on the database collation and then retry the operation.
Msg 5072, Level 16, State 1, Line 2
ALTER DATABASE failed. The default collation of database 'myDB' cannot be set to Turkish_CI_AS.
To recover database from suspect mode in SQL Server 2005 / 2008 use this script in MS SQL Server management Studio:
EXEC sp_resetstatus ‘yourDBname’;
ALTER DATABASE yourDBname SET EMERGENCY
DBCC checkdb(‘yourDBname’)
ALTER DATABASE yourDBname SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC CheckDB (‘yourDBname’, REPAIR_ALLOW_DATA_LOSS)
ALTER DATABASE yourDBname SET MULTI_USER
Note: change ‘yourDBname’ with database name you want to recover from suspect mode.
It is great to be able to build small utilities on top of an excellent RDBMS.
Online Schema Change for mySQL
Thank you MySQL. This is a small but complex utility to perform online schema change for MySQL. We call it OSC and the source is here (Code is also available at bottom).
Some ALTER TABLE statements take too long form the perspective of some MySQL users. The fast index create feature for the InnoDB plugin in MySQL 5.1 makes this less of an issue but this can still take minutes to hours for a large table and for some MySQL deployments that is too long.
A workaround is to perform the change on a slave first and then promote the slave to be the new master. But this requires a slave located near the master. MySQL 5.0 added support for triggers and some replication systems have been built using triggers to capture row changes. Why not use triggers for this? The openarkkit toolkit did just that with oak-online-alter-table. We have published our version of an online schema change utility (OnlineSchemaChange.php aka OSC).
The remainder of this document is copied from the internal documents that were written for this project. Note that this project was done by Vamsi and he did an amazing job with it. In addition to writing the tool, writing the docs and doing a lot of testing he also found and fixed or avoided a few bugs in MySQL to make sure OSC would be reliable.
Overview
If the row format of database allows addition of a new column (possibly positioned at the end of existing columns with some default value) without modifying every row of the table, addition of a column could simply be just a metadata change which can be done very fast. In such databases, an exclusive lock is needed only for a very short time for the metadata change. Our understanding is that InnoDB row format does not allow this, and changing row format would be a significant project. Hence we do not consider this approach. Also, note that this approach would not work for operations like defragmentation.
OSC algorithms typically have several phases:
copy – where they make a copy of the table
build – where they work on the copy until the copy is ready with the new schema
replay – where they propagate the changes that happened on the original table to the copy table. This assumes that there is a mechanism for capturing changes.
cut-over – where they switch the tables ie rename the copy table as original. There is typically small amount of downtime while switching the tables. A small amount of replay is also typically needed during the cut-over.
Note that the above operations can be done within the storage engine itself, or using an external (PHP) script. We followed the latter approach as it can be implemented much faster. An advantage of doing within storage engine is some optimizations can be done that are not available while doing these operations externally.
Copy Phase
When the binlog is enabled Innodb gets read locks on table S during a statement such as “insert into T select * from S”. To avoid this and to reduce the load on MySQL we select the data into an outfile and load from the outfile.
Rather than generating one big outfile, we do multiple scans of the table, where each scan covers about 500000 rows (batchsize is a parameter to the OnlineSchemaChange and default value is set to 500000). The first scan scans from start of the table using LIMIT 500000. Subsequent scans start at the posistion where the previous scan left off. For example, for a 2-column PK, if the 1st select reads till [x, y] the where clause of the 2nd select will look like ((col1 = x and col2 > y) OR (col1 > x)). We patched InnoDB to not get the read locks and expect to get rid of these outfiles in the future. However, we will continue to do multiple scans of table with each scan going after different rows than previous scans.
For efficiency, in innodb _plugin 5.1 we drop all non-clustered indexes before loading data into copytable, and recreate them after load. As a future optimization there may be some cases where it is useful to drop and recreate C index as well. We do not drop NC indexes in which 1st column is AUTO_INCREMENT column. Also, in innodb 5.0 we do not drop non-clustered indexes as recreating them is very slow.
Capturing changes
Some of the approaches for capturing changes for replay are as follows:
Use statement level binary log. Unfortunately, this approach is problematic when statements that refer to other tables are replayed. If the other tables are read during replay phase, they may return different data than what they returned during the original execution of the statement. If those statements update the other tables, those updates need to be skipped during replay.
Use row level binary log. This approach would work assuming we filter out updates to other tables during replay. However many MySQL deployments don’t use row based replication (RBR) yes. Also, we need the original SQL in the binlog even when RBR is used and that feature has yet to appear in an official release.
Use triggers to capture changes. This approach has extra overhead as changes to the table are recorded in a change capture table. Also, we need to get a short duration exclusive lock for creating triggers, as mysql does not allow creating triggers while holding READ lock. If we don’t get any lock while creating triggers, we risk losing changes done by transactions that are active at the time of selecting data into outfile, if those changes were done prior to creating triggers. The trigger approach has the advantage of less effort, and less risk of breaking stuff so we adopt decided to use it for OSC.
The change capture table is referred to as the deltas table. It has has all columns as original table plus two additional columns: an integer autoincrement column to track order of changes AND an integer column to track dml type (insert, update or delete).
An insert trigger is created on the original table to capture all column values of row being inserted in deltas.
A delete trigger is created on original table to capture only the PK columns of row being deleted in deltas.
An update trigger is created on the original table so that if the update does not change the PK columns then it captures new values of all columns in deltas. If the update changes the PK columns, then the update trigger captures it as a delete followed by an insert. A possible optimization in the future is to log only changed columns.
Replay phase
It is useful to do most of the replay work without blocking changes to the original table. Mutliple replay passes are used and only the final replay is done under WRITE lock on the table. Because there are multiple passes we need to avoid replaying the same change multiple times. The following approaches are available to do this:
Delete the records from deltas as they are replayed. When a record is deleted, the entire record is put in transaction log (possibly containing large columns) and this might be too much load.
Have a column ‘done’ in deltas and set it for the records as they are replayed. Updates generate less transaction log than delete, but if the ‘done’ column is not indexed, we will be scanning deltas on each pass anyway.
save IDs of the replayed records in a temporary table so that OSC does not write to deltas.
We choose to save IDs in a temporary table.
Another consideration is how to propagate changes from deltas table to the copytable. There are at least two approaches:
select the columns from deltas table into PHP code and pass them back to mysql through update or insert or delete commands. This could move large column values back and forth between PHP and mysql.
Only fetch the ID column in deltas to PHP code, and then construct the insert, update or delete statements such that column values are directly copied from deltas to copytable.
We use the to only feetch the ID columns.
There are three phases for replaying changes: after all outfiles are loaded, after indexes are recreated and during the swap phase.
Cut-over phase
Mysql offers two ways of renaming a table foo to bar.
Rename table ‘foo’ to ‘bar’. Multiple tables can be renamed atomically using rename command, which makes it attractive for swapping two tables. Unfortunately, this command cannot be executed while holding table locks, or inside a larger transaction (i.e rename has to be transaction all by itself). So we are unable to use this.
Alter table ‘foo’ rename ‘bar’. Alter table causes an implicit commit, but it can be last statement in a multi-statement transaction, and it can be executed while holding table locks. So we use this, but two tables cannot be swapped atomically using alter table command. We need to use two separate statements.
Our cut-over phase looks like
lock tables (original table, new table, change capture table) in exclusive mode
replay any additional changes that happened after last replay
alter original table by renaming it as some old table
alter copytable by renaming it as original table.
Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a ‘table not found’ error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.
Error handling
There are two basic cases of errors:
Sql command issued by OSC fails due to some error, but mysql server is still up
Mysql server crashes during OSC
Here are the various entities created by OSC:
triggers
new non-temporary tables (copy table, deltas table, backup table to which the original table is renamed)
temp tables
outfiles
As we create an entity, we use a variable to track its cleanup. For example, when we create deltas, we set a variable $this->cleanup_deltastable to indicate that deltas needs to be cleaned up. This is not necessary for temp tables as they are automatically nuked when the script ends. A cleanup() method does the cleanup based on these cleanup variables. The cleanup() method is used during both successful termination of the script as well as failure termination. However if mysql server crashes, cleanup steps would also fail. The plan to handle mysql failures is to have a mode ‘force_cleanup’ for the OSC script, which would cleanup all the triggers, non-temporary tables, and outfiles that would have been created by OSC. One caution while using the force_cleanup mode is if the names of triggers/outfiles/tables that OSC would have created coincide with an existing trigger/outfile/table that has nothing to do with OSC, that entity may get dropped. The chances of coincidence are very slim though as we use prefixes like __osc_ for entities that OSC creates. This issue does not arise during regular cleanup (i.e non-forced) because cleanup is done based on cleanup variables in that case. Note that normally the failures during OSC don’t have to be acted on urgently, as the existence of stray tables/outfiles/triggers is not a critical problem. However, an exception is if failure happens after the original table is renamed to a backup table but before copy table is renamed as original table. In that case there should be two tables – backup table and copytable with identical contents except for the schema change. Applications would get ‘table not found’ errors until the issue is fixed. During force_cleanup, if it detects that both backup table and copytable exist, it renames backup table to original table.
Replication
OSC is is not really making any changes on its own, but only propagating the changes done by other transactions (which are replicated). So we set sql_log_bin = 0 for OSC. For schema changes like adding a column, this puts a requirement that the schema change must be done on slaves first.
Assumptions that are validated in the code
The original table must have PK. Otherwise an error is returned.
No foreign keys should exist. Otherwise an error is returned.
No AFTER_{INSERT/UPDATE/DELETE} triggers must exist. Otherwise create trigger would fail and error is returned.
If PK is being altered, post alter table should support efficient lookups on old PK columns. Otherwise an error is returned. The reason for this assumption is that PHP code may have queries/inserts/updates/deletes based on old PK columns and they need to be effiicient. Another reason is during replay, the ‘where’ clauses generated have old PK columns and so replay phase would be very slow.
If two OSCs are executed on same table concurrently, only the first one to create copytable would succeed and the other one would return an error.
OSC creates triggers/tables/outfiles with prefix __osc_. If coincidentally objects with those names already exist, an error is returned as object creation would fail.
Since we only tested OSC on 5.1.47 and 5.0.84, if it is not one of those two versions, it returns error.
Assumptions that are NOT validated in the code
Schema changes are done on slave before master. (If master has more columns than slave, replication may break. )
If OSC is done concurrently with alter table on the same table, race condition could cause “lost schema changes”. For example if column foo is being added using OSC and column bar is being added using alter table directly, it is possible that one of the column additions is lost.
Schema changes are backward compatible, such as addition of a column. Column name changes or dropping a column would cause error on the 1st load.
When OSC is run with OSC_FLAGS_FORCE_CLEANUP, it drops triggers/tables/outfiles with prefix __osc_. So if coincidentally objects with those names exist that have nothing to do with OSC, they would get dropped.
Steps in detail (listed in the order of execution)
Initialization
create_copy_table
alter_copy_table
create_deltas_table
create_triggers
start snapshot xact
select_table_into_outfile
drop NC indexes
load copy table
Replay Changes
recreate NC indexes
Replay Changes
Swap tables
Cleanup
They are described in more detail below.
Slight difference in the sequence of steps in 5.0 and 5.1
Note that (unfortunately) we need to use slightly different sequences for 5.0 and 5.1 – and that is not good. This must be done to compensate for different behavior in those versions.
This order works in 5.1 but not 5.0 (I am only showing the relevant part of the sequence):
Lock table in WRITE mode
Create insert, update, delete triggers
Unlock tables.
Start snapshot transaction
Scan deltas table and track these deltas in ‘changes to exclude from replay’
Scan original table into multiple outfiles
End snapshot xact
Load data from all outfiles to copytable
Replay changes that have not been excluded in step 5.
Since the scan done in step 6 should already see the changes captured in step 5, we exclude them from replay.
The above order does not work for 5.0 because creating trigger after locking table hangs in 5.0. See bug 46780.
This order works in 5.0 but not in 5.1
Same as above except that 1 and 2 are reversed i.e create triggers before locking.
Note that the table lock is for ensuring that transactions that changed the table before triggers were created are all committed. Any changes done after snapshot transaction began in step 4 should be captured by triggers. So even if we get table lock after creating triggers, the purpose of waiting for all prior transactions would still be achieved. So it should work in theory.
However, this sequence does not work in 5.1 in my automated unit tests as it causes the scan in step 5 to exclude some changes from replay that are not captured in scan in step 6. (For example, if a concurrent xact updates row R, the snapshot xact step 5 is seeing the row in deltas table inserted by the update, but step 6 is seeing old image of row instead of new image).
For transactional tables, failure of a statement should cause rollback of all changes performed by the statement. Failure of a trigger causes the statement to fail, so trigger failure also causes rollback.
So that means trigger is executed as part of same transaction as the DML that activated the trigger, right? We don’t know why the snapshot xact in OSC is seeing the affect of trigger but not the affect of original DML and are not sure if this is a bug.
Code Vocabulary/Glossary
$this->tablename is name of original table (i.e table being altered)
$this->dbname is name of database
$this->newtablename is name of copy table or new table
$this->deltastable is name of [deltas] table
$this->renametable is name to which the original table is renamed to before discarding
$this->columns, $this->pkcolumns, $this->nonpkcolumns are comma separated lists of all columns, just pk columns and just non PK columns respectively of the original table
$this->newcolumns and $this->oldcolumns are comma separated lists of columns of the original table prefixed by ‘NEW.’ and ‘OLD.’ respectively. Similarly we also have $this->oldpkcolumns and $this->newpkcolumns.
IDCOLNAME and DMLCOLNAME are names of ID column and DML TYPE column in [deltas] table
TEMP_TABLE_IDS_TO_EXCLUDE refers to temp table used for IDs to exclude. Its actual name is ‘__osc_temp_ids_to_exclude’.
TEMP_TABLE_IDS_TO_INCLUDE refers to temp table used for IDs to include. Its actual name is ‘__osc_temp_ids_to_include’.
$this->insert_trigger, $this->delete_trigger, and $this->update_trigger refer to trigger names.
$this->get_pkmatch_condition($tableA, $tableB) generates condition of the form tableA.pkcolumn1=tableB.pkcolumn1 AND tableA.pkcolumn2=tableB.pkcolumn2 … where pkcolumn1, pkcolumn2 etc are PK columns of original table. tableA and tableB would be table references in the FROM clause.
Initialization
Here we turn off bin log using ‘SET sql_log_bin = 0′.
We do validations like checking for no foreign keys, checking that PK exists, and innodb version.
We also retrieve all column information of the table being altered, so that we don’t have to read from information schema multiple times. (QUESTION: what happens if columns get changed by another alter table running in parallel? For now we assume that OPs is aware of alter table commands being run and won’t run two in parallel.)
This query retrieves column names.
$query = “select column_name, column_key from “.
“information_schema.columns “.
“where table_name =’%s’ and table_schema=’%s’”;
$query = sprintf($query, $this->tablename,
$this->dbname);
if column_key is not ‘PRI’, we infer that it is NOT part of primary key.
// for PK columns we need them to be in correct order as well.
$query = “select * from information_schema.statistics “.
“where table_name = ‘%s’ and TABLE_SCHEMA = ‘%s’ “.
copy table is named as concatenate( ‘__osc_new_’, originaltablename) truncated to 64 characters (maxlen). This is done by ‘create table <copytable> LIKE <originaltable>’.
alter_copy_table
DDL command to alter original table is given as input. We replace original table name by copy table name by doing:
$altercopy = preg_replace(‘/alter\s+table\s+/i’,
‘ALTER TABLE ‘, $this->altercmd,
-1, $count);
$altercopy = preg_replace(‘/ALTER\s+TABLE\s+
‘.$this->tablename.’/',
‘ALTER TABLE ‘.
$this->newtablename,
$altercopy, -1, $count);
The command is then run to alter copytable in the same way as we want original table to look like after doing alter. If we have < 1 or > 1 matches in either of preg_replace mentioned above, exception is raised.
Now we also retrieve index info using the following query so that we can drop and recreate NC indexes. (QUESTION : what happens if a concurrent alter table adds or drops index while this is running? For now we assume that operations is aware of alter table commands being run and won’t run two in parallel.
$query = “select * from information_schema.statistics “.
“where table_name = ‘%s’ and “.
“TABLE_SCHEMA = ‘%s’ “.
“order by INDEX_NAME, SEQ_IN_INDEX”;
$query = sprintf($query, $this->newtablename,
$this->dbname);
The following columns in select list are used:
NON_UNIQUE column: gives info on whether the index is non-unique.
COLUMN_NAME gives the name of the column that is in the index.
SUB_PART column indicates if index is on on part of column. (For example if an index is created on a varchar(1000) column, Innodb only creates index on first 767 chars. SUB_PART column gives this value.)
INDEX_NAME gives the name of index. if name is ‘PRIMARY’ it is inferred to be primary index.
We also check if old PK (available in $this->pkcolumnarry) is a prefix of atleast one index after the alter table. Note that if old PK is (a, b) and after alter table there is an index on (b, a), that is OK as it supports efficient lookups if values of both a and b are provided. This check is done because replay would be very inefficient if lookup based on old PK columns is inefficient after the alter table.
create_deltas_table
creates change capture table. It is named as concatenate(‘__osc_deltas_’, originaltablename) truncated to 64 characters (maxlen). created using:
$createtable = ‘create table %s’. ‘(%s INT AUTO_INCREMENT, ‘.
As mentioned before, in 5.1 we lock table and create triggers and then unlock table, but in 5.0, we create the triggers and then lock the table and unlock it.
Insert trigger is created as:
$trigger = ‘create trigger %s AFTER INSERT ON %s’.
Here we ‘start transaction with consistent snapshot’. At this point the deltas table may already have some changes done by transactions that have committed before out snapshot began. Since such changes are already reflected in our snapshot, we don’t want to replay those changes again during replay phase. So we also create a temp table named __osc_temp_ids_to_exclude to save the IDs of records that already exist in deltas table.
Since innodb gets read locks during “insert into T1 select * from T2″ state ments, we select out into outfile and load from that. Outfile is created in ‘secure-file-priv’ folder with name concatenate(‘__osc_ex_’, $this->tablename).
$selectinto = “select %s, %s from %s “.
“order by %s into outfile ‘%s’ “;
$selectinto = sprintf($selectinto, IDCOLNAME,
DMLCOLNAME, $this->deltastable,
IDCOLNAME, $outfile);
// read from outfile above into the temp table
$loadsql = sprintf(“LOAD DATA INFILE ‘%s’ INTO’.
‘TABLE %s(%s, %s)”,
$outfile, $temptable,
IDCOLNAME, DMLCOLNAME);
select_table_into_outfile
If an outfile folder is passed in, we use that. Otherwise, if @@secure_file_priv is non-NULL, we use it as outfile folder. Otherwise we use @@datadir/dbname as outfile folder. We assume @@datadir is non-NULL.
Outfile is named as concatenate(‘__osc_tbl_’, originaltablename’); Since we use multiple outfiles, they are suffixed .1,.2,.3 etc.
We also commit snapshot xact here.
drop NC indexes
In 5.1 we iterate over the index info gathered in previous step and drop all indexes whose name is NOT ‘PRIMARY’. We also don’t drop indexes in which first column is AUTO_INCREMENT column. We use this command to drop index:
$drop = sprintf(‘drop index %s on %s’, $this->indexname, $this->newtablename);
Indexes are not dropped in 5.0 as mentioned before.
load copy table
We use this command to load each outfile:
$loadsql = sprintf(“LOAD DATA INFILE ‘%s’ INTO”.
“TABLE %s(%s)”,
$this->outfile_table,
$this->newtablename,
$this->columns);
recreate NC indexes
We iterate over the index info gathered in ‘alter_copy_table’ step and recreate all indexes whose name is NOT ‘PRIMARY’.
We use one alter table command to create all NC indexes.
If the ‘SUB_PART’ column value in information_schema.statistics is not-null we use it while building columnlist. For example, if SUB_PART value for column ‘comment’ is 767, we use ‘comment(767)’ in the columnlist passed to create index command.
Replay changes
As mentioned before replay changes could be done multiple times. We maintain a temp table called TEMP_TABLE_IDS_TO_EXCLUDE to track those IDs that have been processed already. The set of IDs to process is obtained by taking the IDs from deltas table and excluding those that are in TEMP_TABLE_IDS_TO_EXCLUDE and is saved in TEMP_TABLE_IDS_TO_INCLUDE.
// Select from deltastable that are not in
// TEMP_TABLE_IDS_TO_EXCLUDE.
// Use left outer join rather than ‘in’ subquery for better perf.
We process about 500 rows in a transaction (except for the final replay which happens while holding WRITE lock on table, which is done without starting any new transaction).
Here is the query to retrieve IDs and dml type from TEMP_TABLE_IDS_TO_INCLUDE.
$query = sprintf(‘select %s, %s from %s order by %s’,
IDCOLNAME, DMLCOLNAME,
TEMP_TABLE_IDS_TO_INCLUDE,
IDCOLNAME);
DMLCOLNAME column tells if it is insert, delete or update.
Here is how insert is replayed:
$insert = sprintf(‘insert into %s(%s) select %s’.
‘from %s where %s.%s = %d’,
$this->newtablename,
$this->columns,
$this->columns,
$this->deltastable,
$this->deltastable,
IDCOLNAME,
$row[IDCOLNAME]);
Here is how delete is replayed:
$delete = sprintf(‘delete %s from %s, %s ‘.
‘where %s.%s = %d AND %s’,
$newtable, $newtable,
$deltas, $deltas,
IDCOLNAME,
$row[IDCOLNAME],
$this->get_pkmatch_condition($newtable,
$deltas));
Here is how update is replayed:
$update = sprintf(‘update %s, %s SET %s where ‘.
‘%s.%s = %d AND %s ‘,
$newtable, $deltas,
$assignment, $deltas,
IDCOLNAME,
$row[IDCOLNAME],
$this->get_pkmatch_condition($newtable,
$deltas));
Swap tables
Here are the steps as mentioned in cut-over phase:
TURN AUTOCOMMIT OFF: ‘set session autocommit=0′ // without this lock tables is not getting innodb lock
COMMIT // alter tables would have already caused implicit commits in innodb
unlock tables
TURN AUTOCOMMIT ON: ‘set session autocommit=1′
Cleanup
ROLLBACK in case we are in the middle of a xact
Turn on autocommit in case we turned it off
if trigger cleanup variables are set, drop triggers and unset trigger cleanup variables
if outfile cleanup variables are set, delete the outfiles and unset outfile cleanup variables
if cleanup variable is set for both newtable and renamedtable, then it means failure happened between the two alter tables. In this case just rename renamedtable as original table, and unset cleanup variable for renamedtable.
if cleanup variable is set for newtable, renamedtable or deltas table, drop the corresponding tables, and unset corresponding cleanup variable
In the force cleanup mode we will pretend as though all cleanup variables are set, and use ‘drop if exists’.