I have written a bit in the past about using INFORMATION_SCHEMA tables, and I use them extensively in stored procedures for performing regular maintenance operations. There are a great many clever and useful things you can do with them. However, recently I have been noticing some performance issues using these tables that puts a damper on their usefulness.
For example, let’s look at the query I use in some of my stored procedures to manage my partitioning logic. I have multiple reporting tables that are partitioned the same way, and I am always adding new tables and removing old tables depending on current requirements, so I need to have a single abstract procedure to deal with all of them. I open the procedure with a cursor that gets the table names for every table that is partitioned on a particular column. INFORMATION_SCHEMA tables make this possible:
SELECT a.table_name FROM information_schema.tables a, information_schema.columns b WHERE a.table_name = b.table_name AND a.table_name like '%reporting' AND b.column_name = 'datestring' AND EXISTS (SELECT 1 FROM information_schema.partitions c WHERE a.table_name = c.table_name AND partition_name IS NOT NULL ); ...... 13 rows in set (6 min 13.38 sec)
It turns out that although there have already been some efforts at information_schema optimization, the partitions table was not included. Indeed, if I take the EXISTS clause out of the picture, you see things are coming back much quicker (you can also see that I have a couple of reporting tables that are not partitioned):
15 rows in set (0.09 sec)
Another thing I noticed was that calling both tables and columns was redundant because table_name exists on columns. Indeed, if I factor this out, it reduces my query time by half:
SELECT distinct a.table_name FROM information_schema.columns a WHERE a.table_name like '%reporting' AND a.column_name = 'datestring' AND EXISTS (SELECT 1 FROM information_schema.partitions b WHERE a.table_name = b.table_name AND partition_name IS NOT NULL ); ..... 13 rows in set (3 min 30.05 sec)
Now, reducing the amount of time by half is great, but it’s not good enough here. There is also a table_name column on partitions. Let’s see what happens when we turn the query on it’s head:
SELECT distinct a.table_name FROM information_schema.partitions a WHERE a.table_name like '%reporting' AND a.partition_name IS NOT NULL AND EXISTS (SELECT 1 FROM information_schema.columns b WHERE a.table_name = b.table_name AND column_name = 'datestring' ); ..... 13 rows in set (28.23 sec)
Now given that this is a maintenance task this timing may be appropriate. The job is already set to run in the wee hours where load is lightest and the operation is unlikely to affect anything. Furthermore, no one is sitting on the other end of a browser or client impatiently waiting for the results.
Do I have a couple of morals or learnings from this? Not anything a good DBA shouldn’t already know:
1.) there is more than one way to write a query
2.) every query needs tuning, even if it’s embedded deep in your maintenance routines.