A Practical Guide to Optimizing Join Table Queries in MySQL

Testing Tech Hub 2023-11-08 12:46 147

This guide will provide valuable insights for students and professionals seeking to enhance their SQL optimization skills and achieve better results in real-world scenarios.

Joining tables is a crucial operation in MySQL, and many students might have concerns about using join operations. They may wonder about the performance impact of joining tables, how to establish the join table index, and whether it's better to perform queries in stages or join tables. To address these concerns, I will provide a practical example to demonstrate how to optimize a join table query in SQL.

// Person - Group many-to-one relationship table

CREATE TABLE `Person_Group` (

  `person_id` int(11) unsigned NOT NULL COMMENT 'User ID',

  `group_id` int(11) unsigned NOT NULL COMMENT 'Group ID',

  `extend` varchar(1000) DEFAULT '[]' COMMENT 'Additional permissions',

  PRIMARY KEY (`person_id`),

  KEY `person_group` (`person_id`,`group_id`),

  KEY `group` (`group_id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

// Group - Privilege many-to-many relationship table

CREATE TABLE `Group_Privilege` (

  `group_id` int(11) unsigned NOT NULL COMMENT 'Group ID',

  `privilege_id` int(11) unsigned NOT NULL COMMENT 'Permission',

  PRIMARY KEY (`group_id`,`privilege_id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

// Statement to be optimized

SELECT Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008;

The purpose of the SQL statement is to retrieve all user IDs with the 20008 permission. Although the SQL is simple, it takes 29 milliseconds to execute, which suggests a performance issue that requires optimization. To better understand how this join SQL is executed, let's use the EXPLAIN command.

EXPLAIN SELECT Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Person_Group

         type: ALL

possible_keys: group

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 988054

        Extra:

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: Group_Privilege

         type: eq_ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 8

          ref: test.Person_Group.group_id,const

         rows: 1

        Extra: Using index

2 rows in set (0.00 sec)

Based on the analysis of the explained results, the slow join query's primary cause is the traversal of the Person_Group table. Person_Group is a large table with nearly one million rows, while Group_Privilege is a much smaller table with just over 300 rows. This scenario is a typical large table-small table join query, which is quite common in everyday SQL writing. If MySQL traverses the large table during the query process, it will significantly impact performance. The general method to optimize SQL is to add indexes to avoid full traversal of the large table. However, the WHERE clause in this statement restricts the privilege_id of Group_Privilege. If an index is added to the privilege_id, it will only reduce the number of traversals for the small table Group_Privilege but not reduce the traversal of Person_Group. As a result, the optimization is stuck in a deadlock.

Upon discussing with experienced colleagues, they suggested changing the order of the large table and small table in the join query to see how the results would be affected. Since there are no NULL group_id values in both the Person_Group and Group_Privilege tables, the results will be the same regardless of how they are joined. In addition, it is known that the query speed of small table-large table joins is faster than large table-small table joins. With this in mind, I decided to give it a try and modified the query statement accordingly.

SELECT Person_Group.person_id FROM  Person_Group Right JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008；

Please use the "EXPLAIN" statement to see how the query is executed.

explain SELECT Person_Group.person_id FROM  Person_Group Right JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Group_Privilege

         type: index

possible_keys: NULL

          key: PRIMARY

      key_len: 8

          ref: NULL

         rows: 338

        Extra: Using where; Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: Person_Group

         type: ref

possible_keys: group

          key: group

      key_len: 4

          ref: test.Group_Privilege.group_id

         rows: 247293

        Extra: Using index

2 rows in set (0.00 sec)

Even though the small table being traversed first has only 338 rows, the large table being traversed afterward has 24,729 rows. If we calculate the total number of traversals, it would be 338 × 24,729 = 83,585,034, which is significantly larger than the previous 988,054. It appears that changing the order has made the performance even worse. However, let's just execute the query and see how it performs.

*************************** 13. row ***************************

Query_ID: 1

Duration: 0.02994200

   Query: SELECT Person_Group.person_id FROM  Person_Group  JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008

*************************** 14. row ***************************

Query_ID: 2

Duration: 0.00039700

   Query: SELECT Person_Group.person_id FROM Person_Group RIGHT JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008

Upon reviewing the results, I was astonished to discover that the small table-large table join took only 0.39 milliseconds, making it over 70 times faster than expected. This outcome seemed counterintuitive and almost unscientific! I quickly composed myself and sought an explanation in the book "High Performance MySQL." According to the text, MySQL performs nested loop join operations for all joins. This means that it first iterates through the data in the initial table, then takes the retrieved data and searches for corresponding data in the subsequent table until all rows have been identified. In the case of our previous inner join query, MySQL would process it as follows.

outerIter = iterator over Person_Group

outerRow = outerIter.next

while outerRow      # This loop runs 988,054 times

    innerIter = iterator over Group_Privilege where group_id=outerRow.group_id and privilege_id=20008   # This directly accesses the composite primary key (group_id and privilege_id)

    innerRow = innerIter.next

    while innerRow

        output[outerRow.person_id]

        innerRow = innerIter.next

    end

    outerRow = outerIter.next

end

Once the order is changed to a right join, the pseudo-code for MySQL execution becomes:

outer_iter = iterator_over Group_Privilege where privilege_id=20008

outer_row = outer_iter.next

while outer_row      # Since there is no privilege_id index, this loop runs 338 times

    inner_iter = iterator over Person_Group where group_id=outer_row.group_id  # This directly hits the Person_Group index group_id

    inner_row = inner_iter.next

    if inner_row

        while inner_row

            output[inner_row.person_id]

            inner_row = inner_iter.next

        end

    else

        output[NULL]

    out_row = outer_iter.next

end

In the process of executing pseudo-code in MySQL, we identified key issues. The initial query inefficiently traversed more than 90,000 rows of the Person_Group table first, and in the inner query of Group_Privilege, it hit directly in one attempt. This caused the query to loop over a million times. By changing the order, MySQL only needs to traverse 338 rows of Group_Privilege initially, and then use the index of group_id in Person_Group for the inner query of Person_Group. This makes the inner query much faster, eliminating the need to traverse the entire Person_Group. This is why joining a small table to a large table is much faster than joining a large table to a small table. The key to optimizing multi-table queries is using the index correctly. For the same table structure, changing the query order of the outer and inner tables may not seem significant, but it changes the indexes used in the query, resulting in a substantial performance difference.

Another question arises: why does the inner query of joining a small table to a large table have 247,293 rows in the explain output? Shouldn't there be fewer rows if an index is used? The truth is, it's unclear how this 247,293 value is calculated. Although MySQL knows that the index on group_id will be used in the inner query, since explain does not execute SQL, it does not know the specific value of group_id in the inner query (this is the result of the outer traversal). So MySQL cannot determine how many rows the index will hit, and thus cannot accurately judge how many rows will be traversed. This means that the number of rows traversed in the inner query may be inaccurate in explain.

Since simply adding a "right" join can improve performance so much, let's continue to optimize. It is clear that the right join query does not have a privilege_id index in the outer layer, so it traverses all Group_Privilege rows. Although the number of rows is not large, optimization is still necessary. By adding an index and running explain, we find that the outer layer only needs to traverse 3 rows.

EXPLAIN SELECT Person_Group.person_id FROM Person_Group RIGHT JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Group_Privilege

         type: ref

possible_keys: privilege

          key: privilege

      key_len: 4

          ref: const

         rows: 3

        Extra: Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: Person_Group

         type: ref

possible_keys: group

          key: group

      key_len: 4

          ref: test.Group_Privilege.group_id

         rows: 247011

        Extra: Using index

2 rows in set (0.00 sec)

The optimization of this statement seems nearly perfect now, but when I casually run "explain" on the initial inner join SQL, I am shocked to see that everything has changed!

explain select Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: Group_Privilege type: ref possible_keys: PRIMARY,privilege key: privilege key_len: 4 ref: const rows: 3 Extra: Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: Person_Group type: ref possible_keys: group key: group key_len: 4 ref: test.Group_Privilege.group_id rows: 247011 Extra: Using index 2 rows in set (0.00 sec)

For the INNER JOIN SQL statement, MySQL initially performs an outer query on Person_Group and then an inner query on Group_Privilege. However, after adding an index, it switches to an outer query on Group_Privilege and an inner query on Person_Group. With the index added, the execution effects of INNER JOIN and RIGHT JOIN become identical! I can't help but marvel at MySQL's mysterious optimization mechanism. In fact, for INNER JOIN SQL statements, expanding the left table first or the right table first doesn't affect the query results. Therefore, MySQL takes various factors into account to choose the optimal expansion order. Although MySQL usually optimizes well, it's not always correct, and the example in the text proves this. So, SQL optimization is a process of accumulating knowledge and experience. Only through continuous practice, analysis, and optimization in real-world scenarios can we achieve the best results.

To briefly summarize, although MySQL has built-in optimizations for joining tables, they're not always reliable. It's recommended to identify poorly performing SQL queries in your work, optimize and adjust them based on actual data, determine the join order, and establish appropriate indexes to improve query speed.

software-testing mysql

Read Previous Post >>

The Challenges and Illusions of Developing AAA Games —— Conclusion

A Practical Guide to Optimizing Join Table Queries in MySQL

Related Content