Joining tables is a crucial operation in MySQL, and many students might have concerns about using join operations. They may wonder about the performance impact of joining tables, how to establish the join table index, and whether it's better to perform queries in stages or join tables. To address these concerns, I will provide a practical example to demonstrate how to optimize a join table query in SQL.
The purpose of the SQL statement is to retrieve all user IDs with the 20008 permission. Although the SQL is simple, it takes 29 milliseconds to execute, which suggests a performance issue that requires optimization. To better understand how this join SQL is executed, let's use the EXPLAIN command.
Based on the analysis of the explained results, the slow join query's primary cause is the traversal of the Person_Group table. Person_Group is a large table with nearly one million rows, while Group_Privilege is a much smaller table with just over 300 rows. This scenario is a typical large table-small table join query, which is quite common in everyday SQL writing. If MySQL traverses the large table during the query process, it will significantly impact performance. The general method to optimize SQL is to add indexes to avoid full traversal of the large table. However, the WHERE clause in this statement restricts the privilege_id of Group_Privilege. If an index is added to the privilege_id, it will only reduce the number of traversals for the small table Group_Privilege but not reduce the traversal of Person_Group. As a result, the optimization is stuck in a deadlock.
Upon discussing with experienced colleagues, they suggested changing the order of the large table and small table in the join query to see how the results would be affected. Since there are no NULL group_id values in both the Person_Group and Group_Privilege tables, the results will be the same regardless of how they are joined. In addition, it is known that the query speed of small table-large table joins is faster than large table-small table joins. With this in mind, I decided to give it a try and modified the query statement accordingly.
Please use the "EXPLAIN" statement to see how the query is executed.
Even though the small table being traversed first has only 338 rows, the large table being traversed afterward has 24,729 rows. If we calculate the total number of traversals, it would be 338 × 24,729 = 83,585,034, which is significantly larger than the previous 988,054. It appears that changing the order has made the performance even worse. However, let's just execute the query and see how it performs.
Upon reviewing the results, I was astonished to discover that the small table-large table join took only 0.39 milliseconds, making it over 70 times faster than expected. This outcome seemed counterintuitive and almost unscientific! I quickly composed myself and sought an explanation in the book "High Performance MySQL." According to the text, MySQL performs nested loop join operations for all joins. This means that it first iterates through the data in the initial table, then takes the retrieved data and searches for corresponding data in the subsequent table until all rows have been identified. In the case of our previous inner join query, MySQL would process it as follows.
Once the order is changed to a right join, the pseudo-code for MySQL execution becomes:
In the process of executing pseudo-code in MySQL, we identified key issues. The initial query inefficiently traversed more than 90,000 rows of the Person_Group table first, and in the inner query of Group_Privilege, it hit directly in one attempt. This caused the query to loop over a million times. By changing the order, MySQL only needs to traverse 338 rows of Group_Privilege initially, and then use the index of group_id in Person_Group for the inner query of Person_Group. This makes the inner query much faster, eliminating the need to traverse the entire Person_Group. This is why joining a small table to a large table is much faster than joining a large table to a small table. The key to optimizing multi-table queries is using the index correctly. For the same table structure, changing the query order of the outer and inner tables may not seem significant, but it changes the indexes used in the query, resulting in a substantial performance difference.
Another question arises: why does the inner query of joining a small table to a large table have 247,293 rows in the explain output? Shouldn't there be fewer rows if an index is used? The truth is, it's unclear how this 247,293 value is calculated. Although MySQL knows that the index on group_id will be used in the inner query, since explain does not execute SQL, it does not know the specific value of group_id in the inner query (this is the result of the outer traversal). So MySQL cannot determine how many rows the index will hit, and thus cannot accurately judge how many rows will be traversed. This means that the number of rows traversed in the inner query may be inaccurate in explain.
Since simply adding a "right" join can improve performance so much, let's continue to optimize. It is clear that the right join query does not have a privilege_id index in the outer layer, so it traverses all Group_Privilege rows. Although the number of rows is not large, optimization is still necessary. By adding an index and running explain, we find that the outer layer only needs to traverse 3 rows.
The optimization of this statement seems nearly perfect now, but when I casually run "explain" on the initial inner join SQL, I am shocked to see that everything has changed!
For the INNER JOIN SQL statement, MySQL initially performs an outer query on Person_Group and then an inner query on Group_Privilege. However, after adding an index, it switches to an outer query on Group_Privilege and an inner query on Person_Group. With the index added, the execution effects of INNER JOIN and RIGHT JOIN become identical! I can't help but marvel at MySQL's mysterious optimization mechanism. In fact, for INNER JOIN SQL statements, expanding the left table first or the right table first doesn't affect the query results. Therefore, MySQL takes various factors into account to choose the optimal expansion order. Although MySQL usually optimizes well, it's not always correct, and the example in the text proves this. So, SQL optimization is a process of accumulating knowledge and experience. Only through continuous practice, analysis, and optimization in real-world scenarios can we achieve the best results.
To briefly summarize, although MySQL has built-in optimizations for joining tables, they're not always reliable. It's recommended to identify poorly performing SQL queries in your work, optimize and adjust them based on actual data, determine the join order, and establish appropriate indexes to improve query speed.