How to optimize SQL/Python select queries against a db2 database?

Ask Time：2021-08-06T22:11:13 Author：Dylan Moore

Connected via Python to a db2 database on server and querying against an enormous table (perhaps 200-mil records, 50-cols). This table is for analytics (OLAP might be the correct term) rather than transactions. I would like to optimize my sql/python code for faster query execution.

Without a deep understanding of sql queries, my suspicion is that SELECT statements begin at the first record of the table and continue until the query is satisfied. FETCH FIRST 10 ROWS ONLY executes in <1 second. However, including WHERE date_col > 20210701 would need to scan through a couple 100-mil records before identifying the first 10 records--a query that takes more than a few minutes to execute. Performance is similar via the cursor object.

Alternatively, I am connected to the same table through Microsoft Access. The same date query in Access executes in <.5 seconds--even faster than my fastest SELECT statement. Surely Access is doing something behind the scenes that I don't know about.

So Access proves the concept that these sql queries can be executed quickly. I am left with the question: How can I optimize my sql/python code to match the performance of Microsoft Access? Thanks all.

import ibm_db_dbi as db
import pandas as pd 

cnxn = db.connect(dsn=     '********', 
                  user=    '********', 
                  password='********', 
                  host=    '********', 
                  database='********')   

cols = "{0}col1, {0}col2, {0}col3, {0}col4".format('database.')

# Executes in <1 second
fast_sql = '''SELECT {} FROM bigtable
              FETCH FIRST 10 ROWS ONLY'''.format(cols)

# Executes in ~5 seconds 
slower_sql = '''SELECT {} FROM bigtable
                WHERE col1 = 1234
                FETCH FIRST 10 ROWS ONLY'''.format(cols)

# Giving up after ~3 minutes
slowest_sql = '''SELECT {} FROM bigtable 
                       WHERE date_col > 20210701
                       FETCH FIRST 10 ROWS ONLY'''.format(cols) 

df = pd.read_sql_query(horribly_slow_sql , cnxn)

cnxn.close()

Author:Dylan Moore，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/68683025/how-to-optimize-sql-python-select-queries-against-a-db2-database

How to optimize SQL/Python select queries against a db2 database?