Let Python make full use of Sqlite3

  • 2020-06-15 09:36:42
  • OfStack

I've been using sqlite3 a lot recently on projects that involve a lot of data processing. My initial attempt involved no database at all, and all the data would be kept in memory, including dictionary lookups, iterations, and conditions queries. This is fine, but there is only so much you can put into memory, and regenerating or loading data from disk into memory is a tedious and time-consuming process.

I decided to try sqlite3. Because you only need to open a connection to the database, this increases the amount of data that can be processed and reduces the application's load time to zero. In addition, I can replace many of the Python logical statements with SQL queries.

I'd like to share a few thoughts and findings about this experience.

TL;DR

Use a lot of operations (aka executemany). You don't need to use the cursor (most of the time). The cursor can be iterated over. Use the context manager. Use the compile directive (when it makes sense). Delay index creation. Use placeholders to insert the python value.

1. Use lots of operations

If you need to insert many rows in the database at once, you really shouldn't use execute. The sqlite3 module provides a way for bulk inserts: executemany.

Instead of doing this:


for row in iter_data():
connection.execute('INSERT INTO my_table VALUES (?)', row)

You can take advantage of the fact that executemany accepts a tuple generator as an argument:


connection.executemany(
 'INSERT INTO my_table VALUE (?)',
  iter_data()
)

This is not only cleaner, but also more efficient. In fact, sqlite3 implements execute behind the scenes using executemany, but the latter inserts one row instead of many.

I wrote a small benchmark to insert 1 million rows into an empty table (database in memory) :
executemany: 1.6 seconds
execute: 2.7 seconds

You don't need a cursor

The first thing I often confused was cursor management. The online examples and documentation are usually as follows:


connection = sqlite3.connect(':memory:')
cursor = connection.cursor()
# Do something with cursor

But in most cases, you don't need a cursor at all, you can just use the connection object (mentioned at the end of this article). Operations like execute and executemany can be called directly on the connection. Here's an example to prove it:


import sqlite3
connection = sqlite3(':memory:')
# Create a table
connection.execute('CREATE TABLE events(ts, msg)')
# Insert values
connection.executemany(
 'INSERT INTO events VALUES (?,?)',
  [
    (1, 'foo'),
    (2, 'bar'),
    (3, 'baz')
  ]
)
# Print inserted rows
for row in connnection.execute('SELECT * FROM events'):
  print(row)

3. The cursor (Cursor) can be used for iteration

You may often see examples of using fetchone or fetchall to process SELECT query results. But I've found that the most natural way to process these results is to iterate directly over the cursor:


for row in connection.execute('SELECT * FROM events'):
 print(row)

This way, once you get enough results, you can terminate the query without wasting resources. Of course, if you know in advance how many results you want, you can use the LIMIT SQL statement, but the Python generator is very convenient and allows you to separate data generation from data consumption.

4. Use Context Managers (Context manager)

Even in the middle of the SQL transaction, nasty things happen. To avoid manually handling rollbacks or commits, you can simply use the connection object as the context manager. In the following example, we create a table and mistakenly insert duplicate values:


import sqlite3
connection = sqlite3.connect(':memory:')
with connection:
  connection.execute(
 'CREATE TABLE events(ts, msg, PRIMARY KEY(ts, msg))')
try:
  with connection:
    connection.executemany('INSERT INTO events VALUES (?, ?)', [
      (1, 'foo'),
      (2, 'bar'),
      (3, 'baz'),
      (1, 'foo'),
    ])
except (sqlite3.OperationalError, sqlite3.IntegrityError) as e:
  print('Could not complete operation:', e)
# No row was inserted because transaction failed
for row in connection.execute('SELECT * FROM events'):
  print(row)
connection.close()

5. Use Pragmas

... When it really works

There are several pragma programs in your program that you can use to adjust the behavior of sqlite3. In particular, one of the ones that improves performance is synchronous:


connection.execute('PRAGMA synchronous = OFF')

You should know it can be dangerous. If the application crashes unexpectedly in the middle of a transaction, the database may be in a different state. So be careful! But if you want to insert a lot of rows faster, then this might be a choice.

6. Delay index creation

Suppose you need to create several indexes on the database, and you need to create indexes while inserting many rows. Delaying index creation until all rows have been inserted can lead to substantial performance improvements.

7. Use placeholders to insert Python values

It is convenient to include values in the query using the Python string operation. But doing so is extremely unsafe, and sqlite3 gives you a better way to do this:


# Do not do this!
my_timestamp = 1
c.execute("SELECT * FROM events WHERE ts = '%s'" % my_timestamp)
# Do this instead
my_timestamp = (1,)
c.execute('SELECT * FROM events WHERE ts = ?', my_timestamp)

In addition, string interpolation using Python % s (or formatted string constants) is not always feasible for executemany. So there's no real point in trying!

Keep in mind that these tips may or may not benefit you, depending on the particular use case. You should always try and decide for yourself whether it's worth it.

remusao, oschina


Related articles: