The issues I faced the most when pushing data into databases

As financial analysts, we're generally skilled at handling and processing data, but many of us struggle when it comes to pushing that data into a database.

Common challenges that arise include avoiding duplicate entries, as well as efficiently removing data that's already present based on new inputs. These are crucial questions: How do we ensure data remains unique? How do we delete outdated or irrelevant rows before inserting fresh data?

In this article, I’ll walk you through a simple yet effective approach that has worked for me. we’ll walk through the process of reading data from an Excel file, cleaning it up, checking for duplicates in an SQLite database, and inserting the most up-to-date data into the database. By the end, you'll have a clear method to handle data insertion while maintaining database integrity and keeping your records up-to-date.

In this scenario, we have an Excel file containing the data that we want to insert into the "SalesTable" in an existing SQLite database. However, before inserting the new records into the "SalesTable" in the database, we need to clean up any existing data that might conflict—specifically, we need to ensure that records for the same company and year-period are not duplicated.

Step 1: Reading the Excel File into a Pandas DataFrame

# Read the Excel file into a pandas dataframe
df = pd.read_excel(excel_file, sheet_name="Sheet1")

We begin by loading the data from an Excel file into a Pandas DataFrame using the read_excel() function. This allows us to work with the data easily within Python. The sheet_name="Sheet1" argument specifies the sheet from which to load the data.

This is the first step in getting the data into Python so that we can manipulate and clean it before pushing it into the database.

Step 2: Create 'Year Period' column to simplify data filtering

# Convert the 'Year Period' column to datetime format
df['MonthDate'] = pd.to_datetime(df['Year Period'].astype(str), format='%Y%m')

In this step, we convert the 'Year Period' column (which is likely in a year-month format like 202303) to the string format %Y%m, which tells Pandas that the data is in the format of year followed by month (e.g., 202303 for March 2023).

This ensures that the date is standardized and easy to work with in the next steps, especially if we need to filter or sort the data later on.

Step 3: Connect to the SQLite Database

# Connect to the SQLite database & create cursor object
conn = sqlite3.connect(db_dir)
cur = conn.cursor()

Next, we establish a connection to the SQLite database using sqlite3.connect(). The db_dir variable contains the path to your SQLite database file. After the connection is established, we create a cursor object with conn.cursor() to interact with the database. You need this connection to interact with the database—whether it’s to read data, execute SQL queries, or insert new data.

Step 4: Prepare Parameters for Deletion

# Define the table name
table_name = "SalesTable"

# Define conditions to remove rows from the DB
YearPeriod = df['Year Period'].astype(str).unique()
companyList = df['Company'].astype(str).unique()

# Add all conditions into the parameter list
params = np.append(YearPeriod, companyList)

# Build placeholders for the IN clause
placeholder_1 = ', '.join('?' for _ in YearPeriod)
placeholder_2 = ', '.join('?' for _ in companyList)

In this step, we create the different parameters required to construct the correct delete statement.

First, we define the name of the table in the database that we’ll be working with, in this case, SalesTable.

Then, we extract the unique values from the Year Period and Company columns in the DataFrame, ensuring that we are working with distinct values. The astype(str) ensures that the columns are treated as strings (useful if there are any numeric values). We then combine these values into a single parameter list (params). These unique values will serve as conditions for removing data from the database, ensuring we only delete rows that match these specific criteria.

Subsequently, we create placeholders for the SQL query to handle dynamic conditions. The placeholders (?) will be replaced by actual values during execution. We build these placeholders dynamically based on the number of unique values in YearPeriod and companyList.

Step 5: Construct DELETE statement

# Construct SQL Statement to Delete rows fulfilling the conditions
delete_query = f"DELETE FROM {table_name} WHERE [Year Period] IN ({placeholder_1}) AND [Company] IN ({placeholder_2})"

This line of codew constructs the SQL DELETE query that will delete rows from the database where both the Year Period and Company match the values we’ve provided. The IN clause allows us to specify a list of values to match against.

The output would be something like the following:

DELETE FROM SalesTable WHERE [Year Period] IN (?, ?, ?) AND [Company] IN (?, ?, ?, ?, ?)

The number of "?" will depend on the number of unique values in YearPeriod and companyList.

This query will remove any existing data that matches the conditions specified, preventing duplicate entries in the database.

Step 6: Check if the table exists

# Check if the table exists
query = f"SELECT name FROM sqlite_master WHERE type='table' AND name='{table_name}'"
cur.execute(query)
conn.commit()
result = cur.fetchone()

Before executing any operations, we check if the table exists in the database. We do this by querying the sqlite_master table, which contains metadata about the database, including table names. This ensures we only try to execute the DELETE statement if the table actually exists. Running queries on non-existent tables would result in errors.

If the table exists, we execute the DELETE query. The params list (which contains the unique values for Year Period and Company) is passed to the query to replace the placeholders (?) with actual values.

Step 7: Execute the DELETE query if the table exists

# Execute DELETE statement if the table exists
if result is not None:
    cur.execute(delete_query, params)
    conn.commit()

If the table exists (checked in the previous step), we execute the DELETE query. The params list (which contains the unique values for Year Period and Company) is passed to the query to replace the placeholders (?) with actual values.

This step removes the data we no longer need, preventing any duplicates when we insert new data.

Step 8: Insert the Data into the Database

# Insert dataframe/table into SQL Database
df.to_sql(table_name, conn, if_exists='append', index=False)
cur.close()
conn.close()

Finally, we insert the cleaned DataFrame (df) into the database. We use the to_sql() method from Pandas, specifying the table name, connection, and the if_exists='append' argument. This ensures that if the table already contains data, the new rows will be appended instead of replacing existing data.

After the data insertion, we close both the cursor and the connection to the database to release resources and ensure that all changes are committed. It is good practice to close database connections after completing operations to prevent memory leaks or potential locking issues.

By following the above steps, you can efficiently insert and update data in an SQLite database while preventing duplicates. The key to this process is properly managing database connections, using conditional deletions, and ensuring that only the most up-to-date data is inserted into the database.