What’s the role of SQL in machine learning?
In this article, we’re going to answer that question. We’ll see where and how SQL is useful for collecting data, cleansing it, validating it, and then how it helps your models use the data for training. Let’s get started.
SQL in Machine Learning
Why Use SQL in Machine Learning?
Machine learning doesn’t happen without data. You need it to train your algorithms, and the more data, the better. So, you need a place to store large quantities of information, and a way to get it into your algorithms so they can analyze and learn from it. That’s where SQL comes in. We don’t need to define what SQL is or review its abilities, but it is worth looking at where it comes from and how we use it. SQL is how we query data.
It’s been around since the early 1970s and has weathered many attempts to replace it with something better. Vendors have extended, improved, and modified SQL over the years, and it remains one of the most common languages for manipulating and filtering large data sets. SQL was created for database servers, and that’s where you can still find it.. But, SQL has spread far and wide beyond SQL data servers.
You can use it to query Google Sheets. You’ll even find SQL-like grammars in so-called NoSQL databases like Couchbase and time-series stores like InfluxDB. There are good reasons why SQL is so pervasive. A committee formed by the American National Standards Institute (ANSI) published a standard in 1986. It’s been reviewed and revised a few times since then, but that standard made it possible for so many languages and platforms to adopt interfaces to SQL data stores. So, the connection between ML and SQL is data. ML development requires large amounts of data, and SQL simplifies organizing, retrieving, and filtering that data.
Is Learning SQL Necessary?
It’s safe to say that you if part of your job is accessing data, you’ll eventually need to learn SQL. It’s a valuable tool for structuring data and training models. One of the early steps in any machine learning project is data acquisition. Depending on the project, this may mean building a new data set, or finding an existing one. Many vendors use relational models for their data sets. So, the easiest way to query this data is with SQL. You may even need to write some SQL scripts to restructure the data for your model. If you need to assemble a new data set, you’ll need to structure it, too.
As we discussed above, even if you’re not using a SQL database, there’s a good chance that you’ll use a query language that is “SQL like.” Later, when you’re ready to test and train your model, you need to access that data. You need to get it into your model. This will require something that can query, filter, and shape the data for your ML code. You can write your own code to do this, or you can use tools that already exist, and focus on your model instead. SQL is available from most programming languages, including Python and R, and excels at filtering and formatting data for consumption.
That said, SQL isn’t the only option for querying and manipulating data. Some noSQL databases, like MongoDB and Redis, have their own query languages. These languages may be better-suited, or simply good enough for your project. Or, you may want to access a relational database by avoiding writing SQL and using an intermediate object layer like Django that does the work for you.
The Role of SQL in Machine Learning
Next, let’s look at the typical stages of a machine learning pipeline and see where SQL fits in.
Data Collection and Cleaning
We’ve already spent a lot of time talking about how SQL fits into data acquisition. Many datasets are already in SQL databases, or SQL friendly formats like SQLite files. For example, the popular Kaggle site has nearly 85,000 datasets in CSV and another 200 in SQLite or BigQuery formats. Once you load a dataset into SQL, cleaning it is easy. For example, let’s consider the popular Iris dataset in the UCI Repository. It’s records look like this:
They map into a single SQL table. If we wanted to remove the data points that have petal_widths of 0.2 class, we simply need a single line of SQL code:
delete from iris where petal_width = 0.2
The SQL database does the work of visiting each record and deleting the right ones for us.
SQL is useful for data validation, too. In the previous step, we saw how a query can find unwanted data and remove it. A similar query can also find records that match the requirements for your model. You can also use constraints and let the database validate your data for you. For example, we’ve found an additional source for iris data but noticed that it sometimes lack petal width values that would break our model. We can create an SQL table that won’t accept those values:
create table iris_dg_tmp ( "index" INTEGER, sepal_length REAL, sepal_width REAL, petal_length REAL, petal_width REAL not null, class TEXT );
The not null constraint means that we can upload new data files and let the database do the validation for us.
Training, Validating, and Optimizing the Model
With your data collected, cleaned, and validated, it’s time to run the model. We already discussed how SQL is integral to model code above: you have many options for accessing SQL data directly in your model code. We’ll look at how easy it is next.
What Database Is Used in Machine Learning?
This is a common question, and it’s hard to answer. Not because there aren’t any databases used in ML, but because you can use nearly any SQL database for machine learning.
Accessing SQL With Pandas
As we discussed above, the primary role of SQL in machine learning is helping you get to your data. One is built right into Pandas, the Python data analysis and manipulation tool. Pandas has its own input/output library, and one channel it supports is SQL.
Let’s load the Iris data set into SQLite with Pandas, and then query it the same way. First, after you download the data set, you’ll need to add the column headings to the first line of the iris.data file as shown above. Then, you can run this script:
import pandas as pd from sqlite3 import connect if __name__ == "__main__": conn = connect("iris.db") data = pd.read_csv("iris.data") data.to_sql("iris",conn) df = pd.read_sql("select petal_width FROM iris where petal_width = 0.2", conn) print(df)
Here is the output:
(trimmed for length)
In four lines of code, we:
- Connected to SQLite. We asked for iris,db, which didn’t exist. So, SQLite created it.
- Read the contents of the Iris dataset into a Pandas DataFrame.
- Inserted the DataFrame into SQLite. We now have a SQL table with the contents of the data set!
- Queried for data points with a petal_width of 0.2
If you try to run this again, it will fail since SQLIte won’t recreate an existing database. But if you comment lines #2 and #3, you can play with different queries. If our model wants petal_widths that are greater than 0.2 and we don’t want to drop them from the database, we can simply adjust the query:
conn = connect("iris.db") # data = pd.read_csv("iris.data") # data.to_sql("iris",conn) df = pd.read_sql("select * FROM iris where petal_width > 0.2", conn) print(df)
This example uses SQLite, but it would work with any SQL database with SQLAlchemy support. So, you can see that accessing SQL data in your models is very simple and also very powerful.
Using SQL in Machine Learning
We’ve looked at how SQL fits into machine learning. We saw how it’s useful at every stage of an ML pipeline. Many datasets are relation, and SQL makes it easy to load, clean, validate, and retrieve relation datasets. So, whether you’re standing up a new learning pipeline or working on ETL for an existing system, SQL is a useful tool and a key part of machine learning. So, get started with SQL data in your ML models today!
This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).