Using Reddit data with SQL Server and Python

  1. You’ve installed SQL Server with Python
  2. You’ve then installed pip
  3. Then you used pip to install PRAW
  4. You’ve also installed Pandas using pip
  5. You’ve created your Reddit API
  6. And you’ve got a working connection to Reddit

Now let’s actually gather this data and turn it into something useful inside SQL Server.

We’re going to build on our previous steps and create a Stored Procedure that we can simply execute from wherever we want and it will start populating data.

In previous steps we’ve only taken data from one subreddit but that’s a bit boring. Let’s make a list of subreddits that can be used by our SP.

We’re creating new table called py_SubredditList and inserted a list of our choosing. The list above are fairly good subreddits to get large blocks of text from. Feel free to change this list above to your favourite subreddits.

We’re going to grab one subreddit at a time and use it

Let’s start working on our Python code. First thing we’ll need is somewhere to push all of our data. Python uses things called dictionaries so we’ll make one called topics_dict;

Let’s dump our data into here

Dictionaries aren’t easy for us to interpret so let’s create a data frame using Pandas;

This data frame is what we’re going to return from our python block;

We’re going to output this data set into a temporary table;

Then we’re going to execute the python script and output the data into that temp table

And we’re going to increment our subreddit Hits by one so we’re not hitting the same subreddit all the time.

We’re going to create a permanent table to hold this data for later use

Finally we’ll put the data from our temp table into the permanent table

Putting together all of these elements you’ll come out with something like this;

FOR EVERYBODY WHO JUST WANTS THE FINISHED SP, IT’S HERE!

Installing this SP isn’t going to do anything until you execute it. Go on, live life on the edge and try it. Snoo agrees you should.

Then check out the py_RedditData table and you will have something like this;

Included you’ll be able to see your data lengths so you can filter out if you want. The DataType field is in there for you to experiment. We’re only pulling title and body from these submissions but you can also pull fields such as url (the post URL), score (int) and created (datetime). Check out the PRAW documentation for all available fields.

You can call this SP however you like. My preference is to call it once per minute from an agent job and leave it overnight so the Reddit API call limit doesn’t get hit but you can do it however you want.

That’s it. You’re done. Go make a cuppa.

Connect to Reddit from SQL Server

  1. You’ve installed SQL Server with Python
  2. You’ve then installed pip
  3. Then you used pip to install PRAW
  4. You’ve also installed Pandas using pip
  5. You’ve created your Reddit API

Fantastic, we’re about ready to connect to Reddit from within SQL Server!

Let’s crank open SSMS and start building our query. In a new query window we’ll start building our query and test it along the way.

To execute our script we’re going to use dynamic SQL to make it easier to read. Let’s get started setting down some foundations for our script.

The first line is the ‘shebang’ line and just allows the system to find Python correctly, we won’t be changing this at all. We’re also loading PRAW and Pandas (that we installed previously) so we can use them later.

Feel free to run this script as it is, you won’t get any results at this point but you should also not see any error messages

We’re going to build our connection string to Reddit next.

We’re creating a variable (redditConnect) and declaring the values we need. Use your own connection details we made when creating our Reddit API.

We then need to decide which Subreddit we’re going to gather data from. I’m going to use AskReddit as it’s one of the largest text-only subreddits and will be perfect for what we’re trying to do here.

We also need to decide what order we’re going to put the data into. We can sort by things like new, top, controversial etc. For this let’s grab ‘new’

subredditOrder = subreddit.new()

We can then test our connection is working by printing the first row of our data

Our whole block of code is going to look something like this

Go ahead and run this script (with your own info in there) and you should see a print out of the latest submission to your chosen Subreddit. Something like this.

If this doesn’t work for you then you may have to allow this connection through your firewall. In my example I disabled Windows Firewall but your method will be different depending on what you use for a firewall.

Congratulations. Next step is to actually do something with this data.

.

Create API Access to Reddit

  1. You’ve installed SQL Server with Python
  2. You’ve then installed pip
  3. Then you used pip to install PRAW
  4. You then installed Pandas with pip

The last thing we need to do before connecting to Reddit is to create our API access. It’s a simple process, let’s get it done.

First of all, you need to create a Reddit account. Go to the following link to create your user. Note down your username and password, we’ll need those later.

https://www.reddit.com/register

Once you have a reddit username set up make sure you’re logged in and go to the apps page;

https://www.reddit.com/prefs/apps

We’re going to create our app, click the nice big button.

Choose a name for your application then select the radio button to create a script. For the redirect URL you’ll want to enter http://localhost:8080


Once you’ve done these three things, click ‘create app’. You’ll be taken to the screen with some useful information you’ll need.

From this screen you’ll need to note your ‘personal use script; as well as the ‘secret’.

At this point, you should have 5 pieces of information;

  1. Reddit Username
  2. Reddit Password
  3. App Name
  4. Personal Use Script (14 characters)
  5. Secret (27 characters)

Once you’ve got all of these we can use these to connect to Reddit from SQL Server (with Python).

Let’s go see how we can do this.

Installing Pandas with SQL Server and Python

  1. You’ve installed SQL Server with Python
  2. You’ve then installed pip
  3. Then you used pip to install PRAW

We’re going to use the Pandas plugin to format our data to make it easier to read. If you’ve already used pip to install PRAW then this is going to be very easy. Let’s get started.

First of all we need to find our Python scripts folder. My default on SQL 2019 is here.

C:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019PYTHON\PYTHON_SERVICES\Scripts

Once here, open the command line to this location and run

This is going to do the heavy lifting for you. You’ll just sit there and watch it install.

Once the script has finished you’ll see a screen like this.

To double check this has worked you can run the following command within SQL Server

If you see an error then you’ll have to try and reinstall Pandas. If not, congratulations, you’ve done it!

Install PRAW with SQL Server and Python

So you’ve installed SQL Server with Python and you’ve also installed pip onto your SQL/Python Instance? Great! Let’s go ahead and get the PRAW package installed.

PRAW stands for Python Reddit API Wrapper and it can be used for creating bots as well as scraping data directly from Reddit. We’ll be using pip to install as it’s the recommended method of installing PRAW.

First of all, find your Python Services Scripts folder. On a standard installation of SQL Server 2019 it’s in this location

C:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019PYTHON\PYTHON_SERVICES\Scripts

Open the command prompt at this file location and run

It’s going to go ahead and download/install PRAW for you

To check PRAW is installed go ahead and run the following command

If you do not have PRAW installed you’ll get an error message similar to this. Notice the line ImportError: No module named ‘praw’.

Whereas with a successful install you won’t see the error message

Congrats, you’ve installed PRAW. Well done!

Installing pip on SQL Server with Python

One of the easiest ways of getting packages into your Python install is to install pip.

First thing you’ll want to do is to make sure that you already have SQL Server installed with Python (check my post about that here). Once you’ve done that we’re ready to begin.

Let’s go ahead and download the pip file. Go here https://bootstrap.pypa.io/get-pip.py and save the page to your local machine. I’m going to save it to my Python folder.

Because I’ve used SQL Server 2019 (named SQL2019PYTHON) the location is below. You will want to find your version of this folder. Go ahead and make sure your get-pip.py file is in here.

C:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019PYTHON\PYTHON_SERVICES\

To test you’ve got the right folder, open the command prompt for this location and type ‘python’. You should see what version of python is installed on your machine.

Once you’re sure you’ve got the right folder you want to go ahead and run

This is going to go ahead and install pip on your server.

To confirm your pip has installed correctly you’ll want to go into the Scripts sub folder

C:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019PYTHON\PYTHON_SERVICES\Scripts

And in the command line run the following command

You should see the specific version you have running.

And that’s it. You’ve got pip installed, you can now use this to get the packages you want.

Installing Python with SQL Server

First things first, you’ll want to choose your version of SQL Server. Python is available on 2017 and greater. For this demo I’ll be using SQL Server 2019 Developer Edition (CTP 2.2).

With 2019 CTP2.2 they’ve increased the requirement of your OS too, in my example I had a spare VM with Windows Server 2012 laying around but I needed to update this to Server 2016. Check the relevant documentation for the version you’re using.

Once you’ve got your install file ready, go ahead and get started with the installation.

When selecting your installation type, you want to choose a custom install

Give it somewhere to save your media

After selecting Install it’s going to go ahead and download your initial installation files. This should only take a couple of minutes based upon your download speeds.

Once downloaded feel free to read through the documentation.

Finished? Great. Now we can go ahead and get SQL Installed. We’re going to go to the installation tab and choose a fresh install.

Because this instance is not going to be doing any sort of production workload we’re going to go ahead and choose Developer Edition. This edition is free and comes with all enterprise features but is only for development work (check your licence terms if needed)

Once you’ve read the T&C’s you can decide to allow SQL to use Microsoft Update to check for patches. This is totally your call.

The installation is going to check any inconsistencies against it’s internal rule list. You’ll have to fix any failures here before moving on. I’ve just got a warning about Windows Firewall but I’m going to leave that for now

The next step we’re going to choose here is to install the database engine but also choose to install Machine Learning Services with the Python option.

Note: you can install Machine Learning Server as a standalone feature. We do not want to do this.

I’ve left the instance directories as standard. Feel free to change these if you would prefer them elsewhere.

Your next choice will be leaving as the default instance or giving your instance a name. I’m going name mine and call it SQL2019Python.

Next, choose what service accounts you want to boot at startup. I like agent starting so I’ll set that to automatic.

Notice the little check box underneath that talks about Instant File Initialisation. You want this ticked. If you’re not sure what it does then just trust me, you’ll thank me later.

Next, you’ll choose your Authentication modes and give your SA account it’s password. I’d also suggest clicking the “Add Current User” button just in case.

Next, read and consent to the Python Install.

You’ll see a summary of your installation options. Have a read and make sure you’re happy with everything. Once you are then go ahead and click Install.

Download is going to depend on your download speeds again so go make a cuppa while you wait, you deserve it.

Eventually you’ll be greeted by a lovely screen that tells you that everything is installed.

The first way to check that your instance is up and running is to go to your Services app and check that you have your SQL instance running.

Congratulations, SQL and Python are installed!

To double check that Python is working you’ll want to open SQL Server Management Studio (SSMS) (if you don’t have it check this post out) and connect to your instance.

Once connected, run the following script;

By default, this option is not enabled. To allow it you’ll need to run the following command.

Now run the first script and you’ll see that it’s now enabled.

Now to confirm Python is working go ahead and run a very simple Python script;

You’ll see the output in the messages tab

Congratulations, you have SQL Server installed with Python and it works!

Installing SQL Server Management Studio (SSMS)

SSMS is the way that most people query their SQL Server instances so let’s go ahead and show how to install it.

First of all, head over to Microsoft’s download page for the latest version of SSMS and go ahead and download the latest version.

Once the file is downloaded go ahead and put it on the machine that you’ll be installing SSMS (this doesn’t have to be your SQL Server, any machine that has network connectivity to that machine is totally fine).

Open the file and you will be met with the splash screen.

You should obviously read the licence terms. Once you’ve done this go ahead and start the installation.

It’s going to do a lot of work without your input, just sit and wait through the process.

Once it’s done it’s thing you’ll need a quick reboot

Once you’re restarted you’re good, it’s all installed and ready to use.

Query Store Missing Intervals

This was an interesting one to narrow down.

When messing about (developing) things that are based on the query store I came across an unusual phenomenon. We were missing query store intervals. It’s not that data had been deleted but rather than it was never created in the first place. Take a look at this;

 

I’d set the query store to 5 minute intervals for testing purposes and left it. Notice that there is no data between 15:20 and 15:25. It’s not that the data has been deleted, we know that because the runtime_stats_interval_id is still creating increments correctly.

Turns out, that if there is no activity on your user database for any given interval then the Query Store will just decide that it doesn’t need to capture anything at all for that interval.

This makes sense from a Query Store perspective, but, just don’t rely on those intervals being there if you’re joining this data to other performance metrics like I was…

Presenting at SQL Saturday Manchester

So, some excellent and surprising news, I’ve just been chosen to speak at SQL Saturday 730 in Manchester this July. Now, it wasn’t entirely a surprise as I designed and submitted three presentations (very deliberately) to their website before the deadline. I suppose you could say that it’s at least partially my own fault.

It was an interesting thing to find out about as a relatively new speaker in the community. First of all, there were 164 sessions submitted for a total of 30 available slots which doesn’t put the odds in the favour of an unknown name in the community. This does say a lot for the selection process for SQL Saturdays that they take the merits of the presentation rather than the reputation of the speaker. Great news for people who are considering embarking on a speaking career in the SQL Server community.

This presentation is one I’ve given before so I’m pretty comfortable with it, I’ve tweaked it since I gave it last but there’s a good foundation there. I’ve moved it over to the SQL Saturday powerpoint template and made sure that my demos are up to date. That’s pretty much the hard work done.

I’ll post updates before the day if I have any, otherwise I’ll make sure to give an update after it’s done and my heart rate has returned to normal.

Now to do the usual things, check floor plans for the venue and try to work out which room you’re going to be presenting in, how many people it seats and how empty it’ll be on the day…