- You’ve installed SQL Server with Python
- You’ve then installed pip
- Then you used pip to install PRAW
- You’ve also installed Pandas using pip
- You’ve created your Reddit API
- And you’ve got a working connection to Reddit
Now let’s actually gather this data and turn it into something useful inside SQL Server.
We’re going to build on our previous steps and create a Stored Procedure that we can simply execute from wherever we want and it will start populating data.
In previous steps we’ve only taken data from one subreddit but that’s a bit boring. Let’s make a list of subreddits that can be used by our SP.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'py_SubredditList') BEGIN CREATE TABLE py_SubredditList (Subreddit varchar(50), Hits bigint) END /* We're pulling data from these subreddits, add your own if you want to */ IF OBJECT_ID('tempdb..#SubredditList') IS NOT NULL DROP TABLE #SubredditList CREATE TABLE #SubredditList (Subreddit varchar(50), Hits int) INSERT INTO #SubredditList (Subreddit, Hits) VALUES ('nosleep',0) ,('LegalAdvice',0) ,('WritingPrompts',0) ,('TalesFromTechSupport',0) ,('TalesFromRetail',0) ,('TalesFromYourServer',0) INSERT INTO py_SubredditList (Subreddit, Hits) SELECT t.Subreddit, t.Hits FROM #SubredditList t WHERE NOT EXISTS (SELECT * FROM py_SubredditList sl WHERE t.Subreddit = t.Subreddit) |
We’re creating new table called py_SubredditList and inserted a list of our choosing. The list above are fairly good subreddits to get large blocks of text from. Feel free to change this list above to your favourite subreddits.
We’re going to grab one subreddit at a time and use it
1 2 3 4 |
DECLARE @SubReddit varchar(50) SET @SubReddit = (SELECT TOP 1 sr.Subreddit FROM py_SubredditList sr WHERE sr.Hits = (SELECT MIN(Hits) FROM py_SubredditList)) |
Let’s start working on our Python code. First thing we’ll need is somewhere to push all of our data. Python uses things called dictionaries so we’ll make one called topics_dict;
1 2 3 4 |
topics_dict = { "title":[], "body":[] } |
Let’s dump our data into here
1 2 3 |
for submission in top_subreddit: topics_dict["title"].append(submission.title) topics_dict["body"].append(submission.selftext) |
Dictionaries aren’t easy for us to interpret so let’s create a data frame using Pandas;
1 |
topics_data = pd.DataFrame(topics_dict) |
This data frame is what we’re going to return from our python block;
1 |
OutputDataSet = topics_data |
We’re going to output this data set into a temporary table;
1 2 3 4 5 6 7 8 |
IF OBJECT_ID('tempdb..#RedditData') IS NOT NULL DROP TABLE #RedditData BEGIN CREATE TABLE #RedditData ( [title] varchar(max) ,[body] varchar(max) ) END |
Then we’re going to execute the python script and output the data into that temp table
1 2 |
INSERT INTO #RedditData EXEC sp_execute_external_script @language = N'Python', @script = @sql |
And we’re going to increment our subreddit Hits by one so we’re not hitting the same subreddit all the time.
1 |
UPDATE py_SubredditList SET Hits = Hits + 1 WHERE Subreddit = @SubReddit |
We’re going to create a permanent table to hold this data for later use
1 2 3 4 5 6 7 8 9 10 11 |
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'py_RedditData') BEGIN CREATE TABLE py_RedditData ( [SampleData] varchar(max) ,[Description] varchar(20) ,[DataType] varchar(20) ,[DataLen] int ,[Subreddit] varchar(20) ) END |
Finally we’ll put the data from our temp table into the permanent table
1 2 3 4 |
INSERT INTO py_RedditData ([SampleData], [Description], [DataType], [DataLen], [Subreddit]) SELECT TRIM(title), 'title', 'varchar', LEN(title), @SubReddit FROM #RedditData WHERE title IS NOT NULL UNION ALL SELECT TRIM(body), 'body', 'varchar', LEN(body), @SubReddit FROM #RedditData WHERE body IS NOT NULL |
Putting together all of these elements you’ll come out with something like this;
FOR EVERYBODY WHO JUST WANTS THE FINISHED SP, IT’S HERE!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
CREATE OR ALTER PROCEDURE [dbo].[sp_PythonTest] AS /* Making a table to store our subreddit list in */ IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'py_SubredditList') BEGIN CREATE TABLE py_SubredditList (Subreddit varchar(50), Hits bigint) END /* We're pulling data from these subreddits, add your own if you want to */ IF OBJECT_ID('tempdb..#SubredditList') IS NOT NULL DROP TABLE #SubredditList CREATE TABLE #SubredditList (Subreddit varchar(50), Hits int) INSERT INTO #SubredditList (Subreddit, Hits) VALUES ('nosleep',0) ,('LegalAdvice',0) ,('WritingPrompts',0) ,('TalesFromTechSupport',0) ,('TalesFromRetail',0) ,('TalesFromYourServer',0) INSERT INTO py_SubredditList (Subreddit, Hits) SELECT t.Subreddit, t.Hits FROM #SubredditList t WHERE NOT EXISTS (SELECT * FROM py_SubredditList sl WHERE t.Subreddit = sl.Subreddit) /* Find the least used subreddit */ DECLARE @SubReddit varchar(50) SET @SubReddit = (SELECT TOP 1 sr.Subreddit FROM py_SubredditList sr WHERE sr.Hits = (SELECT MIN(Hits) FROM py_SubredditList)) /* Here's our block of Python, declaring it dynamically */ DECLARE @sql nvarchar(max) = N'#! usr/bin/env python3 import praw import pandas as pd import datetime as dt redditConnect = praw.Reddit(client_id=''PERSONAL_USE_SCRIPT_14_CHARS'', \ client_secret=''SECRET_KEY_27_CHARS '', \ user_agent=''REDDIT_APP_NAME'', \ username=''REDDIT_USER_NAME'', \ password=''REDDIT_PASSWORD'') subreddit = redditConnect.subreddit(''' + @SubReddit + ''') top_subreddit = subreddit.new(limit=1000) topics_dict = { "title":[], "body":[] } for submission in top_subreddit: topics_dict["title"].append(submission.title) topics_dict["body"].append(submission.selftext) topics_data = pd.DataFrame(topics_dict) OutputDataSet = topics_data ' IF OBJECT_ID('tempdb..#RedditData') IS NOT NULL DROP TABLE #RedditData BEGIN CREATE TABLE #RedditData ( [title] varchar(max) ,[body] varchar(max) ) END INSERT INTO #RedditData EXEC sp_execute_external_script @language = N'Python', @script = @sql /* Increment our subreddit list */ UPDATE py_SubredditList SET Hits = Hits + 1 WHERE Subreddit = @SubReddit IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'py_RedditData') BEGIN CREATE TABLE py_RedditData ( [SampleData] varchar(max) ,[Description] varchar(20) ,[DataType] varchar(20) ,[DataLen] int ,[Subreddit] varchar(20) ) END INSERT INTO py_RedditData ([SampleData], [Description], [DataType], [DataLen], [Subreddit]) SELECT TRIM(title), 'title', 'varchar', LEN(title), @SubReddit FROM #RedditData WHERE title IS NOT NULL UNION ALL SELECT TRIM(body), 'body', 'varchar', LEN(body), @SubReddit FROM #RedditData WHERE body IS NOT NULL GO |
Installing this SP isn’t going to do anything until you execute it. Go on, live life on the edge and try it. Snoo agrees you should.

Then check out the py_RedditData table and you will have something like this;

Included you’ll be able to see your data lengths so you can filter out if you want. The DataType field is in there for you to experiment. We’re only pulling title and body from these submissions but you can also pull fields such as url (the post URL), score (int) and created (datetime). Check out the PRAW documentation for all available fields.
You can call this SP however you like. My preference is to call it once per minute from an agent job and leave it overnight so the Reddit API call limit doesn’t get hit but you can do it however you want.
That’s it. You’re done. Go make a cuppa.