https://opensource.com/article/17/12/python-and-postgresql
Well-designed PostgreSQL queries can replace many lines of Python code, enabling faster, more efficient development.
Python
is a programming language that lets you work quickly and integrate
systems more effectively, and PostgreSQL is the world's most advanced
open source database. Those two work very well together. This article
describes how to make the most of PostgreSQL (psql) when solving a
simple problem. As tempting as it is to throw Python code at a problem,
it's not always the best choice.
SQL comes with quite a lot of processing power, and integrating SQL into your workflow often means writing fewer lines of code. As Edsger Dijkstra said, lines of code are lines spent:
Writing
This SQL query also uses the
The
Here's the result of our query:
Here's a typical Python implementation of that:
Below is the output when running the program. As you can see, we opted for an output similar to the psql output, which makes it easier to compare the effort needed to reach the same result.
I'm using the "week-over-week percentage difference" example because it's both a classic analytics need (although maybe mostly in marketing circles), and because (in my experience) a developer's first reaction is rarely to write a SQL query to do all the math.
Also, the calendar isn't very helpful in computing weeks, but for PostgreSQL, this task is as easy as spelling the word
That's why we use a common table expression—the
The expression
The computed_data result set is then used in the main part of the query as a relation we get data from, and the computation is easier this time, as we simply apply a classic difference percentage formula to the
Here's the result from running this query:
This article is based on an excerpt from Dimitri Fontaine's book Mastering PostgreSQL in Application Development, which explains how to replace thousands of lines of code with simple queries. The book goes into greater detail on these topics and provides many other examples so you can master PostgreSQL and issue the SQL queries that fetch exactly the result set you need.
SQL comes with quite a lot of processing power, and integrating SQL into your workflow often means writing fewer lines of code. As Edsger Dijkstra said, lines of code are lines spent:
The practice is pervaded by the reassuring illusion that programs are just devices like any others, the only difference admitted being that their manufacture might require a new type of craftsmen, viz. programmers. From there it is only a small step to measuring "programmer productivity" in terms of "number of lines of code produced per month." This is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view. My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.By using SQL, you write less code, so you can write applications in less time.
— Dijkstra, "On the cruelty of really teaching computing science."
A simple use case
To examine how PostgreSQL and Python work well together, we'll use the New York Stock Exchange (NYSE) "Daily NYSE Group Volume in NYSE Listed" dataset. To download the data, go to the Facts & Figures Interactive Viewer, click on Market Activity, then click on Daily NYSE Group Volume in NYSE Listed. Then click on the "Excel" symbol at the top (which is actually a CSV file that uses Tab as a separator) to save the "factbook.xls" file to your computer, open it and remove the headings, then load it into a PostgreSQL table.Loading the data set
Here's what the data looks like. It includes comma-separated thousands and dollar signs, so we can't readily process the figures as numbers.To change this, we can create an ad-hoc table definition, and once the data is loaded, it's transformed into a proper SQL data type, thanks to2010 1/4/2010 1,425,504,460 4,628,115 $38,495,460,645 2010 1/5/2010 1,754,011,750 5,394,016 $43,932,043,406 2010 1/6/2010 1,655,507,953 5,494,460 $43,816,749,660 2010 1/7/2010 1,797,810,789 5,674,297 $44,104,237,184
alter table
commands.We can use PostgreSQL'sBEGIN; CREATE TABLE factbook ( YEAR INT, DATE DATE, shares text, trades text, dollars text ); \copy factbook FROM 'factbook.csv' WITH delimiter E'\t' NULL '' ALTER TABLE factbook ALTER shares TYPE BIGINT USING REPLACE(shares, ',', '')::BIGINT, ALTER trades TYPE BIGINT USING REPLACE(trades, ',', '')::BIGINT, ALTER dollars TYPE BIGINT USING SUBSTRING(REPLACE(dollars, ',', '') FROM 2)::NUMERIC; commit;
copy
functionality to stream the data from the CSV file into our table. The \copy
variant is a psql-specific command and initiates client/server
streaming of the data, reading a local file, and sending its contents
through any established PostgreSQL connection.
Application code and SQL
There is a lot of data in this file, so let's use the data from February 2017 for this example. The following query lists all entries in the month of February 2017:We use the psql application to run this query, and psql supports the use of variables. The\SET START '2017-02-01' SELECT DATE, to_char(shares, '99G999G999G999') AS shares, to_char(trades, '99G999G999') AS trades, to_char(dollars, 'L99G999G999G999') AS dollars FROM factbook WHERE DATE >= DATE :'start' AND DATE < DATE :'start' + INTERVAL '1 month' ORDER BY DATE;
\set
command sets the '2017-02-01'
value to the variable start
, and then we can reuse the variable with the expression :'start'
.Writing
date :'start'
is equivalent to date '2017-02-01'
—this is called a decorated literal
expression in PostgreSQL. This allows us to set the data type of the
literal value so that the PostgreSQL query parser won't have to guess or
infer it from the context.This SQL query also uses the
interval
data type to compute the end of the month, which is, of course, the last day of February in our example. Adding an interval
value of 1 month
to the first day of the month gives us the first day of the next month, so we use the "less than" (<
) strict operator to exclude this day from our result set.The
to_char()
function (documented in the PostgreSQL docs section on Data Type Formatting Functions) will convert a number to its text representation with detailed control over the conversion. The format is composed of template patterns. We'll use the following patterns:- Value with the specified number of digits
L
: currency symbol (uses locale)G
: group separator (uses locale)
Here's the result of our query:
The dataset has data for only 19 days in February 2017 (the days the NYSE was open). What if we want to display an entry for each calendar day and fill in the missing dates with either matching data or a zero figure?date │ shares │ trades │ dollars ════════════╪═════════════════╪═════════════╪══════════════════ 2017-02-01 │ 1,161,001,502 │ 5,217,859 │ $ 44,660,060,305 2017-02-02 │ 1,128,144,760 │ 4,586,343 │ $ 43,276,102,903 2017-02-03 │ 1,084,735,476 │ 4,396,485 │ $ 42,801,562,275 2017-02-06 │ 954,533,086 │ 3,817,270 │ $ 37,300,908,120 2017-02-07 │ 1,037,660,897 │ 4,220,252 │ $ 39,754,062,721 2017-02-08 │ 1,100,076,176 │ 4,410,966 │ $ 40,491,648,732 2017-02-09 │ 1,081,638,761 │ 4,462,009 │ $ 40,169,585,511 2017-02-10 │ 1,021,379,481 │ 4,028,745 │ $ 38,347,515,768 2017-02-13 │ 1,020,482,007 │ 3,963,509 │ $ 38,745,317,913 2017-02-14 │ 1,041,009,698 │ 4,299,974 │ $ 40,737,106,101 2017-02-15 │ 1,120,119,333 │ 4,424,251 │ $ 43,802,653,477 2017-02-16 │ 1,091,339,672 │ 4,461,548 │ $ 41,956,691,405 2017-02-17 │ 1,160,693,221 │ 4,132,233 │ $ 48,862,504,551 2017-02-21 │ 1,103,777,644 │ 4,323,282 │ $ 44,416,927,777 2017-02-22 │ 1,064,236,648 │ 4,169,982 │ $ 41,137,731,714 2017-02-23 │ 1,192,772,644 │ 4,839,887 │ $ 44,254,446,593 2017-02-24 │ 1,187,320,171 │ 4,656,770 │ $ 45,229,398,830 2017-02-27 │ 1,132,693,382 │ 4,243,911 │ $ 43,613,734,358 2017-02-28 │ 1,455,597,403 │ 4,789,769 │ $ 57,874,495,227 (19 rows)
Here's a typical Python implementation of that:
In this implementation, we use the above SQL query to fetch our result set and, moreover, to store it in a dictionary. The dict's key is the day of the month, so we can then loop over a calendar's list of days, retrieve matching data when we have it, and install a default result set (e.g., zeroes) when we don't have any data.#! /usr/bin/env python3 import sys import psycopg2 import psycopg2.extras from calendar import Calendar CONNSTRING = "dbname=yesql application_name=factbook" def fetch_month_data(year, month): "Fetch a month of data from the database" date = "%d-%02d-01" % (year, month) sql = """ select date, shares, trades, dollars from factbook where date >= date %s and date < date %s + interval '1 month' order by date; """ pgconn = psycopg2.connect(CONNSTRING) curs = pgconn.cursor() curs.execute(sql, (date, date)) res = {} for (date, shares, trades, dollars) in curs.fetchall(): res[date] = (shares, trades, dollars) return res def list_book_for_month(year, month): """List all days for given month, and for each day list fact book entry. """ data = fetch_month_data(year, month) cal = Calendar() print("%12s | %12s | %12s | %12s" % ("day", "shares", "trades", "dollars")) print("%12s-+-%12s-+-%12s-+-%12s" % ("-" * 12, "-" * 12, "-" * 12, "-" * 12)) for day in cal.itermonthdates(year, month): if day.month != month: continue if day in data: shares, trades, dollars = data[day] else: shares, trades, dollars = 0, 0, 0 print("%12s | %12s | %12s | %12s" % (day, shares, trades, dollars)) if __name__ == '__main__': year = int(sys.argv[1]) month = int(sys.argv[2]) list_book_for_month(year, month)
Below is the output when running the program. As you can see, we opted for an output similar to the psql output, which makes it easier to compare the effort needed to reach the same result.
$ ./factbook-month.py 2017 2 day | shares | trades | dollars -------------+--------------+--------------+------------- 2017-02-01 | 1161001502 | 5217859 | 44660060305 2017-02-02 | 1128144760 | 4586343 | 43276102903 2017-02-03 | 1084735476 | 4396485 | 42801562275 2017-02-04 | 0 | 0 | 0 2017-02-05 | 0 | 0 | 0 2017-02-06 | 954533086 | 3817270 | 37300908120 2017-02-07 | 1037660897 | 4220252 | 39754062721 2017-02-08 | 1100076176 | 4410966 | 40491648732 2017-02-09 | 1081638761 | 4462009 | 40169585511 2017-02-10 | 1021379481 | 4028745 | 38347515768 2017-02-11 | 0 | 0 | 0 2017-02-12 | 0 | 0 | 0 2017-02-13 | 1020482007 | 3963509 | 38745317913 2017-02-14 | 1041009698 | 4299974 | 40737106101 2017-02-15 | 1120119333 | 4424251 | 43802653477 2017-02-16 | 1091339672 | 4461548 | 41956691405 2017-02-17 | 1160693221 | 4132233 | 48862504551 2017-02-18 | 0 | 0 | 0 2017-02-19 | 0 | 0 | 0 2017-02-20 | 0 | 0 | 0 2017-02-21 | 1103777644 | 4323282 | 44416927777 2017-02-22 | 1064236648 | 4169982 | 41137731714 2017-02-23 | 1192772644 | 4839887 | 44254446593 2017-02-24 | 1187320171 | 4656770 | 45229398830 2017-02-25 | 0 | 0 | 0 2017-02-26 | 0 | 0 | 0 2017-02-27 | 1132693382 | 4243911 | 43613734358 2017-02-28 | 1455597403 | 4789769 | 57874495227
PostgreSQL advanced functions
The same thing can be accomplished with a single SQL query, without any application code "spent" on solving the problem:In this query, we use several basic SQL and PostgreSQL techniques that might be new to you:SELECT CAST(calendar.entry AS DATE) AS DATE, COALESCE(shares, 0) AS shares, COALESCE(trades, 0) AS trades, to_char( COALESCE(dollars, 0), 'L99G999G999G999' ) AS dollars FROM /* * Generate the target month's calendar then LEFT JOIN * each day against the factbook dataset, so as to have * every day in the result set, whether or not we have a * book entry for the day. */ generate_series(DATE :'start', DATE :'start' + INTERVAL '1 month' - INTERVAL '1 day', INTERVAL '1 day' ) AS calendar(entry) LEFT JOIN factbook ON factbook.DATE = calendar.entry ORDER BY DATE;
- SQL accepts comments written either in the
-- comment
style, running from the opening to the end of the line, or C-style with a/* comment */
style. As with any programming language, comments are best used to note intentions, which otherwise might be tricky to reverse engineer from the code alone. generate_series()
is a PostgreSQL set returning function, for which the documentation reads: "Generate a series of values, from start to stop with a step size of step." As PostgreSQL knows its calendar, it's easy to generate all days from any given month with the first day of the month as a single parameter in the query.generate_series()
is inclusive, much like theBETWEEN
operator, so we exclude the first day of the next month with the expression- interval '1 day'
.- The
cast(calendar.entry as date)
expression transforms the generatedcalendar.entry
, which is the result of thegenerate_series()
function call into thedate
data type. We need to usecast
because thegenerate_series()
function returns a set of timestamp entries, which is not relevant to us in this exercise. - The
left join
between our generatedcalendar
table and thefactbook
table will keep everycalendar
row and associate afactbook
row with it only when thedate
columns of both the tables have the same value. When thecalendar.date
is not found infactbook
, thefactbook
columns (year
,date
,shares
,trades
, anddollars
) are filled in withNULL
values instead. - Coalesce returns the first of its arguments that is not null. So the expression
coalesce(shares, 0) as shares
is either how many shares we found in thefactbook
table for thiscalendar.date
row, or 0 when we found no entry for thecalendar.date
. In addition, theleft join
kept our result set row and filled in thefactbook
columns withNULL
values.
Note that we replaced 60 lines of Python code with a simple SQL query. Down the road that means less code to maintain and a more efficient implementation, too. Here, the Python is doing adate │ shares │ trades │ dollars ════════════╪════════════╪═════════╪══════════════════ 2017-02-01 │ 1161001502 │ 5217859 │ $ 44,660,060,305 2017-02-02 │ 1128144760 │ 4586343 │ $ 43,276,102,903 2017-02-03 │ 1084735476 │ 4396485 │ $ 42,801,562,275 2017-02-04 │ 0 │ 0 │ $ 0 2017-02-05 │ 0 │ 0 │ $ 0 2017-02-06 │ 954533086 │ 3817270 │ $ 37,300,908,120 2017-02-07 │ 1037660897 │ 4220252 │ $ 39,754,062,721 2017-02-08 │ 1100076176 │ 4410966 │ $ 40,491,648,732 2017-02-09 │ 1081638761 │ 4462009 │ $ 40,169,585,511 2017-02-10 │ 1021379481 │ 4028745 │ $ 38,347,515,768 2017-02-11 │ 0 │ 0 │ $ 0 2017-02-12 │ 0 │ 0 │ $ 0 2017-02-13 │ 1020482007 │ 3963509 │ $ 38,745,317,913 2017-02-14 │ 1041009698 │ 4299974 │ $ 40,737,106,101 2017-02-15 │ 1120119333 │ 4424251 │ $ 43,802,653,477 2017-02-16 │ 1091339672 │ 4461548 │ $ 41,956,691,405 2017-02-17 │ 1160693221 │ 4132233 │ $ 48,862,504,551 2017-02-18 │ 0 │ 0 │ $ 0 2017-02-19 │ 0 │ 0 │ $ 0 2017-02-20 │ 0 │ 0 │ $ 0 2017-02-21 │ 1103777644 │ 4323282 │ $ 44,416,927,777 2017-02-22 │ 1064236648 │ 4169982 │ $ 41,137,731,714 2017-02-23 │ 1192772644 │ 4839887 │ $ 44,254,446,593 2017-02-24 │ 1187320171 │ 4656770 │ $ 45,229,398,830 2017-02-25 │ 0 │ 0 │ $ 0 2017-02-26 │ 0 │ 0 │ $ 0 2017-02-27 │ 1132693382 │ 4243911 │ $ 43,613,734,358 2017-02-28 │ 1455597403 │ 4789769 │ $ 57,874,495,227 (28 rows)
Hash Join Nested Loop
while PostgreSQL picks a Merge Left Join
over two ordered relations.Computing weekly changes
Imagine the analytics department now wants us to provide the weekly difference for each day. This means we need to add a column with the change calculated as a percentage of thedollars
column between each date and the same day of the previous week.I'm using the "week-over-week percentage difference" example because it's both a classic analytics need (although maybe mostly in marketing circles), and because (in my experience) a developer's first reaction is rarely to write a SQL query to do all the math.
Also, the calendar isn't very helpful in computing weeks, but for PostgreSQL, this task is as easy as spelling the word
week
:To implement this case in SQL, we need window functions that appeared in the SQL standard in 1992, but are often skipped in SQL classes. The last things executed in a SQL statement areWITH computed_data AS ( SELECT CAST(DATE AS DATE) AS DATE, to_char(DATE, 'Dy') AS DAY, COALESCE(dollars, 0) AS dollars, lag(dollars, 1) OVER( partition BY EXTRACT('isodow' FROM DATE) ORDER BY DATE ) AS last_week_dollars FROM /* * Generate the month calendar, plus a week before * so that we have values to compare dollars against * even for the first week of the month. */ generate_series(DATE :'start' - INTERVAL '1 week', DATE :'start' + INTERVAL '1 month' - INTERVAL '1 day', INTERVAL '1 day' ) AS calendar(DATE) LEFT JOIN factbook USING(DATE) ) SELECT DATE, DAY, to_char( COALESCE(dollars, 0), 'L99G999G999G999' ) AS dollars, CASE WHEN dollars IS NOT NULL AND dollars <> 0 THEN round( 100.0 * (dollars - last_week_dollars) / dollars , 2) END AS "WoW %" FROM computed_data WHERE DATE >= DATE :'start' ORDER BY DATE;
windows
functions, well after join
operations and where
clauses. So, if we want to see a full week before the first of
February, we need to extend our calendar selection a week into the past
and then, once again, restrict the data we issue to the caller.That's why we use a common table expression—the
WITH
part of the query—to fetch the extended data set we need, including the last_week_dollars
computed column.The expression
extract('isodow' from date)
is a standard SQL feature that allows computing the day of the week following ISO rules. Used as a partition by
frame clause, it allows a row to be a peer to any other row having the same isodow
. The lag()
window function can then refer to the previous peer dollars value when ordered by date; that's the number that we want to compare to the current dollars value.The computed_data result set is then used in the main part of the query as a relation we get data from, and the computation is easier this time, as we simply apply a classic difference percentage formula to the
dollars
and the last_week_dollars
columns.Here's the result from running this query:
Have fun writing code, and as SQL is code, have fun writing SQL!date │ day │ dollars │ WoW % ════════════╪═════╪══════════════════╪════════ 2017-02-01 │ Wed │ $ 44,660,060,305 │ -2.21 2017-02-02 │ Thu │ $ 43,276,102,903 │ 1.71 2017-02-03 │ Fri │ $ 42,801,562,275 │ 10.86 2017-02-04 │ Sat │ $ 0 │ ¤ 2017-02-05 │ Sun │ $ 0 │ ¤ 2017-02-06 │ Mon │ $ 37,300,908,120 │ -9.64 2017-02-07 │ Tue │ $ 39,754,062,721 │ -37.41 2017-02-08 │ Wed │ $ 40,491,648,732 │ -10.29 2017-02-09 │ Thu │ $ 40,169,585,511 │ -7.73 2017-02-10 │ Fri │ $ 38,347,515,768 │ -11.61 2017-02-11 │ Sat │ $ 0 │ ¤ 2017-02-12 │ Sun │ $ 0 │ ¤ 2017-02-13 │ Mon │ $ 38,745,317,913 │ 3.73 2017-02-14 │ Tue │ $ 40,737,106,101 │ 2.41 2017-02-15 │ Wed │ $ 43,802,653,477 │ 7.56 2017-02-16 │ Thu │ $ 41,956,691,405 │ 4.26 2017-02-17 │ Fri │ $ 48,862,504,551 │ 21.52 2017-02-18 │ Sat │ $ 0 │ ¤ 2017-02-19 │ Sun │ $ 0 │ ¤ 2017-02-20 │ Mon │ $ 0 │ ¤ 2017-02-21 │ Tue │ $ 44,416,927,777 │ 8.28 2017-02-22 │ Wed │ $ 41,137,731,714 │ -6.48 2017-02-23 │ Thu │ $ 44,254,446,593 │ 5.19 2017-02-24 │ Fri │ $ 45,229,398,830 │ -8.03 2017-02-25 │ Sat │ $ 0 │ ¤ 2017-02-26 │ Sun │ $ 0 │ ¤ 2017-02-27 │ Mon │ $ 43,613,734,358 │ ¤ 2017-02-28 │ Tue │ $ 57,874,495,227 │ 23.25 (28 rows)
This article is based on an excerpt from Dimitri Fontaine's book Mastering PostgreSQL in Application Development, which explains how to replace thousands of lines of code with simple queries. The book goes into greater detail on these topics and provides many other examples so you can master PostgreSQL and issue the SQL queries that fetch exactly the result set you need.
No comments:
Post a Comment