Thursday, January 18, 2018

How to use PostgreSQL to streamline Python code

https://opensource.com/article/17/12/python-and-postgresql

Well-designed PostgreSQL queries can replace many lines of Python code, enabling faster, more efficient development.

How to use PostgreSQL to streamline Python code
Image by : 
opensource.com
Python is a programming language that lets you work quickly and integrate systems more effectively, and PostgreSQL is the world's most advanced open source database. Those two work very well together. This article describes how to make the most of PostgreSQL (psql) when solving a simple problem. As tempting as it is to throw Python code at a problem, it's not always the best choice.
SQL comes with quite a lot of processing power, and integrating SQL into your workflow often means writing fewer lines of code. As Edsger Dijkstra said, lines of code are lines spent:
The practice is pervaded by the reassuring illusion that programs are just devices like any others, the only difference admitted being that their manufacture might require a new type of craftsmen, viz. programmers. From there it is only a small step to measuring "programmer productivity" in terms of "number of lines of code produced per month." This is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view. My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.
       — Dijkstra, "On the cruelty of really teaching computing science."
By using SQL, you write less code, so you can write applications in less time.

A simple use case

To examine how PostgreSQL and Python work well together, we'll use the New York Stock Exchange (NYSE) "Daily NYSE Group Volume in NYSE Listed" dataset. To download the data, go to the Facts & Figures Interactive Viewer, click on Market Activity, then click on Daily NYSE Group Volume in NYSE Listed. Then click on the "Excel" symbol at the top (which is actually a CSV file that uses Tab as a separator) to save the "factbook.xls" file to your computer, open it and remove the headings, then load it into a PostgreSQL table.

Loading the data set

Here's what the data looks like. It includes comma-separated thousands and dollar signs, so we can't readily process the figures as numbers.
2010    1/4/2010        1,425,504,460   4,628,115       $38,495,460,645 2010    1/5/2010        1,754,011,750   5,394,016       $43,932,043,406 2010    1/6/2010        1,655,507,953   5,494,460       $43,816,749,660 2010    1/7/2010        1,797,810,789   5,674,297       $44,104,237,184
To change this, we can create an ad-hoc table definition, and once the data is loaded, it's transformed into a proper SQL data type, thanks to alter table commands.
BEGIN; CREATE TABLE factbook  (    YEAR    INT,    DATE    DATE,    shares  text,    trades  text,    dollars text  ); \copy factbook FROM 'factbook.csv' WITH delimiter E'\t' NULL '' ALTER TABLE factbook    ALTER shares     TYPE BIGINT    USING REPLACE(shares, ',', '')::BIGINT,    ALTER trades     TYPE BIGINT    USING REPLACE(trades, ',', '')::BIGINT,        ALTER dollars     TYPE BIGINT    USING SUBSTRING(REPLACE(dollars, ',', '') FROM 2)::NUMERIC; commit;
We can use PostgreSQL's copy functionality to stream the data from the CSV file into our table. The \copy variant is a psql-specific command and initiates client/server streaming of the data, reading a local file, and sending its contents through any established PostgreSQL connection.

Application code and SQL

There is a lot of data in this file, so let's use the data from February 2017 for this example. The following query lists all entries in the month of February 2017:
\SET START '2017-02-01'   SELECT DATE,          to_char(shares, '99G999G999G999') AS shares,          to_char(trades, '99G999G999') AS trades,          to_char(dollars, 'L99G999G999G999') AS dollars     FROM factbook    WHERE DATE >= DATE :'start'      AND DATE  < DATE :'start' + INTERVAL '1 month' ORDER BY DATE;
We use the psql application to run this query, and psql supports the use of variables. The \set command sets the '2017-02-01' value to the variable start, and then we can reuse the variable with the expression :'start'.
Writing date :'start' is equivalent to date '2017-02-01'—this is called a decorated literal expression in PostgreSQL. This allows us to set the data type of the literal value so that the PostgreSQL query parser won't have to guess or infer it from the context.
This SQL query also uses the interval data type to compute the end of the month, which is, of course, the last day of February in our example. Adding an interval value of 1 month to the first day of the month gives us the first day of the next month, so we use the "less than" (<) strict operator to exclude this day from our result set.
The to_char() function (documented in the PostgreSQL docs section on Data Type Formatting Functions) will convert a number to its text representation with detailed control over the conversion. The format is composed of template patterns. We'll use the following patterns:
  • Value with the specified number of digits
  • L: currency symbol (uses locale)
  • G: group separator (uses locale)
Other template patterns for numeric formatting are available—see the PostgreSQL docs for reference.
Here's the result of our query:
    date    │     shares      │   trades    │     dollars       ════════════╪═════════════════╪═════════════╪══════════════════  2017-02-01 │   1,161,001,502 │   5,217,859 │ $ 44,660,060,305  2017-02-02 │   1,128,144,760 │   4,586,343 │ $ 43,276,102,903  2017-02-03 │   1,084,735,476 │   4,396,485 │ $ 42,801,562,275  2017-02-06 │     954,533,086 │   3,817,270 │ $ 37,300,908,120  2017-02-07 │   1,037,660,897 │   4,220,252 │ $ 39,754,062,721  2017-02-08 │   1,100,076,176 │   4,410,966 │ $ 40,491,648,732  2017-02-09 │   1,081,638,761 │   4,462,009 │ $ 40,169,585,511  2017-02-10 │   1,021,379,481 │   4,028,745 │ $ 38,347,515,768  2017-02-13 │   1,020,482,007 │   3,963,509 │ $ 38,745,317,913  2017-02-14 │   1,041,009,698 │   4,299,974 │ $ 40,737,106,101  2017-02-15 │   1,120,119,333 │   4,424,251 │ $ 43,802,653,477  2017-02-16 │   1,091,339,672 │   4,461,548 │ $ 41,956,691,405  2017-02-17 │   1,160,693,221 │   4,132,233 │ $ 48,862,504,551  2017-02-21 │   1,103,777,644 │   4,323,282 │ $ 44,416,927,777  2017-02-22 │   1,064,236,648 │   4,169,982 │ $ 41,137,731,714  2017-02-23 │   1,192,772,644 │   4,839,887 │ $ 44,254,446,593  2017-02-24 │   1,187,320,171 │   4,656,770 │ $ 45,229,398,830  2017-02-27 │   1,132,693,382 │   4,243,911 │ $ 43,613,734,358  2017-02-28 │   1,455,597,403 │   4,789,769 │ $ 57,874,495,227 (19 rows)
The dataset has data for only 19 days in February 2017 (the days the NYSE was open). What if we want to display an entry for each calendar day and fill in the missing dates with either matching data or a zero figure?
Here's a typical Python implementation of that:
#! /usr/bin/env python3 import sys import psycopg2 import psycopg2.extras from calendar import Calendar CONNSTRING = "dbname=yesql application_name=factbook" def fetch_month_data(year, month):     "Fetch a month of data from the database"     date = "%d-%02d-01" % (year, month)     sql = """   select date, shares, trades, dollars     from factbook    where date >= date %s      and date  < date %s + interval '1 month' order by date; """     pgconn = psycopg2.connect(CONNSTRING)     curs = pgconn.cursor()     curs.execute(sql, (date, date))     res = {}     for (date, shares, trades, dollars) in curs.fetchall():         res[date] = (shares, trades, dollars)     return res def list_book_for_month(year, month):     """List all days for given month, and for each     day list fact book entry.     """     data = fetch_month_data(year, month)     cal = Calendar()     print("%12s | %12s | %12s | %12s" %           ("day", "shares", "trades", "dollars"))     print("%12s-+-%12s-+-%12s-+-%12s" %           ("-" * 12, "-" * 12, "-" * 12, "-" * 12))     for day in cal.itermonthdates(year, month):         if day.month != month:             continue         if day in data:             shares, trades, dollars = data[day]         else:             shares, trades, dollars = 0, 0, 0         print("%12s | %12s | %12s | %12s" %               (day, shares, trades, dollars)) if __name__ == '__main__':     year = int(sys.argv[1])     month = int(sys.argv[2])     list_book_for_month(year, month)
In this implementation, we use the above SQL query to fetch our result set and, moreover, to store it in a dictionary. The dict's key is the day of the month, so we can then loop over a calendar's list of days, retrieve matching data when we have it, and install a default result set (e.g., zeroes) when we don't have any data.
Below is the output when running the program. As you can see, we opted for an output similar to the psql output, which makes it easier to compare the effort needed to reach the same result.
$ ./factbook-month.py 2017 2          day |       shares |       trades |      dollars -------------+--------------+--------------+-------------   2017-02-01 |   1161001502 |      5217859 |  44660060305   2017-02-02 |   1128144760 |      4586343 |  43276102903   2017-02-03 |   1084735476 |      4396485 |  42801562275   2017-02-04 |            0 |            0 |            0   2017-02-05 |            0 |            0 |            0   2017-02-06 |    954533086 |      3817270 |  37300908120   2017-02-07 |   1037660897 |      4220252 |  39754062721   2017-02-08 |   1100076176 |      4410966 |  40491648732   2017-02-09 |   1081638761 |      4462009 |  40169585511   2017-02-10 |   1021379481 |      4028745 |  38347515768   2017-02-11 |            0 |            0 |            0   2017-02-12 |            0 |            0 |            0   2017-02-13 |   1020482007 |      3963509 |  38745317913   2017-02-14 |   1041009698 |      4299974 |  40737106101   2017-02-15 |   1120119333 |      4424251 |  43802653477   2017-02-16 |   1091339672 |      4461548 |  41956691405   2017-02-17 |   1160693221 |      4132233 |  48862504551   2017-02-18 |            0 |            0 |            0   2017-02-19 |            0 |            0 |            0   2017-02-20 |            0 |            0 |            0   2017-02-21 |   1103777644 |      4323282 |  44416927777   2017-02-22 |   1064236648 |      4169982 |  41137731714   2017-02-23 |   1192772644 |      4839887 |  44254446593   2017-02-24 |   1187320171 |      4656770 |  45229398830   2017-02-25 |            0 |            0 |            0   2017-02-26 |            0 |            0 |            0   2017-02-27 |   1132693382 |      4243911 |  43613734358   2017-02-28 |   1455597403 |      4789769 |  57874495227

PostgreSQL advanced functions

The same thing can be accomplished with a single SQL query, without any application code "spent" on solving the problem:
  SELECT CAST(calendar.entry AS DATE) AS DATE,          COALESCE(shares, 0) AS shares,          COALESCE(trades, 0) AS trades,          to_char(              COALESCE(dollars, 0),              'L99G999G999G999'          ) AS dollars     FROM /*           * Generate the target month's calendar then LEFT JOIN           * each day against the factbook dataset, so as to have           * every day in the result set, whether or not we have a           * book entry for the day.           */          generate_series(DATE :'start',                          DATE :'start' + INTERVAL '1 month'                                        - INTERVAL '1 day',                          INTERVAL '1 day'          )          AS calendar(entry)          LEFT JOIN factbook                 ON factbook.DATE = calendar.entry ORDER BY DATE;
In this query, we use several basic SQL and PostgreSQL techniques that might be new to you:
  • SQL accepts comments written either in the -- comment style, running from the opening to the end of the line, or C-style with a /* comment */ style. As with any programming language, comments are best used to note intentions, which otherwise might be tricky to reverse engineer from the code alone.
  • generate_series() is a PostgreSQL set returning function, for which the documentation reads: "Generate a series of values, from start to stop with a step size of step." As PostgreSQL knows its calendar, it's easy to generate all days from any given month with the first day of the month as a single parameter in the query.
  • generate_series() is inclusive, much like the BETWEEN operator, so we exclude the first day of the next month with the expression - interval '1 day'.
  • The cast(calendar.entry as date) expression transforms the generated calendar.entry, which is the result of the generate_series() function call into the date data type. We need to use cast because the generate_series() function returns a set of timestamp entries, which is not relevant to us in this exercise.
  • The left join between our generated calendar table and the factbook table will keep every calendar row and associate a factbook row with it only when the date columns of both the tables have the same value. When the calendar.date is not found in factbook, the factbook columns (year, date, shares, trades, and dollars) are filled in with NULL values instead.
  • Coalesce returns the first of its arguments that is not null. So the expression coalesce(shares, 0) as shares is either how many shares we found in the factbook table for this calendar.date row, or 0 when we found no entry for the calendar.date. In addition, the left join kept our result set row and filled in the factbook columns with NULL values.
Finally, here's the result of this query:
    date    │   shares   │ trades  │     dollars       ════════════╪════════════╪═════════╪══════════════════  2017-02-01 │ 1161001502 │ 5217859 │ $ 44,660,060,305  2017-02-02 │ 1128144760 │ 4586343 │ $ 43,276,102,903  2017-02-03 │ 1084735476 │ 4396485 │ $ 42,801,562,275  2017-02-04 │          0 │       0 │ $              0  2017-02-05 │          0 │       0 │ $              0  2017-02-06 │  954533086 │ 3817270 │ $ 37,300,908,120  2017-02-07 │ 1037660897 │ 4220252 │ $ 39,754,062,721  2017-02-08 │ 1100076176 │ 4410966 │ $ 40,491,648,732  2017-02-09 │ 1081638761 │ 4462009 │ $ 40,169,585,511  2017-02-10 │ 1021379481 │ 4028745 │ $ 38,347,515,768  2017-02-11 │          0 │       0 │ $              0  2017-02-12 │          0 │       0 │ $              0  2017-02-13 │ 1020482007 │ 3963509 │ $ 38,745,317,913  2017-02-14 │ 1041009698 │ 4299974 │ $ 40,737,106,101  2017-02-15 │ 1120119333 │ 4424251 │ $ 43,802,653,477  2017-02-16 │ 1091339672 │ 4461548 │ $ 41,956,691,405  2017-02-17 │ 1160693221 │ 4132233 │ $ 48,862,504,551  2017-02-18 │          0 │       0 │ $              0  2017-02-19 │          0 │       0 │ $              0  2017-02-20 │          0 │       0 │ $              0  2017-02-21 │ 1103777644 │ 4323282 │ $ 44,416,927,777  2017-02-22 │ 1064236648 │ 4169982 │ $ 41,137,731,714  2017-02-23 │ 1192772644 │ 4839887 │ $ 44,254,446,593  2017-02-24 │ 1187320171 │ 4656770 │ $ 45,229,398,830  2017-02-25 │          0 │       0 │ $              0  2017-02-26 │          0 │       0 │ $              0  2017-02-27 │ 1132693382 │ 4243911 │ $ 43,613,734,358  2017-02-28 │ 1455597403 │ 4789769 │ $ 57,874,495,227 (28 rows)
Note that we replaced 60 lines of Python code with a simple SQL query. Down the road that means less code to maintain and a more efficient implementation, too. Here, the Python is doing a Hash Join Nested Loop while PostgreSQL picks a Merge Left Join over two ordered relations.

Computing weekly changes

Imagine the analytics department now wants us to provide the weekly difference for each day. This means we need to add a column with the change calculated as a percentage of the dollars column between each date and the same day of the previous week.
I'm using the "week-over-week percentage difference" example because it's both a classic analytics need (although maybe mostly in marketing circles), and because (in my experience) a developer's first reaction is rarely to write a SQL query to do all the math.
Also, the calendar isn't very helpful in computing weeks, but for PostgreSQL, this task is as easy as spelling the word week:
WITH computed_data AS (   SELECT CAST(DATE AS DATE)   AS DATE,          to_char(DATE, 'Dy')  AS DAY,          COALESCE(dollars, 0) AS dollars,          lag(dollars, 1)            OVER(              partition BY EXTRACT('isodow' FROM DATE)                  ORDER BY DATE            )          AS last_week_dollars     FROM /*           * Generate the month calendar, plus a week before           * so that we have values to compare dollars against           * even for the first week of the month.           */          generate_series(DATE :'start' - INTERVAL '1 week',                          DATE :'start' + INTERVAL '1 month'                                        - INTERVAL '1 day',                          INTERVAL '1 day'          )          AS calendar(DATE)          LEFT JOIN factbook USING(DATE) )   SELECT DATE, DAY,          to_char(              COALESCE(dollars, 0),              'L99G999G999G999'          ) AS dollars,          CASE WHEN dollars IS NOT NULL                AND dollars <> 0               THEN round(  100.0                          * (dollars - last_week_dollars)                          / dollars                        , 2)           END          AS "WoW %"     FROM computed_data    WHERE DATE >= DATE :'start' ORDER BY DATE;
To implement this case in SQL, we need window functions that appeared in the SQL standard in 1992, but are often skipped in SQL classes. The last things executed in a SQL statement are windows functions, well after join operations and where clauses. So, if we want to see a full week before the first of February, we need to extend our calendar selection a week into the past and then, once again, restrict the data we issue to the caller.
That's why we use a common table expression—the WITH part of the query—to fetch the extended data set we need, including the last_week_dollars computed column.
The expression extract('isodow' from date) is a standard SQL feature that allows computing the day of the week following ISO rules. Used as a partition by frame clause, it allows a row to be a peer to any other row having the same isodow. The lag() window function can then refer to the previous peer dollars value when ordered by date; that's the number that we want to compare to the current dollars value.
The computed_data result set is then used in the main part of the query as a relation we get data from, and the computation is easier this time, as we simply apply a classic difference percentage formula to the dollars and the last_week_dollars columns.
Here's the result from running this query:
    date    │ day │     dollars      │ WoW %   ════════════╪═════╪══════════════════╪════════  2017-02-01 │ Wed │ $ 44,660,060,305 │  -2.21  2017-02-02 │ Thu │ $ 43,276,102,903 │   1.71  2017-02-03 │ Fri │ $ 42,801,562,275 │  10.86  2017-02-04 │ Sat │ $              0 │      ¤  2017-02-05 │ Sun │ $              0 │      ¤  2017-02-06 │ Mon │ $ 37,300,908,120 │  -9.64  2017-02-07 │ Tue │ $ 39,754,062,721 │ -37.41  2017-02-08 │ Wed │ $ 40,491,648,732 │ -10.29  2017-02-09 │ Thu │ $ 40,169,585,511 │  -7.73  2017-02-10 │ Fri │ $ 38,347,515,768 │ -11.61  2017-02-11 │ Sat │ $              0 │      ¤  2017-02-12 │ Sun │ $              0 │      ¤  2017-02-13 │ Mon │ $ 38,745,317,913 │   3.73  2017-02-14 │ Tue │ $ 40,737,106,101 │   2.41  2017-02-15 │ Wed │ $ 43,802,653,477 │   7.56  2017-02-16 │ Thu │ $ 41,956,691,405 │   4.26  2017-02-17 │ Fri │ $ 48,862,504,551 │  21.52  2017-02-18 │ Sat │ $              0 │      ¤  2017-02-19 │ Sun │ $              0 │      ¤  2017-02-20 │ Mon │ $              0 │      ¤  2017-02-21 │ Tue │ $ 44,416,927,777 │   8.28  2017-02-22 │ Wed │ $ 41,137,731,714 │  -6.48  2017-02-23 │ Thu │ $ 44,254,446,593 │   5.19  2017-02-24 │ Fri │ $ 45,229,398,830 │  -8.03  2017-02-25 │ Sat │ $              0 │      ¤  2017-02-26 │ Sun │ $              0 │      ¤  2017-02-27 │ Mon │ $ 43,613,734,358 │      ¤  2017-02-28 │ Tue │ $ 57,874,495,227 │  23.25 (28 rows)
Have fun writing code, and as SQL is code, have fun writing SQL!
This article is based on an excerpt from Dimitri Fontaine's book Mastering PostgreSQL in Application Development, which explains how to replace thousands of lines of code with simple queries. The book goes into greater detail on these topics and provides many other examples so you can master PostgreSQL and issue the SQL queries that fetch exactly the result set you need.

No comments:

Post a Comment