Comments On Starting To Do Economic/Financial Analysis In Python
This article is partly a response to a question I received, which I would paraphrase as: how to start doing economic analysis in Python? I just want to outline a way to do it, leveraging the work that I have pushed out into open source repositories on GitHub. For someone interested in stock-flow consistent models, I have some introductory comments in my book An Introduction to SFC Models Using Python. This article is aimed at people interested in time series analysis.
(I have a Patreon that is associated with my open source projects. My experience with the SFC models book is that I do not want to try to format a programming book again. The Patreon is there to allow anyone to support those projects. My main project now is the agent-based model project. I am putting this article here rather than there on the basis that Substack has better post editing facilities.)
What Would I Need to Learn?
There are two separate steps to getting to be able to do basic time series analysis in Python. (Or be “a Python-powered Data Scientist” in LinkedIn-speak.)
Basics of the Python language.
Working with a time series library, with Pandas being what appears to be the most popular choice.
In order to transmogrify yourself into a data scientist, most of your time should be spent on the latter part. However, there is a Python learning curve that you need to get out of the way first.
Where do my libraries fit in? I have built an open source package (“economics platform” — https://github.com/brianr747/platform) which (hopefully) cleans up the underlying data management, allowing the user to get straight into data sciencing.
Python Basics
I started programming in the late 1970s on friends’ “microcomputers,” and I not really sure what is the easiest way to start programming.
One book that is interesting is “Learn Python 3 the Hard Way” by Zed A. Shaw (Amazon affiliate link). If you want to be a programmer, this book largely simulates the way I learned: typing in programs, and fixing errors by repeated re-runs. However, this might be overkill for someone interested in data science. I believe that you want to work on things closer to your target workflow, and only go back to the basics of Python when required.
I think you need to get a handle on at least three areas.
Basics of Python program structure. There are plenty of tutorials available.
Know how an Integrated Development Environment (IDE) works. I use PyCharm (https://www.jetbrains.com/pycharm/) which has a free and professional version. The free version was generally adequate for my work, I upgraded to the professional version for a few quality-of-life improvements.
The way in which Python imports libraries is easy to use for standard packages you download, but makes very little sense for your own libraries. At some point, you need to know how that works, although PyCharm can handle a lot of that in the background.
One advantage of PyCharm is that they have an educational version: https://www.jetbrains.com/pycharm-edu/. It is a cut down version of PyCharm with a built in Python course. I have not spent much time with it, but it might ease the pain of dealing with the first two steps.
Time Series Analysis
A fairly typical mode of operation in financial/economic research is to write small stand-alone functions that read data, do some basic calculations, and plot the result. Using my “platform” package, this can be done with very simple code.
from econ_platform.start import fetch, quick_plot
ust10 = fetch('F@DGS10')
euro_AAA_10 = fetch('D@Eurostat/irt_h_euryld_d/D.PAR.Y10.EA')
quick_plot(100*(ust10-euro_AAA_10), title='U.S. 10Y Spread Over Euro Govvie')
The above code snippet (the “second fetch” example in my platform examples directory) does the following:
First line is the overhead import of library functions (fetch, quick_plot).
Use fetch() to get the 10-year Treasury yield from the FRED database (St. Louis Fed), using their ticker (“DGS10”).
Use fetch() to get the 10-year AAA-rated Euro bond yield from DB.nomics.
Plot the result.
The above figure can then be seen in PyCharm.
The spread (in basis points) was calculated by using the built-in override of the “-” function of the Pandas time series. You can see various gaps in the time series, which presumably correspond to holidays in either market.
The script that runs this looks like this in PyCharm. Yes, it is probably scary looking to someone not familiar with IDE’s.
The reason why you want to use PyCharm (or an equivalent) is that it makes it easier to work with Pandas objects. If we run in debug mode, we see the following:
If we click on the “View As Series” option, we can then view the data within the Pandas time series object as a nice tabular view.
This is a lot easier to work with than trying to snoop through the data using commands. With practice, it is just as easy to visualise data as in a spreadsheet, without the problems associated with using spreadsheets (errors are not easily spotted).
Stack Exchange/Overflow Puts the Science into Data Science
If you want to do something fancier than calculating a spread between two time series, you want to use the functionality built into Pandas whenever possible. I was not too happy with the Pandas programming interface or documentation, but the way around that is to use the power of the internet. Most basic manipulations you want to do probably were already asked on the Stack Exchange/Stack Overflow websites. I “learned” Pandas this way — relied on canned code snippets from those sites, and then worked my way through the documentation once I was forced to do so.
Why My Platform?
Using code snippets found on those websites, you could re-write the above script to be self-contained — it would use standard Python libraries to download the data (which my platform used under the hood). The reason why you do not want to go that route is two-fold.
You end up repeating the same code that fetches data everywhere.
You will bombard the date providers for data every single time you run the function. My platform code fetches the data from a local database, and only refreshes a time series from its original source once per day (it plays a “ding” sound to let you know when that happens). This way, you can run a chart pack without it turning into a homemade denial-of-service attack.
You could try sticking closer to the original data fetching libraries, but life is a lot simpler if you force everything into a common database format, which can then be fetched with a single command.