Introduction to Notebooks

Jupyter notebooks allow one to perform a great deal of data analysis and statistical validation. We'll demonstrate a few simple techniques here.

Code Cells vs. Text Cells

As you can see, each cell can be either code or text. To select between them, choose from the 'Markdown' dropdown menu on the top of the notebook.

Executing a Command

A code cell will be evaluated when you press play, or when you press the shortcut, shift-enter. Evaluating a cell evaluates each line of code in sequence, and prints the results of the last line below the cell.

Sometimes there is no result to be printed, as is the case with assignment.

Remember that only the result from the last line is printed.

However, you can print whichever lines you want using the print statement.

Knowing When a Cell is Running

While a cell is running, a [*] will display on the left. When a cell has yet to be executed, [ ] will display. When it has been run, a number will display indicating the order in which it was run during the execution of the notebook [5]. Try on this cell and note it happening.

Importing Libraries

The vast majority of the time, you'll want to use functions from pre-built libraries. Here I import numpy and pandas, the two most common and useful libraries in quant finance. I recommend copying this import statement to every new notebook.

Notice that you can rename libraries to whatever you want after importing. The as statement allows this. Here we use np and pd as aliases for numpy and pandas. This is a very common aliasing and will be found in most code snippets around the web. The point behind this is to allow you to type fewer characters when you are frequently accessing these libraries.

Tab Autocomplete

Pressing tab will give you a list of Python's best guesses for what you might want to type next. This is incredibly valuable and will save you a lot of time. If there is only one possible option for what you could type next, Python will fill that in for you. Try pressing tab very frequently, it will seldom fill in anything you don't want, as if there is ambiguity a list will be shown. This is a great way to see what functions are available in a library.

Try placing your cursor after the . and pressing tab.

Getting Documentation Help

Placing a question mark after a function and executing that line of code will give you the documentation Python has for that function. It's often best to do this in a new cell, as you avoid re-executing other code and running into bugs.


We'll sample some random data using a function from numpy.


We can use the plotting library we imported as follows.

Squelching Line Output

You might have noticed the annoying line of the form [<matplotlib.lines.Line2D at 0x7f72fdbc1710>] before the plots. This is because the .plot function actually produces output. Sometimes we wish not to display output, we can accomplish this with the semi-colon as follows.

Adding Axis Labels

No self-respecting quant leaves a graph without labeled axes. Here are some commands to help with that.

Generating Statistics

Let's use numpy to take some simple statistics.

Getting Real Pricing Data

Randomly sampled data can be great for testing ideas, but let's get some real data. In QuantRocket, all securities are referenced by sid (short for "security ID") rather than by symbol since symbols can change. So, first, we'll use the get_securities function to look up the sid for MSFT.

(Notice the use of vendors='usstock' in the get_securities function call. This limits the query to securities from the US Stock dataset. This filter isn't necessary if you've only collected US Stock data, but is a best practice when looking up securities by symbol in case you've also collected data from other global exchanges where the same ticker symbols are re-used.)

This returns a pandas dataframe, where sids are stored in the dataframe's index.

Then we use get_prices to query our data bundle. Although the bundle contains minute data, here we use the data_frequency parameter to request the data at daily frequency:

Our data is now a dataframe. You can see the datetime index and the colums with different pricing data.

This is a pandas dataframe, so we can index in to just get the closing price for MSFT like this. For more info on pandas, please click here.

Because there is now also date information in our data, we provide two series to .plot. X.index gives us the datetime index, and X.values gives us the pricing values. These are used as the X and Y coordinates to make a graph.

We can get statistics again on real data.

Getting Returns from Prices

We can use the pct_change function to get returns. Notice how we drop the first element after doing this, as it will be NaN (nothing -> something results in a NaN percent change).

We can plot the returns distribution as a histogram.

Get statistics again.

Now let's go backwards and generate data out of a normal distribution using the statistics we estimated from Microsoft's returns. We'll see that we have good reason to suspect Microsoft's returns may not be normal, as the resulting normal distribution looks far different.

Generating a Moving Average

pandas has some nice tools to allow us to generate rolling statistics. Here's an example. Notice how there's no moving average for the first 60 days, as we don't have 60 days of data on which to generate the statistic.

