Data Analysis with Python and Pandas

Video Lectures

Displaying all 16 video lectures.
Lecture 1
Introduction to Pandas
Play Video
Introduction to Pandas
Pandas is a Python module, and Python is the programming language that we're going to use. The Pandas module is a high performance, highly efficient, and high level data analysis library.

At its core, it is very much like operating a headless version of a spreadsheet, like Excel. Most of the datasets you work with will be what are called dataframes. You may be familiar with this term already, it is used across other languages, but, if not, a dataframe is most often just like a spreadsheet. Columns and rows, that's all there is to it! From here, we can utilize Pandas to perform operations on our data sets at lightning speeds.

Sample code: http://pythonprogramming.net/data-analysis-python-pandas-tut...

Pip install tutorial: http://pythonprogramming.net/using-pip-install-for-python-mo...

Matplotlib series starts here: http://pythonprogramming.net/matplotlib-intro-tutorial/
Lecture 2
Pandas Basics
Play Video
Pandas Basics
In this Data analysis with Python and Pandas tutorial, we're going to clear some of the Pandas basics. Data prior to being loaded into a Pandas Dataframe can take multiple forms, but generally it needs to be a dataset that can form to rows and columns.

Text-version and sample code for this tutorial: http://pythonprogramming.net/basics-data-analysis-python-pan...

Python dictionaries tutorial: http://pythonprogramming.net/dictionaries-tutorial-python-3/

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 3
IO Basics
Play Video
IO Basics
Welcome to Part 3 of Data Analysis with Pandas and Python. In this tutorial, we will begin discussing IO, or input/output, with Pandas, and begin with a realistic use-case. To get ample practice, a very useful website is Quandl. Quandl contains a plethora of free and paid data sources. What makes this location great is that the data is generally normalized, it's all in one place, and extracting the data is the same method. If you are using Python, and you access the Quandl data via their simple module, then the data is automatically returned to a dataframe. For the purposes of this tutorial, we're going to just manually download a CSV file instead, for learning purposes, since not every data source you find is going to have a nice and neat module for extracting the datasets.

Let's say we're interested in maybe purchasing or selling a home in Austin, Texas. The zipcode there is 77006. We could go to the local housing listings and see what the current prices are, but this doesn't really give us any real historical information, so let's just try to get some data on this. Let's query for "home value index 77006." Sure enough, we can see an index here. There's top, middle, lower tier, three bedroom, and so on. Let's say, sure, we got a a three bedroom house. Let's check that out. Turns out Quandl already provides graphs, but let's grab the dataset anyway, make our own graph, and maybe do some other analysis. Go to download, and choose CSV. Pandas is capable of IO with csv, excel data, hdf, sql, json, msgpack, html, gbq, stata, clipboard, and pickle data, and the list continues to grow. Check out the IO Tools documentation for the current list. Take that CSV and move it into the local directory (the directory that you are currently working in / where this .py script is).

sample code and text-based write up for this tutorial: http://pythonprogramming.net/input-output-data-analysis-pyth...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 4
Building dataset
Play Video
Building dataset
In this part of Data Analysis with Python and Pandas tutorial series, we're going to expand things a bit. Let's consider that we're multi-billionaires, or multi-millionaires, but it's more fun to be billionaires, and we're trying to diversify our portfolio as much as possible. We want to have all types of asset classes, so we've got stocks, bonds, maybe a money market account, and now we're looking to get into real estate to be solid. You've all seen the commercials right? You buy a CD for $60, attend some $500 seminar, and you're set to start making your 6 figure at a time investments into property, right?

Okay, maybe not, but we definitely want to do some research and have some sort of strategy for buying real estate. So, what governs the prices of homes, and do we need to do the research to find this out? Generally, no, you don't really need to do that digging, we know the factors. The factors for home prices are governed by: The economy, interest rates, and demographics. These are the three major influences in general for real estate value. Now, of course, if you're buying land, various other things matter, how level is it, are we going to need to do some work to the land before we can actually lay foundation, how is drainage etc. If there is a house, then we have even more factors, like the roof, windows, heating/AC, floors, foundation, and so on. We can begin to consider these factors later, but first we'll start at the macro level. You will see how quickly our data sets inflate here as it is, it'll blow up fast.

So, our first step is to just collect the data. Quandl still represents a great place to start, but this time let's automate the data grabbing. We're going to pull housing data for the 50 states first, but then we stand to try to gather other data as well. We definitely dont want to be manually pulling this data. First, if you do not already have an account, you need to get one. This will give you an API key and unlimited API requests to the free data, which is awesome.

Once you create an account, go to your account / me, whatever they are calling it at the time, and then find the section marked API key. That's your key, which you will need. Next, we want to grab the Quandl module. We really don't need the module to make requests at all, but it's a very small module, and the size is worth the slight ease it gives us, so might as well. Open up your terminal/cmd.exe and do pip install quandl (again, remember to specify the full path to pip if pip is not recognized).

Next, we're ready to rumble, open up a new editor.

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 5
Concatenating and Appending dataframes
Play Video
Concatenating and Appending dataframes
Welcome to Part 5 of our Data Analysis with Python and Pandas tutorial series. In this tutorial, we're going to be covering how to combine dataframes in a variety of ways.

In our case with real estate investing, we're hoping to take the 50 dataframes with housing data and then just combine them all into one dataframe. We do this for multiple reasons. First, it is easier and just makes sense to combine these, but also it will result in less memory being used. Every dataframe has a date and value column. This date column is repeated across all the dataframes, but really they should all just share the one, effectively nearly halving our total column count.

When combining dataframes, you might have quite a few goals in mind. For example, you may want to "append" to them, where you may be adding to the end, basically adding more rows. Or maybe you want to add more columns, like in our case. There are four major ways of combining dataframes, which we'll begin covering now. The four major ways are: Concatenation, joining, merging, and appending. We'll begin with Concatenation.

Sample code and text-based version of this tutorial: http://pythonprogramming.net/concatenate-append-data-analysi...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 6
Joining and Merging Dataframes
Play Video
Joining and Merging Dataframes
Welcome to Part 6 of the Data Analysis with Python and Pandas tutorial series. In this part, we're going to talk about joining and merging dataframes, as another method of combining dataframes. In the previous tutorial, we covered concatenation and appending.

Joining/merging tutorial text and sample code: http://pythonprogramming.net/join-merge-data-analysis-python...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 7
Pickling
Play Video
Pickling
Welcome to Part 7 of our Data Analysis with Python and Pandas tutorial series. In the last couple tutorials, we learned how to combine data sets. In this tutorial, we're going to resume under the premise that we're aspiring real estate moguls. We're looking to protect our wealth by having diversified wealth, and, one component to this is real-estate.

Tutorial text and sample code: http://pythonprogramming.net/pickle-data-analysis-python-pan...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 8
Percent Change and Correlation Tables
Play Video
Percent Change and Correlation Tables
Welcome to Part 8 of our Data Analysis with Python and Pandas tutorial series. In this part, we're going to do some of our first manipulations on the data.

Tutorial sample code and text: http://pythonprogramming.net/percent-change-correlation-data...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 9
Resampling
Play Video
Resampling
Welcome to another data analysis with Python and Pandas tutorial. In this tutorial, we're going to be talking about smoothing out data by removing noise. There are two main methods to do this. The most popular method used is what is called resampling, though it might take many other names. This is where we have some data that is sampled at a certain rate. For us, we have the Housing Price Index sampled at a one-month rate, but we could sample the HPI every week, every day, every minute, or more, but we could also resample at every year, every 10 years, and so on.

Another environment where resampling almost always occurs is with stock prices, for example. Stock prices are intra-second. What winds up happening though, is usually stock prices are resampled to minute data at the lowest for free data. You can buy access to live data, however. On a long-term scale, usually the data will be sampled daily, or even every 3-5 days. This is often done to keep the size of the data being transferred low. For example, over the course of, say, one year, intra-second data is usually in the multiples of gigabytes, and transferring all of that at once is unreasonable and people would be waiting minutes or hours for pages to load.

Using our current data, which is currently sampled at once a month, how might we sample it instead to once every 6 months, or 2 years? Try to think about how you might personally write a function that might perform that task, it's a fairly challenging one, but it can be done. That said, it's a fairly computationally inefficient job, but Pandas has our backs and does it very fast.

Sample code and text tutorial for this video: http://pythonprogramming.net/resample-data-analysis-python-p...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 10
Handling Missing Data
Play Video
Handling Missing Data
Welcome to Part 10 of our Data Analysis with Python and Pandas tutorial. In this part, we're going to be talking about missing or not available data. We have a few options when considering the existence of missing data.

Ignore it - Just leave it there
Delete it - Remove all cases. Remove from data entirely. This means forfeiting the entire row of data.
Fill forward or backwards - This means taking the prior or following value and just filling it in.
Replace it with something static - For example, replacing all NaN data with -9999.
Each of these options has their own merits for a variety of reasons. Ignoring it requires no more work on our end. You may choose to ignore missing data for legal reasons, or maybe to retain the utmost integrity of the data. Missing data might also be very important data. For example, maybe part of your analysis is investigating signal drops from a server. In this case, maybe the missing data is super important to keep in the set.

Next, we have delete it. You have another two choices at this point. You can either delete rows if they contain any amount of NaN data, or you can delete the row if it is completely NaN data. Usually a row that is full of NaN data comes from a calculation you performed on the dataset, and no data is really missing, it's just simply not available given your formula. In most cases, you would at least want to drop all rows that are completely NaN, and in many cases you would like to just drop rows that have any NaN data.

Tutorial sample code and text: http://pythonprogramming.net/nan-na-missing-data-analysis-py...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 11
Rolling statistics
Play Video
Rolling statistics
Welcome to another data analysis with Python and Pandas tutorial series, where we become real estate moguls. In this tutorial, we're going to be covering the application of various rolling statistics to our data in our dataframes.

One of the more popular rolling statistics is the moving average. This takes a moving window of time, and calculates the average or the mean of that time period as the current value. In our case, we have monthly data. So a 10 moving average would be the current value, plus the previous 9 months of data, averaged, and there we would have a 10 moving average of our monthly data. Doing this is Pandas is incredibly fast. Pandas comes with a few pre-made rolling statistical functions, but also has one called a rolling_apply. This allows us to write our own function that accepts window data and apply any bit of logic we want that is reasonable. This means that even if Pandas doesn't officially have a function to handle what you want, they have you covered and allow you to write exactly what you need. Let's start with a basic moving average, or a rolling_mean as Pandas calls it. You can check out all of the Moving/Rolling statistics from Pandas' documentation.

Text tutorial and sample code: http://pythonprogramming.net/rolling-statistics-data-analysi...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 12
Applying Comparison Operators to DataFrame
Play Video
Applying Comparison Operators to DataFrame
Welcome to part 12 of the Data Analysis with Python and Pandas tutorial series. In this tutorial, we're goign to talk briefly on the handling of erroneous/outlier data. Just because data is an outlier, it does not mean it is erroneous. A lot of times, an outlier data point can nullify a hypothesis, so the urge to just get rid of it can be high, but this isn't what we're talking about here.

What would an erroneous outlier be? An example I like to use is when measuring fluctuations in something like, say, a bridge. As bridges carry weight, they can move a bit. In storms, that can wiggle about a bit, there is some natural movement. As time goes on, and supports weaken, the bridge might move a bit too much, and eventually need to be reinforced. Maybe we have a system in place that constantly measures fluctuations in the bridge's height.

Text based tutorial and sample code: http://pythonprogramming.net/comparison-operators-data-analy...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 13
Joining 30 year mortgage rate
Play Video
Joining 30 year mortgage rate
Welcome to Part 13 of our Data Analysis with Python and Pandas, using Real Estate investing as an example. At this point, we've learned quite a bit about what Pandas has to offer us, and we'll come up here with a bit of a challenge! As we've covered so far, we can make relatively low-risk investments based on divergence between highly correlated state pairs and probably do just fine. We'll cover testing this strategy later on, but, for now, let's look into acquiring the other necessary data that comprises housing values: Interest rates. Now, there are many different types of mortgage rates both in the way interest is accrued as well as the time-frame for the loan. Opinions vary over the years, and depending on the current market situation, on whether you want a 10 year, 15 year, or 30 year mortgage. Then you have to consider if you want an adjustable rate, or maybe along the way you decide you want to re-finance your home.

At the end of the data, all of this data is finite, but ultimately will likely be a bit too noisy. For now, let's just keep it simple, and look into the 30 year conventional mortgage rate. Now, this data should be very negatively correlated with the House Price Index (HPI). Before even bothering with this code, I would automatically assume and expect that the correlation wont be as negatively strong as the higher-than-90% that we were getting with state HPI correlation, certainly less than -0.9, but also it should be greater than -0.5. The interest rate is of course important, but correlation to the overall HPI was so very strong because these were very similar statistics. The interest rate is of course related, but not as directly as other HPI values, or the US HPI.

Sample code and text-based tutorial: http://pythonprogramming.net/joining-mortgage-rate-data-anal...

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 14
Adding other economic indicators
Play Video
Adding other economic indicators
Hello everyone and welcome to Part 14 of our Data Analysis with Python and Pandas for Real Estate investing tutorial series. We've come quite a long ways here, and the next, and final, macro step that we want to take here involves looking into economic indicators to see their impact on housing prices, or the HPI.

Text based version of this tutorial and sample code: http://pythonprogramming.net/economic-factors-data-analysis-...

There are two major economic indicators that come to mind out the gate: S&P 500 index (stock market) and GDP (Gross Domestic Product). I suspect the S&P 500 to be more correlated than the GDP, but the GDP is usually a better overall economic indicator, so I may be wrong. Another macro indicator that I suspect might have value here is the unemployment rate. If you're unemployed, you're probably not getting that mortgage. We'll see though. We've been through the process for adding more data points, so I do not see much point in dragging you all through this process. There will be one new thing to note, however. In the HPI_Benchmark() function, we're changing the "United States" column to be US_HPI. This makes a bit more sense when we're bringing in other values now.

For GDP, I couldn't find one that encompassed the full time frame. I am sure you can find a dataset, somewhere, with this data, maybe even on Quandl. Sometimes you have to do some digging. I also had trouble finding a nice long-term monthly unemployment rate. I did find an unemployment level, but we really want more of a percentage/rate, otherwise we need to divide the unemployment level by the population. We could do that if we decide unemployment rate is worth having, but we'll work with what we get first.

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 15
Rolling Apply and Mapping Functions
Play Video
Rolling Apply and Mapping Functions
In this data analysis with Python and Pandas tutorial, we cover function mapping and rolling_apply with Pandas.

The idea of function mapping and rolling apply is to allow you to fully customize Pandas to do whatever you need. If there isn't a pre-built method or function for you to run against to your dataframe to do analysis or manipulation, you can use function mapping, creating your own function entirely.

Sample code and text-based version of this tutorial: http://pythonprogramming.net/rolling-apply-mapping-functions...

If you need to do something similar to this, but in a rolling fashion with a moving window, then you can do this with rolling_apply. Both will be covered here.

http://pythonprogramming.net
https://twitter.com/sentdex
Lecture 16
Scikit Learn Incorporation
Play Video
Scikit Learn Incorporation
In this Data Analysis with Pandas and Python tutorial series, we're going to show how quickly we can take our Pandas dataset in the dataframe and convert it to, for example, a numpy array, which can then be fed through a variety of other data analysis Python modules. The example that we're going to use here is Scikit-Learn, or SKlearn. In order to do this, you will need to install it:

pip install sklearn
From here, we're almost already done. For machine learning to take place, at least in the supervised form, we need only a couple things. First, we need "features." In our case, features are things like current HPI, maybe the GDP, and so on. Then you have "labels." Labels are assigned to the feature "sets," where a feature set is the collective GDP, HPI, and so on for any given "label." Our label, in this case, is either a 1 or a 0, where 1 means the HPI increased in the future, and a 0 means it did not.

Sample code and text-based tutorial: http://pythonprogramming.net/scikit-learn-sklearn-machine-le...

http://pythonprogramming.net
https://twitter.com/sentdex