Using Python & Tableau to Analyze Indego’s Bike Trips & Locations in Philadelphia

Ghafar Shah
9 min readJan 4, 2021

Using Tableau/Tableau Prep and Python (Beautiful Soup) to perform an analysis on Indego’s bike trips and station addresses in Philly.

Click here to view dashboard on Tableau Public.

Background.

A few weeks ago, I was online searching for a new bicycle as my old one, unfortunately, does not run anymore. All of sudden, I remembered Philly’s Indego bike sharing service and Googled them instantly. I know it sounds a bit strange to hop from one topic to another, but I had this viz idea going through my head when I thought of Indego. So, looking through Indego’s website, I spotted a map on a webpage with the locations of all available bike services in Philadelphia. At this point, I thought it would be really awesome to do some analysis on it. As I have been trying to brush up on Python, in particular, web scraping, I really believed this personal project would help broaden my programming skills a bit more while also having fun vizzing!

Studying the website HTML structure.

Just recently, I have been web scraping sites at work using Beautiful Soup. Something that I have learned from this coding experience and through reading up on a ton of blogs is that understanding the HTML structure on a webpage is extremely important for web scraping. In my opinion, not every site is exactly the same so that means the HTML/CSS code will most likely be different on every website.

Let us jump right into it. Below is a screenshot of a map on Indego’s webpage that I plan to web scrape data from. My initial goal is to extract the bike service addresses from the map icons. To do so, I need to look at the HTML content of this page, specifically, the Indego map, and then dig deep into the nested HTML elements to extract the data.

To start diving into the HTML, I right clicked on the webpage, and selected “Inspect”. This opened up a pane on the right-side of my screen where I could see the HTML structure under the “Elements” tab.

Webpage: https://www.rideindego.com/suggest-a-station/

At this point, the most challenging part for me personally was to find the list of tags that had the map addresses. That means I need to dig through a bunch of “div” tags until I find the data I need for my visualization. I will not go too much into detail here but after navigating from the top hierarchy div classes “gform_body” , “ginput_container ginput_container_gfgeo_map” and then digging even deeper into the nested div tags within those top hierarchy classes, “gfgeo-map-wrapper-7_1”, and “gfgeo-map-7_1”, I finally found the addresses I was looking for. But, I noticed those bike service addresses were nested inside a “div style” CSS tag!

I did not realize I would need to worry about CSS tags as this made my coding experience even more challenging since I have not brushed up on CSS. The screenshot below shows the HTML/CSS hierarchy of the webpage and the red circle is an address of a bike service location.

HTML/CSS content of webpage: https://www.rideindego.com/suggest-a-station/

Bingo! I found the addresses but they are all nested inside “div” tags with a “style” attribute which means I need to write code that goes through all the nested “divs” and extract the bike station address from the “title” attribute.

Let us take a closer look.

Below is a screenshot of the addresses inside “div title”.

Close-up look at bike station addresses inside “title” attribute.

Web scraping using Python’s Beautiful Soup library.

Now that I had found the data I was looking for on Indego’s webpage, I needed to write some Python code using Beautiful Soup library and then use the Pandas library to organize the data into a table format. In all honesty, I was scared of this part and it literally took me three weeks to get my code to work.

Here is a screenshot of my code for the web scraping part:

Beautiful Soup web scraping code in Pycharm IDE.

I will walk through this chunk of code from top to bottom. I do not plan to cover every detail about Python but I hope my explanation below will be clear enough to help you understand what I did in this code. I also included some links to resources at the end of this blog that I found incredibly helpful to learn Python and understand Beautiful Soup.

First, I assigned the variable “url” to the Indego webpage I plan to extract data from as a string. That way, Beautiful Soup can read the HTML/CSS content. You can call this variable anything you want.

Getting the Selenium driver to work in Chrome.

Following the “url” variable, I then used Selenium driver which opens up a web browser when you run this code. Think of Selenium as a testing tool. It just lets me know that when I run my code, the Chrome browser opens the correct Indego webpage and reassures me that I am web scraping a working site. I had to download the Chrome driver for this to work which is why you see a path pointing to “…Downloads/chromedriver” in the “driver” variable.

Finally, I passed the “url” variable (webpage address) into “driver.get(url)” and waited 20 seconds using “time.sleep(20)” for the Chrome browser to load before extracting any addresses from the map icons. By waiting 20 seconds, I know that all of the map icons I plan to grab addresses from will show up on the Indego webpage.

We also see the two variables “html” and “soup”. The variable “html” has all the HTML content from the Chrome browser webpage. I passed this “html” variable into “BeautfulSoup(html, features=’html.parser’)”. I love this part because now we are actually using Beautiful Soup to read all the HTML data!

“html” and soup” variables

Now I have my new function “getindegoStationAddress” below. This function pretty much loops through the nested “divs”. Using two “For loops”, I had to loop through the nested “divs” inside “gfgeo-map-7_1” using Beautiful Soup’s property, “soup.findAll”.

Remember when we created the “soup” variable above? Well that has our HTML/CSS content. The “findAll” part in the “Soup.findAll” code is Beautiful Soup’s property and it looks for all the nested “divs” inside the HTML.

The second “For loop” goes through “div style” to finally grab the “title” attribute information. Notice how I used the “style” attribute. Basically, I am using “soup.find” in the second “For loop” to get all the “div title”. I then append the “title” attribute text which has the bike station addresses into an empty array I have created above called “indegoStations”.

“getindegoStationsAddress” function

After looping through all the “div” tags and appending the addresses from the “title” attribute to the empty array “indegoStations”, I create a dictionary and passed the dictionary into the Pandas DataFrame: “df = pd.DataFrame(dictionary)”. Now we can return “df” inside the function which is a list of all our addresses and even export the results into an Excel file.

Export results into Excel file and save it in the downloads folder.

Here is what the results look like in Excel. I needed to remove column “A” and row two since that is an extra row/column generated from Python. In total, there are 142 active bike stations in Philly. I still need to plot this data so I added an additional column called “City” and concatenated the “Title” column (street address) in Excel with the “City” column so I get a complete address.

Excel output with Indego bike addresses from Python script.
Added the “City” column.

Great, I have all the bike station addresses but I still need to add two more columns “latitudes” and “longitudes” to plot this data on a map in Tableau. After some research, I found Google’s Geocoding API really powerful to use for this project. Geocoding is basically taking addresses and getting the longitudes and latitudes from it. You can also do the reverse but we already have the complete addresses from our web scraping above. So, after following some YouTube tutorials and reading up on blogs, I finally found code I can re-use with slight modifications to get what I want.

You can check out the code that I re-used to convert addresses into coordinates here. The only difference is that I used Google’s Geocoding API instead of OpenCageGeocode which means I had generated an API key on my Google account.

After pulling this final piece of code together, I got my lat/long coordinates! Not only did I get my coordinates, but I also used Google’s Geocoding API to get the Google address as well. This is just to make sure that the coordinates were actually correct and “apples-to-apples” with my bike station addresses that I web scraped from the webpage.

In the screenshot below, you will notice that I have labeled the column headers from Google’s API response data as “google-” since it is data from Google’s Geocoding API.

Final Excel output with lat/long coordinates using Google’s Geocoding API.

Finally, I plotted the 142 bike stations on a map in Tableau!

Map visual in Tableau showing Indego bike stations (Created by: Ghafar Shah)

Need more data! Using Tableau Prep to compile bike trips data from Indego’s website.

After putting this map visual together, I ran this small viz by one of my mentors, Josh Tapley (Tableau Public | Twitter) and he absolutely loved it. One thing Josh recommended was to see if I could find more data and turn this into a long-form viz. I started exploring Indego’s website and saw that there were Excel files that had bike trips data spanning to 2015 (click here) when Indego first launched. I started downloading all these files and plugged them into Tableau Prep.

Bike trip data: https://www.rideindego.com/about/data/

There was only one issue though. Some of these Excel files were off by one column compared to the other files. That column was called “bike_type”. It seems like Indego did not actually start tracking the bike type (standard vs. electric) until 2018. So, I added the missing column for the files that did not have a bike type with a “N/A” value. I also used the “MAKEDATE()” function in Tableau Prep to create a new column called “Date” and flagged all the records for the quarterly files as Q1 (03/30), Q2 (06/30), Q3 (09/30), and Q4 (12/31) for the year it was for. This helped me with creating the time-series line chart showing bike trips and average trip duration in Tableau.

Here is what my Tableau Prep flow looks like (zoomed out to 25%). I had to use three unions and only one cleaning step since there were so many Excel files. The cleaning step removed all the columns except for “trip_id”, “bike_type”, “Date”, “trip_duration (in minutes)”, and “passholder_type”.

Tableau Prep flow to combine bike trips data.

At last, I have all my data! Now I started thinking through my design using Figma to build out my UI layout. After two days of mock-ups and numerous iterations, I finally created a visualization I really love.

Data Visualization:

Created by: Ghafar Shah

That’s it! In all honesty, this was probably one of the most challenging analyses I had to do but the experience was so worth it. I learned a ton about Python and web scraping and also practiced my design/storytelling skills. I cannot wait to do more of these exercises using Python and Tableau. More importantly, I really hope you found this blog journey helpful.

Thank you all so much for reading!

Link to my viz on Tableau Public:
https://public.tableau.com/profile/ghafar.shah2168#!/vizhome/PhillyIndegoBikeSharingAnalysis/PhillysIndegoBikeSharing

Link to my Python (Beautiful Soup) code on GitHub: https://github.com/ghafarshahanalytics/IndegoBikeStations

Python & Beautiful Soup Resources:

Python: https://www.w3schools.com/python/

Beautiful Soup: https://www.dataquest.io/blog/web-scraping-beautifulsoup/

Beautiful Soup: https://programminghistorian.org/en/lessons/intro-to-beautiful-soup

Beautiful Soup: https://medium.com/analytics-vidhya/detailed-tutorials-for-beginners-web-scrap-movie-database-from-multiple-pages-with-beautiful-soup-5836828d23

Beautiful Soup: https://towardsdatascience.com/scraping-tv-show-epsiode-imdb-ratings-using-python-beautifulsoup-7a9e09c4fbe5

--

--

Ghafar Shah

Passionate about analytics and insights, I am a dedicated learner with a keen interest in data science and visualization.