deng's mighty adventures in python: 2017

Friday, June 16, 2017

Seismic Events for Q1 2017

Plot it here, there and everywhere...

My friend showed me the Phivolcs' data set on seismic activities and asked if I can interpret the data. I reviewed it and figured it would be great to add to my portfolio. At the same time, my Coursera assignment's due was almost up. We were tasked to pull data from the web and provide interpretation and data visualization. Sort of putting everything we learned all together...being an independent data scientist. I decided to use the Phivolcs data for the assignment. Hitting 2 birds with one stone...heh!

I started to work on the 2014 data since it showed all the data in 1 page. I used BeautifulSoup to scrape the data and I almost went insane after a week of working on the HTML file. I did not get the format that I needed.

So... To make things easier, I just used the 2017 data since the HTML tables were well formatted. I extracted a 3-month data sample of seismic activities from Phivolcs and converted the data to dataframes using Pandas. I used Matplotlib to plot the earthquake magnitudes and leveraged the Basemap toolkit to draw the map and plot the magnitudes in their respective longitude and latitude coordinates. Ha!

The complete python code may be found here. If you interested with the source file, I have extracted the raw logs as and saved as .csv here.

Here is a breakdown of my source code

Import the essential libraries needed. I am running Python 3.5 over Windows 8.1. As far as I can remember, I updated matplotlib to get the basemap library. I ran the command pip install matplotlib update under the administrator account to initiate the task. You may refer here for the documentation on basemap.

The Q1 data was extracted from Phivolcs' site. The monthly data were encoded on separate pages and were extracted separately. The read_HTML function would have an output of list of dataframes. After inspection, the needed dataframes were on the 3rd index and were extracted respectively. The 3 dataframes were then concatenated and named as df. The dataframe columns were renamed for proper labeling. Saving the file is only optional. I used to_csv to export the file.

Finally! Time to draw! Basemap is pretty easy to follow. (Not!) I used the Transverse Mercator Projection to have a realistic map.I also used the colorbar to aid in the data representation.

This tutorial helped me a lot in the visualization using Basemap. I hope it can help you too.

And....the output!

Thursday, May 4, 2017

Plotting Temperatures

I recently completed the 2nd assignment in the Coursera course. I had so much fun and had to post it on my blog! Ha!

We were tasked to analyze an 11-year data set which contains the maximum and minimum temperatures for everyday from 2005 to 2015. 2005 to 2014 data was sorted and 2015 data was set aside. The maximum and minimum temperatures per day were plotted. Feb 29 data was removed in the data to keep it clean. Finally, 2015 data was processed and temperature outliers were highlighted.

The following libraries were used.

The data set came from comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.

The original dataframe has 165085 rows and 4 columns and was assigned to the variable df.
The 'Data_Value' column is in tenths of degrees C.

Additional columns were added to df for further processing.

February 29 from leap years were also removed at this stage.
'temp_in_C' was used to convert 'Data_Value' to Celcius. Spyder was able to process the /10 operation but Jupyter kept on crashing. One of the mentors advised that the division operation consumes alot of memory thus causing Jupyter notebook to crash. I opted to use *.10 instead.

Created more dataframes! More fun! At this stage, the maximum and minimum temperatures per day were identified. For example, all temps for November 24 of 2005 to 2014 were collected and the highest (or lowest) temperature was selected and placed to the new dataframe.

I felt comfortable using pivot_table but groupby function may be used as well.
2015 was excluded in the dataframe. We needed to identify which 2015 temperature exceeded the 2005 to 2014 data.

#create df without 2015 data
df_data=df[~(df.year==2015)]

#create df for TMAX Elements and get max value
df_tmax = df_data[df_data.Element=='TMAX']
tmax = pd.pivot_table(df_tmax,index='dayofyear',values=['temp_in_C'],aggfunc={'temp_in_C':'max'})

#create df for TMIN Elements and get min value
df_tmin = df_data[df_data.Element=='TMIN']
tmin=pd.pivot_table(df_tmin,index='dayofyear',values=['temp_in_C'],aggfunc={'temp_in_C':'min'})

#create 2015 df and respective tmax and tmin df's
df_2015=df[(df.year==2015)]

df_2015_tmax = df_2015[df_2015.Element=='TMAX']
tmax_2015=pd.pivot_table(df_2015_tmax,index='dayofyear',values=['temp_in_C'],aggfunc={'temp_in_C':'max'})

df_2015_tmin = df_2015[df_2015.Element=='TMIN']
tmin_2015=pd.pivot_table(df_2015_tmin,index='dayofyear',values=['temp_in_C'],aggfunc={'temp_in_C':'min'})

#create df to show data values that exceed the 2005 to 2014 data
tmin_exceed=tmin_2015[tmin_2015.temp_in_C < tmin.temp_in_C.loc[:len(tmin_2015)]]

tmax_exceed=tmax_2015[tmax_2015.temp_in_C > tmax.temp_in_C.loc[:len(tmax_2015)]]

Plotting! I no matplotlib expert (yet) but thanks to stackoverflow, I survived. Yey! I did try to minimize the noise to put more emphasis on the data requirement.

#create figure and axes
fig = plt.figure()
fig, ax = plt.subplots()
plt.plot(tmin.temp_in_C,color="white")
plt.plot(tmax.temp_in_C,color="white")
plt.plot(tmin_exceed.temp_in_C,".",color="blue")
plt.plot(tmax_exceed.temp_in_C,".",color="red")

# set y axis with degrees celcius softened label by turning to grey
ax.yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}$^\circ$C'.format))
plt.yticks(alpha=0.8)

#set the range of xticks soften labels by turning to grey
plt.xticks(range(0, 366, 30), alpha=0.8)
plt.xlabel('Day of Year', alpha=0.8)

#set figure title
plt.title('Daily High and Low temperatures from 2005-2014 and record breaks of 2015 data.\n',alpha=0.8)

# fill the area between the tmax annd tmin data
plt.fill_between(range(len(tmax)),tmax.temp_in_C,tmin.temp_in_C,facecolor='blue',alpha=0.25)

# remove the frame of the chart
for spine in plt.gca().spines.values():
    spine.set_visible(False)

# remove the ticks (both axes)
plt.tick_params(top='off', bottom='on', left='on', right='off', labelleft='on', labelbottom='on')

#set up legends
red_patch = mpatches.Patch(color='red', label='2015 high temp')
blue_patch = mpatches.Patch(color='blue', label='2015 low temp')
shade_patch = mpatches.Patch(color='blue', label='2005 to 2014 temp',alpha=0.25)
plt.legend(handles=[shade_patch,red_patch,blue_patch],loc=4,frameon=False)

Finally, here is the output.

The complete code may be found here.

TrulyRichClub.com - Do You Want to Gain Financial Wealth and Spiritual Abundance at the Same Time?

Friday, April 7, 2017

Let's Bot In!

I was trying out the steps on building a slack bot and I would like to share my notes with you guys.

Here are some takeaways on this blog:

1. How to set the base Python Version in your virtual environment.

2. How to switch to the created virtual environments.

3. How to export environment variables in windows.

Let's start!

I am running a Windows 8.1 machine with both python 2.7 and 3.5 installed in their respective folders.

c:\Python27

c:\Python35

Part of the process is to create a virtual environment to have an independent library structure and avoid conflicts. I created a virtual environment via PowerShell and used python 3.5 as the base. Here's a guide on how to create virualenv on Windows with PowerShell.

To run PowerShell in Windows, go to the command prompt and run: C:\>powershell

By default, my machine is runnning Python 2.7. To set the base Python version of the virtual environment to 3.5, perform the steps below: Go to your virtual environmnt directory:

Set the base Python version:

To work on or switch to other virtual environments, use the command:

One of the challenges I had was how to export the secret tokens as environment variables. The line below does not work for me.

The syntax is different for windows. After searching the interwebs, this article helped me out. I replaced the line instead with the script below and I was able to extract the bot ID.:

That's about it! I just followed the steps and was able to deploy a slack bot! Developing the bot brains is a different story, but hey! I got it working... hohohoho..

Friday, March 31, 2017

Code used for the Grab Challenge 2017

Yey! we did not win!

Let me share our scripts for the submission on: The DataSeer Grab Challenge 2017

First, let us import the needed libraries.

Then process the data set using pandas. First, read_csv to open the file.We used the column "created_at_local" to derive the "date", "time" and "day of week" columns using the datetime library.

Here is a sample of the extracted dataframe. The dataframe has 265073 rows and 13 columns.

In [3]: df1.head()

Out[3]: 
  source    created_at_local  pick_up_latitude  pick_up_longitude  \
0    ADR 2013-09-22 23:46:18         14.604348         120.998654   
1    T47 2013-11-04 03:51:59         14.590099         121.082645   
2    T47 2013-11-21 05:21:24         14.582707         121.061458   
3    ADR 2013-09-16 20:53:34         14.585812         121.060171   
4    IOS 2013-09-10 23:49:16         14.552010         121.051260

drop_off_latitude  drop_off_longitude          city     fare  \
0          14.537370          120.994423  Metro Manila  281.875   
1          14.508611          121.019444  Metro Manila  413.125   
2          14.537752          121.001379  Metro Manila  277.500   
3          14.575915          121.085487  Metro Manila  220.625   
4          14.630210          120.995920  Metro Manila  378.125

pick_up_distance      state        date  time day of week  
0          0.389894  CANCELLED  2013-09-22    23      Sunday  
1          2.209770  COMPLETED  2013-11-04     3      Monday  
2          2.702910  COMPLETED  2013-11-21     5    Thursday  
3          0.321403  CANCELLED  2013-09-16    20      Monday  
4          0.667067  COMPLETED  2013-09-10    23     Tuesday

The "city" and "city_only" dataframes were created. We'll keep "city" for later (see Fig 3).

city=pd.pivot_table(df1,index=['city','source'],columns=['state'],values=['fare','pick_up_distance'],aggfunc={'fare':'count','pick_up_distance':'mean'},fill_value=0)
city['Alloc Rate']=(city[('pick_up_distance','CANCELLED')]+city[('fare','COMPLETED')])/(city[('fare','CANCELLED')]+city[('fare','COMPLETED')]+city[('fare','UNALLOCATED')])
city['Actual Alloc Rate']=city[('fare','COMPLETED')]/(city[('fare','CANCELLED')]+city[('fare','COMPLETED')]+city[('fare','UNALLOCATED')])
city.rename(columns={'fare':'Count of Trip Status','pick_up_distance':'Pickup point ave distance (Km)'}, inplace=True)

city_only=pd.pivot_table(df1,index=['city'],columns=['state'],values=['fare','pick_up_distance'],aggfunc={'fare':'count','pick_up_distance':'mean'},fill_value=0)
city_only['Alloc Rate']=(city_only[('pick_up_distance','CANCELLED')]+city_only[('fare','COMPLETED')])/(city_only[('fare','CANCELLED')]+city_only[('fare','COMPLETED')]+city_only[('fare','UNALLOCATED')])
city_only['Actual Alloc Rate']=city_only[('fare','COMPLETED')]/(city_only[('fare','CANCELLED')]+city_only[('fare','COMPLETED')]+city_only[('fare','UNALLOCATED')])

Findings on Pickup Distance:
Fig 1: This is a summary of 2013 trips showing the average pickup distance (in Km.) per city. The data shows us that a trip gets cancelled if average pickup point is 1.8Km away and would be completed below the 1.8Km mark.

Fig 1: ax1

Findings on AAR:
Fig 2: This graph shows the Actual Allocation Rate (AAR) per city for 2013. AAR is below 50% for all cities.

Fig 2: ax2

Recommendation on AAR:
Fig 3: If sources are considered for the AAR data, it can be seen that VNU for Metro Manila has the highest AAR which is above 50%. We have utilized the "city" dataframe here. *It might be a good idea to promote VNU in other cities.

Fig 3: ax3

Findings on the Relationship of Trip State with Time:
Fig 4: This graph shows the number of trip status per hour in Cebu. Cebu had a high unallocated rate from 5pm to 7pm.

A new dataframe called "time" is created. This df uses time (in hours) instead of dates.

Fig 4: ax4

Fig 5: This graph shows the number of trip status per hour in Davao. It is seen that Davao had a high unallocated rate at around 5pm and started to go down by 8pm.

Fig 5: ax5

Fig 6: This graph shows the number of trip status per hour in Metro Manila. Metro Manila had an initial spike on unallocated trips which started on the early commute hour of 5am and dropped at around 10 am and another spike which started to rise at 3pm and went down by 9pm. Highest peak at around 6pm.

Fig 6: ax6

Fig 7: Cebu had high travel count during Monday, Friday and Saturday. The daily dataframe shows the trip status and fare per day sorted per city. This also shows the daily daily pickup distance and corresponding trip status.

Fig 7: ax7

Fig 8: Davao peaked its travel count every Thursday.

Fig 8: ax8

Fig 9: Metro Manila's Peak is Friday.

Fig 9: ax9

We hope you have picked up a thing or two...Your comments are very welcome!
...pardon on the html boxes...lol

Tuesday, March 21, 2017

We have declared war with Pie Graphs!

I signed up at meetup.com to check on upcoming IT conferences. Initially, I was interested in topics such as DevOps, AWS and Cloud Operations. I scanned thru the topics and well...who am I kidding? I won't be able to catch up with the folks there. I am way behind in Operations Technology! Ha!

I did mention before that I started a "journey" (ugh) in Python which led to Studying Data Science. So...I searched for topics related to both and found a couple...ok..Just 1. DataSeer hosted an event last January 19, 2017 entitled "The Art of Data Story Telling" and this is where we declared war with Pie Graphs! Argh!

During the first few minutes of the talk, I got my big take away...it says, "Your data presentation should communicate the Big Take Away clearly" Ha!

Edward Tufte's concepts were also discussed. It mainly revolved on a minimalist approach in data presentation. Always mind the Data Ink Ratio. Avoid bad visualization, resist 3D! Additional ink only distracts. Above all else, show the data.

The minimalist approach really made sense...Until I came across the work of Nigel Holmes. "Nigel Holmes, whose work regularly incorporates strong visual imagery into the fabric of the chart" (I got this some where and I guess it would be safe to quote it...lol)... OK...I'm sleepy now, you can read more of Nigel Holmes here.

Sunday, February 26, 2017

The DataSeer Grab Challenge 2017

The Dataset has been generously provided by Grab Philippines:

The dimensions of the dataset is (265073 rows, 10 columns).

The period covered is 2013 only. The only available taxi type is GrabTaxi (GrabCar wasn't offered yet here back then, or any other taxi types for that matter).
Cities covered are Metro Manila, Cebu, and Davao.
Source is the channel whereby the booking was made. ADR is for Android smartphones, IOS for iPhones, VNU are for partner venues (like events or hotels, for example), T47 and COM aremanually created from computers (for those who called in back then or were manually assigned by callouts), WIN is for Windows smartphones, and BBA for Blackberry.
There are two general states for a booking: allocated and unallocated. Allocated means the booking was paired with a driver. Allocated is broken down to cancelled bookings and completedrides.
One of our measures of success is the allocation rate, or the rate of successful matching we can do. Another is actual allocation rate, or the rate of completed matches that actually happened. The formulas are:

AR = (Allocated) / (Allocated + Unallocated)
AAR = (Completed) / (Allocated + Unallocated)

fares are in PHP. As this is GrabTaxi, the formula to the fare is just meter fare + booking fee. In 2013, the BF is only PHP40.
pick_up_distance is the distance of the driver from the passenger. This is measured via roaddistance and not by straight line. The optimal distance back in 2013 due to large mismatchbetween supply and demand was 3 kilometers. Anything outside that is considered an outlierand/or a very bad match (i.e. increases chance that it will be cancelled due to ETA, etc.).

DataFrame Sample:
In[1]:df1.head()
Out[1]:
source    created_at_local pick_up_latitude pick_up_longitude \
0    ADR 2013-09-22 23:46:18         14.604348         120.998654
1    T47 2013-11-04 03:51:59         14.590099         121.082645
2    T47 2013-11-21 05:21:24         14.582707         121.061458
3    ADR 2013-09-16 20:53:34         14.585812         121.060171
4    IOS 2013-09-10 23:49:16         14.552010         121.051260

   drop_off_latitude drop_off_longitude          city     fare \
0          14.537370          120.994423 Metro Manila 281.875
1          14.508611          121.019444 Metro Manila 413.125
2          14.537752          121.001379 Metro Manila 277.500
3          14.575915          121.085487 Metro Manila 220.625
4          14.630210          120.995920 Metro Manila 378.125

   pick_up_distance      state        date time day of week
0          0.389894 CANCELLED 2013-09-22    23      Sunday
1          2.209770 COMPLETED 2013-11-04     3      Monday
2          2.702910 COMPLETED 2013-11-21     5    Thursday
3          0.321403 CANCELLED 2013-09-16    20      Monday
4          0.667067 COMPLETED 2013-09-10    23     Tuesday

Fig. 1

Fig. 2


Fig. 3

Findings on AAR:
Fig 2: This graph shows the Actual Allocation Rate (AAR) per city for 2013. AAR is below 50% for all cities.

Recommendation on AAR:
Fig 3: If sources are considered for the AAR data, it can be seen that VNU for Metro Manila has the highest AAR which is above 50%.
*It might be a good idea to promote VNU in other cities.

Fig. 4

Findings on the Relationship of Trip State with Time:
Fig 4: This graph shows the number of trip status per hour in Cebu. Cebu had a high unallocated rate from 5pm to 7pm.

Fig. 5


Fig. 6

Fig 5: This graph shows the number of trip status per hour in Davao. It is seen that Davao had a high unallocated rate at around 5pm and started to go down by 8pm.

Fig 6: This graph shows the number of trip status per hour in Metro Manila. Metro Manila had an initial spike on unallocated trips which started on the early commute hour of 5am and dropped at around 10 am and another spike which started to rise at 3pm and went down by 9pm. Highest peak at around 6pm.

Recommendations to address unallocated trips:
*Most of the spikes happen during rush hour. It might be a good idea to have carpool promos or cars with more that 4 passenger capacity.
*Point to point travel might be considered as well.

Findings on the Trip Status per Day:

Fig. 7

Fig 7 to 9 represents the number of trip status per day for the different cities.

Fig 7: Cebu had high travel count during Monday, Friday and Saturday.

Fig 8: Davao peaked its travel count every Thursday.

Fig 9: Metro Manila's Peak is Friday.

Recommendation:
*As with the recommendation mentioned above, Pooling promos and large capacity vehicles are encouraged to be maximized on the peak days of the week.

Fig. 8


Fig. 9

The noob team is composed of:

Edmil Sta. Maria

Anthony Lazam

Saturday, February 18, 2017

Pandas Pivot Table...Weeeee!

Thank you Chris Moffitt!!!! You cleared up a lot of my questions on pandas pivot tables 😃 This article made my day!

Go visit Practical Business Python for more info!