deng's mighty adventures in python: Code used for the Grab Challenge 2017

Yey! we did not win!

Let me share our scripts for the submission on: The DataSeer Grab Challenge 2017

First, let us import the needed libraries.

Then process the data set using pandas. First, read_csv to open the file.We used the column "created_at_local" to derive the "date", "time" and "day of week" columns using the datetime library.

Here is a sample of the extracted dataframe. The dataframe has 265073 rows and 13 columns.

In [3]: df1.head()

Out[3]: 
  source    created_at_local  pick_up_latitude  pick_up_longitude  \
0    ADR 2013-09-22 23:46:18         14.604348         120.998654   
1    T47 2013-11-04 03:51:59         14.590099         121.082645   
2    T47 2013-11-21 05:21:24         14.582707         121.061458   
3    ADR 2013-09-16 20:53:34         14.585812         121.060171   
4    IOS 2013-09-10 23:49:16         14.552010         121.051260

drop_off_latitude  drop_off_longitude          city     fare  \
0          14.537370          120.994423  Metro Manila  281.875   
1          14.508611          121.019444  Metro Manila  413.125   
2          14.537752          121.001379  Metro Manila  277.500   
3          14.575915          121.085487  Metro Manila  220.625   
4          14.630210          120.995920  Metro Manila  378.125

pick_up_distance      state        date  time day of week  
0          0.389894  CANCELLED  2013-09-22    23      Sunday  
1          2.209770  COMPLETED  2013-11-04     3      Monday  
2          2.702910  COMPLETED  2013-11-21     5    Thursday  
3          0.321403  CANCELLED  2013-09-16    20      Monday  
4          0.667067  COMPLETED  2013-09-10    23     Tuesday

The "city" and "city_only" dataframes were created. We'll keep "city" for later (see Fig 3).

city=pd.pivot_table(df1,index=['city','source'],columns=['state'],values=['fare','pick_up_distance'],aggfunc={'fare':'count','pick_up_distance':'mean'},fill_value=0)
city['Alloc Rate']=(city[('pick_up_distance','CANCELLED')]+city[('fare','COMPLETED')])/(city[('fare','CANCELLED')]+city[('fare','COMPLETED')]+city[('fare','UNALLOCATED')])
city['Actual Alloc Rate']=city[('fare','COMPLETED')]/(city[('fare','CANCELLED')]+city[('fare','COMPLETED')]+city[('fare','UNALLOCATED')])
city.rename(columns={'fare':'Count of Trip Status','pick_up_distance':'Pickup point ave distance (Km)'}, inplace=True)

city_only=pd.pivot_table(df1,index=['city'],columns=['state'],values=['fare','pick_up_distance'],aggfunc={'fare':'count','pick_up_distance':'mean'},fill_value=0)
city_only['Alloc Rate']=(city_only[('pick_up_distance','CANCELLED')]+city_only[('fare','COMPLETED')])/(city_only[('fare','CANCELLED')]+city_only[('fare','COMPLETED')]+city_only[('fare','UNALLOCATED')])
city_only['Actual Alloc Rate']=city_only[('fare','COMPLETED')]/(city_only[('fare','CANCELLED')]+city_only[('fare','COMPLETED')]+city_only[('fare','UNALLOCATED')])

Findings on Pickup Distance:
Fig 1: This is a summary of 2013 trips showing the average pickup distance (in Km.) per city. The data shows us that a trip gets cancelled if average pickup point is 1.8Km away and would be completed below the 1.8Km mark.

Fig 1: ax1

Findings on AAR:
Fig 2: This graph shows the Actual Allocation Rate (AAR) per city for 2013. AAR is below 50% for all cities.

Fig 2: ax2

Recommendation on AAR:
Fig 3: If sources are considered for the AAR data, it can be seen that VNU for Metro Manila has the highest AAR which is above 50%. We have utilized the "city" dataframe here. *It might be a good idea to promote VNU in other cities.

Fig 3: ax3

Findings on the Relationship of Trip State with Time:
Fig 4: This graph shows the number of trip status per hour in Cebu. Cebu had a high unallocated rate from 5pm to 7pm.

A new dataframe called "time" is created. This df uses time (in hours) instead of dates.

Fig 4: ax4

Fig 5: This graph shows the number of trip status per hour in Davao. It is seen that Davao had a high unallocated rate at around 5pm and started to go down by 8pm.

Fig 5: ax5

Fig 6: This graph shows the number of trip status per hour in Metro Manila. Metro Manila had an initial spike on unallocated trips which started on the early commute hour of 5am and dropped at around 10 am and another spike which started to rise at 3pm and went down by 9pm. Highest peak at around 6pm.

Fig 6: ax6

Fig 7: Cebu had high travel count during Monday, Friday and Saturday. The daily dataframe shows the trip status and fare per day sorted per city. This also shows the daily daily pickup distance and corresponding trip status.

Fig 7: ax7

Fig 8: Davao peaked its travel count every Thursday.

Fig 8: ax8

Fig 9: Metro Manila's Peak is Friday.

Fig 9: ax9

We hope you have picked up a thing or two...Your comments are very welcome!
...pardon on the html boxes...lol

deng's mighty adventures in python

Friday, March 31, 2017

Code used for the Grab Challenge 2017

No comments:

Post a Comment

Blog Archive