Tuesday, May 8, 2018

Get intersection between 2 columns using Pandas


I came across a task wherein I had to work on ticket logs extracted as an excel file. Each ticket field was assigned a column. Ticket number, service name, summary and work log are some of the column names. The specific task is to check if engineer names appeared on the work log column. I was able to do this in excel by creating an array with the engineer names and created a new column with the formula: 



It did the trick. It shows true if the name appeared else, it would yield false. 

However...I wanted to do more. I wanted to output which names appeared in the work log instead of just true or false. I got crazy in googling the correct formula, so I tried to use the Pandas library to find the intersection.

Here is a sample file that I used. I called it 'Book1.xlsx'. Column A containts the Worklog with random strings.



The names in this case would be "dog" and "cat". The task would be to check if "dog" and "cat" appeared in the "Worklog" column.

To do this, import pandas and initiate the names. In this case, I used  "uniqueNames" for "dog" and "cat". The converted the string to a list.

The file is opened and assigned to the variable df.



Processing the columns:




Here is the output of df. This is the same as the output of the excel formula.





Now....to do more.



And here is the output for for the final dataframe (finalDf)




These are the required information:
  1. Worklog contains the original strings.
  2. Intersection Y/N performed the excel formula mentioned earlier.
  3. intersect showed which texts appeared in the Worklog




Finally, the output may be saved to excel by using. I hope this helps the interwebs!