Do some good with what you learn and what you know they said. So I am trying to do just that. When the latest Panama Papers got released I was in the midst of learning about the vast field of data science. Then a thought crossed my mind "Why not do something with the data?'. And so began this small project. It seemed daunting task for my noob mind but then again whats the point of learning if you cannot experiment and go out of your comfort zone. So I set out to do 'something' with the data set at hand. First look at the data set and I was like, 'This is an experiment which is way out of my league'. But then again everything is to someone who is just starting out, so I said 'Onwards and to god knows where'.
So what did I do?
Sunday morning breakfast ->Download data set -> Load it up in R -> Check all the data and find relevant stuff -> Make a copy of the relevant data and ditch the rest -> Google about Geocoding -> R-Blogger to the rescue -> Trying to fit the code out of the box -> Of course it did not work -> Lot of head bashing for hours -> Lunch break -> Read through the code -> More head bashing -> Google for more geocoding stuff to find alternatives -> Intuition says R-Blogger code will work -> Back at the code again -> Hair pulling -> Cursing -> Voila ! Something worked -> 10% there -> Turns out there were no column headers in the intermediary file being used in the code -> Duh! Fix it! Re-wire the code and put in some customization which fit fits the data -> RUN ! -> Runs perfectly till 708th row and then BOOM! Error !! -> Look at the data and scratch head -> Do some mumbo jumbo (not exactly but some much needed cleaning of the data) -> Lot of iterations to get the needed output and fixing the errors or customizing the code -> There was some green tea time in between, forgot where -> After a lot of wishing , willing and cursing I make a markdown to knit and html -> KNIT ! -> 'Over the querying limit' -> Need I say more ?
Next day it worked and here we are. So this some summary of all the Indian addresses present in the latest release of Panama Papers.
The source : Panama Papers
Total addresses : 151054
Redacted: 15
Missing : 4
Indian Addresses : 828
Correctly geocoded : 627
Wrongly geocoded : 38 + 163 unsearchable, due to wrong spellings or incorrect addresses.
The Visualization is here. Along with some flow of how the code ended up there. The page contains two maps.
The first one is a heatmap which clearly shows the concentration of the addresses spread across the country. And the results are as expected, the majority being centered around the metros.
The second one provides a more in depth view of the addresses as it can be used to see particular areas of interests with their addresses mentioned along with it. The second map shows 657 Indian addresses which is more than the number shown by R. This is due to the fact that I used batchgeo.com to process the addresses file externally to get a different visualization. Now what has happened is the site has taken the nearest or the city point for some addresses other than the 627 which were correctly coded. There are still 14 addresses which are pointed out on the map and are in the wrong area. You can see the addresses by clicking on them and observe, which will give you a picture as to why that happened.
The final geocoded file is also included in the project folder at Github
Thanks to r-blogger for the geocoding script.
View Indian Addresses in a full screen map
To do :
Correct the 38 wrongly geocoded addresses.
Create a better connected visualization which shows which address was connected to which bank and where.
This is great! Love what you did with what you learnt.
ReplyDelete