Abhijith Asok

About

Experienced professional with pending Master of Science degree in Health Data Science from Harvard University and 3 years of past experience with data. Extensive background as data scientist across public health, pharmaceuticals, real estate, communications and more across corporate, social, and research.

Machine Learning

Data Wrangling

Business Intelligence

Tool Development, Automation and Engineering

Experience

Harvard University Oct 2018 - Present

Research Assistant

I am currently working with Dr. Xihong Lin's research group that focuses on statistical genetics and genomics. We are undertaking a huge task of creating vector representations for known genetic variants and making them available to the general public through a software portal.

Developing a unique functional annotation tool for vector representation and querying of genetic variants.

My specific task is to create the tool from scratch(front-end and back-end) and engineer the pipeline for querying the required variants as per user input and returning the vector representations to them in tabular format and/or data visualisations with options for download in multiple formats.

Tools and frameworks used : R (with focus on Shiny), Google Cloud Platform
Mount Sinai Health System June 2018 - Aug 2018

Student Intern, Data Science

During the summer of 2018, I worked with the Data Science team of the Arnhold Institute for Global Health at the Mount Sinai Health System in New York. It is a non-profit organization within the overarching Mount Sinai ecosystem that focuses on global health and epidemiologic advancements in the developing world.

Developed a Computer Vision deep learning solution with convolutional neural networks to detect abandoned tires in aerial drone shots of Guatemala to reduce mosquito-borne diseases.

My specific project, of which I was the sole owner, dealt with an innovative way of utilizing the data at hand. Mosquitoes were a health hazard in the area and abandoned tires were attributed as a major contributor since it is very easy for water to be collected in them and act as breeding sites for mosquitoes. To tackle that, we used data that consisted of a set of images(both RGB and Infrared) which were aerial shots of multiple cities in Guatemala taken using drones. The preliminary dataset consisted of around 7000 images. In the time that was available, I designed a preliminary deep learning solution based on Convolutional Neural Networks that accepted these images as input with bounding box annotations for the abandoned tires and predicted the locations of abandoned tires in new drone images. The implementation was based on the YOLO v3 algorithm, which was chosen for its efficiency to detect small objects, since each tire was a small object in each image.

Mathematically derived optimum bounding-box dimensions using drone altitude and camera specs.

One of the major challenges we faced was to decide whether bounding boxes should be of fixed size and if so, what the size should be. This was particularly important since tires are small objects in comparison to the size of the image and the differences in size between different types of tires aren't significant compared to the image size. To decide on an ideal bounding box size to cater to all tires, we identified one of the largest tires available for commercial vehicles in Guatemala and found out its dimensions. But to know how this translated into the size in the image, we had to know the drone flight altitude. For this, we identified freight trucks that were visible in the image and took out their actual dimensions, following which we used that in conjunction with dimensions in the image and the camera's focal length to formulaically reverse engineer the drone altitude and derive ideal bounding box sizes.

Tools and frameworks used : Python (with focus on keras, pandas, numpy)
Safecity Aug 2015 - Aug 2017

Head Data Scientist

Safecity is a non-profit based out of Mumbai, India with partners in Nepal, Kenya, Cameroon, Nigeria, Malaysia, United States among others which crowd maps data on gender-based harassment and uses innovative Data Analytics to combat it. I set up Safecity's formal data team in mid-2015 and led it for 2 years as it grew. Under my leadership, Safecity's data team attained structure and grew manifold with many new successful initiatives taken up.

Designed fully dynamic dashboards based on harassment data that regularly influenced administrative decisions.

One of the first and most influential initiatives I took up was to create a fresh dashboard based on the data that was collected by the organization. Initially created using Advanced Ms Excel, the dashboard was very extensive, cutting across different harassment types, temporal variables such as day, week, month, and year as well as geographical locations down to zipcode. The highlight of the dashboard was that it was completely dynamic, and would update itself fully on a regular basis as and when new data came in. This dashboard was regularly sent over to police forces and administration on an interntaional level to impact beat patrol timings and influence policy-level changes and is still being updated and used today.

Led the analysis of the organization’s social media feeds that formed the basis of many of the organization’s international participations.

One of the other major initiatives I led was the analysis of the organization's social media feed. We mined tweets and posts from the organization's twitter and facebook feeds and extracted trends and patterns regarding what our followers were talking about, how much impact our posts are able to make, what helps the cause and what does not and more. The analysis also expanded to basic NLP on the text of the tweets and posts to mine more hidden insights. The Twitter analysis in particular was largely appreciated and was used by the CEO at many international events and conferences to illustrate the power of social media in combating gender-based harassment.

Collaborated with academic institutions and partners and many renowned US universities for impactful projects.

Further, over the course of 2 years, I mentored close to 30 entry level data analyst volunteers and trained them to mentor many more. In addition, I took the lead on the organization's collaborations with many major universities. Namely,
- We partnered with Stanford University's CS 50 : 'Using Tech for Good' course where a group of students worked with us over a semester to design a data-driven mobile interface. I was the liaison for the project and the lead from the organization.
- We had been working Dr. Susan Sorenson and Lauren Gurfein at the Ortner Center for Family Violence at the University of Pennsylvania to map street harassment in India. I took up the lead for the project and developed dynamic dashboards to help analyze the survey data
- We collaborated with Dr. Suzanne Lea at the University of Maryland to understand trends in transportation harassment in India. Our research culminated in a research paper in the Springer journal called 'Crime Prevention and Community safety'
Tools and frameworks used : R, Python, MS Excel, Tableau
Global Eagle Mar 2017 - May 2017

Remote Data Scientist

Global Eagle is an international networking firm that focuses on in-flight and in-ship networks and entertainment systems. I worked with Global Eagle over the summer of 2017 in a project aimed at increasing the efficiency of bandwidth allocation.

Designed unique network intelligence data visualization dashboard focused on maximizing network & signal transmission efficiency and long-range organizational cost reductions of $7m.

The organization was looking at a potential research to understand the trends of usage of different kinds of apps and websites on their networks so that they could optimize the allocation of bandwidth to prevent wastage. Using real-time usage data as input and cloud infrastructure, I created an extensive interactive dashboard on Tableau that enabled users within the organization to track usage patterns and make bandwidth allocation decisions in real-time. The initiative was the first step towards planned organizational savings of $7 million in the short-term

Tools and frameworks used : Tableau, Amazon Web Services(AWS)
Omaxe Aug 2016 - Jan 2017

Senior Data Scientist

Omaxe is one of the leading real estate development firms in India that employs over a 1000 people with the number of contractors coming out to be many times that number. I kick started Omaxe's data science initiatives by directly assisting the Chief Operating Officer of the medium-sized firm.

Automated an end-to-end dashboarding process to cut down time involvement from 4-5 hours to 2 minutes daily.

The COO was looking to optimize the purchase of construction materials on a daily basis based on the demand and the prices of each item till the previous day. We implemented basic time series models on past data to forecast predictions and kept improving predictions as each day's data came in. However, my biggest contribution to this role and to the organization came as an initiative I took up beyond my assigned roles. An extensive dashboard was being manually created every day based on the data that comes in that day and was being passed on to the COO so that she could understand trends and make decisions. I found this process extremely monotonous, inefficient and repetitive and decided to try and improve it through automation. The following were the steps involved in the process:
- The daily data came in as an attachment via an automatically generated email
- A dashboard with a fixed format had to be created based on this data and additional charts had to be generated
- The files had to be compressed into a zip
- The zip had to be emailed to the COO with metadata about the dashboard in the email body
Utilizing a combination of R, Python, VBA, and MS Excel, I created a co-existing framework that identified that day's email in in my inbox, downloaded it, created a dashboard and charts from it, compressed the files and attached them and emailed them to the COO with supplementary information. The framework was hosted on top of a task scheduler so that the entire process occurred seamlessly on its own every day morning.

The initiative was well appreciated and it cut down the daily time involvement from 4-5 hours with 100% human involvement to under 2 minutes with 0 human involvement. The GitHub repo of the project can be found here.

Tools and frameworks used : R, Python, VBA, MS Excel
ZS Associates Jul 2015 - Jul 2016

Business Analyst

ZS Associates is a global sales and marketing Consultancy based out of Evanston, IL primarily focusing on broad analytics for Pharmaceutical Corporations. As a Business Analyst at ZS, I was involved in creating long-term strategy for an upcoming Business unit of a leading US-based Biotechnology and Pharmaceutical Corporation. For this, we dynamically extracted insights from highly descriptive text to create broad themes updated over time.

Predicted medical insurance premiums based on customer data.

I also worked in Predictive Analytics, creating and optimizing prediction systems to predict Medical Insurance Premiums based on a wide array of variables. R was the tool of choice to implement generalized non-linear regression modeling for the purpose. My effort was able to come up with better results than 48 other similar efforts.

Predicted pharmaceutical product sales based on past sales data.

In addition, I worked in time series prediction of pharmaceutical product sales, where an ARIMA variant implemented in R emerged the final best model. In line with the previous project, this one was better than 50 other similar efforts.

Predicted the presence of the Pap Smear test for Cervical Cancer in the medical history of women patients.

The project of the largest scale that I worked on here, was one to predict the presence of the 'Pap Smear Test' for Cervical Cancer in the medical history of female patients, based on diverse data such as Physician visits, diagnosis, prescriptions etc. The data was spread out across 11 tables which we hosted on a Teradata Server. The features were engineered using domain knowledge into a total of 684 predictors for 1.5 million female patients and both R and Python were used to model the data, hosted on an AWS server. The well-balanced dataset gave almost 91% accuracy for our prediction model based on gradient boosted trees.

Tools and frameworks used : R, Python, MS Excel, Amazon Web Services(AWS), SQL
GSG Telco(Now, Sakon) Jan 2016 - Feb 2016

Independent Consultant

GSG is a telecommunication services firm based in Masschusetts with a Global Delivery Center in Pune, India. I worked with one of the co-founders of the firm in a market segmentation project.

Created successful unsupervised machine learning model on customers based on telecom device usage for company to prepare multiple incentive plans.

The firm was interested to create specialized recommendations for their customers based on their usage patterns of telecommunication services offered to them. I created an unsupervised machine learning solution for the organization's customer base using data about their service usage. The solution was based on a bootstrapped K-Means algorithm and went on to assist the organization in planning and offering specific plans to customers based on their segment.

Tools and frameworks used : R
Mu Sigma July 2014 - June 2015

Intern, Analytical Product Research

Mu Sigma is one of the world's leading pure play providers of Analytics. I was a full-time intern in the Analytical Research wing of the Product Research group of Mu Sigma in the final year of my college curriculum.

Took ownership of feature modules in the organization’s flagship project and drove them to implementation.

The Analytical Research team worked behind the scenes of muRx, Mu Sigma's flagship Decision Sciences workbench product, which is an end-to-end analytical solution with both SAS and R backends. Even though I started as an intern, I was soon given sole ownership of two developmental modules, for which I created requirements, drew wireframes, created prototype implementations and worked with engineers to implement them as modules on the product. On the SAS backend, I created a Hybrid clustering solution characterized by high computational speeds without a large compromise on accuracy. On the R backend, I created an end-to-end distance matching module which offered multiple distance types and different kinds of data input. The latter was also followed by an extensive 60 page documentation just for the module, with theoretical background, implementation algorithm, usage manual and more.

I also started my independent research in data transformation here, building on a thread that I came across during my work. It culminated in a research paper that I presented at the 2016 IEEE International Conference on Data Science and Engineering.

Tools and frameworks used : R, SAS, MS Excel

Education

3

Harvard University, Boston, MA

Master of Science - Health Data Science

2017 - 2019

BITS Pilani, India

MSc. (Hons.) Mathematics + Bachelor of Engineering (Hons.) Electrical and Electronics Engineering

2010 - 2015

2

1

Loyola School, Thiruvananthapuram, India

ISC, ICSE, Science

1997 - 2010