๐Ÿ”— GitHub

Description

The main goal is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. The simplest way is to find out a basic estimate based on just the distance between the two points. However, this will result in an RMSE (Root Mean Square Error) of $5 - $8 in the prediction, depending on the model used. To improve the accuracy, we have to include more parameters in our study such as: the nature of Pickup and Dropoff locations, different possible routes, amount of time, traffic, and so on. The main challenge/goal is to minimize the RMSE using improved Machine Learning Models. There are several ways for making this improvement like: improving the dataset using efficient data preprocessing techniques, increasing the accuracy of models using efficient Hyperparameter Tuning techniques, and so on.

Experience

This project is based on the concepts of Data Science and Machine Learning. In this project, I was to develop efficient Machine learning models for predicting fair amounts for taxi rides in New York City, given pickup and drop locations with high accuracies. While working on this project, I got to know about a large number of things related to my fields of interest. I got to know: how to find good Data Sources, familiarity with different tools for Extraction of Data, Combination of Data from various sources, Data Cleaning, Exploratory Data Analysis, and familiarity with different tools to effectively present information from the data to the general audience, Training Machine Learning Models, Testing Machine Learning Models, and Hyperparameter Tuning of Models to improve the predicting accuracies, implementations of several different Machine Learning Models along with a deep understanding of their working. This has enabled me to get familiar with various important tools that are used in my fields of interest. For example, in the case of Data Science and Machine Learning, some of the tools include PyTorch, TensorFlow, Keras.io, Colab, Scikit Learn, Tableau, and so on.

Challenges Faced

  • The available datasets were small in size. This problem was overcome by simply combining these datasets into a larger dataset based on a set of parameters. This enabled the models to perform better.
  • After the combined dataset was obtained, there were various other things wrong with it. For instance: there were a lot of rows with null values w.r.t. various attributes, lack of uniform data formatting across a specific column, and so on. All of these issues were resolved one by one by using various Data Preprocessing Techniques.
  • There were a lot of attributes (columns) in the given dataset which were present in the categorical form. I had to employ various efficient data encoding techniques to convert the entries under these attributes into numerical values so that they can be used for training the models.
  • In the beginning, the predictions that were being made by the trained models were not accurate at all. This problem was resolved by undertaking several measures like: making many changes in the dataset, hyperparameter tuning of the models, and so on. All of this enabled the improvement of the accuracies of the models by a very large amount.
  • In the beginning, there was also a lot of confusion regarding how to plot various figures such as: bar graphs, pie charts, line graphs, box plots, etc. To overcome this I had to go through several online courses, blogs, and youtube videos. This enabled me to learn about several tools/libraries that can be used for plotting good-looking and accurate figures from the given dataset.

Lessons Learnt

  • How to efficiently extract and combine data together from different sources.
  • How to use different Data Preprocessing techniques in order to make the available dataset suitable for training the Machine Learning Models.
  • How to efficiently communicate related to any aspect of the project with the superiors (professors in this case).
  • How to improve the accuracy of the models using different Hyperparameter Techniques.
  • How to plot suitable graphs related to the dataset for: checking the correctness of the dataset, observing useful patterns, and to display/ extract valuable information from the dataset to the users.
  • How to effectively interact with people online who are working in the same domain as my work/project. This can be very useful in order to get problems resolved, or for learning a better or much more efficient technique. This also helps in expanding knowledge in the related field.