Portfolio

My portfolio includes three data science projects on different topics focusing on tabular data, computer vision and package development. To see more of my work, visit my GitHub profile or download my CV.

My portfolio includes three data science projects on different topics focusing on tabular data, computer vision and package development. To see more of my work, visit my GitHub profile or download my CV.


My portfolio features the following projects:

Click "read more" to see project summaries and follow GitHub links for the code and documentation. Scroll down to see other ML and DL projects.


Profit-driven demand forecasting with gradient boosting

Notebook

Highlights

  • developed a two-stage demand forecasting pipeline with LightGBM models
  • performed a thorough cleaning, aggregation and feature engineering on transactional data
  • implemented custom loss functions aimed at maximizing the retailer's profit

Summary

Forecasting demand is an important managerial task that helps to optimize inventory planning. The optimized stocks can reduce retailer's costs and increase customer satisfaction due to faster delivery time. This project uses historical purchase data to predict future demand for different products.

The project pipeline includes several crucial steps:

  • thorough data preparation, cleaning and feature engineering
  • aggregation of transactional data into the daily format
  • implementation of custom profit-driven loss functions
  • two-stage demand forecasting with LightGBM models
  • hyper-parameter tuning with Bayesian algorithms
  • stacking ensemble to further maximize the performance

The code and documentation are available on GitHub. A detailed walkthrough is provided in this blog post.


Image-to-text translation of chemical structures with DL

Notebook

Highlights

  • built CNN-LSTM encoder-decoder models to translate images into chemical formulas
  • developed a comprehensive PyTorch GPU/TPU image captioning pipeline
  • finished in the top-5% of the Kaggle competition leaderboard with silver medal

Summary

Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.

The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.

My solution is an ensemble of CNN-LSTM Encoder-Decoder models implemented in PyTorch.The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The code is documented and published on GitHub.


fairness: Package for computing fair ML metrics

Notebook

Highlights

  • developing and actively maintaining an R package for fair machine learning
  • the package offers calculation, visualization and comparison of algorithmic fairness metrics
  • the package is published on CRAN and has more than 11k total downloads

Summary

How to measure fairness of a machine learning model? To date, a number of algorithmic fairness metrics have been proposed. Demographic parity, proportional parity and equalized odds are among the most commonly used metrics to evaluate group fairness in binary classification problems.

Together with Tibor V. Varga, we developed the fairness R package for fair machine learning. The package offers tools to calculate, visualize and compare commonly used metrics of algorithmic fairness across the sensitive groups. After publishing the package on CRAN in 2019, I have been actively working on maintaining the package and extending its functionality. The comprehensive overview of fairness is provided in this blog post.


Other projects

Want to see more? Check out some of my further ML projects grouped by the application areas below. You can also visit my GitHub profile or read my recent blog posts, competition solutions and academic publications.

Cassava Leaf Disease Classification

Notebook

  • built CNNs and Vision Transformers in PyTorch to classify plant diseases
  • constructed a stacking ensemble with multiple computer vision models
  • finished in the top-1% of the Kaggle competition with a gold medal


Catheter and tube position detection with deep learning

Notebook

  • built deep learning models to detect catheter and tube position on X-ray images
  • developed a comprehensive PyTorch GPU/TPU computer vision pipeline
  • finished in the top-5% of the Kaggle competition leaderboard with silver medal


Detecting blindness on retina photos

Notebook

  • developed CNN models to identify disease types from retina photos
  • written a detailed report covering problem statement, EDA and modeling
  • submitted as a capstone project within the Udacity ML Engineer program

Fair ML in credit scoring

Notebook

  • benchmarked eight fair ML algorithms on seven credit scoring data sets
  • investigated profit-fairness trade-off to quantify the cost of fairness
  • written a paper accepted to the European Journal of Operational Research


Google Analytics customer revenue prediction

Notebook

  • worked with two-year transactional data from a Google merchandise store
  • developed LightGBM models to predict future revenues generated by customers
  • finished in the top-2% of the Kaggle competition leaderboard with silver medal

Projects to be added.