My portfolio includes three data science projects on different topics focusing on tabular data, computer vision and package development. To see more of my work, visit my GitHub page or download my CV.
My portfolio features the following projects:
Profit-driven demand forecasting with gradient boosting
- developed a two-stage demand forecasting pipeline with LightGBM models
- performed a thorough cleaning, aggregation and feature engineering on transactional data
- implemented custom loss functions aimed at maximizing the retailer's profit
Forecasting demand is an important managerial task that helps to optimize inventory planning. The optimized stocks can reduce retailer's costs and increase customer satisfaction due to faster delivery time. This project uses historical purchase data to predict future demand for different products.
The project pipeline includes several crucial steps:
- thorough data preparation, cleaning and feature engineering
- aggregation of transactional data into the daily format
- implementation of custom profit-driven loss functions
- two-stage demand forecasting with LightGBM models
- hyper-parameter tuning with Bayesian algorithms
- stacking ensemble to further maximize the performance
A detailed walkthrough is provided in this blog post.
Image-to-text translation of chemical structures with deep learning
- built CNN-LSTM encoder-decoder models to translate images into chemical formulas
- developed a comprehensive PyTorch GPU/TPU image captioning pipeline
- finished in the top-5% of the Kaggle competition leaderboard with silver medal
Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.
The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.
My solution is an ensemble of CNN-LSTM Encoder-Decoder models implemented in PyTorch.The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The code is documented and published on GitHub.
fairness: Package for computing fair ML metrics
- developing and actively maintaining an R package for fair machine learning
- the package offers calculation, visualization and comparison of algorithmic fairness metrics
- the package is published on CRAN and has more than 11k total downloads
How to measure fairness of a machine learning model? To date, a number of algorithmic fairness metrics have been proposed. Demographic parity, proportional parity and equalized odds are among the most commonly used metrics to evaluate group fairness in binary classification problems.
Together with Tibor V. Varga, we developed the
fairness R package for fair machine learning. The package offers tools to calculate, visualize and compare commonly used metrics of algorithmic fairness across the sensitive groups. After publishing the package on CRAN in 2019, I have been actively working on maintaining the package and extending its functionality. The comprehensive overview of
fairness is provided in this blog post.