Portfolio

My portfolio includes three data science projects on different topics focusing on tabular data, computer vision and natural language processing. To see more of my work, visit my GitHub profile or download my CV.

My portfolio includes three data science projects on different topics focusing on tabular data, computer vision and natural language processing. To see more of my work, visit my GitHub profile or download my CV.


My portfolio features the following projects:

Click "read more" to see the project summaries. Follow the GitHub links for the code and documentation. Scroll down to see more ML and DL projects in different application areas.


Text Readability Prediction with Transformers

Notebook

Highlights

  • developed a comprehensive PyTorch pipeline for text classification
  • implemented eight transformers including BERT, RoBERTa and others
  • built an interactive web app for custom text reading complexity estimation

Tags: natural language processing, deep learning, web app

Summary

Estimating text reading complexity is a crucial task for school teachers. Offering students text passages at the right level of challenge is important for facilitating a fast development of reading skills. The existing tools to estimate text complexity rely on weak proxies and heuristics, which results in a suboptimal accuracy. This project uses deep learning to predict the readability scores of text passages.

My solution implements eight transformer models, including BERT, RoBERTa and others in PyTorch. The models feature a custom regression head that uses a concatenated output of multiple hidden layers. The modeling pipeline implements text augmentations such as sentence order shuffle, backtranslation and injecting target noise. The solution places in the top-9% of the Kaggle competition leaderboard.

The project also includes an interactive web app built in Python. The app allows to estimate reading complexity of a custom text using two of the trained transformer models. The code and documentation are available on GitHub.


Profit-Driven Demand Forecasting with Gradient Boosting

Notebook

Highlights

  • developed a two-stage demand forecasting pipeline with LightGBM models
  • performed a thorough cleaning, aggregation and feature engineering on transactional data
  • implemented custom loss functions aimed at maximizing the retailer's profit

Tags: tabular data, e-commerce, profit maximization

Summary

Forecasting demand is an important managerial task that helps to optimize inventory planning. The optimized stocks can reduce retailer's costs and increase customer satisfaction due to faster delivery time. This project uses historical purchase data to predict future demand for different products.

The project pipeline includes several crucial steps:

  • thorough data preparation, cleaning and feature engineering
  • aggregation of transactional data into the daily format
  • implementation of custom profit-driven loss functions
  • two-stage demand forecasting with LightGBM models
  • hyper-parameter tuning with Bayesian algorithms
  • stacking ensemble to further maximize the performance

The code and documentation are available on GitHub. A detailed walkthrough is provided in this blog post.


Image-to-Text Translation of Chemical Structures with Deep Learning

Notebook

Highlights

  • built a CNN-LSTM encoder-decoder architecture to translate images into chemical formulas
  • developed a comprehensive PyTorch GPU/TPU image captioning pipeline
  • finished in the top-5% of the Kaggle competition leaderboard with a silver medal

Tags: computer vision, natural language processing, deep learning

Summary

Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.

The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.

My solution is an ensemble of CNN-LSTM Encoder-Decoder models implemented in PyTorch.The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The code is documented and published on GitHub.


Other projects

Want to see more? Check out some of my further ML projects grouped by the application areas below. You can also visit my GitHub profile or read my recent blog posts, competition solutions and academic publications.

Cassava Leaf Disease Classification

Notebook

  • built CNNs and Vision Transformers in PyTorch to classify plant diseases
  • constructed a stacking ensemble with multiple computer vision models
  • finished in the top-1% of the Kaggle competition with a gold medal


Catheter and Tube Position Detection on Chest X-Rays

Notebook

  • built deep learning models to detect catheter and tube position on X-ray images
  • developed a comprehensive PyTorch GPU/TPU computer vision pipeline
  • finished in the top-5% of the Kaggle competition leaderboard with silver medal


Detecting Blindness on Retina Photos

Notebook

  • developed CNN models to identify disease types from retina photos
  • written a detailed report covering problem statement, EDA and modeling
  • submitted as a capstone project within the Udacity ML Engineer program

Fair ML in Credit Scoring

Notebook

  • benchmarked eight fair ML algorithms on seven credit scoring data sets
  • investigated profit-fairness trade-off to quantify the cost of fairness
  • published a paper with the results at the European Journal of Operational Research


Google Analytics Customer Revenue Prediction

Notebook

  • worked with two-year transactional data from a Google merchandise store
  • developed LightGBM models to predict future revenues generated by customers
  • finished in the top-2% of the Kaggle competition leaderboard with silver medal

fairness: Package for Computing Fair ML Metrics

Notebook

  • developing and actively maintaining an R package for fair machine learning
  • the package offers calculation, visualization and comparison of algorithmic fairness metrics
  • the package is published on CRAN and has more than 11k total downloads


dptools: Package for Data Processing and Feature Engineering

Notebook

  • Python package with helper functions to simplify common data processing tasks
  • functions cover feature engineering, data aggregation, working with missings and more
  • the source code and documentation are available on GitHub and PyPi