Portfolio

My portfolio includes three ML projects on different topics focusing on computer vision, NLP and tabular data. To see more of my work, visit my GitHub page, download my CV or check out the about page.

My portfolio includes three ML projects on different topics focusing on computer vision, NLP and tabular data. To see more of my work, visit my GitHub page, download my CV or check out the about page.


My portfolio features the following projects:

Click "read more" to see project summaries. Follow GitHub links for code and documentation. Scroll down to see more ML projects grouped by application domainas and links to further pages with my work.


Text Readability Prediction with Transformers

Notebook

Highlights

  • developed a comprehensive PyTorch / HuggingFace text classification pipeline
  • build multiple transformers including BERT and RoBERTa with custom pooling layers
  • implemented an interactive web app for custom text reading complexity estimation

Tags: natural language processing, deep learning, web app

Summary

Estimating text reading complexity is a crucial task for school teachers. Offering students text passages at the right level of challenge is important for facilitating a fast development of reading skills. The existing tools to estimate text complexity rely on weak proxies and heuristics, which results in a suboptimal accuracy. In this project, I use deep learning to predict the readability scores of text passages.

My solution implements eight transformer models, including BERT, RoBERTa and others in PyTorch. The models feature a custom regression head that uses a concatenated output of multiple hidden layers. The modeling pipeline includes text augmentations such as sentence order shuffle, backtranslation and injecting target noise. The solution places in the top-9% of the Kaggle competition leaderboard.

The project also includes an interactive web app built in Python. The app allows to estimate reading complexity of a custom text using two of the trained transformer models. The code and documentation are available on GitHub.

Notebook


Image-to-Text Translation of Molecules with Deep Learning

Notebook

Highlights

  • built a CNN-LSTM encoder-decoder architecture to translate images into chemical formulas
  • developed a comprehensive PyTorch GPU/TPU image captioning pipeline
  • finished in the top-5% of the Kaggle competition leaderboard with a silver medal

Tags: computer vision, natural language processing, deep learning

Summary

Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.

The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.

My solution is an ensemble of CNN-LSTM Encoder-Decoder models implemented in PyTorch.The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The code is documented and published on GitHub.


Fair Machine Learning in Credit Scoring

Notebook

Highlights

  • benchmarked eight fair ML algorithms on seven credit scoring data sets
  • investigated profit-fairness trade-off to quantify the cost of fairness
  • published a paper with the results at the European Journal of Operational Research

Tags: tabular data, fairness, profit maximization

Summary

The rise of algorithmic decision-making has spawned much research on fair ML. In this project, I focus on fairness of credit scoring decisions and make three contributions:

  • revisiting statistical fairness criteria and examining their adequacy for credit scoring
  • cataloging algorithmic options for incorporating fairness goals in the ML model development pipeline
  • empirically comparing different fairness algorithms in a profit-oriented credit scoring context using real-world data

The code and documentation are available on GitHub. A detailed walkthrough and key results are published in this paper.

The study reveals that multiple fairness criteria can be approximately satisfied at once and recommends separation as a proper criterion for measuring scorecard fairness. It also finds fair in-processors to deliver a good profit-fairness balance and shows that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost.

Notebook

Further projects

Want to see more? Check out my further ML projects grouped by application areas below. You can also visit my GitHub page, check my recent blog posts, watch public tech talks and read academic publications.

Pet Popularity Prediction

Notebook

  • built a PyTorch pipeline for predicting pet cuteness from image and tabular data
  • reached top-4% in the Kaggle competition using Transformers and CNNs
  • implemented an interactive web app for estimating cuteness of custom pet photos


Cassava Leaf Disease Classification

Notebook

  • built CNNs and Vision Transformers in PyTorch to classify plant diseases
  • constructed a stacking ensemble with multiple computer vision models
  • finished in the top-1% of the Kaggle competition with a gold medal


Catheter and Tube Position Detection on Chest X-Rays

Notebook

  • built deep learning models to detect catheter and tube position on X-ray images
  • developed a comprehensive PyTorch GPU/TPU computer vision pipeline
  • finished in the top-5% of the Kaggle competition leaderboard with silver medal


Detecting Blindness on Retina Photos

Notebook

  • developed CNN models to identify disease types from retina photos
  • written a detailed report covering problem statement, EDA and modeling
  • submitted as a capstone project within the Udacity ML Engineer program

Profit-Driven Demand Forecasting with Gradient Boosting

Notebook

  • developed a two-stage demand forecasting pipeline with LightGBM models
  • performed a thorough cleaning, aggregation and feature engineering on transactional data
  • implemented custom loss functions aimed at maximizing the retailer's profit


Google Analytics Customer Revenue Prediction

Notebook

  • worked with two-year transactional data from a Google merchandise store
  • developed LightGBM models to predict future revenues generated by customers
  • finished in the top-2% of the Kaggle competition leaderboard with silver medal

fairness: Package for Computing Fair ML Metrics

Notebook

  • developing and actively maintaining an R package for fair machine learning
  • the package offers calculation, visualization and comparison of algorithmic fairness metrics
  • the package is published on CRAN and has more than 16k total downloads


dptools: Package for Data Processing and Feature Engineering

Notebook

  • Python package with helper functions to simplify common data processing tasks
  • functions cover feature engineering, data aggregation, working with missings and more
  • the source code and documentation are available on GitHub and PyPi