Portfolio

My portfolio includes three ML projects on different topics focusing on computer vision, NLP and tabular data. To see more of my work, visit my GitHub page, download my CV or check out the about page.

My portfolio features the following projects:

📖 Text reading complexity prediction with transformers
🧬 Image-to-text translation of chemical structures with deep learning
📈 Fair machine learning in credit scoring applications

Click "read more" to see project summaries. Follow GitHub links for code and documentation. Scroll down to see more ML projects grouped by application domainas and links to further pages with my work.

Text Readability Prediction with Transformers

Highlights

developed a comprehensive PyTorch / HuggingFace text classification pipeline
build multiple transformers including BERT and RoBERTa with custom pooling layers
implemented an interactive web app for custom text reading complexity estimation

Tags: natural language processing, deep learning, web app

Summary

Estimating text reading complexity is a crucial task for school teachers. Offering students text passages at the right level of challenge is important for facilitating a fast development of reading skills. The existing tools to estimate text complexity rely on weak proxies and heuristics, which results in a suboptimal accuracy. In this project, I use deep learning to predict the readability scores of text passages.

My solution implements eight transformer models, including BERT, RoBERTa and others in PyTorch. The models feature a custom regression head that uses a concatenated output of multiple hidden layers. The modeling pipeline includes text augmentations such as sentence order shuffle, backtranslation and injecting target noise. The solution places in the top-9% of the Kaggle competition leaderboard.

The project also includes an interactive web app built in Python. The app allows to estimate reading complexity of a custom text using two of the trained transformer models. The code and documentation are available on GitHub.

Image-to-Text Translation of Molecules with Deep Learning

Highlights

built a CNN-LSTM encoder-decoder architecture to translate images into chemical formulas
developed a comprehensive PyTorch GPU/TPU image captioning pipeline
finished in the top-5% of the Kaggle competition leaderboard with a silver medal

Tags: computer vision, natural language processing, deep learning

Summary

Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.

The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.

My solution is an ensemble of CNN-LSTM Encoder-Decoder models implemented in PyTorch.The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The code is documented and published on GitHub.

Fair Machine Learning in Credit Scoring

Highlights

benchmarked eight fair ML algorithms on seven credit scoring data sets
investigated profit-fairness trade-off to quantify the cost of fairness
published a paper with the results at the European Journal of Operational Research

Tags: tabular data, fairness, profit maximization

Summary

The rise of algorithmic decision-making has spawned much research on fair ML. In this project, I focus on fairness of credit scoring decisions and make three contributions:

revisiting statistical fairness criteria and examining their adequacy for credit scoring
cataloging algorithmic options for incorporating fairness goals in the ML model development pipeline
empirically comparing different fairness algorithms in a profit-oriented credit scoring context using real-world data

The code and documentation are available on GitHub. A detailed walkthrough and key results are published in this paper.

The study reveals that multiple fairness criteria can be approximately satisfied at once and recommends separation as a proper criterion for measuring scorecard fairness. It also finds fair in-processors to deliver a good profit-fairness balance and shows that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost.

Further projects

Want to see more? Check out my further ML projects grouped by application areas below. You can also visit my GitHub page, check my recent blog posts, watch public tech talks and read academic publications.

Pet Popularity Prediction

built a PyTorch pipeline for predicting pet cuteness from image and tabular data
reached top-4% in the Kaggle competition using Transformers and CNNs
implemented an interactive web app for estimating cuteness of custom pet photos

Cassava Leaf Disease Classification

built CNNs and Vision Transformers in PyTorch to classify plant diseases
constructed a stacking ensemble with multiple computer vision models
finished in the top-1% of the Kaggle competition with a gold medal

Catheter and Tube Position Detection on Chest X-Rays

built deep learning models to detect catheter and tube position on X-ray images
developed a comprehensive PyTorch GPU/TPU computer vision pipeline
finished in the top-5% of the Kaggle competition leaderboard with silver medal

Detecting Blindness on Retina Photos

developed CNN models to identify disease types from retina photos
written a detailed report covering problem statement, EDA and modeling
submitted as a capstone project within the Udacity ML Engineer program

Profit-Driven Demand Forecasting with Gradient Boosting

developed a two-stage demand forecasting pipeline with LightGBM models
performed a thorough cleaning, aggregation and feature engineering on transactional data
implemented custom loss functions aimed at maximizing the retailer's profit

Google Analytics Customer Revenue Prediction

worked with two-year transactional data from a Google merchandise store
developed LightGBM models to predict future revenues generated by customers
finished in the top-2% of the Kaggle competition leaderboard with silver medal

`fairness`: Package for Computing Fair ML Metrics

developing and actively maintaining an R package for fair machine learning
the package offers calculation, visualization and comparison of algorithmic fairness metrics
the package is published on CRAN and has more than 16k total downloads

`dptools`: Package for Data Processing and Feature Engineering

Python package with helper functions to simplify common data processing tasks
functions cover feature engineering, data aggregation, working with missings and more
the source code and documentation are available on GitHub and PyPi

Portfolio

Text Readability Prediction with Transformers

Highlights

Summary

Image-to-Text Translation of Molecules with Deep Learning

Highlights

Summary

Fair Machine Learning in Credit Scoring

Highlights

Summary

Further projects

Pet Popularity Prediction

Cassava Leaf Disease Classification

Catheter and Tube Position Detection on Chest X-Rays

Detecting Blindness on Retina Photos

Profit-Driven Demand Forecasting with Gradient Boosting

Google Analytics Customer Revenue Prediction

fairness: Package for Computing Fair ML Metrics

dptools: Package for Data Processing and Feature Engineering

`fairness`: Package for Computing Fair ML Metrics

`dptools`: Package for Data Processing and Feature Engineering