My portfolio includes three data science projects on different topics focusing on tabular data, computer vision and natural language processing. To see more of my work, visit my GitHub profile or download my CV.
My portfolio features the following projects:
- 📖 Text reading complexity prediction with transformers
- 📈 Profit-driven demand forecasting with gradient boosting
- 🧬 Image-to-text translation of chemical structures with deep learning
Click "read more" to see the project summaries. Follow the GitHub links for the code and documentation. Scroll down to see more ML and DL projects in different application areas.
Text Readability Prediction with Transformers
- developed a comprehensive PyTorch pipeline for text classification
- implemented eight transformers including BERT, RoBERTa and others
- built an interactive web app for custom text reading complexity estimation
Tags: natural language processing, deep learning, web app
Estimating text reading complexity is a crucial task for school teachers. Offering students text passages at the right level of challenge is important for facilitating a fast development of reading skills. The existing tools to estimate text complexity rely on weak proxies and heuristics, which results in a suboptimal accuracy. This project uses deep learning to predict the readability scores of text passages.
My solution implements eight transformer models, including BERT, RoBERTa and others in PyTorch. The models feature a custom regression head that uses a concatenated output of multiple hidden layers. The modeling pipeline implements text augmentations such as sentence order shuffle, backtranslation and injecting target noise. The solution places in the top-9% of the Kaggle competition leaderboard.
The project also includes an interactive web app built in Python. The app allows to estimate reading complexity of a custom text using two of the trained transformer models. The code and documentation are available on GitHub.
Profit-Driven Demand Forecasting with Gradient Boosting
- developed a two-stage demand forecasting pipeline with LightGBM models
- performed a thorough cleaning, aggregation and feature engineering on transactional data
- implemented custom loss functions aimed at maximizing the retailer's profit
Tags: tabular data, e-commerce, profit maximization
Forecasting demand is an important managerial task that helps to optimize inventory planning. The optimized stocks can reduce retailer's costs and increase customer satisfaction due to faster delivery time. This project uses historical purchase data to predict future demand for different products.
The project pipeline includes several crucial steps:
- thorough data preparation, cleaning and feature engineering
- aggregation of transactional data into the daily format
- implementation of custom profit-driven loss functions
- two-stage demand forecasting with LightGBM models
- hyper-parameter tuning with Bayesian algorithms
- stacking ensemble to further maximize the performance
Image-to-Text Translation of Chemical Structures with Deep Learning
- built a CNN-LSTM encoder-decoder architecture to translate images into chemical formulas
- developed a comprehensive PyTorch GPU/TPU image captioning pipeline
- finished in the top-5% of the Kaggle competition leaderboard with a silver medal
Tags: computer vision, natural language processing, deep learning
Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.
The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.
My solution is an ensemble of CNN-LSTM Encoder-Decoder models implemented in PyTorch.The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The code is documented and published on GitHub.
Want to see more? Check out some of my further ML projects grouped by the application areas below. You can also visit my GitHub profile or read my recent blog posts, competition solutions and academic publications.
Cassava Leaf Disease Classification
- built CNNs and Vision Transformers in PyTorch to classify plant diseases
- constructed a stacking ensemble with multiple computer vision models
- finished in the top-1% of the Kaggle competition with a gold medal
Catheter and Tube Position Detection on Chest X-Rays
- built deep learning models to detect catheter and tube position on X-ray images
- developed a comprehensive PyTorch GPU/TPU computer vision pipeline
- finished in the top-5% of the Kaggle competition leaderboard with silver medal
Detecting Blindness on Retina Photos
- developed CNN models to identify disease types from retina photos
- written a detailed report covering problem statement, EDA and modeling
- submitted as a capstone project within the Udacity ML Engineer program
Fair ML in Credit Scoring
- benchmarked eight fair ML algorithms on seven credit scoring data sets
- investigated profit-fairness trade-off to quantify the cost of fairness
- published a paper with the results at the European Journal of Operational Research
Google Analytics Customer Revenue Prediction
- worked with two-year transactional data from a Google merchandise store
- developed LightGBM models to predict future revenues generated by customers
- finished in the top-2% of the Kaggle competition leaderboard with silver medal
fairness: Package for Computing Fair ML Metrics
- developing and actively maintaining an R package for fair machine learning
- the package offers calculation, visualization and comparison of algorithmic fairness metrics
- the package is published on CRAN and has more than 11k total downloads