Dataiku
2022, September, 16
Data ScienceHow-ToCodeLinux
Dataiku is a data science platform. It aims to provide a single system to handle all data science tasks like:
- Cleaning datasets
- Transforming data
- Training models
- Testing models
- Deploying models
- Monitoring model drift
Software development based analogy for a data science platform is an IDE (IntelliJ, PyCharm, VS Code)
Intro video for Dataiku.
Dataiku Advantages
- Visual Reprsentation : Graph called Flow is used to represent interactions between datasets & code in a project.
- Code Reusability : Library to write code with aim to increase reuse and avoid duplication.
- Cloud Agnostic : Dataiku could be installed on major cloud provides or any linux server. Dataiku projects could be exported and moved to any other Dataiku instance.
Linux Installation
- Installed Ubuntu 18.04 LTS in Virtual Box 6.1
- Installed libraries
sudo apt updatesudo apt install -y python3.7 libsqlite3-dev build-essential
- Followed steps for linux installation given in dataiku docs
wget https://cdn.downloads.dataiku.com/public/dss/11.0.2/dataiku-dss-11.0.2.tar.gztar xzf dataiku-dss-11.0.2.tar.gzmkdir dataikudataiku-dss-11.0.2/installer.sh -d dataiku -p 11000./dataiku/bin/dss start./dataiku/bin/dss status
Quick Start Course
Following this starter course aimed at data scientists.
Concepts
- Dataset -
- Dataset Partition -
- Recipe - code for transforming datasets
- Notebook - Jupyter Notebook
- Flow - Graph of datasets, recipes, notebooks