Dataiku

2022, September, 16

Data ScienceHow-ToCodeLinux

Dataiku is a data science platform. It aims to provide a single system to handle all data science tasks like:

  • Cleaning datasets
  • Transforming data
  • Training models
  • Testing models
  • Deploying models
  • Monitoring model drift

Software development based analogy for a data science platform is an IDE (IntelliJ, PyCharm, VS Code)

Intro video for Dataiku.

Dataiku Advantages

  • Visual Reprsentation : Graph called Flow is used to represent interactions between datasets & code in a project.
  • Code Reusability : Library to write code with aim to increase reuse and avoid duplication.
  • Cloud Agnostic : Dataiku could be installed on major cloud provides or any linux server. Dataiku projects could be exported and moved to any other Dataiku instance.

Linux Installation

  • Installed Ubuntu 18.04 LTS in Virtual Box 6.1
  • Installed libraries
sudo apt update
sudo apt install -y python3.7 libsqlite3-dev build-essential
wget https://cdn.downloads.dataiku.com/public/dss/11.0.2/dataiku-dss-11.0.2.tar.gz
tar xzf dataiku-dss-11.0.2.tar.gz
mkdir dataiku
dataiku-dss-11.0.2/installer.sh -d dataiku -p 11000
./dataiku/bin/dss start
./dataiku/bin/dss status

Quick Start Course

Following this starter course aimed at data scientists.

Concepts

  • Dataset -
  • Dataset Partition -
  • Recipe - code for transforming datasets
  • Notebook - Jupyter Notebook
  • Flow - Graph of datasets, recipes, notebooks

Other Data Science Platforms