v2.5.0
Datasets ⭐
EXAMPLE CODE
Kuzushiji-Kanji (KKanji) dataset
Load KKanji dataset fast. KKanji is a dataset with 140,426 images of 3832 Kanji characters. Stream KKanji while training your models in TensorFlow & PyTorch.
Visualization of the Kuzushiji Kanji Dataset on the Activeloop Platform

Kuzushiji Kanji (KKanji) dataset

What is Kuzushiji Kanji (KKanji) Dataset?

The Kuzushiji Kanji (KKanji) dataset contains 140,426 images of Kanji characters (Kuzushiji is a Japanese writing style in cursive). It is a large and highly imbalanced 64x64 grayscale image dataset. Its distribution ranges from 1,766 examples per class to only a single example per class.

Download Kuzushiji-Kanji (KKanji) Dataset in Python

Instead of downloading the KKanji dataset in Python, you can effortlessly load it in Python via our open-source package Hub with just one line of code.

Load Kuzushiji-Kanji (KKanji) Dataset in Python

1
import hub
2
ds = hub.load("hub://activeloop/kuzushiji-kanji")
Copied!

Kuzushiji-Kanji (KKanji) Dataset Structure

Kuzushiji-Kanji (KKanji) Data Fields

  • image: tensor containing the 64x64 image.
  • label: an integer between 0 and 3831 representing the Kanji Character.

How to use Kuzushiji-Kanji (KKanji) Dataset with PyTorch and TensorFlow in Python

Train a model on Kuzushiji-Kanji (KKanji) dataset with PyTorch in Python

Let's use Hub's built-in PyTorch one-line dataloader to connect the data to the compute:
1
dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)
Copied!

Train a model on Kuzushiji-Kanji (KKanji) dataset with TensorFlow in Python

1
dataloader = ds.tensorflow()
Copied!

Kuzushiji-Kanji (KKanji) Dataset Creation

Data Collection Information
Kusushiji Kanji is one of the three Kuzushiji-MNIST datasets(Kuzushiji-MNIST, Kuzushiji-49, and Kuzushiji-Kanji) created by the National Institute of Japanese Literature (NIJL)and curated by Center for Open Data in the Humanities (CODH). A bounding box was created for each character during the transcription process, but literature scholars did not think they were worth sharing. From the perspective of machine learning, CODH suggested to make a separate dataset for bounding boxes on a page. This is because that can be used as the basis for many machine learning challenges and working towards automated transcription. This resulted in the full release of the Kuzushiji dataset in November 2016. The dataset contains 3,999 character types along with 403,242 characters

Additional Information about Kuzushiji-Kanji (KKanji) Dataset

Kuzushiji-Kanji (KKanji) Dataset Description

  • Homepage: http://codh.rois.ac.jp/kmnist/index.html.en
  • Repository: https://github.com/rois-codh/kmnist
  • Paper: Deep Learning for Classical Japanese Literature. Tarin Clanuwat et al. arXiv:1812.01718
  • Point of Contact: http://codh.rois.ac.jp/feedback/

Kuzushiji-Kanji (KKanji) Dataset Curators

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto and David Ha

Kuzushiji-Kanji (KKanji) Dataset Licensing Information

CC BY-SA 4.0 License

Kuzushiji-Kanji (KKanji) Dataset Citation Information

1
@online{clanuwat2018deep,
2
author = {Tarin Clanuwat and Mikel Bober-Irizar and Asanobu Kitamoto and Alex Lamb and Kazuaki Yamamoto and David Ha},
3
title = {Deep Learning for Classical Japanese Literature},
4
date = {2018-12-03},
5
year = {2018},
6
eprintclass = {cs.CV},
7
eprinttype = {arXiv},
8
eprint = {cs.CV/1812.01718},
9
}Kuzushiji-Kanji (KKanji) Dataset FAQs
10
Copied!

Kuzushiji-Kanji (KKanji) Dataset FAQs

What is the Kuzushiji-Kanji (KKanji) dataset for Python?

The Kuzushiji-Kanji dataset is a Machine Learning dataset of the Kanji characters. It is a dataset of 140,426 square 64×64 pixel images of handwritten kanji characters labeled between 0 and 3831. The images are in grayscale format.

What is the Kuzushiji-Kanji (KKanji) dataset used for?

Kuzushiji-Kanji is used as a popular dataset of Kanji Characters used in Japanese Language.
How to download the Kuzushiji-Kanji (KKanji) dataset in Python?
You can load Kuzushiji-Kanji dataset fast with one line of code using the open-source package Activeloop Hub in Python. See detailed instructions on how to load the Kuzushiji Kanji dataset in Python.

How can I use Kuzushiji-Kanji (KKanji) dataset in PyTorch or TensorFlow?

You can stream Kuzushiji-Kanji dataset while training a model in PyTorch or TensorFlow with one line of code using the open-source package Activeloop Hub in Python. See detailed instructions on how to train a model on Kuzushiji-Kanji dataset with PyTorch in Python or train a model on Kuzushiji-Kanji dataset with TensorFlow in Python.

Should I work with Kuzushiji-Kanji (KKanji) dataset in CSV?

No. CSV is not optimized for working with image data, especially for machine learning workflows. Instead of downloading the Kuzushiji-Kanji dataset CSV, you easily load, version-control, query, and manipulate Kuzushiji-Kanji for machine learning purposes using Activeloop Hub.

How do I create an Image Dataset like Kuzushiji-Kanji (KKanji) dataset?

With Activeloop Hub, creating image datasets like the Kuzushiji-Kanji character dataset is simple. Simple datasets like Kuzushiji-Kanji can be created automatically by allowing Hub to parse the legacy files into Hub dataset format. More complex datasets can be created manually.

Kuzushiji-Kanji vs Kuzushiji-MNIST. What is the difference between Kuzushiji-Kanji and Kuzushiji-MNIST?

Kuzushiji-Kanji and Kuzushiji-MNIST are two separate datasets. Kuzushiji-MNIST dataset is meant to be a drop-in replacement to MNIST dataset. It contains 28x28 grayscale, and 70,000 images, similar to MNIST. While MNIST has 10 classes, in the Japanese Language there are 48 Hiragana characters and one Hiragana iteration mark. Hence, one Hirangana character was chosen to represent 10 rows of Hiragana.
On the other hand, Kuzushiji-Kanji is a large and highly imbalanced Dataset with the sole purpose to provide a detailed Dataset for Kanji Characters. The high-class imbalance of Kuzushiji-Kanji is because of the frequency of appearance in the real books from which the data was sourced, and kept that way to represent the real data distribution.

What is the size of each image in the Kuzushiji-Kanji (KKanji) dataset?

Kuzushiji-Kanji dataset image size is constant across all images of the dataset. Each Kuzushiji-Kanji dataset image is a fixed-size 64×64 pixel square image.
Hub community member Uday Uppal has contributed to this dataset. You rock, Uday! :)