EndoNuke is a dataset dedicated to train the models for nuclei detection in endometrium samples. It consists of more than 1600 image tiles of physical size 100μm x 100 μm with the annotated nuclei locations as keypoints. The nuclei are also classified with one of three labels: stroma, epithelium, other. The process of the dataset annotation is described in our paper (currently under review). The supplementary code, as well as masks generating scripts are available at the supplementary repository.
The dataset itself is placed in the directory data/dataset
. It is organized in the following manner:
data/dataset/images
contains all dataset images (tiles). There are no subdirectories here. Each image has a unique numeric filename, which can be treated as image_id.data/dataset/labels
contains the annotations. It has two subdirectories: bulk
and agreement
.bulk
directory contains most of the annotations, the subdirectories (ptg1
, stud3
...) correspond to different annotators. In these dirs the annotations are contained. The annotations are txt
files with the same numeric names as for the corresponding images files.agreement
directory contains the annotations from the agreement studies. It has three subdirectories: prior
, hidden
and posterior
, which correspond to different studies (the details are in the paper). Each of these subdirs also have the subdirectories,
corresponding to different experts (ptg1
, stud3
...). In these dirs the annotations are contained. The annotations are txt
files with the same numeric names as for the corresponding images files.data/dataset/images_context
contains all the context images. There are no subdirectories here. These images are 9 times larger than the labeled images. The filenames for the context images are the same, as for the labeled images. There are no annotations for these images.data/dataset/metadata
contains images metadata in json
format. There are no subdirectories here. The filenames for the context images are the same, as for the labeled images.data/dataset/file_lists
contains the files with lists of the relative filepaths for the bulk of the dataset and for the agreement studies. It has the same structure as data/dataset/labels
directory. These lists are needed to initialize PointsDataset()
instance (from here). Each dir on the lower level has two files: images.txt
with the filepaths to images and labels.txt
with the filepaths to annotations. The path on the first line of images.txt
corresponds to the path on th first line of labels.txt
, the second one to the second one and so on..txt
.
The filename of the annotation file indicates, to which image it corresponds.
The file consists of multiple lines, each line corresponds to a single keypoint.
The format is x_coordinate, y_coordinate, class_label
. The coordinates are given in pixels with zero indexing. The classes are:
The example of the annotation file is below:
11 3 0
5 26 2
20 29 0
29 56 1
...
Master ymls are placed in the directory data/master_ymls
and help to organize data access to the dataset. They contain relative paths to files with lists of paths to images and label (from data/dataset/file_lists
). There are several master_ymls
for different purposes:
everything.yaml
has the paths to all the dataset images and annotations. If these files lists are used, the tiles from agreement studies will be accessed multiple times.unique.yaml
has all the paths to images and annotations from the bulk of the dataset and the paths to images and annotations for agreement studies from the ptg1
expert. If these files lists are used, every tile will be accessed a single time.bulk.yaml
has all the paths to images and annotations from the bulk of the dataset.agreement.yaml
has all the paths to images and annotations for the agreement studies. If these files lists are used, the tiles from agreement studies will be accessed multiple times.hidden_agreement.yaml
, preliminary_agreement.yaml
and posterior_agreement.yaml
has all the paths to images and annotations for different agreement studies.As mentioned in the paper, some tiles were manually filtered to ensure the presence of unordinary cases in the dataset. However, this process affects the feature distribution, which can lead to the bias in the model quality estimation. If a researcher wants to address this issue, it is possible to separate filtered and non-filtered data using images filenames (which are the same as images ids).
Images with the ids less or equal to 1600 were randomly sampled from the slides and then manually filtered to form the tasks for the annotation. Images with the ids greater than 1600 were filtered only with the background detection script before forming the tasks.
You can download the dataset here.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Anton Naumov
Andrey Ivanov
Egor Ushakov
Evgeny Karpulevich
Tatiana Khovanskaya
Alexandra Konyukova
Konstantin Midiber
Nikita Rybak
Maria Ponomareva
Alesya Lesko
Ekaterina Volkova
Polina Vishnyakova
Sergey Nora
Liudmila Mikhaleva
Timur Fatkhudinov