Overview
A goal of the Allen Institute for Cell Science is to produce data that are accessible to a variety of users: queryable, easy to download, easy to load into your development environment, and easy to separate into subsets. While such systems are mature for scientific code (e.g. Git and GitHub) and emerging for development environments (e.g. Docker and DockerHub), they are in their infancy for data sets.
We have made substantial inroads, using the Quilt Data Platform, into releasing our image and metadata sets in a versioned, reproducible, and open fashion. Ultimately this provides a wget-free mechanism, either from the command line or within Jupyter notebooks, to download and access images of individual cells, cell colonies, and associated metadata.
We have made substantial inroads, using the Quilt Data Platform, into releasing our image and metadata sets in a versioned, reproducible, and open fashion. Ultimately this provides a wget-free mechanism, either from the command line or within Jupyter notebooks, to download and access images of individual cells, cell colonies, and associated metadata.
Of significant interest to external users is the association of versionable metadata with image data. This allows both the association of an image’s context with the image and also the updating of derived calculations that provide summary statistical information about their associated images.
Here we’ll explore the mechanisms for downloading and accessing our data through this system, but a systematic list of package contents is available on our Quilt data package pages.
Here we’ll explore the mechanisms for downloading and accessing our data through this system, but a systematic list of package contents is available on our Quilt data package pages.
Sections below mix code and visualizations using the common Jupyter Notebook data science toolkit.
Jupyter.org provides no-installation-needed introductions to allow you to use this powerful ecosystem. |
This project provides a data sharing mechanism that:
Choosing Quilt
Using Quilt, we can do all of the above. Let’s explore the capabilities we’ve introduced by looking at our Tight Junction Protein, ZO1 (AICS-23) cell line. Further documentation of what this dataset contains and how it is structured is available on its front page, but this example will function as a quickstart guide.
- allows easy downloads
- allows versioning of datasets
- provides automatic de-duplication of files between datasets
- works well within the standard Python data science Jupyter stack
Choosing Quilt
Using Quilt, we can do all of the above. Let’s explore the capabilities we’ve introduced by looking at our Tight Junction Protein, ZO1 (AICS-23) cell line. Further documentation of what this dataset contains and how it is structured is available on its front page, but this example will function as a quickstart guide.
Dive in to the Automating Access to Cell Images and Other Data Jupyter notebook
First we need to get Jupyter setup.
With Jupyter running, you can paste the code below and work through to the bottom of the notebook, or use our published github version: |
A quick guide to accessing our data
Now we move into the technical tutorial. First, let’s download our tight junction protein, ZO1 cell line package.
Input code for install
Code Editor
Output from install
That’s it?
Yep, that is all there is to getting our most up-to-date data for AICS-23: simple, fast, easy. We are supporting two main package divisions: each of our cell lines will have their own quilt data packages (aics/aics_11, aics/aics_16, etc) and aics/cell_lines will contain all our released data. For toy explorations, aics/random_sample will have a small set of randomly sampled images, distributed across all of our cell lines.
That looks suspiciously fast.
In this case, my system was downloading an update to bring my local set to the latest version. The benefit is that we will only send you data that has changed or been added to a dataset. Small updates are painless and data shared between package isn’t duplicated.
Yep, that is all there is to getting our most up-to-date data for AICS-23: simple, fast, easy. We are supporting two main package divisions: each of our cell lines will have their own quilt data packages (aics/aics_11, aics/aics_16, etc) and aics/cell_lines will contain all our released data. For toy explorations, aics/random_sample will have a small set of randomly sampled images, distributed across all of our cell lines.
That looks suspiciously fast.
In this case, my system was downloading an update to bring my local set to the latest version. The benefit is that we will only send you data that has changed or been added to a dataset. Small updates are painless and data shared between package isn’t duplicated.
What do the packages contain?
Great question! Let’s take a look inside the package and find out. This is the interactive version of the documentation that exists on the package homepage.
Great question! Let’s take a look inside the package and find out. This is the interactive version of the documentation that exists on the package homepage.
Dataset input
Code Editor
Dataset contents
These are all of the objects in the base level of the package. We have set things up in a directory style: anything with a / after it is a directory and anything else is a file. This package has directories for cell_segs (cell segmentations), fovs (field of view images), lines (cell line information), nuclei_segs (nuclei segmentations), plates (plate information), structure_segs (protein structure segmentations), and wells (well images and information).
It also has README and SCHEMA documents, which describe the package contents. README is the exact same document you see on the Quilt documentation page for the package, while SCHEMA is a JSON file that documents where metadata lives.
It also has README and SCHEMA documents, which describe the package contents. README is the exact same document you see on the Quilt documentation page for the package, while SCHEMA is a JSON file that documents where metadata lives.
How can I open the SCHEMA to actually view that information if it is a document?
Let’s do that.
Let’s do that.
Opening files
Code Editor
Schema contents
So there are multiple basic info files?
These packages contain a lot of data. It was necessary to devise a method to document the entirety of the contents, but just as an example, let’s view a smaller portion of the schema. For example, perhaps we are only interested in what is in a unique line’s metadata.
These packages contain a lot of data. It was necessary to devise a method to document the entirety of the contents, but just as an example, let’s view a smaller portion of the schema. For example, perhaps we are only interested in what is in a unique line’s metadata.
Schema exploration
Code Editor
Ahhh, much better.
What does all this mean?
If you notice, there are attributes for all the other types of metadata files, plates, wells, fovs, cell_segs, nuclei_segs, and structure_segs. All packages will be set up this way, and all metadata info files will have this same behavior. The information contained in those attributes is the corresponding data of that type that relates to the current metadata information.
As an example:
Will schema always be in this format?
Generally, this schema format won’t change unless new data or features become available and we want to add more data to a dataset. But you can always double check by parsing the SCHEMA document or reading the documentation page for the package as the SCHEMA auto populates there as well.
What does all this mean?
If you notice, there are attributes for all the other types of metadata files, plates, wells, fovs, cell_segs, nuclei_segs, and structure_segs. All packages will be set up this way, and all metadata info files will have this same behavior. The information contained in those attributes is the corresponding data of that type that relates to the current metadata information.
As an example:
- We are currently looking at cell line 23, if we open up cell line 23’s metadata file, we will find a list of fov images that contain cell line 23.
- A code example of this next!
- The other attributes are specific to the metadata you are looking at. In this case, that is edits, line, and line_id.
Will schema always be in this format?
Generally, this schema format won’t change unless new data or features become available and we want to add more data to a dataset. But you can always double check by parsing the SCHEMA document or reading the documentation page for the package as the SCHEMA auto populates there as well.
What is in the attributes for each base key?
Let’s take a quick look at one of the schema defined attributes for a unique_line_info.
Let’s take a quick look at one of the schema defined attributes for a unique_line_info.
Cell line 'line' attribute
Code Editor
Why does it say AICS-10?
This is the schema definition document. This is an example document of how metadata may look like for a given metadata JSON file.
What do these attributes mean?
All the attributes here are meant to help you understand what the information is and where it lives in the package. If we look at the edits attribute for line we can see that it has description, example_value, origin, and value_types attributes. These will always exist for every attribute in a unique_TYPE_info and explain various details about what the data will look like in the actual metadata.
description is a human written description of what this value encodes. There may be one example_value or multiple example_values that show how the metadata may look like when retrieved. origin shows the path of how we retrieve this metadata. If baseline is used, that means that the attribute is encoded as a baseline standard and is generated by our package builder. Lastly, value_types is a list of possible types of values you can expect this attribute to be.
This is the schema definition document. This is an example document of how metadata may look like for a given metadata JSON file.
What do these attributes mean?
All the attributes here are meant to help you understand what the information is and where it lives in the package. If we look at the edits attribute for line we can see that it has description, example_value, origin, and value_types attributes. These will always exist for every attribute in a unique_TYPE_info and explain various details about what the data will look like in the actual metadata.
description is a human written description of what this value encodes. There may be one example_value or multiple example_values that show how the metadata may look like when retrieved. origin shows the path of how we retrieve this metadata. If baseline is used, that means that the attribute is encoded as a baseline standard and is generated by our package builder. Lastly, value_types is a list of possible types of values you can expect this attribute to be.
How about that code example for this package, to get a better understanding of everything?
Let’s load the AICS 23 cell line (the tight junctions line we are using from above) metadata info document and parse the information.
Let’s load the AICS 23 cell line (the tight junctions line we are using from above) metadata info document and parse the information.
Node navigation
Code Editor
Why didn’t that work?
In order to properly version metadata separately from the images themselves we had to route the metadata and image data uniquely. So besides the main README and SCHEMA documents, all files will be accessible by using a load() call.
In order to properly version metadata separately from the images themselves we had to route the metadata and image data uniquely. So besides the main README and SCHEMA documents, all files will be accessible by using a load() call.
Loading files
Code Editor
I can tell it is a file path, but it has no file extension, what do I do?
Great question. All images we release are in OME-TIFF format, and all metadata info documents we release are in JSON format. So use your favorite TIFF Python reader and use the standard JSON Python reader to open them!
Great question. All images we release are in OME-TIFF format, and all metadata info documents we release are in JSON format. So use your favorite TIFF Python reader and use the standard JSON Python reader to open them!
Reading metadata
Code Editor
Why are there keys for the other data types?
All metadata files will have a list of files related to the object you are currently viewing. If you are looking at the metadata for AICS-23, like we are now, the metadata info file will have lists for plates, wells, fovs, cell_segs, nuclei_segs, and structure_segs that then each contain a list of nodes that have AICS-23 data in them. Let’s take a smaller look at just the plates with AICS-23 data.
All metadata files will have a list of files related to the object you are currently viewing. If you are looking at the metadata for AICS-23, like we are now, the metadata info file will have lists for plates, wells, fovs, cell_segs, nuclei_segs, and structure_segs that then each contain a list of nodes that have AICS-23 data in them. Let’s take a smaller look at just the plates with AICS-23 data.
Associated plates
Code Editor
What does this mean?
This is a list of plates that were used to culture and image cells from the tight junctions (AICS-23) cell line in them. You can use these to route to view information about that specific plate.
This is a list of plates that were used to culture and image cells from the tight junctions (AICS-23) cell line in them. You can use these to route to view information about that specific plate.
Associated navigation
Code Editor
Oh! So, any attributes that correspond to another file type will work like this?
Yes. All metadata JSON files will have these lists of related files that you can use to navigate around quickly.
As an example, let’s say you were looking at the segmentation of a specific nucleus in a specific cell. You could then open up that nucleus segmentation’s info metadata document, and find the plate’s attribute, in order to navigate back to that plate’s metadata, for more information about said plate, including cells neighboring the one you were looking at.
Yes. All metadata JSON files will have these lists of related files that you can use to navigate around quickly.
As an example, let’s say you were looking at the segmentation of a specific nucleus in a specific cell. You could then open up that nucleus segmentation’s info metadata document, and find the plate’s attribute, in order to navigate back to that plate’s metadata, for more information about said plate, including cells neighboring the one you were looking at.
What about the other attributes in the line metadata?
Let’s get back to that now.
Let’s get back to that now.
Unique line attributes
Code Editor
Unique line metadata
These are the other metadata attributes available for cell lines. And just as the SCHEMA document showed, the edits is a list of dict objects that contain information about which edits were made to the cell line. The line attribute is a string that has the production name for the cell line. And the line_id attribute has the line index.
Okay, so that’s the metadata, what about the images?
On it!
On it!
Loading images
Code Editor
It has an image and an info directory?
Yes, the info will always be a metadata JSON file about that current node. In this case, it will be a metadata file about field of view image 00dde0c979ef45949337220524ccfec5.
Again, both of these are directories because we can version them better when these are their own subdirectories of the object, so we’ll do the same .load() step as we did with the info before:
Yes, the info will always be a metadata JSON file about that current node. In this case, it will be a metadata file about field of view image 00dde0c979ef45949337220524ccfec5.
Again, both of these are directories because we can version them better when these are their own subdirectories of the object, so we’ll do the same .load() step as we did with the info before:
Reading images
Code Editor
That is an odd shape…
All of our image data is actually five dimensional (!) and always in the order, TIME - Z - CHANNEL - Y - X. In short, these are 3D images with time labeling, and structures (cell membrane, nucleus, other subcellular structures) are put into their own channels. Thinking in standard 2D XY space, this corresponds to a 2D plane that shows the XY data at a given height in the z slice, at a given time, for a specific channel.
You said five dimensional, but the above image is four dimensional…
Correct, any image that is four dimensional is a single time point so you can think of it as (1, 65, 7, 624, 924), but the python object that handles the data manages that single time point for you and ignores it.
All of our image data is actually five dimensional (!) and always in the order, TIME - Z - CHANNEL - Y - X. In short, these are 3D images with time labeling, and structures (cell membrane, nucleus, other subcellular structures) are put into their own channels. Thinking in standard 2D XY space, this corresponds to a 2D plane that shows the XY data at a given height in the z slice, at a given time, for a specific channel.
You said five dimensional, but the above image is four dimensional…
Correct, any image that is four dimensional is a single time point so you can think of it as (1, 65, 7, 624, 924), but the python object that handles the data manages that single time point for you and ignores it.
How can I display the image?
Sorry, on it!
Sorry, on it!
Plotting channels
Code Editor
Look at those beautiful cells!
How can I label the channels with what they actually are?
This is where the metadata file comes in very handy!
How can I label the channels with what they actually are?
This is where the metadata file comes in very handy!
How can I label the channels with what they actually are?
This is where the metadata file comes in very handy!
This is where the metadata file comes in very handy!
Metadata usage
Code Editor
Why do some of the channel names have an _2 attached to them?
Sorry about that. It’s an artifact of the imaging process. But we can remove those channels if you want!
Sorry about that. It’s an artifact of the imaging process. But we can remove those channels if you want!
Channel filtering
Code Editor
Ahhhh, much better!
Metadata working in combination with image data, what a joy!
Metadata working in combination with image data, what a joy!
Can we get the segmentations for this field of view?
You bet. We’ll collect the three segmentations from the loaded metadata. Remember: these are lists of related files: even though there is one object in them, they are still a list. So let’s only grab the first object from each:
You bet. We’ll collect the three segmentations from the loaded metadata. Remember: these are lists of related files: even though there is one object in them, they are still a list. So let’s only grab the first object from each:
Associated segmentations
Code Editor
Segmentation nodes
Load all the images into memory:
Plotting segmentations
Code Editor
What about combining the original field of view image channels together?
Yep, we can do that – but it gets a little tricky. We need to write a custom RGB mapping function…
Yep, we can do that – but it gets a little tricky. We need to write a custom RGB mapping function…
RGB plotting
Code Editor
Wrapping up
Using Quilt as the basis for our analysis stack, we can easily version, deploy, and natively install and download our datasets straight into Python and Jupyter Notebooks (with other language support coming soon)!
We believe this will enable easier, and faster development, from anyone who wants to use our data.
We believe this will enable easier, and faster development, from anyone who wants to use our data.