Programmatic Access#

Important

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

CAVEclient#

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

import os
from caveclient import CAVEclient
datastack_name = 'minnie65_public'
client = CAVEclient(datastack_name)

# Show the description of the datastack
client.info.get_datastack_info()['description']
'This is the publicly released version of the minnie65 volume and segmentation. '

CAVEclient Basics#

The most frequent use of the CAVEclient is to query the database for annotations like synapses. All database functions are under the client.materialize property. To see what tables are available, use the get_tables function:

client.materialize.get_tables()
['synapses_pni_2',
 'baylor_gnn_cell_type_fine_model_v2',
 'nucleus_alternative_points',
 'connectivity_groups_v507',
 'proofreading_status_public_release',
 'allen_column_mtypes_v1',
 'allen_v1_column_types_slanted_ref',
 'aibs_column_nonneuronal_ref',
 'nucleus_ref_neuron_svm',
 'aibs_soma_nuc_exc_mtype_preds_v117',
 'baylor_log_reg_cell_type_coarse_v1',
 'apl_functional_coreg_forward_v5',
 'nucleus_detection_v0',
 'aibs_soma_nuc_metamodel_preds_v117',
 'coregistration_manual_v3']

For each table, you can see the metadata describing that table. For example, let’s look at the nucleus_detection_v0 table:

client.materialize.get_table_metadata('nucleus_detection_v0')
{'schema': 'nucleus_detection',
 'aligned_volume': 'minnie65_phase3',
 'valid': True,
 'table_name': 'nucleus_detection_v0__minnie3_v1',
 'id': 14621,
 'created': '2020-11-02T18:56:35.530100',
 'schema_type': 'nucleus_detection',
 'user_id': '121',
 'description': 'A table of nuclei detections from a nucleus detection model developed by Shang Mu, Leila Elabbady, Gayathri Mahalingam and Forrest Collman. Pt is the centroid of the nucleus detection. id corresponds to the flat_segmentation_source segmentID. Only included nucleus detections of volume>25 um^3, below which detections are false positives, though some false positives above that threshold remain. ',
 'notice_text': None,
 'reference_table': None,
 'flat_segmentation_source': 'precomputed://https://bossdb-open-data.s3.amazonaws.com/iarpa_microns/minnie/minnie65/nuclei',
 'write_permission': 'PRIVATE',
 'read_permission': 'PUBLIC',
 'last_modified': '2022-10-25T19:24:28.559914',
 'segmentation_source': '',
 'pcg_table_name': 'minnie3_v1',
 'last_updated': '2023-08-21T01:00:00.651639',
 'annotation_table': 'nucleus_detection_v0',
 'voxel_resolution': [4.0, 4.0, 40.0]}

You get a dictionary of values. Two fields are particularly important: the description, which offers a text description of the contents of the table and voxel_resolution which defines how the coordinates in the table are defined, in nm/voxel.

Querying Tables#

To get the contents of a table, use the query_table function. This will return the whole contents of a table without any filtering, up to for a maximum limit of 200,000 rows. The table is returned as a Pandas DataFrame and you can immediately use standard Pandas function on it.

cell_type_df = client.materialize.query_table('nucleus_detection_v0')
cell_type_df.head()
id created superceded_id valid volume pt_supervoxel_id pt_root_id pt_position bb_start_position bb_end_position
0 730537 2020-09-28 22:40:41.780734+00:00 NaN t 32.307937 0 0 [381312, 273984, 19993] [nan, nan, nan] [nan, nan, nan]
1 373879 2020-09-28 22:40:41.781788+00:00 NaN t 229.045043 96218056992431305 864691136090135607 [228816, 239776, 19593] [nan, nan, nan] [nan, nan, nan]
2 601340 2020-09-28 22:40:41.782714+00:00 NaN t 426.138010 0 0 [340000, 279152, 20946] [nan, nan, nan] [nan, nan, nan]
3 201858 2020-09-28 22:40:41.783784+00:00 NaN t 93.753836 84955554103121097 864691135373893678 [146848, 213600, 26267] [nan, nan, nan] [nan, nan, nan]
4 600774 2020-09-28 22:40:41.785273+00:00 NaN t 135.189791 0 0 [339120, 276112, 19442] [nan, nan, nan] [nan, nan, nan]

Important

While most tables are small enough to be returned in full, the synapse table has hundreds of millions of rows and is too large to download this way

Tables have a collection of columns, some of which specify point in space (columns ending in _position), some a root id (ending in _root_id), and others that contain other information about the object at that point. Before describing some of the most important tables in the database, it’s useful to know about a few advanced options that apply when querying any table.

  • desired_resolution : This parameter allows you to convert the columns specifying spatial points to different resolutions. Many tables are stored at a resolution of 4x4x40 nm/voxel, for example, but you can convert to nanometers by setting desired_resolution=[1,1,1].

  • split_positions : This parameter allows you to split the columns specifying spatial points into separate columns for each dimension. The new column names will be the original column name with _x, _y, and _z appended.

  • select_columns : This parameter allows you to get only a subset of columns from the table. Once you know exactly what you want, this can save you some cleanup.

  • limit : This parameter allows you to limit the number of rows returned. If you are just testing out a query or trying to inspect the kind of data within a table, you can set this to a small number to make sure it works before downloading the whole table. Note that this will show a warning so that you don’t accidentally limit your query when you don’t mean to.

For example, using all of these together:

cell_type_df = client.materialize.query_table('nucleus_detection_v0', split_positions=True, desired_resolution=[1,1,1], select_columns=['pt_position', 'pt_root_id'], limit=10)
cell_type_df
201 - "Limited query to 10 rows
pt_position_x pt_position_y pt_position_z pt_root_id
0 241856.0 374464.0 838720.0 0
1 227200.0 389120.0 797160.0 0
2 230144.0 422336.0 795320.0 0
3 239488.0 386432.0 794120.0 0
4 239744.0 423488.0 803120.0 864691136050815731
5 245888.0 384512.0 800120.0 0
6 249792.0 391680.0 807080.0 0
7 243328.0 403008.0 794280.0 0
8 247872.0 386816.0 805320.0 0
9 260352.0 416640.0 802360.0 864691135013273238

Filtering Queries#

Filtering tables so that you only get data about certain rows back is a very common operation. While there are filtering options in the query_table function (see documentation for more details), a more unified filter interface is available through a “table manager” interface. Rather than passing a table name to the query_table function, client.materialize.tables has a subproperty for each table in the database that can be used to filter that table. The general pattern for usage is

client.materialize.tables.{table_name}({filter options}).query({format and timestamp options})

where {table_name} is the name of the table you want to filter, {filter options} is a collection of arguments for filtering the query, and {format and timestamp options} are those parameters controlling the format and timestamp of the query.

For example, let’s look at the table aibs_soma_nuc_metamodel_preds_v117, which has cell type predictions across the dataset. We can get the whole table as a DataFrame:

cell_type_df = client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117().query()
cell_type_df.head()
id created valid volume pt_supervoxel_id pt_root_id id_ref created_ref valid_ref target_id classification_system cell_type pt_position bb_start_position bb_end_position
0 498173 2020-09-28 22:43:20.177696+00:00 t 308.176159 103884538719281829 864691135373830344 553 2022-07-26 23:54:55.895294+00:00 t 498173 aibs_neuronal 6P-IT [284688, 211936, 15566] [nan, nan, nan] [nan, nan, nan]
1 487329 2020-09-28 22:41:27.945151+00:00 t 295.937638 105279407463397326 864691135975935434 4509 2022-07-27 00:00:10.165062+00:00 t 487329 aibs_neuronal MC [294544, 118624, 21745] [nan, nan, nan] [nan, nan, nan]
2 106662 2020-09-28 22:42:56.452281+00:00 t 230.148178 79524515478544304 864691136084076652 4693 2022-07-27 00:00:10.313814+00:00 t 106662 aibs_neuronal 23P [107056, 119248, 19414] [nan, nan, nan] [nan, nan, nan]
3 271350 2020-09-28 22:41:38.906480+00:00 t 305.328128 87351114324194368 864691135777995965 5061 2022-07-27 00:00:10.592207+00:00 t 271350 aibs_neuronal 6P-CT [163920, 235968, 20875] [nan, nan, nan] [nan, nan, nan]
4 456040 2020-09-28 22:42:07.860678+00:00 t 257.463910 101129507251445952 864691136084057196 8652 2022-07-27 00:01:29.589487+00:00 t 456040 aibs_neuronal MC [264544, 132528, 23988] [nan, nan, nan] [nan, nan, nan]

and we can add similar formatting options as in the last section to the query function:

cell_type_df = client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117().query(split_positions=True, desired_resolution=[1,1,1], select_columns=['pt_position', 'pt_root_id', 'cell_type'], limit=10)
cell_type_df
pt_position_x pt_position_y pt_position_z pt_root_id cell_type
0 257600.0 487936.0 802760.0 864691135724233643 23P
1 260992.0 493568.0 801560.0 864691136436395166 23P
2 256256.0 466432.0 831040.0 864691135462260637 NGC
3 255744.0 480640.0 833200.0 864691136723556861 23P
4 262144.0 505856.0 824880.0 864691135776658528 23P
5 257536.0 521728.0 804440.0 864691135941166708 23P
6 251136.0 546048.0 821320.0 864691135479369926 23P
7 324096.0 417920.0 658880.0 864691135937358133 astrocyte
8 324032.0 432960.0 679800.0 864691135207734905 NGC
9 309568.0 421120.0 706000.0 864691135758479438 NGC

However, now we can also filter the table to get only cells that are predicted to have cell type "BC" (for “basket cell”).

my_cell_type = "BC"
client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117(cell_type=my_cell_type).query()
id created valid volume pt_supervoxel_id pt_root_id id_ref created_ref valid_ref target_id classification_system cell_type pt_position bb_start_position bb_end_position
0 193846 2020-09-28 22:40:41.897904+00:00 t 306.148966 82838443188669165 864691135684976823 39997 2022-07-27 00:05:37.339316+00:00 t 193846 aibs_neuronal BC [131568, 168496, 16452] [nan, nan, nan] [nan, nan, nan]
1 615735 2020-09-28 22:40:41.957345+00:00 t 314.539540 112181247505371364 864691136311774525 15248 2022-07-27 00:02:08.492581+00:00 t 615735 aibs_neuronal BC [344880, 161104, 17084] [nan, nan, nan] [nan, nan, nan]
2 613047 2020-09-28 22:40:41.982376+00:00 t 242.159780 113234168401651200 864691136065413528 30769 2022-07-27 00:04:44.154032+00:00 t 613047 aibs_neuronal BC [352688, 141616, 25312] [nan, nan, nan] [nan, nan, nan]
3 369908 2020-09-28 22:40:41.814964+00:00 t 332.862751 96002690286851358 864691136522768017 41346 2022-07-27 00:05:49.017547+00:00 t 369908 aibs_neuronal BC [227104, 207840, 20841] [nan, nan, nan] [nan, nan, nan]
4 402885 2020-09-28 22:40:41.994716+00:00 t 279.232348 97621720621533350 864691135645529583 61160 2022-07-27 00:07:51.974931+00:00 t 402885 aibs_neuronal BC [238848, 211712, 16471] [nan, nan, nan] [nan, nan, nan]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3079 376258 2020-09-28 22:45:25.357703+00:00 t 507.659387 95022544591277972 864691136578023700 56954 2022-07-27 00:07:26.643176+00:00 t 376258 aibs_neuronal BC [220080, 245168, 22235] [nan, nan, nan] [nan, nan, nan]
3080 434713 2020-09-28 22:45:25.293700+00:00 t 496.550052 100367614546691106 864691134989270650 37178 2022-07-27 00:05:21.976305+00:00 t 434713 aibs_neuronal BC [258672, 223072, 24681] [nan, nan, nan] [nan, nan, nan]
3081 676137 2020-09-28 22:45:25.371421+00:00 t 510.804296 116061493832778551 864691136119431960 74764 2022-07-27 00:09:16.780389+00:00 t 676137 aibs_neuronal BC [372976, 235696, 25121] [nan, nan, nan] [nan, nan, nan]
3082 591219 2020-09-28 22:45:25.526753+00:00 t 567.517839 110216764830845707 864691135279126177 20112 2022-07-27 00:03:41.743075+00:00 t 591219 aibs_neuronal BC [330320, 204752, 25060] [nan, nan, nan] [nan, nan, nan]
3083 438586 2020-09-28 22:45:25.430745+00:00 t 529.501389 99807894274485381 864691136897160046 60636 2022-07-27 00:07:51.544023+00:00 t 438586 aibs_neuronal BC [254912, 247440, 23680] [nan, nan, nan] [nan, nan, nan]

3084 rows × 15 columns

or maybe we just want the cell types for a particular collection of root ids:

my_root_ids = [864691135771677771, 864691135560505569, 864691136723556861]
client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117(pt_root_id=my_root_ids).query()
id created valid volume pt_supervoxel_id pt_root_id id_ref created_ref valid_ref target_id classification_system cell_type pt_position bb_start_position bb_end_position
0 19116 2020-09-28 22:41:51.767906+00:00 t 301.426115 74737997899501359 864691135771677771 69892 2022-07-27 00:08:42.855344+00:00 t 19116 aibs_neuronal 23P [72576, 108656, 20291] [nan, nan, nan] [nan, nan, nan]
1 21783 2020-09-28 22:41:59.966574+00:00 t 263.637074 75795590176519004 864691135560505569 1778 2022-07-26 23:54:56.804122+00:00 t 21783 aibs_neuronal 23P [80128, 124000, 16563] [nan, nan, nan] [nan, nan, nan]
2 4074 2020-09-28 22:42:41.341179+00:00 t 313.678234 73543309863605007 864691136723556861 52713 2022-07-27 00:07:02.006105+00:00 t 4074 aibs_neuronal 23P [63936, 120160, 20830] [nan, nan, nan] [nan, nan, nan]

You can get a list of all parameters than be used for querying with the standard IPython/Jupyter docstring functionality, e.g. client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117.

Note

Use of this functionality will show a brief warning that the interface is experimental. This is because the interface is still being developed and may change in the near future in response to user feedback.

Querying Synapses#

While synapses are stored as any other table in the database, in this case synapses_pni_2, this table is much larger than any other table at more than 337 million rows, and it works best when queried in a different way. The synapse_query function allows you to query the synapse table in a more convenient way than most other tables. In particular, the pre_ids and post_ids let you specify which root id (or collection of root ids) you want to query, with pre_ids indicating the collection of presynaptic neurons and post_ids the collection of postsynaptic neurons. Using both pre_ids and post_ids in one call is effectively a logical AND, returning only those synapses from neurons in the list of pre_ids that target neurons in the list of post_ids. Let’s look at one particular example.

my_root_id = 864691135808473885
syn_df = client.materialize.synapse_query(pre_ids=my_root_id)
print(f"Total number of output synapses for {my_root_id}: {len(syn_df)}")
syn_df.head()
Total number of output synapses for 864691135808473885: 1498
id created superceded_id valid size pre_pt_supervoxel_id pre_pt_root_id post_pt_supervoxel_id post_pt_root_id pre_pt_position post_pt_position ctr_pt_position
0 158405512 2020-11-04 06:48:59.403833+00:00 NaN t 420 89385416926790697 864691135808473885 89385416926797494 864691135546540484 [179076, 188248, 20233] [179156, 188220, 20239] [179140, 188230, 20239]
1 185549462 2020-11-04 06:49:10.903020+00:00 NaN t 4832 91356016507479890 864691135808473885 91356016507470163 864691135884799088 [193168, 190452, 19262] [193142, 190404, 19257] [193180, 190432, 19254]
2 138110803 2020-11-04 06:49:46.758528+00:00 NaN t 3176 87263084540201919 864691135808473885 87263084540199587 864691135195078186 [163440, 104292, 19808] [163498, 104348, 19806] [163460, 104356, 19804]
3 155339535 2020-11-04 09:53:22.361558+00:00 NaN t 5624 88540717319827050 864691135808473885 88540717319834759 864691136039974142 [173050, 186398, 21570] [173026, 186518, 21573] [173100, 186472, 21569]
4 148262628 2020-11-04 06:53:27.294021+00:00 NaN t 3536 88189766885093187 864691135808473885 88189835604584343 864691135250533976 [170154, 193170, 21123] [170046, 193240, 21123] [170118, 193220, 21128]

Note that synapse queries always return the list of every synapse between the neurons in the query, even if there are multiple synapses between the same pair of neurons.

A common pattern to generate a list of connections between unique pairs of neurons is to group by the root ids of the presynaptic and postsynaptic neurons and then count the number of synapses between them. For example, to get the number of synapses from this neuron onto every other neuron, ordered

syn_df.groupby(
  ['pre_pt_root_id', 'post_pt_root_id']
).count()[['id']].rename(
  columns={'id': 'syn_count'}
).sort_values(
  by='syn_count',
  ascending=False,
)
# Note that the 'id' part here is just a way to quickly extract one column.
# This could be any of the remaining column names, but `id` is often convenient because it is common to all tables.
syn_count
pre_pt_root_id post_pt_root_id
864691135808473885 864691135946651940 20
864691135808597021 16
864691136578647572 15
864691136066504856 13
864691135841325283 11
... ...
864691135545375172 1
864691135545381160 1
864691135546540484 1
864691135547406532 1
864691137197468481 1

1038 rows × 1 columns