Visual Behavior Neuropixels Dataset#
The main entry point to the VBN dataset is the VisualBehaviorNeuropixelsProjectCache
class. This class is responsible for downloading any requested data or metadata as needed and storing it in well known locations. For detailed info about how to access this data, check out this tutorial
We begin by importing the VisualBehaviorNeuropixelsProjectCache
class.
import pandas as pd
import os
from allensdk.brain_observatory.behavior.behavior_project_cache.\
behavior_neuropixels_project_cache \
import VisualBehaviorNeuropixelsProjectCache
/opt/envs/allensdk/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Now we can specify our cache directory and set up the cache.
# this path should point to the location of the dataset on your platform
cache_dir = '/root/capsule/data/'
cache = VisualBehaviorNeuropixelsProjectCache.from_local_cache(
cache_dir=cache_dir, use_static_cache=True)
We can use the VisualBehaviorNeuropixelsProjectCache
to explore the parameters of this dataset. Let’s start by examining the cache metadata tables.
VBN Metadata#
The data manifest comprises 5 tables:
ecephys_sessions_table
(64 kB)behavior_sessions_table
(531 kB)units_table
(130 MB)probes_table
(127 kB)channels_table
(28 MB)
The ecephys_sessions_table
contains metadata for every Neuropixels recording session in the dataset. We attempted to insert 6 probes for each experiment, but occasionally individual insertions failed. The probe_count
column tells you how many probes were inserted for a given session. The structure_acronyms
column indicates which brain areas were targeted. For the majority of mice, there are two recording sessions. These were run on consecutive days with two different image sets, G
and H
. The experience_level
column tells you whether the image set used for a particular recording was the same as the training image set (Familiar
), or different from the training image set (Novel
).
The behavior_sessions_table
contains metadata for each behavior session. Some behavior sessions have Neuropixels data associated with them, while others took place during training in the behavior facility. The different training stages that mice progressed through are described by the session_type
.
The units_table
contains metadata for every unit in the release. Each unit can be linked to the corresponding recording session, probe and channel by the ecephys_session_id
, ecephys_probe_id
and ecephys_channel_id
columns. This table also contains a number of helpful quality metrics, which can be used to filter out contaminated units before analysis. For more guidance on how to use these metrics, check out this tutorial.
The probes_table
contains metadata for each probe insertion.
The channels_table
contains metadata for each channel recorded during an ephys session. This table provides useful info about where a particular channel is located in the Allen Common Coordinate Framework as well as it’s relative position on the probe.
Now let’s look at each of these tables in more detail to get a better sense of the dataset.
Ecephys sessions table#
First, let’s load the ecephys_sessions_table and print the columns:
ecephys_sessions_table = cache.get_ecephys_session_table()
ecephys_sessions_table.columns
Index(['behavior_session_id', 'date_of_acquisition', 'equipment_name',
'session_type', 'mouse_id', 'genotype', 'sex', 'project_code',
'age_in_days', 'unit_count', 'probe_count', 'channel_count',
'structure_acronyms', 'image_set', 'prior_exposures_to_image_set',
'session_number', 'experience_level', 'prior_exposures_to_omissions',
'file_id', 'abnormal_histology', 'abnormal_activity'],
dtype='object')
This table gives us lots of useful metadata about each recording session, including the genotype, sex and age of the mouse that was run, what brain areas were recorded and some important info about the stimulus.
To demystify a few of these columns, let’s briefly review the experimental design. Each mouse was trained with one of two image sets (G
or H
). For the majority of mice, we recorded two sessions: one with the trained ‘familiar’ image set and one with a ‘novel’ image set. Note that two of the eight images were shared across these two image sets as diagrammed below for an example mouse. For this mouse, image set G
(images on blue and purple backgrounds) was used in training and was therefore ‘familiar’, while image set H
(the two holdover images on purple background plus six novel images on red background) was ‘novel’.
Each recording session can be defined by a few parameters, including the image_set
used (G or H), the experience_level
of the mouse (indicating whether the mouse had seen the image set in previous training sessions) and the session_number
(indicating whether it was the first or second recording day for the mouse). In bottom bubble of the above diagram, you can see the three different training/recording trajectories mice in this dataset took:
Train on G; see G on first recording day; see H on second recording day
Train on G; see H on first recording day; see G on second recording day
Train on H; see H on first recording day; see G on second recording day
The numbers in the recording session cells indicate how many of each session type exist in this dataset. The first number is what the SDK returns by default. The second number (in parentheses) is what the SDK returns without filtering for abnormalities (see below as well as the Data Access tutorial).
Here is a brief description of each column:
- abnormal_activity
nan or list List of experiment time stamps when possible epileptic activity was noted.
- abnormal_histology
nan or list List of brain areas where possible damage was noted in post-hoc imaging.
- age_in_days
int age of mouse in days
- behavior_session_id
int unique identifier for a behavior session
- channel_count
float total number of channels on all probes used for experiment
- date_of_acquisition
date time object date and time of data acquisition, yyyy-mm-dd hh:mm:ss.
- ecephys_session_id
int unique identifier for an ecephys recording session
- equipment_name
string identifier for equipment data was collected on
- experience_level
string ‘Familiar’: image set mouse was trained on, ‘Novel’: not the image set the mouse was trained on.
- file_id
int lookup id to retrieve NWB file from S3 or the local cache.
- genotype
string full genotype of transgenic mouse
- image_set
string image set shown for a particular behavior session or ephys session
- mouse_id
int unique identifier for a mouse
- prior_exposures_to_image_set
float 64 number of prior sessions (during training or ophys) where the mouse was exposed to the image set shown in the current session. Starts at 0 for first exposure
- prior_exposures_to_omissions
int 64 number of sessions where the mouse was exposed to omissions. Starts at 0 for first exposure. Omissions do not occur during training
- prior_exposures_to_session_type
int 64 Number of previous sessions (during training or ophys) where the mouse was exposed to the current session type. Starts at 0 for first exposure
- project_code
string ‘NeuropixelVisualBehavior’–Project this session belongs to
- session_number
float 64 [1, 2] Indicates whether this session was the first or second recording day for the mouse. Takes values ‘1’ or ‘2’.
- session_type
string Visual stimulus type displayed during behavior session
- sex
string [‘M’, ‘F’] Sex of the mouse
- structure_acronyms
string List of CCF structures recorded during this experiment
- unit_count
float number of units recorded during this
Behavior Sessions Table#
In this dataset, mice are trained on a visual change detection task. This task involves a continuous stream of stimuli, and mice learn to lick in response to a change in the stimulus identity to earn a water reward. There are different stages of training in this task, described below. The metadata for each behavior session in the dataset can be found in the behavior_sessions_table
and can be used to build a training history for each mouse. Importantly, this table lists all of the behavior sessions for each mouse from the beginning of its training.
Let’s load the table and take a look:
behavior_sessions = cache.get_behavior_session_table()
print(f"Total number of behavior sessions: {len(behavior_sessions)}")
behavior_sessions.head()
Total number of behavior sessions: 3424
equipment_name | genotype | mouse_id | sex | age_in_days | session_number | prior_exposures_to_session_type | prior_exposures_to_image_set | prior_exposures_to_omissions | ecephys_session_id | date_of_acquisition | session_type | image_set | file_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
behavior_session_id | ||||||||||||||
1040882886 | NP.1 | Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | 506940 | M | 253 | 18 | 0 | 6 | 0 | 1040871931 | 2020-08-04 16:30:32.835000+00:00 | HABITUATION_5_images_G_handoff_ready_5uL_reward | G | 1022 |
1042886405 | NP.0 | Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | 506940 | M | 262 | 24 | 6 | 12 | 0 | 1042863451 | 2020-08-13 16:45:10.869000+00:00 | HABITUATION_5_images_G_handoff_ready_5uL_reward | G | 1023 |
1042328825 | NP.0 | Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | 506940 | M | 260 | 22 | 4 | 10 | 0 | 1042309765 | 2020-08-11 13:38:10.476000+00:00 | HABITUATION_5_images_G_handoff_ready_5uL_reward | G | 1024 |
1041062566 | NP.1 | Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | 506940 | M | 254 | 19 | 1 | 7 | 0 | 1041050914 | 2020-08-05 09:27:22.482000+00:00 | HABITUATION_5_images_G_handoff_ready_5uL_reward | G | 1025 |
1042578564 | NP.0 | Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | 506940 | M | 261 | 23 | 5 | 11 | 0 | 1042566236 | 2020-08-12 13:13:12.237000+00:00 | HABITUATION_5_images_G_handoff_ready_5uL_reward | G | 1026 |
You can see that there are many more sessions here than in the ecephys_sessions_table. But the columns are defined in the same way.
To see how to use this table to look at the training trajectory for one mouse, check out this tutorial
Units Table#
The units metadata table contains important quality and waveform metrics for every unit recorded in this dataset. For more info about how to use these quality metrics to filter units for analysis, check out this tutorial
Let’s grab this table and take a look at the columns:
units = cache.get_unit_table()
print(f'This dataset contains {len(units)} total units')
units.head()
This dataset contains 319013 total units
ecephys_channel_id | ecephys_probe_id | ecephys_session_id | amplitude_cutoff | anterior_posterior_ccf_coordinate | dorsal_ventral_ccf_coordinate | left_right_ccf_coordinate | cumulative_drift | d_prime | structure_acronym | ... | valid_data | amplitude | waveform_duration | waveform_halfwidth | PT_ratio | recovery_slope | repolarization_slope | spread | velocity_above | velocity_below | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
unit_id | |||||||||||||||||||||
1157005856 | 1157001834 | 1046469925 | 1046166369 | 0.500000 | 8453.0 | 3353.0 | 6719.0 | 140.32 | 6.088133 | MB | ... | True | 143.066332 | 0.151089 | 0.096147 | 0.310791 | -0.113863 | 0.480656 | 20.0 | -0.457845 | NaN |
1157005853 | 1157001834 | 1046469925 | 1046166369 | 0.323927 | 8453.0 | 3353.0 | 6719.0 | 239.76 | 4.635583 | MB | ... | True | 90.709418 | 0.357119 | 0.192295 | 0.531490 | -0.075261 | 0.366371 | 30.0 | 2.060302 | -2.060302 |
1157005720 | 1157001786 | 1046469925 | 1046166369 | 0.044133 | 8575.0 | 3842.0 | 6590.0 | 263.32 | 5.691955 | MRN | ... | True | 90.433103 | 0.521943 | 0.178559 | 0.612217 | -0.012119 | 0.269843 | 80.0 | 0.000000 | 0.863364 |
1157006074 | 1157001929 | 1046469925 | 1046166369 | 0.000583 | 8212.0 | 2477.0 | 6992.0 | 154.64 | 6.049284 | NOT | ... | True | 287.492107 | 0.343384 | 0.192295 | 0.470194 | -0.178335 | 1.129324 | 40.0 | 1.373534 | 0.000000 |
1157006072 | 1157001929 | 1046469925 | 1046166369 | 0.500000 | 8212.0 | 2477.0 | 6992.0 | 242.58 | 4.745499 | NOT | ... | True | 157.897057 | 0.329648 | 0.164824 | 0.488276 | -0.105005 | 0.660135 | 70.0 | 0.412060 | 0.343384 |
5 rows × 35 columns
For more information about many of the metrics included in this table and how to use them to guide your analysis, see our quality metrics tutorial. For now, here’s a brief description of each column:
General Metadata
ecephys_channel_id
: unique ID of channel on which unit’s peak waveform occurred
ecephys_probe_id
: unique ID for probe on which unit was recorded
ecephys_session_id
: unique ID for session during which unit was recorded
anterior_posterior_ccf_coordinate
: CCF coord in the AP axis
dorsal_ventral_ccf_coordinate
: CCF coord in the DV axis
left_right_ccf_coordinate
: CCF coord in the left/right axis
structure_acronym
: CCF acronym for area to which unit was assigned
structure_id
: CCF structure ID for the area to which unit was assigned
probe_horizontal_position
: Horizontal (perpendicular to shank) probe position of each unit’s peak channel in microns
probe_vertical_position
: Vertical (along shank) probe position of each unit’s peak channel in microns
Waveform metrics: Look here for more detail on these metrics and the code that computes them. For the below descriptions the ‘1D waveform’ is defined as the waveform on the peak channel. The ‘2D waveform’ is the waveform across channels centered on the peak channel.
amplitude
: Peak to trough amplitude for mean 1D waveform in microvolts
waveform_duration
: Time from trough to peak for 1D waveform in milliseconds
waveform_halfwidth
: Width of 1D waveform at half-amplitude in milliseconds
PT_ratio
: Ratio of the max (peak) to the min (trough) amplitudes for 1D waveform
recovery_slope
: Slope of recovery of 1D waveform to baseline after repolarization (coming down from peak)
repolarization_slope
: Slope of repolarization of 1D waveform to baseline after trough
spread
: Range of channels for which the spike amplitude was above 12% of the peak channel amplitude
velocity_above
: Slope of spike propagation velocity traveling in dorsal direction from soma (note to avoid infinite values, this is actually the inverse of velocity: ms/mm)
velocity_below
: Slope of spike propagation velocity traveling in ventral direction from soma (note to avoid infinite values, this is actually the inverse of velocity: ms/mm)
snr
: signal-to-noise ratio for 1D waveform
Quality metrics: Look here for more detail on these metrics and the code that computes them.
amplitude_cutoff
: estimate of miss rate based on amplitude histogram (ie fraction of spikes estimated to have been below detection threshold)
cumulative_drift
: cumulative change in spike depth along probe throughout the recording
d_prime
: classification accuracy based on LDA
firing_rate
: Mean firing rate over entire recording
isi_violations
: Ratio of refractory violation rate to total spike rate
isolation_distance
: Distance to nearest cluster in Mahalanobis space
l_ratio
: The Mahalanobis distance and chi-squared inverse cdf are used to find the probability of cluster membership for each spike.
max_drift
: Maximum change in unit depth across recording
nn_hit_rate
: Fraction of nearest neighbors in PCA space for spikes in unit cluster that are also in unit cluster
nn_miss_rate
: Fraction of nearest neighbors for spikes outside unit cluster than are in unit cluster
presence_ratio
: Fraction of time during session for which a unit was spiking
silhouette_score
: Standard metric for cluster overlap, computed in PCA space
quality
: Label assigned based on waveform shape as described here. Either ‘good’ for physiological waveforms or ‘noise’ for artifactual waveforms.
Probes Table#
The probes table contains useful info about every probe insertion in the dataset.
probes = cache.get_probe_table()
probes.head()
ecephys_session_id | name | sampling_rate | lfp_sampling_rate | phase | has_lfp_data | unit_count | channel_count | structure_acronyms | file_id | |
---|---|---|---|---|---|---|---|---|---|---|
ecephys_probe_id | ||||||||||
1044506933 | 1044385384 | probeB | 30000.178402 | 2500.014867 | 1.0 | True | 701 | 384 | ['CA1', 'DG', 'LP', 'POL', 'PoT', 'VISpm', 'ro... | 0 |
1044506934 | 1044385384 | probeC | 30000.049852 | 2500.004154 | 1.0 | True | 307 | 384 | ['MB', 'MRN', 'POST', 'SCig', 'VISp', 'root'] | 1 |
1044506935 | 1044385384 | probeD | 30000.029115 | 2500.002426 | 1.0 | True | 521 | 384 | ['CA1', 'CA3', 'DG', 'LGv', 'MB', 'TH', 'VISl'... | 2 |
1044506936 | 1044385384 | probeE | 30000.075851 | 2500.006321 | 1.0 | True | 282 | 384 | ['CA1', 'DG', 'MB', 'MGd', 'MGm', 'MRN', 'SGN'... | 3 |
1044506937 | 1044385384 | probeF | 29999.959578 | 2499.996631 | 1.0 | True | 368 | 384 | ['CA1', 'DG', 'LP', 'MRN', 'POL', 'PoT', 'SGN'... | 4 |
Here’s a brief description of every column:
- ecephys_probe_id
int Unique identifier for probe insertion
- ecephys_session_id
int Unique identifier for ecephys session for this insertion
- name
string Name of probe indicating which of the six probe positions this probe filled (A-F).
- sampling_rate
float Sampling rate for AP band data
- has_lfp_data
bool Flag indicating whether LFP data was collected on this probe
- lfp_sampling_rate
float Sampling rate for LFP data
- phase
float Neuropixels generation for this probe (all probes in this dataset are Neuropixels 1.0)
- unit_count
int Number of units recorded for this insertion
- channel_count
int Number of channels available for recording on this insertion
- structure_acronyms
string List of areas recorded for this probe insertion
Channels Table#
Metadata for every probe channel recorded in this dataset. This table can be merged with the units table to get CCF area assignments for each unit.
channels = cache.get_channel_table()
channels.head()
ecephys_probe_id | ecephys_session_id | probe_channel_number | probe_vertical_position | probe_horizontal_position | anterior_posterior_ccf_coordinate | dorsal_ventral_ccf_coordinate | left_right_ccf_coordinate | structure_acronym | unit_count | valid_data | |
---|---|---|---|---|---|---|---|---|---|---|---|
ecephys_channel_id | |||||||||||
1049365509 | 1048089911 | 1047969464 | 0 | 20.0 | 43.0 | 8445.0 | 4013.0 | 6753.0 | MRN | 0 | True |
1049365511 | 1048089911 | 1047969464 | 1 | 20.0 | 11.0 | 8443.0 | 4005.0 | 6755.0 | MRN | 5 | True |
1049365512 | 1048089911 | 1047969464 | 2 | 40.0 | 59.0 | 8441.0 | 3997.0 | 6757.0 | MRN | 0 | True |
1049365513 | 1048089911 | 1047969464 | 3 | 40.0 | 27.0 | 8439.0 | 3989.0 | 6759.0 | MRN | 5 | True |
1049365514 | 1048089911 | 1047969464 | 4 | 60.0 | 43.0 | 8438.0 | 3981.0 | 6761.0 | MRN | 7 | True |
Here are the columns:
- ecephys_channel_id
Unique identifier for each channel. This is the id used in the units
ecephys_channel_id
column.- ecephys_probe_id
int Unique identifier for probe insertion
- ecephys_session_id
int Unique identifier for ecephys session for this insertion
- anterior_posterior_ccf_coordinate
CCF coord in the AP axis
- dorsal_ventral_ccf_coordinate
CCF coord in the DV axis
- left_right_ccf_coordinate
CCF coord in the left/right axis
- structure_acronym
CCF acronym for area to which channel was assigned
- probe_horizontal_position
Horizontal (perpendicular to shank) probe position of each unit’s peak channel in microns
- probe_vertical_position
Vertical (along shank) probe position of each unit’s peak channel in microns
- probe_channel_number
Index of channel position on probe (0-383 with 0 at tip of probe)
- unit_count
int Number of units assigned to this channel