How to construct new dataset from h5ad file?

#25
by beannn - opened

Thanks for your fantastic work! I want to try Geneformer on Zheng68k dataset for cell classification. How can I process the h5ad file to fulfill the input requirements? Can you provide a simple tutorial, since the h5ad file is quite a popular format in this task. Thank you!

Thank you for your interest in Geneformer. Please see this closed issue: https://huggingface.co/ctheodoris/Geneformer/discussions/4

ctheodoris changed discussion status to closed

Thank you for your prompt reply. I tried doing the data conversion, but it seems that the "organ_major" field is required. So can data without this field be used? (Zheng68k does not have this field). And from the code point of view, without this field, it seems that the subsequent classification cannot be run.
BTW, I can't seem to find any instructions about the requirement of the data or the field. Thank you!

beannn changed discussion status to open

Thank you for your question. The organ_major field is not required. (There are no specific attributes required.) You can supply a dictionary of custom attributes to be included, but this is optional. All of the examples in this repository are just examples. I am guessing you are following the example literally and including organ_major as an attribute you would like to include in the tokenized dataset even though this is not an attribute in your dataset. If you remove this from the custom attribute dictionary then it should resolve the issue.

Similarly, the classification example notebooks are just examples of downstream applications. In the manuscript we demonstrate a diverse panel of downstream tasks, of which we include just a couple of examples here to provide a concrete example for others to design their own fine-tuning applications. You can fine-tune the model to classify any gene or cell state classes based on how you design the learning objective. You will need to modify the example notebooks/scripts accordingly for your specific application.

We have now added detailed explanation to the tokenizer example script to make the input data format clearer for users. However, there is also documentation within each of the modules that you are always welcome to check for more information should you need.

ctheodoris changed discussion status to closed

Thank you for your feedback. From the code, I found that it has to have an "ensembl_id", how can we construct this if is not included in other datasets?

[self.genelist_dict.get(i, False) for i in data.ra["ensembl_id"]]
beannn changed discussion status to open

Yes, no cell attributes are required, but the genes should be labeled with the Ensembl ID because this is how the token dictionary converts the genes to the appropriate token. While gene names can vary, Ensembl IDs should be unique. Ensembl Biomart is one way you can convert gene names (or other identifiers) to Ensembl IDs. http://useast.ensembl.org/info/data/biomart/index.html

ctheodoris changed discussion status to closed

It seems it also needs n_counts. What is the meaning of this field? Is it the sum of expression for each cell? Thanks!

subview_norm_array = (
                    subview[:, :]
                    / subview.ca.n_counts
                    * 10_000
                    / norm_factor_vector[:, None]
                )
beannn changed discussion status to open

Thank you for your question. Yes, it is the total read counts for each cell (described in the documentation in the lines above the code you pasted). We have now added this point to the documentation in the example notebook as well to help clarify.

ctheodoris changed discussion status to closed

Sign up or log in to comment