Env setup

Environment Setup for Presidio in Fabric

The spaCy model can be downloaded from here: English · spaCy Models Documentation

1. Requirements

Fabric workspace with sufficient permissions to create and manage custom environments.
Lakehouse access for uploading large models or data files.

2. Configure Spark Pool

Make sure to create (or select) a valid Spark pool that you can attach to your Fabric environment.

Spark pool configuration

3. Create a New Environment

In your Fabric workspace, go to Settings and select New Environment.
Provide a name (e.g., presidio-env) and choose the appropriate Python version.
Configure any required settings (e.g., pinned versions, advanced options).

4. Add Dependencies

Under Public Library, add the essential libraries:
presidio-analyzer
presidio-anonymizer
spacy

Creating a custom environment

For smaller SpaCy models (like en_core_web_md < 300MB), you can include them directly in this environment.

Adding dependencies in Fabric

5. Upload a Large SpaCy Model

If you want to use en_core_web_lg (which typically exceeds 300MB): 1. Upload the .whl file to your Lakehouse (or any location accessible by Spark). 2. You will install it within the notebook rather than from this environment.

Upload large model to the lakehouse

6. Compute

Configure your compute, make sure to use the pool configured before

Custom environment summary

7. Review & Save

Confirm your chosen libraries appear under the Custom Library or Public Library tabs.
Click Save to finalize your environment setup.

8. Run the Sample Notebook

Open the presidio_and_spark.ipynb notebook.
When opening your notebook, ensure you pick the custom environment you created.
Confirm you have selected the valid Spark pool you configured earlier.

Configure env to the notebook

Within the notebook, if you're using a large SpaCy model, install it using: ```python # Please update the path to your model path. %pip install /lakehouse/default/Files/presidio/models/en_core_web_lg-3.8.0-py3-none-any.whl