Build Your Own Connection

Overview

Delta Sharing is an open protocol for secure real-time data sharing, allowing organizations to share data across different computing platforms. This guide will walk you through the process of connecting to and accessing data through Delta Sharing.

Resources

Delta Sharing Connector Options

  • Python Connector

  • Apache Spark Connector

  • Set up an Interactive Shell

  • Set up a Standalone Project

Python Connector

The Delta Sharing Python Connector is a Python library that implements the Delta Sharing Protocol to read tables from a Delta Sharing server. You can load shared tables as a pandas DataFrame, or as an Apache Spark DataFrame if running in PySpark with the Apache Spark Connector installed.

System Requirements

  • Python 3.8+ for delta-sharing version 1.1+

  • Python 3.6+ for older versions

  • If running Linux, glibc version >= 2.31

  • For automatic delta-kernel-rust-sharing-wrapper package installation, please see next section for more details.

Installation Process

Unsetpip3 install delta-sharing

  • If you are using Databricks Runtime, you can follow Databricks Libraries doc to install the library on your clusters.

  • If this doesn’t work because of an issue downloading delta-kernel-rust-sharing-wrapper try the following:

    • Check python3 version >= 3.8

    • Upgrade your pip3 to the latest version

Accessing Shared Data

The connector accesses shared tables based on profile files, which are JSON files containing a user's credentials to access a Delta Sharing server. We have several ways to get started:

Before You Begin

  • Download a profile file from your data provider.

Accessing Shared Data Options

After you save the profile file, you can use it in the connector to access shared tables.

import delta_sharing

  • Point to the profile file. It can be a file on the local file system or a file on a remote storage.

    • profile_file = ""

  • Create a SharingClient.

    • client = delta_sharing.SharingClient(profile_file)

  • List all shared tables.

    • client.list_all_tables()

  • Create a url to access a shared table.

  • A table path is the profile file path following with `#` and the fully qualified name of a table.

  • (`..`).

    • table_url = profile_file + "#.."

  • Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data from a table that cannot fit in the memory.

    • delta_sharing.load_as_pandas(table_url, limit=10)

  • Load a table as a Pandas DataFrame. This can be used to process tables that can fit in the memory.

    • delta_sharing.load_as_pandas(table_url)

  • Load a table as a Pandas DataFrame explicitly using Delta Format

    • delta_sharing.load_as_pandas(table_url, use_delta_format = True)

  • If the code is running with PySpark, you can use `load_as_spark` to load the table as a Spark DataFrame.

    • delta_sharing.load_as_spark(table_url)

  • If the table supports history sharing(tableConfig.cdfEnabled=true in the OSS Delta Sharing Server), the connector can query table changes.

  • Load table changes from version 0 to version 5, as a Pandas DataFrame.

    • delta_sharing.load_table_changes_as_pandas(table_url, starting_version=0, ending_version=5)

  • If the code is running with PySpark, you can load table changes as Spark DataFrame.

    • delta_sharing.load_table_changes_as_spark(table_url, starting_version=0, ending_version=5)

Apache Spark Connector

The Apache Spark Connector implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. It can be used in SQL, Python, Java, Scala and R.

System Requirements

Accessing Shared Data

The connector loads user credentials from profile files.

Configuring Apache Spark

You can set up Apache Spark to load the Delta Sharing connector in the following twoways:

  • Run interactively: Start the Spark shell (Scala or Python) with the Delta Sharing connector and run the code snippets interactively in the shell.

  • Run as a project: Set up a Maven or SBT project (Scala or Java) with the Delta Sharing connector, copy the code snippets into a source file, and run the project.

If you are using Databricks Runtime, you can skip this section and follow Databricks Libraries doc to install the connector on your clusters.

Set up an interactive shell

To use Delta Sharing connector interactively within the Spark’s Scala/Python shell, you can launch the shells as follows.

PySpark Shell

Unsetpyspark --packages io.delta:delta-sharing-spark_2.12:3.1.0

Scala Shell

Unsetbin/spark-shell --packages
io.delta:delta-sharing-spark_2.12:3.1.0

Set up a standalone project

If you want to build a Java/Scala project using Delta Sharing connector from Maven Central Repository, you can use the following Maven coordinates.

Maven

You include Delta Sharing connector in your Maven project by adding it as a dependency in your POM file. Delta Sharing connector is compiled with Scala 2.12.

io.delta

delta-sharing-spark_2.12

3.1.0