Skip to main content

Pinecone

Overview

This page guides you through the process of setting up the Pinecone destination connector.

There are three parts to this:

  • Processing - Split individual records into chunks so they fit the context window, and decide which fields to use as context and which are supplementary metadata.
  • Embedding - Convert the text into a vector representation using a pre-trained model.
  • Indexing - Store the vectors in a Pinecone index for similarity search.

Prerequisites

To use the Pinecone destination, you need:

  • A Pinecone account with a pre-created index. The index dimensions must match your chosen embedding method.
  • An account with API access for your chosen embedding provider (OpenAI, Cohere, Azure OpenAI, or an OpenAI-compatible service). Not required if using the Fake embeddings option for testing.

You need the following information to configure the destination:

  • Embedding service API Key - The API key for your embedding provider account.
  • Pinecone API Key - The API key for your Pinecone project. You can find this in the Pinecone console.
  • Pinecone Environment - The environment for your Pinecone project (for example, us-east-1-aws).
  • Pinecone Index name - The name of the Pinecone index to load data into.

Supported sync modes

Sync modeSupported?
Full Refresh - OverwriteYes
Full Refresh - AppendYes
Full Refresh - Overwrite + DedupedNo
Incremental Sync - AppendYes
Incremental Sync - Append + DedupedYes

Data type mapping

All fields specified as metadata fields will be stored in the metadata object of the document and can be used for filtering. The following data types are allowed for metadata fields:

  • String
  • Number (integer or floating point, gets converted to a 64 bit floating point)
  • Booleans (true, false)
  • List of String

All other fields are ignored.

Configuration

Processing

Each record is split into text fields and metadata fields as configured in the "Processing" section. All text fields are concatenated into a single string and then split into chunks of the configured length. If specified, the metadata fields are stored as-is along with the embedded text chunks. Metadata fields can only be used for filtering, not for retrieval, and must be of type string, number, boolean, or list of strings. All other values are ignored. Pinecone limits total metadata to 40 KB per record. The connector reserves approximately 10 KB for internal fields, leaving about 30 KB for user-defined metadata per entry. The chunking process uses the LangChain Python library.

When specifying text fields, you can access nested fields in the record by using dot notation, e.g. user.name will access the name field in the user object. It's also possible to use wildcards to access all fields in an object, e.g. users.*.name will access all names fields in all entries of the users array.

The chunk length is measured in tokens produced by the tiktoken library. The maximum is 8191 tokens, which is the maximum length supported by the text-embedding-ada-002 model.

The stream name gets added as a metadata field _ab_stream to each document. If available, the primary key of the record is used to identify the document to avoid duplications when updated versions of records are indexed. It is added as the _ab_record_id metadata field.

Embedding

The connector can use one of the following embedding methods:

  1. OpenAI - Uses the OpenAI API to produce embeddings using the text-embedding-ada-002 model with 1536 dimensions. This integration is constrained by the OpenAI rate limits.

  2. Cohere - Uses the Cohere API to produce embeddings using the embed-english-light-v2.0 model with 1024 dimensions.

  3. Azure OpenAI - Uses an Azure-hosted OpenAI deployment. Requires your Azure endpoint URL and API key.

  4. OpenAI-compatible - Uses any API that implements the OpenAI embeddings interface. Configure a custom base URL to point to your preferred provider.

For testing purposes, you can use the Fake embeddings integration, which generates random embeddings with 1536 dimensions. This is suitable for testing a data pipeline without incurring embedding costs.

Indexing

Before running the destination, use the Pinecone console or API to create an index. The index dimensions must match your embedding method:

Embedding methodDimensions
OpenAI (text-embedding-ada-002)1536
Cohere (embed-english-light-v2.0)1024
Azure OpenAIDepends on the deployed model
OpenAI-compatibleDepends on the model
Fake1536

All streams are indexed into the same index. The _ab_stream metadata field distinguishes between streams. The connector supports both serverless and pod-based indexes.

Namespace support

This destination supports namespaces.

Reference

Config fields reference

Field
Type
Property name
object
embedding
object
indexing
object
processing
boolean
omit_raw_text

Changelog

Expand to review
VersionDatePull RequestSubject
0.1.482026-03-1775170Add logging to check and write operations and fix __init__ bug
0.1.472026-03-1775136Emit TRACE error instead of LOG on write failure for proper error surfacing
0.1.462025-10-2168334Update dependencies
0.1.452025-10-1461096Update dependencies
0.1.442025-05-1757171Update dependencies
0.1.432025-03-2956630Update dependencies
0.1.422025-03-2256150Update dependencies
0.1.412025-03-0855400Update dependencies
0.1.402025-03-0154861Update dependencies
0.1.392025-02-2254255Update dependencies
0.1.382025-02-1553879Update dependencies
0.1.372025-02-0853434Update dependencies
0.1.362025-02-0152908Update dependencies
0.1.352025-01-2551762Update dependencies
0.1.342025-01-1151245Update dependencies
0.1.332025-01-0450904Update dependencies
0.1.322024-12-2850480Update dependencies
0.1.312024-12-2150203Update dependencies
0.1.302024-12-1449303Update dependencies
0.1.292024-11-2548654Update dependencies
0.1.282024-11-0548323Update dependencies
0.1.272024-10-2947106Update dependencies
0.1.262024-10-1246782Update dependencies
0.1.252024-10-0546474Update dependencies
0.1.242024-09-2846127Update dependencies
0.1.232024-09-2145791Update dependencies
0.1.222024-09-1445490Update dependencies
0.1.212024-09-0745247Update dependencies
0.1.202024-08-3145063Update dependencies
0.1.192024-08-2444669Update dependencies
0.1.182024-08-1744302Update dependencies
0.1.172024-08-1243932Update dependencies
0.1.162024-08-1043701Update dependencies
0.1.152024-08-0343134Update dependencies
0.1.142024-07-2742594Update dependencies
0.1.132024-07-2042243Update dependencies
0.1.122024-07-1341901Update dependencies
0.1.112024-07-1041598Update dependencies
0.1.102024-07-0941194Update dependencies
0.1.92024-07-0740753Fix a regression with AirbyteLogger
0.1.82024-07-0640780Update dependencies
0.1.72024-06-2940627Update dependencies
0.1.62024-06-2740215Replaced deprecated AirbyteLogger with logging.Logger
0.1.52024-06-2540430Update dependencies
0.1.42024-06-2240150Update dependencies
0.1.32024-06-0639148[autopull] Upgrade base image to v1.2.2
0.1.22023-05-1738336Fix for regression: Custom namespaces not created automatically
0.1.12023-05-1438151Add airbyte source tag for attribution
0.1.02023-05-06#37756Add support for Pinecone Serverless
0.0.242023-04-15#37333Update CDK & pytest version to fix security vulnerabilities.
0.0.232023-03-22#35911Bump versions to latest, resolves test failures.
0.0.222023-12-11#33303Fix bug with embedding special tokens
0.0.212023-12-01#32697Allow omitting raw text
0.0.202023-11-13#32357Improve spec schema
0.0.192023-10-20#31329Improve error messages
0.0.182023-10-20#31329Add support for namespaces and fix index cleaning when namespace is defined
0.0.172023-10-19#31599Base image migration: remove Dockerfile and use the python-connector-base image
0.0.162023-10-15#31329Add OpenAI-compatible embedder option
0.0.152023-10-04#31075Fix OpenAI embedder batch size
0.0.142023-09-29#30820Update CDK
0.0.132023-09-26#30649Allow more text splitting options
0.0.122023-09-25#30649Fix bug with stale documents left on starter pods
0.0.112023-09-22#30649Set visible certified flag
0.0.102023-09-20#30514Fix bug with failing embedding step on large records
0.0.92023-09-18#30510Fix bug with overwrite mode on starter pods
0.0.82023-09-14#30296Add Azure embedder
0.0.72023-09-13#30382Promote to certified/beta
0.0.62023-09-09#30193Improve documentation
0.0.52023-09-07#30133Refactor internal structure of connector
0.0.42023-09-0530086Switch to GRPC client for improved performance.
0.0.32023-09-01#30079Fix bug with potential data loss on append+dedup syncing. 🚨 Streams using append+dedup mode need to be reset after upgrade.
0.0.22023-08-3129946Improve test coverage
0.0.12023-08-29#29539Pinecone connector with some embedders