Pinecone

Availability: Core Standard Plus Pro Enterprise Flex Self-Managed Enterprise PyAirbyte
Support Level: Airbyte
Connector Version: 0.1.47 (last updated a day ago)
CDK Version: 1.8.0
Sync Success Rate
Usage Rate
Definition ID: 3d2b6f84-7f0d-4e3f-a5e5-7c7d4b50eabd

Overview

This page guides you through the process of setting up the Pinecone destination connector.

There are three parts to this:

Processing - Split individual records into chunks so they fit the context window, and decide which fields to use as context and which are supplementary metadata.
Embedding - Convert the text into a vector representation using a pre-trained model.
Indexing - Store the vectors in a Pinecone index for similarity search.

Prerequisites

To use the Pinecone destination, you need:

A Pinecone account with a pre-created index. The index dimensions must match your chosen embedding method.
An account with API access for your chosen embedding provider (OpenAI, Cohere, Azure OpenAI, or an OpenAI-compatible service). Not required if using the Fake embeddings option for testing.

You need the following information to configure the destination:

Embedding service API Key - The API key for your embedding provider account.
Pinecone API Key - The API key for your Pinecone project. You can find this in the Pinecone console.
Pinecone Environment - The environment for your Pinecone project (for example, us-east-1-aws).
Pinecone Index name - The name of the Pinecone index to load data into.

Supported sync modes

Sync mode	Supported?
Full Refresh - Overwrite	Yes
Full Refresh - Append	Yes
Full Refresh - Overwrite + Deduped	No
Incremental Sync - Append	Yes
Incremental Sync - Append + Deduped	Yes

Data type mapping

All fields specified as metadata fields will be stored in the metadata object of the document and can be used for filtering. The following data types are allowed for metadata fields:

String
Number (integer or floating point, gets converted to a 64 bit floating point)
Booleans (true, false)
List of String

All other fields are ignored.

Configuration

Processing

Each record is split into text fields and metadata fields as configured in the "Processing" section. All text fields are concatenated into a single string and then split into chunks of the configured length. If specified, the metadata fields are stored as-is along with the embedded text chunks. Metadata fields can only be used for filtering, not for retrieval, and must be of type string, number, boolean, or list of strings. All other values are ignored. Pinecone limits total metadata to 40 KB per record. The connector reserves approximately 10 KB for internal fields, leaving about 30 KB for user-defined metadata per entry. The chunking process uses the LangChain Python library.

When specifying text fields, you can access nested fields in the record by using dot notation, e.g. user.name will access the name field in the user object. It's also possible to use wildcards to access all fields in an object, e.g. users.*.name will access all names fields in all entries of the users array.

The chunk length is measured in tokens produced by the tiktoken library. The maximum is 8191 tokens, which is the maximum length supported by the text-embedding-ada-002 model.

The stream name gets added as a metadata field _ab_stream to each document. If available, the primary key of the record is used to identify the document to avoid duplications when updated versions of records are indexed. It is added as the _ab_record_id metadata field.

Embedding

The connector can use one of the following embedding methods:

OpenAI - Uses the OpenAI API to produce embeddings using the text-embedding-ada-002 model with 1536 dimensions. This integration is constrained by the OpenAI rate limits.
Cohere - Uses the Cohere API to produce embeddings using the embed-english-light-v2.0 model with 1024 dimensions.
Azure OpenAI - Uses an Azure-hosted OpenAI deployment. Requires your Azure endpoint URL and API key.
OpenAI-compatible - Uses any API that implements the OpenAI embeddings interface. Configure a custom base URL to point to your preferred provider.

For testing purposes, you can use the Fake embeddings integration, which generates random embeddings with 1536 dimensions. This is suitable for testing a data pipeline without incurring embedding costs.

Indexing

Before running the destination, use the Pinecone console or API to create an index. The index dimensions must match your embedding method:

Embedding method	Dimensions
OpenAI (`text-embedding-ada-002`)	1536
Cohere (`embed-english-light-v2.0`)	1024
Azure OpenAI	Depends on the deployed model
OpenAI-compatible	Depends on the model
Fake	1536

All streams are indexed into the same index. The _ab_stream metadata field distinguishes between streams. The connector supports both serverless and pod-based indexes.

Namespace support

This destination supports namespaces.

Reference

Config fields reference

Field

Type

Property name

object

embedding

object

indexing

object

processing

boolean

omit_raw_text

Changelog

Expand to review

Version	Date	Pull Request	Subject
0.1.48	2026-03-17	75170	Add logging to check and write operations and fix `__init__` bug
0.1.47	2026-03-17	75136	Emit TRACE error instead of LOG on write failure for proper error surfacing
0.1.46	2025-10-21	68334	Update dependencies
0.1.45	2025-10-14	61096	Update dependencies
0.1.44	2025-05-17	57171	Update dependencies
0.1.43	2025-03-29	56630	Update dependencies
0.1.42	2025-03-22	56150	Update dependencies
0.1.41	2025-03-08	55400	Update dependencies
0.1.40	2025-03-01	54861	Update dependencies
0.1.39	2025-02-22	54255	Update dependencies
0.1.38	2025-02-15	53879	Update dependencies
0.1.37	2025-02-08	53434	Update dependencies
0.1.36	2025-02-01	52908	Update dependencies
0.1.35	2025-01-25	51762	Update dependencies
0.1.34	2025-01-11	51245	Update dependencies
0.1.33	2025-01-04	50904	Update dependencies
0.1.32	2024-12-28	50480	Update dependencies
0.1.31	2024-12-21	50203	Update dependencies
0.1.30	2024-12-14	49303	Update dependencies
0.1.29	2024-11-25	48654	Update dependencies
0.1.28	2024-11-05	48323	Update dependencies
0.1.27	2024-10-29	47106	Update dependencies
0.1.26	2024-10-12	46782	Update dependencies
0.1.25	2024-10-05	46474	Update dependencies
0.1.24	2024-09-28	46127	Update dependencies
0.1.23	2024-09-21	45791	Update dependencies
0.1.22	2024-09-14	45490	Update dependencies
0.1.21	2024-09-07	45247	Update dependencies
0.1.20	2024-08-31	45063	Update dependencies
0.1.19	2024-08-24	44669	Update dependencies
0.1.18	2024-08-17	44302	Update dependencies
0.1.17	2024-08-12	43932	Update dependencies
0.1.16	2024-08-10	43701	Update dependencies
0.1.15	2024-08-03	43134	Update dependencies
0.1.14	2024-07-27	42594	Update dependencies
0.1.13	2024-07-20	42243	Update dependencies
0.1.12	2024-07-13	41901	Update dependencies
0.1.11	2024-07-10	41598	Update dependencies
0.1.10	2024-07-09	41194	Update dependencies
0.1.9	2024-07-07	40753	Fix a regression with AirbyteLogger
0.1.8	2024-07-06	40780	Update dependencies
0.1.7	2024-06-29	40627	Update dependencies
0.1.6	2024-06-27	40215	Replaced deprecated AirbyteLogger with logging.Logger
0.1.5	2024-06-25	40430	Update dependencies
0.1.4	2024-06-22	40150	Update dependencies
0.1.3	2024-06-06	39148	[autopull] Upgrade base image to v1.2.2
0.1.2	2023-05-17	38336	Fix for regression: Custom namespaces not created automatically
0.1.1	2023-05-14	38151	Add airbyte source tag for attribution
0.1.0	2023-05-06	#37756	Add support for Pinecone Serverless
0.0.24	2023-04-15	#37333	Update CDK & pytest version to fix security vulnerabilities.
0.0.23	2023-03-22	#35911	Bump versions to latest, resolves test failures.
0.0.22	2023-12-11	#33303	Fix bug with embedding special tokens
0.0.21	2023-12-01	#32697	Allow omitting raw text
0.0.20	2023-11-13	#32357	Improve spec schema
0.0.19	2023-10-20	#31329	Improve error messages
0.0.18	2023-10-20	#31329	Add support for namespaces and fix index cleaning when namespace is defined
0.0.17	2023-10-19	#31599	Base image migration: remove Dockerfile and use the python-connector-base image
0.0.16	2023-10-15	#31329	Add OpenAI-compatible embedder option
0.0.15	2023-10-04	#31075	Fix OpenAI embedder batch size
0.0.14	2023-09-29	#30820	Update CDK
0.0.13	2023-09-26	#30649	Allow more text splitting options
0.0.12	2023-09-25	#30649	Fix bug with stale documents left on starter pods
0.0.11	2023-09-22	#30649	Set visible certified flag
0.0.10	2023-09-20	#30514	Fix bug with failing embedding step on large records
0.0.9	2023-09-18	#30510	Fix bug with overwrite mode on starter pods
0.0.8	2023-09-14	#30296	Add Azure embedder
0.0.7	2023-09-13	#30382	Promote to certified/beta
0.0.6	2023-09-09	#30193	Improve documentation
0.0.5	2023-09-07	#30133	Refactor internal structure of connector
0.0.4	2023-09-05	30086	Switch to GRPC client for improved performance.
0.0.3	2023-09-01	#30079	Fix bug with potential data loss on append+dedup syncing. 🚨 Streams using append+dedup mode need to be reset after upgrade.
0.0.2	2023-08-31	29946	Improve test coverage
0.0.1	2023-08-29	#29539	Pinecone connector with some embedders

Overview​

Prerequisites​

Supported sync modes​

Data type mapping​

Configuration​

Processing​

Embedding​

Indexing​

Namespace support​

Reference​

Config fields reference

Changelog​

Overview

Prerequisites

Supported sync modes

Data type mapping

Configuration

Processing

Embedding

Indexing

Namespace support

Reference

Changelog