My Data is Not Mine: The Emergence of Data Layers

Intermediate2/10/2025, 12:11:20 PM
Discussions around data ownership and privacy have intensified. Web3 data protocols like Vana, Ocean Protocol, and Masa are emerging, driving decentralized data sovereignty and enabling users to control and monetize their data, particularly in AI training and real-time data acquisition. These protocols offer new solutions for data trading and privacy protection, addressing the growing demand for high-quality data.

Data is the digital gold in this age where attention is online. The global average screen time in 2024 stands at 6 hours and 40 minutes per day, an increase from previous years. In the United States, the average is even higher at 7 hours and 3 minutes daily.

With this level of engagement, the volume of data generated is staggering—328.77 million terabytes are created every day in 2024. That’s approximately 0.4 zettabytes (ZB) per day when considering all newly generated, captured, copied, or consumed data.

Yet, despite the massive amounts of data being produced and consumed daily, users own very little of it:

  • Social Media: Data on platforms like Twitter, Instagram, and others is controlled by the companies, even though users generate it.
  • Internet of Things (IoT): Data from smart devices often belongs to the device manufacturer or service provider unless specific agreements state otherwise.
  • Health Data: While individuals have rights over their medical records, much of the data from health apps or wearables is controlled by the companies providing those services.

Crypto and Social Data

In crypto, we’ve seen the rise of @_kaitoai, which indexes social data on Twitter and translates it into actionable sentiment data for projects, KOLs, and thought leaders. The words “yap” and “mindshare” were popularized by the Kaito team because of their growth hacking expertise (with their popular mindshare & yapper dashboards) and ability to attract organic interest on Crypto Twitter.

“Yap” aims to incentivize quality content creation on Twitter, but many questions remain unanswered:

  • How “exactly” are yaps being scored?
  • Do you get additional yap for mentioning Kaito?
  • Does Kaito truly reward quality content, or does it favor controversial hot takes?

Beyond social data, discussions around data ownership, privacy, and transparency are heating up. With AI rapidly advancing, new questions emerge: Who owns the data used to train AI models? Who benefits from AI-generated outputs?

These questions set the stage for the rise of Web3 data layers—a shift toward user-owned, decentralized data ecosystems.

The Emergence of Data Layers

In Web3, there’s a growing ecosystem of data layers, protocols, and infrastructure focused on enabling personal data sovereignty—the idea of giving individuals more control over their data, with options to monetize it.

1. Vana

@vana‘s core mission is to give users control over their data, particularly in the context of AI, where data is invaluable for training models.

Vana introduces DataDAOs, community-driven entities where users pool their data for collective benefit. Each DataDAO focuses on a specific dataset:

  • r/datadao: Focuses on Reddit user data, enabling users to control and monetize their contributions.
  • Volara: Deals with Twitter data, allowing users to benefit from their social media activity.
  • DNA DAO: Aimed at managing genetic data with privacy and ownership in mind.

Vana tokenizes data into a tradable asset called “DLP.” Each DLP aggregates data for a specific domain, and users can stake tokens to these pools for rewards, with the top pools being rewarded based on community support and data quality.

What makes Vana stand out is its ease of contributing data. Users simply:

  1. Choose a DataDAO
  2. Pool their data directly via API integration or manually upload it
  3. Earn DataDAO tokens and $VANA as rewards

2. Ocean Protocol

@oceanprotocol is a Decentralized Data Marketplace that allows data providers to share, sell, or license their data, while consumers access it for AI and research.

Ocean Protocol uses “datatokens” (ERC-20 tokens) to represent access rights to datasets, allowing data providers to monetize their data while maintaining control over access conditions.

Types of data traded on Ocean:

  • Public Data: Open datasets like weather information, public demographics, or historical stock data—valuable for AI training and research.
  • Private Data: Medical records, financial transactions, IoT sensor data, or personalized user data—requires stringent privacy controls.

Compute-to-Data is another key feature of Ocean, allowing computations to be done on the data without moving it, ensuring privacy and security for sensitive datasets.

3. Masa

@getmasafi is focused on creating an open layer for AI training data, supplying real-time, high-quality, and low-cost data for AI agents and developers.

Masa has launched two subnets on the Bittensor network:

  • Subnet 42 (SN42): Aggregates and processes millions of data records daily, serving as a foundation for AI agent and application development.
  • Subnet 59 (SN59) – “AI Agent Arena”: A competitive environment where AI agents, powered by real-time data from SN42, compete for $TAO emissions based on performance metrics like mindshare, user engagement, and self-improvement.

Masa partnered with @virtuals_io, empowering Virtuals agents with real-time data capabilities. It also launched $TAOCAT, showcasing its abilities (currently on Binance Alpha).

4. Open Ledger

@OpenledgerHQ is building a blockchain specifically tailored for data, particularly for AI and ML applications, ensuring secure, decentralized, and verifiable data management.

Key Highlights:

  • Datanets: Specialized data sourcing networks within OpenLedger that curate and enrich real-world data for AI applications.
  • SLMs: AI models tailored for specific industries or applications. The idea is to provide models that are not only more accurate for niche use cases but also privacy-compliant and less prone to biases found in general-purpose models
  • Data Verification: Ensures the accuracy and trustworthiness of data used for training specialized language models (SLMs) that are accurate and reliable for specific use cases.

The Demand for Data for AI Training

The demand for high-quality data to fuel AI and autonomous agents is surging. Beyond initial training, AI agents require real-time data for continuous learning and adaptation.

Key challenges & opportunities:

  • Data Quality Over Quantity: AI models require high-quality, diverse, and relevant data to avoid bias or poor performance.
  • Data Sovereignty & Privacy: As seen with Vana, there’s a push for user-owned data monetization, which could reshape how AI training data is sourced.
  • Synthetic Data: With privacy concerns, synthetic data is gaining traction as a way to train AI models while mitigating ethical issues.
  • Market for Data: The rise of data marketplaces (centralized & decentralized) is creating an economy where data is a tradeable asset.
  • AI for Data Management: AI is now used to manage, clean, and enhance datasets, improving data quality for AI training.

As AI agents become more autonomous, their ability to access and process real-time, high-quality data will determine their effectiveness. This growing demand has led to the rise of AI agent-specific data marketplaces—where both humans and AI agents can tap into high-quality AI agent data

Market for Web3 Agents Data

  • @cookiedotfun aggregates AI agent social sentiment & token-related data, transforming it into actionable insights for human and AI agents.
  • Cookie DataSwarm API allows AI agents to access current, high-quality data for trading-related insights—one of the most sought-after use cases in crypto.
  • Cookie boasts 200K MAU & 20K DAU, making it one of the largest AI agent data marketplaces, with $COOKIE at the center.

Other key players:

  • @GoatIndexAI focuses on Solana ecosystem insights.
  • @Decentralisedco specializes in niche data dashboards like GitHub repositories & project-specific analytics.

Wrapping up Part 1

This is just the beginning. Part 2 will dive deeper into:

  • The evolving challenges and opportunities in the data economy
  • The role of synthetic data in AI training
  • Data privacy concerns and how they’re being addressed
  • The future of decentralized AI training

Who controls the data will shape the future, and the projects building within this sector will define how data is owned, shared, and monetized in the AI era. As demand for high-quality data continues to grow, the race to create a more transparent, user-owned data economy is only getting started.

Stay tuned for Part 2!

Personal Note: Thanks for reading! If you’re in Crypto AI and want to connect, feel free to shoot me a DM.

If you’d like to pitch a project, please use the form in my bio—it gets priority over DMs.

Full Disclaimer: This document is intended for informational & entertainment purposes only. The views expressed in this document are not, and should not be construed as, investment advice or recommendations. Recipients of this document should do their due diligence, taking into account their specific financial circumstances, investment objectives, and risk tolerance (which are not considered in this document) before investing. This document is not an offer, nor the solicitation of an offer, to buy or sell any of the assets mentioned herein

Disclaimer:

  1. This article is reproduced from [X]. The copyright belongs to the original author [@Defi0xJeff]. If there are any objections to the reproduction, please contact the Gate Learn Team, and the team will process it as per the relevant procedures.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute investment advice.
  3. The Gate Learn team translated the article into other languages. Copying, distributing, or plagiarizing the translated articles is prohibited unless mentioned.

My Data is Not Mine: The Emergence of Data Layers

Intermediate2/10/2025, 12:11:20 PM
Discussions around data ownership and privacy have intensified. Web3 data protocols like Vana, Ocean Protocol, and Masa are emerging, driving decentralized data sovereignty and enabling users to control and monetize their data, particularly in AI training and real-time data acquisition. These protocols offer new solutions for data trading and privacy protection, addressing the growing demand for high-quality data.

Data is the digital gold in this age where attention is online. The global average screen time in 2024 stands at 6 hours and 40 minutes per day, an increase from previous years. In the United States, the average is even higher at 7 hours and 3 minutes daily.

With this level of engagement, the volume of data generated is staggering—328.77 million terabytes are created every day in 2024. That’s approximately 0.4 zettabytes (ZB) per day when considering all newly generated, captured, copied, or consumed data.

Yet, despite the massive amounts of data being produced and consumed daily, users own very little of it:

  • Social Media: Data on platforms like Twitter, Instagram, and others is controlled by the companies, even though users generate it.
  • Internet of Things (IoT): Data from smart devices often belongs to the device manufacturer or service provider unless specific agreements state otherwise.
  • Health Data: While individuals have rights over their medical records, much of the data from health apps or wearables is controlled by the companies providing those services.

Crypto and Social Data

In crypto, we’ve seen the rise of @_kaitoai, which indexes social data on Twitter and translates it into actionable sentiment data for projects, KOLs, and thought leaders. The words “yap” and “mindshare” were popularized by the Kaito team because of their growth hacking expertise (with their popular mindshare & yapper dashboards) and ability to attract organic interest on Crypto Twitter.

“Yap” aims to incentivize quality content creation on Twitter, but many questions remain unanswered:

  • How “exactly” are yaps being scored?
  • Do you get additional yap for mentioning Kaito?
  • Does Kaito truly reward quality content, or does it favor controversial hot takes?

Beyond social data, discussions around data ownership, privacy, and transparency are heating up. With AI rapidly advancing, new questions emerge: Who owns the data used to train AI models? Who benefits from AI-generated outputs?

These questions set the stage for the rise of Web3 data layers—a shift toward user-owned, decentralized data ecosystems.

The Emergence of Data Layers

In Web3, there’s a growing ecosystem of data layers, protocols, and infrastructure focused on enabling personal data sovereignty—the idea of giving individuals more control over their data, with options to monetize it.

1. Vana

@vana‘s core mission is to give users control over their data, particularly in the context of AI, where data is invaluable for training models.

Vana introduces DataDAOs, community-driven entities where users pool their data for collective benefit. Each DataDAO focuses on a specific dataset:

  • r/datadao: Focuses on Reddit user data, enabling users to control and monetize their contributions.
  • Volara: Deals with Twitter data, allowing users to benefit from their social media activity.
  • DNA DAO: Aimed at managing genetic data with privacy and ownership in mind.

Vana tokenizes data into a tradable asset called “DLP.” Each DLP aggregates data for a specific domain, and users can stake tokens to these pools for rewards, with the top pools being rewarded based on community support and data quality.

What makes Vana stand out is its ease of contributing data. Users simply:

  1. Choose a DataDAO
  2. Pool their data directly via API integration or manually upload it
  3. Earn DataDAO tokens and $VANA as rewards

2. Ocean Protocol

@oceanprotocol is a Decentralized Data Marketplace that allows data providers to share, sell, or license their data, while consumers access it for AI and research.

Ocean Protocol uses “datatokens” (ERC-20 tokens) to represent access rights to datasets, allowing data providers to monetize their data while maintaining control over access conditions.

Types of data traded on Ocean:

  • Public Data: Open datasets like weather information, public demographics, or historical stock data—valuable for AI training and research.
  • Private Data: Medical records, financial transactions, IoT sensor data, or personalized user data—requires stringent privacy controls.

Compute-to-Data is another key feature of Ocean, allowing computations to be done on the data without moving it, ensuring privacy and security for sensitive datasets.

3. Masa

@getmasafi is focused on creating an open layer for AI training data, supplying real-time, high-quality, and low-cost data for AI agents and developers.

Masa has launched two subnets on the Bittensor network:

  • Subnet 42 (SN42): Aggregates and processes millions of data records daily, serving as a foundation for AI agent and application development.
  • Subnet 59 (SN59) – “AI Agent Arena”: A competitive environment where AI agents, powered by real-time data from SN42, compete for $TAO emissions based on performance metrics like mindshare, user engagement, and self-improvement.

Masa partnered with @virtuals_io, empowering Virtuals agents with real-time data capabilities. It also launched $TAOCAT, showcasing its abilities (currently on Binance Alpha).

4. Open Ledger

@OpenledgerHQ is building a blockchain specifically tailored for data, particularly for AI and ML applications, ensuring secure, decentralized, and verifiable data management.

Key Highlights:

  • Datanets: Specialized data sourcing networks within OpenLedger that curate and enrich real-world data for AI applications.
  • SLMs: AI models tailored for specific industries or applications. The idea is to provide models that are not only more accurate for niche use cases but also privacy-compliant and less prone to biases found in general-purpose models
  • Data Verification: Ensures the accuracy and trustworthiness of data used for training specialized language models (SLMs) that are accurate and reliable for specific use cases.

The Demand for Data for AI Training

The demand for high-quality data to fuel AI and autonomous agents is surging. Beyond initial training, AI agents require real-time data for continuous learning and adaptation.

Key challenges & opportunities:

  • Data Quality Over Quantity: AI models require high-quality, diverse, and relevant data to avoid bias or poor performance.
  • Data Sovereignty & Privacy: As seen with Vana, there’s a push for user-owned data monetization, which could reshape how AI training data is sourced.
  • Synthetic Data: With privacy concerns, synthetic data is gaining traction as a way to train AI models while mitigating ethical issues.
  • Market for Data: The rise of data marketplaces (centralized & decentralized) is creating an economy where data is a tradeable asset.
  • AI for Data Management: AI is now used to manage, clean, and enhance datasets, improving data quality for AI training.

As AI agents become more autonomous, their ability to access and process real-time, high-quality data will determine their effectiveness. This growing demand has led to the rise of AI agent-specific data marketplaces—where both humans and AI agents can tap into high-quality AI agent data

Market for Web3 Agents Data

  • @cookiedotfun aggregates AI agent social sentiment & token-related data, transforming it into actionable insights for human and AI agents.
  • Cookie DataSwarm API allows AI agents to access current, high-quality data for trading-related insights—one of the most sought-after use cases in crypto.
  • Cookie boasts 200K MAU & 20K DAU, making it one of the largest AI agent data marketplaces, with $COOKIE at the center.

Other key players:

  • @GoatIndexAI focuses on Solana ecosystem insights.
  • @Decentralisedco specializes in niche data dashboards like GitHub repositories & project-specific analytics.

Wrapping up Part 1

This is just the beginning. Part 2 will dive deeper into:

  • The evolving challenges and opportunities in the data economy
  • The role of synthetic data in AI training
  • Data privacy concerns and how they’re being addressed
  • The future of decentralized AI training

Who controls the data will shape the future, and the projects building within this sector will define how data is owned, shared, and monetized in the AI era. As demand for high-quality data continues to grow, the race to create a more transparent, user-owned data economy is only getting started.

Stay tuned for Part 2!

Personal Note: Thanks for reading! If you’re in Crypto AI and want to connect, feel free to shoot me a DM.

If you’d like to pitch a project, please use the form in my bio—it gets priority over DMs.

Full Disclaimer: This document is intended for informational & entertainment purposes only. The views expressed in this document are not, and should not be construed as, investment advice or recommendations. Recipients of this document should do their due diligence, taking into account their specific financial circumstances, investment objectives, and risk tolerance (which are not considered in this document) before investing. This document is not an offer, nor the solicitation of an offer, to buy or sell any of the assets mentioned herein

Disclaimer:

  1. This article is reproduced from [X]. The copyright belongs to the original author [@Defi0xJeff]. If there are any objections to the reproduction, please contact the Gate Learn Team, and the team will process it as per the relevant procedures.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute investment advice.
  3. The Gate Learn team translated the article into other languages. Copying, distributing, or plagiarizing the translated articles is prohibited unless mentioned.
Empieza ahora
¡Registrarse y recibe un bono de
$100
!