PlanetCassandra.org – Where the Apache Cassandra community gets together.

Growing the Cassandra Community

Hugh Lashbrooke on March 13, 2023

As the Apache Cassandra® community grows, it is important that we have a place to call home – a place where end users and contributors alike can find the resources they need to be successful with Cassandra. Whether success is enhancing how you use Cassandra to power your business, or contributing the next awesome feature to the project, each of us needs access to information and community support to achieve our goals with Cassandra. End-users should always have the opportunity to learn new ways of using Cassandra, and contributors should have no barriers to getting involved with the project in the areas where they can offer innovation. With that in mind, we have made easy access to guiding resources a top priority for our community.

We’re excited to announce that today we have relaunched a revitalized Planet Cassandra community hub. Planet Cassandra is your central place to keep informed about the Cassandra project and find out how you can use and contribute to it. The platform aggregates content from a variety of online sources and brings documentation and learning resources into one place. The goal is to make it easier for new users to start using Cassandra, existing users to learn more, and contributors to have a greater impact.

We also want to make it easier for all users to connect and collaborate with each other. So, along with relaunching Planet Cassandra, we will also be exploring a few ways that we can do this effectively:

Update & Consolidate Documentation

We can all speak from experience that documentation, whether its technical and end-user docs or community and process docs, has a habit of becoming out- dated with surprising regularity. To work toward the goal of providing alway-current information to the community, public documentation for joining the Cassandra community will be reviewed and updated where necessary with regularly scheduled reviews as we move forward.The documentation will also be updated to consolidate end-user and contributor information in the appropriate locations and ensure a single source of truth rather than disparate pages in different locations.

Audit & Clarify Community Platforms

We want to meet the community where they are – whether it be on Slack, Discord, Mailing Lists, or somewhere else. We will be taking a look at how the community is using each of these platforms, and with what frequency. From there, we will provide public guidance on how to use each platform. As needed, we will bolster support of new channels and move away from channels no longer serving us, to ensure we are able to easily connect with each other.

Host Regular Onboarding Meetings

Along with the clarification of the community platforms, new meetings will be added to the Cassandra community calendar. The purpose of these meetings will be to welcome and onboard new members and help them find their way as they get started. There will be separate meetings for end users and contributors, and they will be open to anyone who wishes to join, including experienced community members looking to renew their engagement.

These are just a few ideas – we want to hear from all of you! Please comment below with what you need to be successful with Cassandra.

Practitioner’s guide for using Cassandra as a real-time feature store

Patrick McFadin on March 13, 2023

By Alan Ho

The following document describes best practices for using Apache Cassandra® / DataStax AstraDB as a real-time feature store. The document covers primarily the performance and cost aspects of selecting a database for storing machine learning features. Real-Time AI requires a database that supports both high-throughput and low latency queries to serve features, as well as high write throughput for updating features. Under real world conditions, Cassandra can serve features for real-time inference with a tp99 < 23ms. Cassandra is used by large companies such as Uber and Netflix for their feature store.

This guide does not discuss the data science aspects of real-time machine learning, or the lifecycle management aspects of features in a feature store. The best practices we’ll cover are based on technical conversations with practitioners at large technology firms such as Google, Facebook, Uber, AirBnB, and Netflix on how they deliver real-time AI experiences to their customers on their cloud-native infrastructures. Although we’ll specifically focus on how to implement real-time feature storage with Cassandra, the architecture guidelines really apply to any database technology, including Redis, MongoDB, and Postgres.

What is real-time AI?

Real-time AI makes inferences or training models based on recent events. Traditionally, training models and inferences (predictions) based on models have been done in batch – typically overnight or periodically through the day. Today, modern machine learning systems perform inferences of the most recent data in order to provide the most accurate prediction possible. A small set of companies like TikTok and Google has pushed the real-time paradigm further by including on-the-fly training of models as new data comes in.

Because of these changes in inference, and changes that will likely happen to model training, persistence of feature data – data that is used to train and perform inferences for a ML model – needs to also adapt. When you’re done reading this guide, you’ll have a clearer picture of how Cassandra and DataStax Astra DB, a managed service built on Cassandra, meets real-time AI needs, and how they can be used in conjunction with other database technologies for model inference and training.

What’s a feature store?

Life cycle of a feature store courtesy of the Feast blog

A feature store is a data system specific to machine learning (ML) that:

Runs data pipelines that transform raw data into feature values
Stores and manages the feature data itself, and
Serves feature data consistently for training and inference purposes

Main components of a feature store courtesy of the Feast blog

Real-time AI places specific demands on a feature store that Cassandra is uniquely qualified to fulfill, specifically when it comes to the storage and serving of features for model serving and model Training.

Best Practices

Implement low latency queries for feature serving

For real-time inference, features need to be returned to applications with low latency at scale. Typical models involve ~200 features spread across ~10 entities. Real-time inferences require time to be budgeted for collecting features, light-weight data transformations, and performing an inference. According to the following survey (also confirmed by our conversations with practitioners), feature stores need to return the features to an application performing inference in under 50ms.

Typically, models require “inner joins” across multiple logical entities – combining rows values from multiple tables that share a common value ; this presents a significant challenge to low-latency feature serving. Take the case of Uber Eats, which predicts the time to deliver a meal. Data needs to be joined from order information, which is joined with restaurant information, which is further joined by traffic information in the region of the restaurant. In this case, two inner joins are necessary (see the illustration below).

To achieve an inner join in Cassandra, one can either denormalize the data upon insertion or make 2 sequential queries to Cassandra + perform the join on the client side. Although it’s possible to perform all inner joins upon inserting data into the database through denormalization, having a 1:1 ratio between model and table is impractical because it means maintaining an inordinate number of denormalized tables. Best practices suggest that the feature store needs to allow for 1-2 sequential queries for inner joins, combined with denormalization.

Here is a summary of the performance metrics that can be used to estimate requirements for real-time ML pipelines:

Testing Conditions:
- # features = 200
- # of tables (entities) = 3
- # inner join = 2
- Query TPS : 5000 queries / second
- Write TPS : 500 records / second
- Cluster Size : 3 nodes on AstraDB*
Latency Performance summary (uncertainties here are standard deviations):
- tp95 = 13.2(+/-0.6) ms
- tp99 = 23.0(+/-3.5) ms
- tp99.9 = 63(+/- 5) ms
- Effect of compaction:
  - tp95 = negligible
  - tp99, tp999 = negligible, captured by the sigmas quoted above
- Effect of Change Data Capture (CDC):
  - tp50, tp95 ~ 3-5 ms
  - tp99 ~ 3 ms
  - tp999 ~ negligible

* The following tests were done on DataStax’s Astra DB’s free tier, which is a serverless environment for Cassandra. Users should expect similar latency performance when deployed on 3 nodes using the following recommended settings.

The most significant impact on latency is the number of inner joins. If only one table is queried instead of three, the tp99 drops by 58%; for two tables, it is 29% less. The tp95 drops by 56% and 21% respectively. Because Cassandra is horizontally scalable, querying for more features does not significantly increase the average latency, either.

Lastly, if the latency requirements cannot be met out of box, Cassandra has two additional features: the ability to support denormalized data (and thus reduce inner joins) due to high write-throughput capabilities, and the ability to selectively replicate data to in-memory caches (e.g. Redis) through Change Data Capture. You can find more tips to reduce latency here.

Implement fault tolerant and low latency writes for feature transformations

A key component of real-time AI is the ability to use the most recent data for doing a model inference, so it is important that new data is available for inference as soon as possible. At the same time, for enterprise use cases, it is important that the writes are durable because data loss can cause significant production challenges.

Suggested deployment architecture to enable low-latency feature transformation for inference

*Object store (e.g. S3 or HIVE) can be replaced with other types of batch oriented systems such as data warehouses.

There is a trade off between low latency durable writes and low latency feature serving. For example, it is possible to only store the data in a non-durable location (e.g. Redis), but production failures can make it difficult to recover the most up-to-date features because it would require a large recomputation from raw events.

A common architecture suggests writing features to an offline store (e.g. Hive / S3), and replication of the features to an online store (e.g. in-memory cache). Even though this provides durability and low latency for feature serving, it comes at the cost of introducing latency for feature writes, which invariably causes poorer prediction performance.

Databricks Reference Architecture for Real-time AI

Cassandra provides a good trade-off between low-latency feature serving and low-latency “durable” feature writes. Data written to Cassandra is typically replicated a minimum of three times, and it supports multi-region replication. The latency from writing to availability to read is typically sub-millisecond. As a result, by persisting features directly to the online store (Cassandra) and bypassing the offline store, the application has faster access to recent data to make more accurate predictions. At the same time, CDC from the online store to the offline store allows for batch training or data exploration with existing tools.

Implement low latency and writes for prediction caching and performance monitoring

In addition to storing feature transformation, there is also the need to store predictions and other tracking data for performance monitoring.

There are several use cases for storing predictions:

Prediction store – In this scenario, a database used to cache predictions made by either a batch system or a streaming system. The streaming architecture is particularly useful when the time it takes to inference is beyond what is acceptable in a request-response system.
Prediction performance monitoring It is often necessary to monitor the prediction output of a real-time inference and compare to the final results. This means having a database to log the results of the prediction and the final result.

Cassandra is a suitable store for both use cases because of its high-write throughput capabilities.

Plan for elastic read and write workloads

The level of query and write transactions per second usually depends on the number of users simultaneously using the system. As a result, workloads may change based on the time of day or time of year. Having the ability to quickly scale up and scale down the cluster to support increased workloads is important. Cassandra and Astra DB have features that enable dynamic cluster scaling.

The second aspect that could affect write workloads is if there are changes in the feature transformation logic. With a large spike in write workloads, Cassandra automatically prioritizes maintaining low-latency queries and write TPS over data consistency, which is typically acceptable for performing real-time inference.

Implement low-latency, multi-region support

As real-time AI becomes ubiquitous across all apps, it’s important to make sure that feature data is available as close as possible to where inference occurs. This means having the feature store in the same region as the application doing inference. Replicating the data in the feature store across regions helps ensure that feature. Furthermore, replicating just the features rather than the raw data used to generate the features significantly cuts down on cloud egress fees.

Astra DB supports multi-region replication out of the box, with a replication latency in the milliseconds. Our recommendation is to stream all the raw event data to a single region, perform the feature generation, and store and replicate the features to all other regions.

Although theoretically one can achieve some latency advantage by generating features in each region, event data often needs to be joined with raw event data from other regions;. from a correctness and efficiency standpoint, it is easier to ship all events to one region for processing for most use-cases. On the other hand, if the model usage makes the most sense in a regional context, and most events are associated with region-specific entities, then it makes sense to treat features as region specific. Any events that do need to be replicated across regions can be placed in keyspaces with global replication strategies, but ideally this should be a small subset of events. At a certain point, replicating event tables globally will be less efficient than simply shipping all events to a single region for feature computations.

Plan for cost-effective and low-latency multi-cloud support

Multi-cloud support increases the resilience of applications, and allows customers to negotiate lower prices. Single-cloud online stores such as DynamoDB result in both increased latency for retrieving features and significant data egress costs, but also creates lockin to a single cloud vendor.

Open source databases that support replication across clouds provide the best balance of performance cost. To minimize the cost of egress, events and feature generation should be consolidated into one cloud, and feature data should be replicated to open source databases across the other clouds. This minimizes egress costs.

Plan for both batch and real-time training of production models

Suggested deployment architecture to enable low-latency feature transformation for inference

Batch processing infrastructure for building models is used for two use cases: building and testing new models, and building models for production. Therefore it was typically sufficient for feature data to be stored in slower object stores for the purpose of training. However, newer model training paradigms include updating the models in real-time or near real-time (real-time training); this is known as “online learning” (e.g. TikTok’s Monolith). The access pattern for real-time training sits somewhere between inference and traditional batch training. The throughput data requirements are higher than inference (because it is not usually accessing a single-row lookup), but not as high as batch processing that would involve full table scans.

Cassandra can support a TPS rating in the hundreds of thousands per second (with an appropriate data model), which can provide enough throughput for most real time training use cases. However in the case the user wants to keep real time training from an object store, Cassandra achieves this through CDC to object storage. For batch training, CDC should replicate data to object storage. It’s worth noting that machine learning frameworks like Tensorflow and PyTorch are particularly optimized for parallel training of ML models from object storage.

For a more detailed explanation of “online learning”, see Chip Huyuen’s explanation on Continual Learning, or this technical paper from Gomes et. al.

Support for Kappa architecture

Kappa architecture is gradually replacing Lambda architecture due to costs and data quality issues due to online/offline skew. Although lots of articles discuss the advantages of moving from separate batch and real-time computation layers to a single real-time layer, articles don’t often describe how to architect the serving layer.

Using Kappa architecture for generating features brings up some new considerations:

Updating features are being updated en masse and can result in a significant number of writes to the database. It’s important to ensure that query latency does not suffer during these large updates.
The serving layer still needs to support different types of queries, including low-latency queries for inference, and highTPS queries for batch training of models.

Cassandra supports Kappa architecture in the following ways:

Cassandra is designed for writes; an increased influx of writes does not significantly reduce the latency of queries. Cassandra opts for processing the writes with eventual consistency instead of strong consistency, which is typically acceptable for making predictions.
Using CDC, data can be replicated to object storage for training and in-memory storage for inference. CDC has little impact on the latency of queries to Cassandra.

Support for Lambda architecture

Most companies have a Lambda architecture, with a batch layer pipeline that’s separate from the real-time pipeline. There are several categories of features in this scenario:

Features that are only computed in real time, and replicated to the batch feature store for training
Features that are only computed in batch, and are replicated to the real-time feature store
Features are computed in real-time first, then recomputed in the batch. Discrepancies are then updated in both the real-time and object store.

In this scenario, however, DataStax recommends the architecture as described in this illustration:

The reasons are the following:

Cassandra is designed to take batch uploads of data with little impact on read latency
By having a single system of record, data management becomes significantly easier than if the data is split between the feature store and object store. This is especially important for features that are first computed realtime, then recomputed in batch.
When exporting data from Cassandra via CDC to the object feature store, the data export can be optimized for batch training (a common pattern used at companies like Facebook), which significantly cuts down on training infrastructure costs.

If it is not possible to update existing pipelines, or there are specific reasons that the features need to be in the object store first, our recommendation is to go with a two-way CDC path between the Cassandra feature store and and the object store, as illustrated below.

Ensure compatibility with existing ML software ecosystem

To use Cassandra as a feature store, it should be integrated with two portions of the ecosystem: machine learning libraries that perform inference and training, and data processing libraries that perform feature transformation.

The two most popular frameworks for machine learning are TensorFlow and PyTorch. Cassandra has Python drivers that enable easy retrieval of features from the Cassandra database; in other words, multiple features can be fetched in parallel (see this example code). The two most popular frameworks for performing feature transformation are Flink and Spark Structured Streaming. Connectors for both Flink and Spark are available for Cassandra. Practitioners can use tutorials for Flink and Spark Structured Streaming and Cassandra.

Open Source feature stores such as FEAST also have a connector and tutorial for Cassandra as well.

Understand query patterns and throughput to determine costs

Various models of real-time inference courtesy of Swirl.ai

The number of read queries for Cassandra as a feature store is dependent on the number of incoming inference requests. Assuming the feature data is split across multiple tables, or if the data can be loaded in parallel, this should give an estimate of the fanout between real-time inference can be made. For example 200 features across 10 entities in 10 separate tables gives you about a 1:10 ratio between real-time inference and queries to Cassandra.

Calculating the number of inferences being performed will depend on the inference traffic pattern. For example in the case of “streaming inference,” an inference will be performed whenever a relevant feature changes, so the total number of inferences is dependent on how often the feature data changes. When inference is performed in a “request-reply” setting, it’s is only being performed when a user requests it.

Understand batch and realtime write patterns to determine costs

The write throughput is primarily dominated by how frequently the features change. If denormalization occurs, this too could impact the number of features that are written. Other write throughput considerations include caching inferences for either batch or streaming inference scenarios.

Conclusion

When designing a real-time ML pipeline, special attention needs to be paid to the performance and scalability of the feature store. The requirements are particularly well satisfied by NoSQL databases such as Cassandra. Stand up your own feature store with Cassandra or AstraDB and try out the Feast.dev with the Cassandra connector.

Optimizing Apache Cassandra for a Social Media Feed

Rahul Singh on February 25, 2025

Last week on the PlanetCassandra Global Meetup Hartmut Armbuster showed us his experience and expertise with Apache Cassandra by showing how he tackled the design and implementation of the data model for a social media feed.

Hartmut began by outlining the access patterns for the social media feed:

Fetching a paginated list of posts.
Retrieving statistics for each post (impressions, likes, comments).
Determining if the current user has interacted with a post (likes, bookmarks).
Fetching author information for each post.
Getting a count of new posts since the user last viewed their feed.

With these patterns in mind, Hartmut detailed the initial database schema design and the corresponding queries. This initial design, while functional, involved a large number of sequential queries, which would result in unacceptable latency.The core of the presentation focused on an iterative schema refinement process. Hartmut demonstrated how to optimize the schema and process flow through several key steps:

Consolidating tables: Combining the “user likes post” and “user bookmarked post” tables into a single “user relationships post” table to reduce the number of queries.
Parallelizing queries: Executing independent queries concurrently to minimize wait times.
Modifying primary keys: Adjusting primary keys in the “post stats” and “user relationships post” tables to enable bulk queries, drastically reducing the number of requests.
Caching: Implementing an in-memory cache for user information to further reduce database load.

These optimizations resulted in a significant reduction in the number of database queries and a dramatic improvement in response time. What started as 81 sequential queries was transformed into 4-23 parallel queries, all executed in just two subsequent steps.Hartmut also discussed the importance of choosing the right tools for the job. He highlighted his choice of non-blocking IO, asynchronous drivers, reactive programming with Mutiny and Quercus, and Kotlin as the programming language. He emphasized that while Apache Cassandra can be a powerful tool for such use cases, it’s crucial to make an informed decision based on the specific requirements and constraints of the project.

Key Takeaways:

Careful schema design is crucial for optimizing Apache Cassandra performance.
Understanding and optimizing access patterns is essential.
Parallelizing queries and leveraging caching can significantly reduce latency.
Reactive programming can be a powerful paradigm for building highly performant and scalable applications with Apache Cassandra.
While Cassandra is powerful, it is not always the right choice. Consider your needs and choose the database that fits best.

The presentation concluded with a live trace from Hartmut’s experimental stack, demonstrating the optimized solution’s performance. The entire API request processing, including fetching and processing data for 20 posts, was completed in under 4 milliseconds.

The meetup was a great success, providing valuable insights into optimizing Apache Cassandra for a real-world use case. The community is encouraged to contribute ideas and presentations for future meetups. The next meetup is scheduled for March 19th.To learn more and explore the code examples, check out the repository shared during the presentation.

Thank you to Hartmut and all the organizers for this informative session!

Introducing CDC in Apache Cassandra: A Game-Changer for Event-Driven Architecture

Rahul Singh on February 25, 2025

Thanks to James Berragan, we had an amazing presentation on Kafka Integration for Cassandra CDC Using Sidecar (CEP-44) at our last PlanetCassandra Global Meetup, CDC in the Cassandra Sidecar.

Overview

Change Data Capture (CDC) is a powerful mechanism for tracking changes in a database, enabling real-time data streaming and analytics. Recently, a new CDC feature was introduced in Apache Cassandra, set to be released as part of Sidecar. This feature is particularly valuable for organizations shifting toward event-based architectures. In this post, we explore what CDC in Cassandra offers, how it works, and the technical details that make it a lightweight yet robust solution.

Understanding CDC in Apache Cassandra

What Is CDC?

CDC allows tracking of database changes by capturing inserts, updates, and deletions. The newly introduced CDC feature in Cassandra builds on its existing commit log mechanism, ensuring efficient change tracking without disrupting database performance.

Why Is This Important?

Organizations increasingly rely on real-time data processing for analytics, event-driven workflows, and integrations with message brokers like Kafka. With this new CDC implementation, Cassandra can seamlessly integrate with these systems, enhancing data consistency and scalability.

Technical Implementation

Hardlinking Commit Logs

The CDC feature in Cassandra 8844 operates by hardlinking the existing commit log to a separate directory (CDC world directory). This allows changes to be efficiently recorded and processed without modifying the standard database write path.

Key Benefits:

Lightweight and High-Performance: Minimal impact on Cassandra’s normal operations.
Transparent to Clients: Works in the background without requiring application changes.
Efficient Change Tracking: Ensures that database modifications are accurately captured.

Managing Cluster Topology Changes

One of the key design aspects of this CDC implementation is its ability to handle cluster topology changes gracefully. When a cluster topology change occurs, the system:

Reads previous state values from the commit logs.
Merges them into the new token range for the updated cluster configuration.
Ensures data consistency without significant performance degradation.

CDC and Kafka Integration

A major advantage of this feature is its ability to stream duplicated CDC events directly to a Kafka topic. This enables:

Real-time processing of data changes across distributed systems.
Event-driven workflows that react instantly to database updates.
Scalable architecture by decoupling database updates from application logic.

Handling Edge Cases

Differentiating Inserts vs. Updates

One challenge in implementing CDC is distinguishing between an insert and an update in collections. The new CDC feature can track these differences, ensuring accurate event representation.

Ensuring Minimal Cluster Impact

CDC must run without disrupting the cluster’s normal operations. The new design:

Reduces compaction load to avoid performance bottlenecks.
Uses primary keys and token ranges to efficiently track changes.
Ensures data integrity by validating that the required quorum copies exist before applying changes.

Conclusion

The new CDC feature in Apache Cassandra represents a significant advancement in database change tracking. By leveraging commit logs and integrating seamlessly with event-driven architectures like Kafka, it enables real-time data streaming with minimal overhead. As more organizations adopt CDC for data synchronization and analytics, this feature is set to become an essential tool in modern distributed systems.

Stay tuned for its official release with Sidecar and explore how CDC can revolutionize your data architecture!

Would you like to present at the next PlanetCassandra Global Meetup? Come join the Apache Software Foundation’s Slack group and join the #cassandra-com-dev group and propose your idea.

Version [1.2.0] – 2024-01-1

Aaron Ploetz on January 1, 2024

UAT

Turned off datastax filters

Fixed the Video

Delivery in progress

Update change log

Fixed the Video

Sprint 25 B
1/1/2024 - 1/5/2024

UAT

Turned off datastax filters

Fixed the Video

Delivery in progress

Test changelog plugins

Create Changlog file in Github Source control

Sprint 26 a
1/15/2024 -1/19/2024

UAT

Delivery in progress

Setup Changelog in source control root folder

Sprint 2 7a 1/22/2024 - 1/27/2024

Cassandra 5 : Dynamic Data Masking

Melissa Logan on December 15, 2023

https://www.youtube.com/watch?v=A-jPgB3phAI

New Cassandra Catalyst Program Launched Today

Melissa Logan on December 15, 2023

Today the Apache Cassandra community launched the first-ever Cassandra Catalyst program, an effort that aims to recognize individuals who invest in the growth of the Apache Cassandra community by enthusiastically sharing their expertise, encouraging participation, and creating a welcoming environment.

Catalysts are trustworthy, expert contributors with a passion for connecting and empowering others with Cassandra knowledge. They must be able to demonstrate strong knowledge of Cassandra such as production deployments, educational material, conference talks or other ways.

Read the blog, and visit Cassandra Catalyst Program to learn more and nominate someone or apply.

6 Reasons Why You Can’t Miss This Year’s Cassandra Summit

Melissa Logan on December 15, 2023

We’re less than 2 weeks away from Cassandra Summit, the annual event that unites industry experts, developers, and enthusiasts to delve into the world of Apache Cassandra. Taking place in San Jose, California December 12-13, this year’s event will also introduce AI.dev, a nexus for developers delving into the realm of open source generative AI and machine learning.

If you’re not sure whether to attend or need to convince your boss, here are some key reasons Cassandra Summit shouldn’t be missed:

#1 Learning Opportunities

The Cassandra Summit is a goldmine of knowledge for both beginners and seasoned professionals. With a diverse range of sessions, workshops, and keynotes, attendees have the chance to deepen their understanding of Apache Cassandra. Whether you’re looking to enhance your development skills, explore advanced topics, or gain insights into real-world use cases, the Cassandra Summit schedule is packed with content for you.

#2 Networking with Peers and Experts

The Cassandra Summit provides a forum for engaging in discussions with peers and earning from Apache Cassandra experts. There is no better way to share knowledge and spur innovation than face-to-face at a conference like Cassandra Summit.

#3 Stay Ahead of Industry Trends

Technology is ever-evolving, and staying ahead of the curve is crucial for professionals in the field. The Cassandra Summit provides a platform to discover the latest trends, innovations, and best practices in the world of Apache Cassandra. Whether it’s exploring new features, understanding emerging use cases, or learning about integrations with other technologies, the Cassandra Summit offers a comprehensive view of the current state and future direction of Cassandra.

#4 Jam Packed Schedule

The schedule for Cassandra Summit is filled with nearly 60 sessions hosted by speakers from companies like Amazon Web Services, Apple, Hugging Face, LlamaIndex, Microsoft, Netflix and many others. Here’s a snapshot of some of the great speakers Cassandra Summit offers:

Jeff Boudier, Product & Growth, Hugging Face;
Victor Dibia, Principal Research Software Engineer (Generative AI), Microsoft;
Jon Haddad, Apache Cassandra Committer & Tech Consultant, Rustyrazorblade Consulting;
Chip Huyen, Co-founder, Claypot AI;
Ania Kubow Software Developer & Content Creator;
Patrick Lee and Andrew Weaver, Software Engineers, Walmart;
Jerry Liu, Co-founder & CEO, LlamaIndex;
Josh McKenzie, Software Engineer, Apple; and
Cheng Wang, Software Engineer, Netflix.

#5 Community Engagement and Support

Being part of a community is integral to the success of any open-source project, and Apache Cassandra is no exception. The Cassandra Summit fosters community engagement, allowing attendees to connect with others who share a passion for Cassandra. This sense of community support is invaluable—whether you’re troubleshooting a challenge, seeking advice, or simply looking to share your experiences, the connections made at Cassandra Summit can become long-lasting.

#6 NEW AI.dev Conference

The new AI.dev: Open Source GenAI & ML Summit 2023 conference will be co-located with this year’s Cassandra Summit. This means that Cassandra Summit will welcome an expanded audience that includes developers who are delving into the realm of open source generative AI and machine learning. Cassandra Summit and AI.dev will be running together simultaneously and attendees will have access to both events with one single registration. So whether you’ve already registered or are planning to register, you’ll gain access to both of these events for one price.

Cassandra Summit is where the community can connect to share best practices and use cases, celebrate makers and users, forge critical relationships, and learn about advancements in Apache Cassandra. It’s more than a conference; it’s an immersive experience that offers a wealth of knowledge, networking opportunities, and hands-on learning.

Don’t miss the chance to be a part of this year’s event; register now!

Version [1.1.5] – 2023-12-11

Aaron Ploetz on December 1, 2023

UAT

Provide Documentation updated with Sync to Eric on Team Channel

Delivery in progress

Add Video to Videos in the normal Channel

Once done delete the below item

https://prodcassandra.wpengine.com/cassandra-5-dynamic-data-masking/

Push Catalyst Program to Dev and Production
- https://prodcassandra.wpengine.com/new-cassandra-catalyst-program-launched-today/

Sprint 24 A 12/11/2023 - 12/15/2023

UAT

Provide Documentation updated with Sync to Eric on Team Channel

Delivery in progress

Fix Video

Remove Astra and DataStax Buttons

Sprint 24 B 12/18/2023 - 12/22/2023

Version [1.1.4] – 2023-11-13

Aaron Ploetz on November 1, 2023

UAT

Gave Blog Admin Author Creation authority

- Apache Logo Change

Home Page Trademark Issue Hero text change

Sprint 23B 11/13/2023 - 11/17/2023

Cassandra Summit Preview:Guerilla Tactics for Building Scalable E-Commerce Services with Apache Cassandra®, Apache Pulsar®, and Vector Search

Melissa Logan on October 24, 2023

Aaron Ploetz is a developer advocate at DataStax. He’s been a professional software developer since 1997 and has several years of experience working on and leading DevOps teams for startups and Fortune 50 enterprises. He is a three-time Cassandra MVP, and has worked as an author on the books “Seven NoSQL Databases in a Week” and “Mastering Apache Cassandra 3.x.”

The importance of a good e-commerce website is illustrated by the fact that worldwide digital sales in 2021 eclipsed five trillion dollars (USD). Most consumers will leave a web page or a mobile app if it takes longer than a few seconds to load. Businesses that want to compete, need a high performing e-commerce website.

At Cassandra Summit this December, Aaron will host a talk titled, “Guerilla Tactics for Building Scalable E-Commerce Services with Apache Cassandra®, Apache Pulsar®, and Vector Search.” During his session, Aaron will cover how to architect high-performing data models and services, helping you to build an e-commerce site with high throughput and low latency. The datastore backend will be built on Apache Cassandra®, allowing you to leverage Cassandra’s well-known features like high-availability and data center awareness.

Modern E-commerce websites consist of many smaller subsystems, as shown in figure 1. These subsystems are really just smaller pieces of a puzzle that, when put together correctly, are capable of working together to provide a positive customer experience.

Figure 1 – The many subsystems which go into building a large-scale E-commerce website are smaller pieces of a larger whole.

However, not all of these subsystems are created equally. Some of them are “nice-to-haves,” which can help maintain customer loyalty or drive additional sales. But some are critical to the function of the site overall, and are absolutely essential for all E-commerce sites to have. These critical systems are Product, User Profile, Shopping Cart, and Order Processing. Aaron will walk through each of these systems during his talk:

Product

Every e-commerce site needs a good product system. Aaron will talk about how to model that in the database, with the aim of making it easy for customers to find what they are looking for. The product system will be further broken down into three smaller services, Category Navigation, Product data, and Pricing.

User Profile

All websites need systems for user management and sign-in. Aaron will also talk through how to implement a seamless login process using Google single-sign-on. This way we are helping to ensure a customer experience that is both convenient and secure.

Shopping Cart

Every E-commerce site has a shopping cart, but you want yours to be high-performing and easily expandable in the future. During his talk, Aaron will present ways to model the cart system to accommodate both of these requirements. This way the shopping cart runs well in a distributed environment, while also being highly-available and geographically aware.

Order Processing

Processing an order can be a complex task, especially at the enterprise level. Aaron will cover how to move orders between different business units using Apache Pulsar®.

In addition to covering how to build out and model these systems, we will also cover one of the “nice-to-have” subsystems, and demonstrate a simple way to build a recommendation system.

Product Recommendations

Want to drive additional sales by recommending valid products to your customers? Legacy recommendation systems can be large, complex, and cumbersome. But we can show you how to quickly build real-time recommendations using Vector Search. We will also cover ways to generate vector embeddings.

Conclusions

In the digital age, having a solid, functional and flexible e-commerce system is paramount to success. This session will show you how to architect these systems to take advantage of Apache Cassandra’s® features in the distributed database world. You will be well on your way to having an e-commerce website which will be high-performing, highly-available, and ultimately highly-profitable.

Aaron Ploetz will discuss this topic in more detail during one of his sessions at the Cassandra Summit. A shortened, workshop version of this talk can also be found here. More details about Aaron’s Cassandra Summit sessions can be found here.

More about Cassandra Summit

Cassandra Summit 2023 Gains Ai.dev as Co-located Event; NEW AI + Cassandra Track

Melissa Logan on October 16, 2023

We are excited to announce that the new AI.dev: Open Source GenAI & ML Summit 2023 conference will be co-located with Cassandra Summit this year! This means that Cassandra Summit will welcome an expanded audience that includes developers who are delving into the realm of open source generative AI and machine learning.

And with the addition of AI.dev, a NEW AI + Cassandra track will be featured at the event. The Call for Proposals is open until 9:00 AM PDT on Monday, October 23.

Here’s what you need to know:

WHEN + WHERE IS THIS HAPPENING?: Cassandra Summit + AI.dev will take place December 12-13, 2023 at the San Jose, California McEnery Convention Center

WHO SHOULD ATTEND?: data practitioners, developers, engineers and enthusiasts + developers who are interested in open source generative AI and machine learning.

WHAT ARE THE CFP DETAILS? The CFP for the new AI + Cassandra track is now open. This track will include lightning talks, conference sessions, panel sessions and technical workshops that delve into distributed AI using Cassandra and case studies that cover AI-powered applications using Apache Cassandra. Submit a talk today!

HOW DO I REGISTER? Cassandra Summit and AI.dev will be running together simultaneously and attendees will have access to both events with one single registration. So whether you’ve already registered or are planning to register, you’ll gain access to both of these events for one price. To learn more or to register, visit https://events.linuxfoundation.org/cassandra-summit/register/

Cassandra Summit is where the community can connect to share best practices and use cases, celebrate makers and users, forge critical relationships, and learn about advancements in Apache Cassandra. With the addition of AI.dev, we are excited to expand the community’s flagship event and include talks that showcase how AI and Cassandra synergize, unlocking new possibilities and enhancing data-driven solutions.

We hope to see you soon!

Update & Consolidate Documentation

Audit & Clarify Community Platforms

Host Regular Onboarding Meetings

What is real-time AI?

What’s a feature store?

Best Practices

Conclusion

Key Takeaways:

Overview

Understanding CDC in Apache Cassandra

What Is CDC?

Why Is This Important?

Technical Implementation

Hardlinking Commit Logs

Key Benefits:

Managing Cluster Topology Changes

CDC and Kafka Integration

Handling Edge Cases

Differentiating Inserts vs. Updates

Ensuring Minimal Cluster Impact

Conclusion

Sprint 25 B 1/1/2024 - 1/5/2024

Sprint 26 a 1/15/2024 -1/19/2024

Sprint 2 7a 1/22/2024 - 1/27/2024

Sprint 24 A 12/11/2023 - 12/15/2023

Sprint 24 B 12/18/2023 - 12/22/2023

Sprint 23B 11/13/2023 - 11/17/2023

Product

User Profile

Shopping Cart

Order Processing

Product Recommendations

Conclusions

Aaron Ploetz will discuss this topic in more detail during one of his sessions at the Cassandra Summit. A shortened, workshop version of this talk can also be found here. More details about Aaron’s Cassandra Summit sessions can be found here.

More about Cassandra Summit

Sprint 25 B
1/1/2024 - 1/5/2024

Sprint 26 a
1/15/2024 -1/19/2024