Growing the Cassandra Community

As the Apache Cassandra® community grows, it is important that we have a place to call home – a place where end users and contributors alike can find the resources they need to be successful with Cassandra. Whether success is enhancing how you use Cassandra to power your business, or contributing the next awesome feature to the project, each of us needs access to information and community support to achieve our goals with Cassandra.  End-users should always have the opportunity to learn new ways of using Cassandra, and contributors should have no barriers to getting involved with the project in the areas where they can offer innovation. With that in mind, we have made easy access to guiding resources a top priority for our community.

We’re excited to announce that today we have relaunched a revitalized Planet Cassandra community hub. Planet Cassandra is your central place to keep informed about the Cassandra project and find out how you can use and contribute to it. The platform aggregates content from a variety of online sources and brings documentation and learning resources into one place. The goal is to make it easier for new users to start using Cassandra, existing users to learn more, and contributors to have a greater impact.

We also want to make it easier for all users to connect and collaborate with each other. So, along with relaunching Planet Cassandra, we will also be exploring a few ways that we can do this effectively:

Update & Consolidate Documentation

We can all speak from experience that documentation, whether its technical and end-user docs or community and process docs, has a habit of becoming out- dated with surprising regularity. To work toward the goal of providing alway-current information to the community, public documentation for joining the Cassandra community will be reviewed and updated where necessary with regularly scheduled reviews as we move forward.The documentation will also be updated to consolidate end-user and contributor information in the appropriate locations and ensure a single source of truth rather than disparate pages in different locations.

Audit & Clarify Community Platforms

We want to meet the community where they are – whether it be on Slack, Discord, Mailing Lists, or somewhere else. We will be taking a look at how the community is using each of these platforms, and with what frequency. From there, we will provide public guidance on how to use each platform. As needed, we will bolster support of new channels and move away from channels no longer serving us, to ensure we are able to easily connect with each other.

Host Regular Onboarding Meetings

Along with the clarification of the community platforms, new meetings will be added to the Cassandra community calendar. The purpose of these meetings will be to welcome and onboard new members and help them find their way as they get started. There will be separate meetings for end users and contributors, and they will be open to anyone who wishes to join, including experienced community members looking to renew their engagement.

These are just a few ideas – we want to hear from all of you! Please comment below with what you need to be successful with Cassandra.

Practitioner’s guide for using Cassandra as a real-time feature store

By Alan Ho

The following document describes best practices for using Apache Cassandra®  / DataStax AstraDB as a real-time feature store. The document covers primarily the performance and cost aspects of selecting a database for storing machine learning features. Real-Time AI requires a database that supports both high-throughput and low latency queries to serve features, as well as high write throughput for updating features. Under real world conditions, Cassandra can serve features for real-time inference with a tp99 < 23ms. Cassandra is used by large companies such as Uber and Netflix for their feature store.

This guide does not discuss the data science aspects of real-time machine learning, or the lifecycle management aspects of features in a feature store. The best practices we’ll cover are based on technical conversations with practitioners at large technology firms such as Google, Facebook, Uber, AirBnB, and Netflix on how they deliver real-time AI experiences to their customers on their cloud-native infrastructures. Although we’ll specifically focus on how to implement real-time feature storage with Cassandra, the architecture guidelines really apply to any database technology, including Redis, MongoDB, and Postgres. 

What is real-time AI?

Real-time AI makes inferences or training models based on recent events. Traditionally, training models and inferences (predictions) based on models have been done in batch – typically overnight or periodically through the day. Today, modern machine learning systems perform inferences of the most recent data in order to provide the most accurate prediction possible. A small set of companies like TikTok and Google has pushed the real-time paradigm further by including on-the-fly training of models as new data comes in.

Because of these changes in inference, and changes that will likely happen to model training, persistence of feature data – data that is used to train and perform inferences for a ML model – needs to also adapt. When you’re done reading this guide, you’ll have a clearer picture of how Cassandra and DataStax Astra DB, a managed service built on Cassandra, meets real-time AI needs, and how they can be used in conjunction with other database technologies for model inference and training.

What’s a feature store?

Life cycle of a feature store courtesy of the Feast blog

A feature store is a data system specific to machine learning (ML) that:

  • Runs data pipelines that transform raw data into feature values
  • Stores and manages the feature data itself, and
  • Serves feature data consistently for training and inference purposes

Main components of a feature store courtesy of the Feast blog

Real-time AI places specific demands on a feature store that Cassandra is uniquely qualified to fulfill, specifically when it comes to the storage and serving of features for model serving and model Training.

Best Practices

Implement low latency queries for feature serving

For real-time inference, features need to be returned to applications with low latency at scale. Typical models involve ~200 features spread across ~10 entities. Real-time inferences require time to be budgeted for collecting features, light-weight data transformations, and performing an inference. According to the following survey (also confirmed by our conversations with practitioners), feature stores need to return the features to an application performing inference in under 50ms.

Typically, models require “inner joins” across multiple logical entities – combining rows values from multiple tables that share a common value ; this presents a significant challenge to low-latency feature serving. Take the case of Uber Eats, which predicts the time to deliver a meal. Data needs to be joined from order information, which is joined with restaurant information, which is further joined by traffic information in the region of the restaurant. In this case, two inner joins are necessary (see the illustration below).

To achieve an inner join in Cassandra, one can either denormalize the data upon insertion or make 2 sequential queries to Cassandra + perform the join on the client side. Although it’s possible to perform all inner joins upon inserting data into the database through denormalization, having a 1:1 ratio between model and table is impractical because it means maintaining an inordinate number of denormalized tables. Best practices suggest that the feature store needs to allow for 1-2 sequential queries for inner joins, combined with denormalization.

Here is a summary of the performance metrics that can be used to estimate requirements for real-time ML pipelines:

  • Testing Conditions:
    • # features = 200
    • # of tables (entities) = 3
    • # inner join = 2 
    • Query TPS : 5000 queries / second
    • Write TPS : 500 records / second
    • Cluster Size : 3 nodes on AstraDB*
  • Latency Performance summary (uncertainties here are standard deviations):
    • tp95 = 13.2(+/-0.6) ms
    • tp99 = 23.0(+/-3.5) ms
    • tp99.9 = 63(+/- 5) ms
    • Effect of compaction:
      • tp95 = negligible
      • tp99, tp999 = negligible, captured by the sigmas quoted above
    • Effect of Change Data Capture (CDC):
      • tp50, tp95 ~ 3-5 ms
      • tp99 ~ 3 ms
      • tp999 ~ negligible

* The following tests were done on DataStax’s Astra DB’s free tier, which is a serverless environment for Cassandra. Users should expect similar latency performance when deployed on 3 nodes using the following recommended settings.

The most significant impact on latency is the number of inner joins. If only one table is queried instead of three, the tp99 drops by 58%; for two tables, it is 29% less. The tp95 drops by 56% and 21% respectively. Because Cassandra is horizontally scalable, querying for more features does not significantly increase the average latency, either.

Lastly, if the latency requirements cannot be met out of box, Cassandra has two additional features: the ability to support denormalized data (and thus reduce inner joins) due to high write-throughput capabilities, and the ability to selectively replicate data to in-memory caches (e.g. Redis) through Change Data Capture. You can find more tips to reduce latency here.

Implement fault tolerant and low latency writes for feature transformations

A key component of real-time AI is the ability to use the most recent data for doing a model inference, so it is important that new data is available for inference as soon as possible. At the same time, for enterprise use cases, it is important that the writes are durable because data loss can cause significant production challenges.

Suggested deployment architecture to enable low-latency feature transformation for inference

*Object store (e.g. S3 or HIVE) can be replaced with other types of batch oriented systems such as data warehouses.

There is a trade off between low latency durable writes and low latency feature serving. For example, it is possible to only store the data in a non-durable location (e.g. Redis), but production failures can make it difficult to recover the most up-to-date features because it would require a large recomputation from raw events.

A common architecture suggests writing features to an offline store (e.g. Hive / S3), and replication of the features to an online store (e.g. in-memory cache). Even though this provides durability and low latency for feature serving, it comes at the cost of introducing latency for feature writes, which invariably causes poorer prediction performance.

Databricks Reference Architecture for Real-time AI

Cassandra provides a good trade-off between low-latency feature serving and low-latency “durable” feature writes. Data written to Cassandra is typically replicated a minimum of three times, and it supports multi-region replication. The latency from writing to availability to read is typically sub-millisecond. As a result, by persisting features directly to the online store (Cassandra) and bypassing the offline store, the application has faster access to recent data to make more accurate predictions. At the same time, CDC from the online store to the offline store allows for batch training or data exploration with existing tools.

Implement low latency and writes for prediction caching and performance monitoring

In addition to storing feature transformation, there is also the need to store predictions and other tracking data for performance monitoring.

There are several use cases for storing predictions:

  1. Prediction store – In this scenario, a database used to cache predictions made by either a batch system or a streaming system. The streaming architecture is particularly useful when the time it takes to inference is beyond what is acceptable in a request-response system.
  2. Prediction performance monitoring It is often necessary to monitor the prediction output of a real-time inference and compare to the final results. This means having a database to log the results of the prediction and the final result.

Cassandra is a suitable store for both use cases because of its high-write throughput capabilities. 

Plan for elastic read and write workloads

The level of query and write transactions per second usually depends on the number of users simultaneously using the system. As a result, workloads may change based on the time of day or time of year. Having the ability to quickly scale up and scale down the cluster to support increased workloads is important. Cassandra and Astra DB have features that enable dynamic cluster scaling.

The second aspect that could affect write workloads is if there are changes in the feature transformation logic. With a large spike in write workloads, Cassandra automatically prioritizes maintaining low-latency queries and write TPS over data consistency, which is typically acceptable for performing real-time inference.

Implement low-latency, multi-region support

As real-time AI becomes ubiquitous across all apps, it’s important to make sure that feature data is available as close as possible to where inference occurs. This means having the feature store in the same region as the application doing inference. Replicating the data in the feature store across regions helps ensure that feature. Furthermore, replicating just the features rather than the raw data used to generate the features significantly cuts down on cloud egress fees.

Astra DB supports multi-region replication out of the box, with a replication latency in the milliseconds. Our recommendation is to stream all the raw event data to a single region, perform the feature generation, and store and replicate the features to all other regions.

Although theoretically one can achieve some latency advantage by generating features in each region, event data often needs to be joined with raw event data from other regions;. from a correctness and efficiency standpoint, it is easier to ship all events to one region for processing for most use-cases. On the other hand, if the model usage makes the most sense in a regional context, and most events are associated with region-specific entities, then it makes sense to treat features as region specific. Any events that do need to be replicated across regions can be placed in keyspaces with global replication strategies, but ideally this should be a small subset of events. At a certain point, replicating event tables globally will be less efficient than simply shipping all events to a single region for feature computations.

Plan for cost-effective and low-latency multi-cloud support

Multi-cloud support increases the resilience of applications, and allows customers to negotiate lower prices. Single-cloud online stores such as DynamoDB result in both increased latency for retrieving features and significant data egress costs, but also creates lockin to a single cloud vendor.

Open source databases that support replication across clouds provide the best balance of performance cost. To minimize the cost of egress, events and feature generation should be consolidated into one cloud, and feature data should be replicated to open source databases across the other clouds. This minimizes egress costs.

Plan for both batch and real-time training of production models

Suggested deployment architecture to enable low-latency feature transformation for inference

Batch processing infrastructure for building models is used for two use cases: building and testing new models, and building models for production. Therefore it was typically sufficient for feature data to be stored in slower object stores for the purpose of training. However, newer model training paradigms include updating the models in real-time or near real-time (real-time training); this is known as “online learning” (e.g. TikTok’s Monolith). The access pattern for real-time training sits somewhere between inference and traditional batch training. The throughput data requirements are higher than inference (because it is not usually accessing a single-row lookup), but not as high as batch processing that would involve full table scans.

Cassandra can support a TPS rating in the hundreds of thousands per second (with an appropriate data model), which can provide enough throughput for most real time training use cases. However in the case the user wants to keep real time training from an object store, Cassandra achieves this through CDC to object storage. For batch training, CDC should replicate data to object storage. It’s worth noting that machine learning frameworks like Tensorflow and PyTorch are particularly optimized for parallel training of ML models from object storage.

For a more detailed explanation of “online learning”, see Chip Huyuen’s explanation on Continual Learning, or this technical paper from Gomes et. al.

Support for Kappa architecture

Kappa architecture is gradually replacing Lambda architecture due to costs and data quality issues due to online/offline skew. Although lots of articles discuss the advantages of moving from separate batch and real-time computation layers to a single real-time layer, articles don’t often describe how to architect the serving layer. 

Using Kappa architecture for generating features brings up some new considerations:

  1. Updating features are being updated en masse and can result in a significant number of writes to the database. It’s important to ensure that query latency does not suffer during these large updates.
  2. The serving layer still needs to support different types of queries, including low-latency queries for inference, and highTPS queries for batch training of models.

Cassandra supports Kappa architecture in the following ways:

  • Cassandra is designed for writes; an increased influx of writes does not significantly reduce the latency of queries. Cassandra opts for processing the writes with eventual consistency instead of strong consistency, which is typically acceptable for making predictions.
  • Using CDC, data can be replicated to object storage for training and in-memory storage for inference. CDC has little impact on the latency of queries to Cassandra.

Support for Lambda architecture

Most companies have a Lambda architecture, with a batch layer pipeline that’s separate from the real-time pipeline. There are several categories of features in this scenario:

  1. Features that are only computed in real time, and replicated to the batch feature store for training
  2. Features that are only computed in batch, and are replicated to the real-time feature store 
  3. Features are computed in real-time first, then recomputed in the batch. Discrepancies are then updated in both the real-time and object store.

In this scenario, however, DataStax recommends the architecture as described in this illustration:

The reasons are the following:

  1. Cassandra is designed to take batch uploads of data with little impact on read latency
  2. By having a single system of record, data management becomes significantly easier than if the data is split between the feature store and object store. This is especially important for features that are first computed realtime, then recomputed in batch.
  3. When exporting data from Cassandra via CDC to the object feature store, the data export can be optimized for batch training (a common pattern used at companies like Facebook), which significantly cuts down on training infrastructure costs.

If it is not possible to update existing pipelines, or there are specific reasons that the features need to be in the object store first, our recommendation is to go with a two-way CDC path between the Cassandra feature store and and the object store, as illustrated below. 

Ensure compatibility with existing ML software ecosystem

To use Cassandra as a feature store, it should be integrated with two portions of the ecosystem: machine learning libraries that perform inference and training, and data processing libraries that perform feature transformation.

The two most popular frameworks for machine learning are TensorFlow and PyTorch. Cassandra has Python drivers that enable easy retrieval of features from the Cassandra database; in other words, multiple features can be fetched in parallel (see this example code). The two most popular frameworks for performing feature transformation are Flink and Spark Structured Streaming. Connectors for both Flink and Spark are available for Cassandra. Practitioners can use tutorials for Flink and Spark Structured Streaming and Cassandra.

Open Source feature stores such as FEAST also have a connector and tutorial for Cassandra as well.

Understand query patterns and throughput to determine costs

Various models of real-time inference courtesy of Swirl.ai

The number of read queries for Cassandra as a feature store is dependent on the number of incoming inference requests. Assuming the feature data is split across multiple tables, or if the data can be loaded in parallel, this should give an estimate of the fanout between real-time inference can be made. For example 200 features across 10 entities in 10 separate tables gives you about a 1:10 ratio between real-time inference and queries to Cassandra.

Calculating the number of inferences being performed will depend on the inference traffic pattern. For example in the case of “streaming inference,” an inference will be performed whenever a relevant feature changes, so the total number of inferences is dependent on how often the feature data changes. When inference is performed in a “request-reply” setting, it’s is only being performed when a user requests it.

Understand batch and realtime write patterns to determine costs

The write throughput is primarily dominated by how frequently the features change. If denormalization occurs, this too could impact the number of features that are written. Other write throughput considerations include caching inferences for either batch or streaming inference scenarios.

Conclusion

When designing a real-time ML pipeline, special attention needs to be paid to the performance and scalability of the feature store. The requirements are particularly well satisfied by NoSQL databases such as Cassandra. Stand up your own feature store with Cassandra or AstraDB and try out the Feast.dev with the Cassandra connector.

Version [1.2.0] – 2024-01-1

UAT

  • Turned off datastax filters 
  • Fixed the Video 

Delivery in progress

  • Update change log 
  • Fixed the Video 

Sprint 25 B
1/1/2024 - 1/5/2024

UAT

  • Turned off datastax filters 
  • Fixed the Video 

Delivery in progress

  • Test changelog plugins 
  • Create Changlog file in Github Source control

 

Sprint 26 a
1/15/2024 -1/19/2024

UAT

  •  

Delivery in progress

  • Setup Changelog in source control root folder 

 

Sprint 2 7a 1/22/2024 - 1/27/2024

New Cassandra Catalyst Program Launched Today

Today the Apache Cassandra community launched the first-ever Cassandra Catalyst program, an effort that aims to recognize individuals who invest in the growth of the Apache Cassandra community by enthusiastically sharing their expertise, encouraging participation, and creating a welcoming environment.

Catalysts are trustworthy, expert contributors with a passion for connecting and empowering others with Cassandra knowledge. They must be able to demonstrate strong knowledge of Cassandra such as production deployments, educational material, conference talks or other ways.

Read the blog, and visit Cassandra Catalyst Program to learn more and nominate someone or apply.

6 Reasons Why You Can’t Miss This Year’s Cassandra Summit

We’re less than 2 weeks away from Cassandra Summit, the annual event that unites industry experts, developers, and enthusiasts to delve into the world of Apache Cassandra. Taking place in San Jose, California December 12-13, this year’s event will also introduce AI.dev, a nexus for developers delving into the realm of open source generative AI and machine learning. 

If you’re not sure whether to attend or need to convince your boss, here are some key reasons Cassandra Summit shouldn’t be missed: 

#1 Learning Opportunities

The Cassandra Summit is a goldmine of knowledge for both beginners and seasoned professionals. With a diverse range of sessions, workshops, and keynotes, attendees have the chance to deepen their understanding of Apache Cassandra. Whether you’re looking to enhance your development skills, explore advanced topics, or gain insights into real-world use cases, the Cassandra Summit schedule is packed with content for you. 

#2 Networking with Peers and Experts

The Cassandra Summit provides a forum for engaging in discussions with peers and earning from Apache Cassandra experts. There is no better way to share knowledge and spur innovation than face-to-face at a conference like Cassandra Summit. 

#3 Stay Ahead of Industry Trends 

Technology is ever-evolving, and staying ahead of the curve is crucial for professionals in the field. The Cassandra Summit provides a platform to discover the latest trends, innovations, and best practices in the world of Apache Cassandra. Whether it’s exploring new features, understanding emerging use cases, or learning about integrations with other technologies, the Cassandra Summit offers a comprehensive view of the current state and future direction of Cassandra.

#4 Jam Packed Schedule 

The schedule for Cassandra Summit is filled with nearly 60 sessions hosted by speakers from companies like Amazon Web Services, Apple, Hugging Face, LlamaIndex, Microsoft, Netflix and many others. Here’s a snapshot of some of the great speakers Cassandra Summit offers: 

#5 Community Engagement and Support 

Being part of a community is integral to the success of any open-source project, and Apache Cassandra is no exception. The Cassandra Summit fosters community engagement, allowing attendees to connect with others who share a passion for Cassandra. This sense of community support is invaluable—whether you’re troubleshooting a challenge, seeking advice, or simply looking to share your experiences, the connections made at Cassandra Summit can become long-lasting.

#6 NEW AI.dev Conference

The new AI.dev: Open Source GenAI & ML Summit 2023 conference will be co-located with this year’s Cassandra Summit. This means that Cassandra Summit will welcome an expanded audience that includes developers who are delving into the realm of open source generative AI and machine learning. Cassandra Summit and AI.dev will be running together simultaneously and attendees will have access to both events with one single registration. So whether you’ve already registered or are planning to register, you’ll gain access to both of these events for one price.

Cassandra Summit is where the community can connect to share best practices and use cases, celebrate makers and users, forge critical relationships, and learn about advancements in Apache Cassandra. It’s more than a conference; it’s an immersive experience that offers a wealth of knowledge, networking opportunities, and hands-on learning. 

Don’t miss the chance to be a part of this year’s event; register now!

Version [1.1.5] – 2023-12-11

UAT

  • Provide Documentation updated with Sync to Eric on Team Channel

Delivery in progress

  • Add Video to Videos in the normal Channel
  • Once done delete the below item
  • Push Catalyst Program to Dev and Production 
    • https://prodcassandra.wpengine.com/new-cassandra-catalyst-program-launched-today/

Sprint 24 A 12/11/2023 - 12/15/2023

UAT

  • Provide Documentation updated with Sync to Eric on Team Channel

Delivery in progress

  • Fix Video 
  • Remove Astra and DataStax Buttons

Sprint 24 B 12/18/2023 - 12/22/2023

Cassandra Summit Preview:Guerilla Tactics for Building Scalable E-Commerce Services with Apache Cassandra®, Apache Pulsar®, and Vector Search

Aaron Ploetz is a developer advocate at DataStax. He’s been a professional software developer since 1997 and has several years of experience working on and leading DevOps teams for startups and Fortune 50 enterprises. He is a three-time Cassandra MVP, and has worked as an author on the books “Seven NoSQL Databases in a Week” and “Mastering Apache Cassandra 3.x.” 

The importance of a good e-commerce website is illustrated by the fact that worldwide digital sales in 2021 eclipsed five trillion dollars (USD). Most consumers will leave a web page or a mobile app if it takes longer than a few seconds to load. Businesses that want to compete, need a high performing e-commerce website.

At Cassandra Summit this December, Aaron will host a talk titled, “Guerilla Tactics for Building Scalable E-Commerce Services with Apache Cassandra®, Apache Pulsar®, and Vector Search.” During his session, Aaron will cover how to architect high-performing data models and services, helping you to build an e-commerce site with high throughput and low latency. The datastore backend will be built on Apache Cassandra®, allowing you to leverage Cassandra’s well-known features like high-availability and data center awareness.

Modern E-commerce websites consist of many smaller subsystems, as shown in figure 1. These subsystems are really just smaller pieces of a puzzle that, when put together correctly, are capable of working together to provide a positive customer experience.

Figure 1 – The many subsystems which go into building a large-scale E-commerce website are smaller pieces of a larger whole.

However, not all of these subsystems are created equally. Some of them are “nice-to-haves,” which can help maintain customer loyalty or drive additional sales. But some are critical to the function of the site overall, and are absolutely essential for all E-commerce sites to have. These critical systems are Product, User Profile, Shopping Cart, and Order Processing. Aaron will walk through each of these systems during his talk: 

Product 

Every e-commerce site needs a good product system. Aaron will talk about how to model that in the database, with the aim of making it easy for customers to find what they are looking for. The product system will be further broken down into three smaller services, Category Navigation, Product data, and Pricing.

User Profile

All websites need systems for user management and sign-in. Aaron will also talk through how to implement a seamless login process using Google single-sign-on. This way we are helping to ensure a customer experience that is both convenient and secure.

Shopping Cart

Every E-commerce site has a shopping cart, but you want yours to be high-performing and easily expandable in the future. During his talk, Aaron will present ways to model the cart system to accommodate both of these requirements. This way the shopping cart runs well in a distributed environment, while also being highly-available and geographically aware.

Order Processing

Processing an order can be a complex task, especially at the enterprise level. Aaron will cover how to move orders between different business units using Apache Pulsar®.

In addition to covering how to build out and model these systems, we will also cover one of the “nice-to-have” subsystems, and demonstrate a simple way to build a recommendation system.

Product Recommendations

Want to drive additional sales by recommending valid products to your customers? Legacy recommendation systems can be large, complex, and cumbersome. But we can show you how to quickly build real-time recommendations using Vector Search. We will also cover ways to generate vector embeddings.

Conclusions

In the digital age, having a solid, functional and flexible e-commerce system is paramount to success. This session will show you how to architect these systems to take advantage of Apache Cassandra’s® features in the distributed database world. You will be well on your way to having an e-commerce website which will be high-performing, highly-available, and ultimately highly-profitable.

Aaron Ploetz will discuss this topic in more detail during one of his sessions at the Cassandra Summit. A shortened, workshop version of this talk can also be found here. More details about Aaron’s Cassandra Summit sessions can be found here

More about Cassandra Summit

Cassandra Summit 2023 Gains Ai.dev as Co-located Event; NEW AI + Cassandra Track

We are excited to announce that the new AI.dev: Open Source GenAI & ML Summit 2023 conference will be co-located with Cassandra Summit this year! This means that Cassandra Summit will welcome an expanded audience that includes developers who are delving into the realm of open source generative AI and machine learning. 

And with the addition of AI.dev, a NEW AI + Cassandra track will be featured at the event. The Call for Proposals is open until 9:00 AM PDT on Monday, October 23

Here’s what you need to know: 

WHEN + WHERE IS THIS HAPPENING?: Cassandra Summit + AI.dev will take place December 12-13, 2023 at the San Jose, California McEnery Convention Center

WHO SHOULD ATTEND?: data practitioners, developers, engineers and enthusiasts + developers who are interested in open source generative AI and machine learning. 

WHAT ARE THE CFP DETAILS? The CFP for the new AI + Cassandra track is now open. This track will include lightning talks, conference sessions, panel sessions and technical workshops that delve into distributed AI using Cassandra and case studies that cover AI-powered applications using Apache Cassandra. Submit a talk today! 

HOW DO I REGISTER? Cassandra Summit and AI.dev will be running together simultaneously and attendees will have access to both events with one single registration. So whether you’ve already registered or are planning to register, you’ll gain access to both of these events for one price. To learn more or to register, visit https://events.linuxfoundation.org/cassandra-summit/register/ 

Cassandra Summit is where the community can connect to share best practices and use cases, celebrate makers and users, forge critical relationships, and learn about advancements in Apache Cassandra. With the addition of AI.dev, we are excited to expand the community’s flagship event and include talks that showcase how AI and Cassandra synergize, unlocking new possibilities and enhancing data-driven solutions. 

We hope to see you soon!

Showcasing the Power of Apache Cassandra: Four Must-Attend Sessions at Community Over Code

This weekend the Apache Software Foundation’s flagship conference Community Over Code will kick off. The four-day, in-person event will bring the ASF and the broader open source community together in Halifax, Nova Scotia from October 7-10. We’re excited that the Apache Cassandra community will be among the projects represented with four talks included in the event schedule

These talks highlight some of the key features of the forthcoming 5.0 release and underscore how a powerful tool like Cassandra can be used in IoT workloads: 

Adding Vector Search to Apache Cassandra

Speaker: Jonathan Ellis

Saturday, October 7, 2023, 12:10 ADT

Vector search is a hot topic in the world of databases, and Jonathan Ellis, the founder of DataStax and former Apache Cassandra project chair, will shed light on its implementation in Cassandra. This session will explore the fundamentals of k-Nearest Neighbors (kNN) and Approximate Nearest Neighbors (ANN) vector search, introducing the Hierarchical, Navigable Small-World (HNSW) algorithm for vector indexing. You’ll gain insights into the challenges and solutions involved in adapting HNSW to Cassandra, including concurrent updates and queries. Witness the execution of supported queries with the HNSW index and other storage-attached index (SAI) predicates, and learn valuable lessons to enhance performance. As an application developer, understanding Cassandra’s vector search capabilities is crucial in today’s data-driven landscape.

Unified Compaction Strategy in Cassandra (CEP-26)

Speaker: Branimir Lambov

Sunday, October 8, 2023, 12:10 ADT

Cassandra 5.0 is set to revolutionize compaction strategies with CEP-26, offering a unified solution to address existing strategy deficiencies, improve performance, and facilitate easy reconfiguration. In this session, Branimir Lambov, a long-term Cassandra committer, will dive deep into the key features of this strategy and the rationale behind them. Learn how this strategy covers leveled, tiered, and hybrid compaction schemes, employs a flexible SSTable sharding scheme, and selects and prioritizes SSTable sets for compaction based on overlap. Whether you’re dealing with large-scale data or time-series data, CEP-26 has you covered. Discover real-world examples of its impressive performance improvements and the possibilities it unlocks.

IoT Overkill: Running a Cassandra and Kafka cluster on Open Source Hardware

Speaker: Kassian Wren

Sunday, October 8, 2023, 14:20 ADT

Open source hardware meets open source software in this session by Kassian Wren, an Open Source Technology Evangelist. Dive into the world of open source clusters, featuring a unique five-node configuration with Raspberry Pi and Orange Pi nodes. Witness the orchestration of a Docker swarm that runs Cassandra and Kafka services, distributed across worker nodes. This session isn’t just about showcasing the cluster but also delves into automation, setup, and maintenance. If you’re passionate about IoT projects and the intersection of hardware and software, this session promises to be a fascinating journey into the possibilities of open source technology.

Performance Measurement and Tuning of Cassandra 5.0 Transactions on Cloud Infrastructure

Speakers: German Eichberger and Pallavi Iyengar

Tuesday, October 10, 2023, 14:20 ADT

Cassandra 5.0 introduces transaction support based on ACCORD, necessitating new benchmarks for distributed transactional databases. German Eichberger and Pallavi Iyengar from Microsoft’s Azure Managed Instances for Apache Cassandra team will explore this topic in detail. Learn about benchmark scenarios inspired by YCSB+T’s Closed Economy Workload and delve into the challenges of cloud environments. Understand the impact of network topologies, including one-region and multi-region clusters, and discover performance-enhancing techniques like SSD-based write-through cache. Gain insights into tuning Cassandra 5.0 for optimal performance in different scenarios and compare it with previous Cassandra versions.

To see the full Community Over Code schedule, visit https://communityovercode.org/schedule.

To learn more about The ASF’s Community Over Code and register to attend, visit https://communityovercode.org

Version [1.1.3] – 2023-10-16

UAT

  • All new blogs have been deploy to prod 
  • Blogs created by DS team members need to go to prod
  • Operations Manual
    • Update more sections 

Sprint 22a 10/16/2023-10/20/2023

UAT

  • All new blogs have been deploy to prod 
  • Operations Manual
    • Update more sections 
  • Cassandra.link Ad placement for tshirt
  • Allowing DS user to add editors 

Sprint 22b 10/23/2023 - 10/27/2023

UAT

  • All new blogs have been deploy to prod 
  • Cassandra.link Ad placement for tshirt
  • Allowing DS user to add editors  SOP
  • Change Thumbnails for posts
  • Contribution Page Change

Sprint 23a 10/30/2023 - 11/3/2023