Growing the Cassandra Community

As the Apache Cassandra® community grows, it is important that we have a place to call home – a place where end users and contributors alike can find the resources they need to be successful with Cassandra. Whether success is enhancing how you use Cassandra to power your business, or contributing the next awesome feature to the project, each of us needs access to information and community support to achieve our goals with Cassandra.  End-users should always have the opportunity to learn new ways of using Cassandra, and contributors should have no barriers to getting involved with the project in the areas where they can offer innovation. With that in mind, we have made easy access to guiding resources a top priority for our community.

We’re excited to announce that today we have relaunched a revitalized Planet Cassandra community hub. Planet Cassandra is your central place to keep informed about the Cassandra project and find out how you can use and contribute to it. The platform aggregates content from a variety of online sources and brings documentation and learning resources into one place. The goal is to make it easier for new users to start using Cassandra, existing users to learn more, and contributors to have a greater impact.

We also want to make it easier for all users to connect and collaborate with each other. So, along with relaunching Planet Cassandra, we will also be exploring a few ways that we can do this effectively:

Update & Consolidate Documentation

We can all speak from experience that documentation, whether its technical and end-user docs or community and process docs, has a habit of becoming out- dated with surprising regularity. To work toward the goal of providing alway-current information to the community, public documentation for joining the Cassandra community will be reviewed and updated where necessary with regularly scheduled reviews as we move forward.The documentation will also be updated to consolidate end-user and contributor information in the appropriate locations and ensure a single source of truth rather than disparate pages in different locations.

Audit & Clarify Community Platforms

We want to meet the community where they are – whether it be on Slack, Discord, Mailing Lists, or somewhere else. We will be taking a look at how the community is using each of these platforms, and with what frequency. From there, we will provide public guidance on how to use each platform. As needed, we will bolster support of new channels and move away from channels no longer serving us, to ensure we are able to easily connect with each other.

Host Regular Onboarding Meetings

Along with the clarification of the community platforms, new meetings will be added to the Cassandra community calendar. The purpose of these meetings will be to welcome and onboard new members and help them find their way as they get started. There will be separate meetings for end users and contributors, and they will be open to anyone who wishes to join, including experienced community members looking to renew their engagement.

These are just a few ideas – we want to hear from all of you! Please comment below with what you need to be successful with Cassandra.

Practitioner’s guide for using Cassandra as a real-time feature store

By Alan Ho

The following document describes best practices for using Apache Cassandra®  / DataStax AstraDB as a real-time feature store. The document covers primarily the performance and cost aspects of selecting a database for storing machine learning features. Real-Time AI requires a database that supports both high-throughput and low latency queries to serve features, as well as high write throughput for updating features. Under real world conditions, Cassandra can serve features for real-time inference with a tp99 < 23ms. Cassandra is used by large companies such as Uber and Netflix for their feature store.

This guide does not discuss the data science aspects of real-time machine learning, or the lifecycle management aspects of features in a feature store. The best practices we’ll cover are based on technical conversations with practitioners at large technology firms such as Google, Facebook, Uber, AirBnB, and Netflix on how they deliver real-time AI experiences to their customers on their cloud-native infrastructures. Although we’ll specifically focus on how to implement real-time feature storage with Cassandra, the architecture guidelines really apply to any database technology, including Redis, MongoDB, and Postgres. 

What is real-time AI?

Real-time AI makes inferences or training models based on recent events. Traditionally, training models and inferences (predictions) based on models have been done in batch – typically overnight or periodically through the day. Today, modern machine learning systems perform inferences of the most recent data in order to provide the most accurate prediction possible. A small set of companies like TikTok and Google has pushed the real-time paradigm further by including on-the-fly training of models as new data comes in.

Because of these changes in inference, and changes that will likely happen to model training, persistence of feature data – data that is used to train and perform inferences for a ML model – needs to also adapt. When you’re done reading this guide, you’ll have a clearer picture of how Cassandra and DataStax Astra DB, a managed service built on Cassandra, meets real-time AI needs, and how they can be used in conjunction with other database technologies for model inference and training.

What’s a feature store?

Life cycle of a feature store courtesy of the Feast blog

A feature store is a data system specific to machine learning (ML) that:

  • Runs data pipelines that transform raw data into feature values
  • Stores and manages the feature data itself, and
  • Serves feature data consistently for training and inference purposes

Main components of a feature store courtesy of the Feast blog

Real-time AI places specific demands on a feature store that Cassandra is uniquely qualified to fulfill, specifically when it comes to the storage and serving of features for model serving and model Training.

Best Practices

Implement low latency queries for feature serving

For real-time inference, features need to be returned to applications with low latency at scale. Typical models involve ~200 features spread across ~10 entities. Real-time inferences require time to be budgeted for collecting features, light-weight data transformations, and performing an inference. According to the following survey (also confirmed by our conversations with practitioners), feature stores need to return the features to an application performing inference in under 50ms.

Typically, models require “inner joins” across multiple logical entities – combining rows values from multiple tables that share a common value ; this presents a significant challenge to low-latency feature serving. Take the case of Uber Eats, which predicts the time to deliver a meal. Data needs to be joined from order information, which is joined with restaurant information, which is further joined by traffic information in the region of the restaurant. In this case, two inner joins are necessary (see the illustration below).

To achieve an inner join in Cassandra, one can either denormalize the data upon insertion or make 2 sequential queries to Cassandra + perform the join on the client side. Although it’s possible to perform all inner joins upon inserting data into the database through denormalization, having a 1:1 ratio between model and table is impractical because it means maintaining an inordinate number of denormalized tables. Best practices suggest that the feature store needs to allow for 1-2 sequential queries for inner joins, combined with denormalization.

Here is a summary of the performance metrics that can be used to estimate requirements for real-time ML pipelines:

  • Testing Conditions:
    • # features = 200
    • # of tables (entities) = 3
    • # inner join = 2 
    • Query TPS : 5000 queries / second
    • Write TPS : 500 records / second
    • Cluster Size : 3 nodes on AstraDB*
  • Latency Performance summary (uncertainties here are standard deviations):
    • tp95 = 13.2(+/-0.6) ms
    • tp99 = 23.0(+/-3.5) ms
    • tp99.9 = 63(+/- 5) ms
    • Effect of compaction:
      • tp95 = negligible
      • tp99, tp999 = negligible, captured by the sigmas quoted above
    • Effect of Change Data Capture (CDC):
      • tp50, tp95 ~ 3-5 ms
      • tp99 ~ 3 ms
      • tp999 ~ negligible

* The following tests were done on DataStax’s Astra DB’s free tier, which is a serverless environment for Cassandra. Users should expect similar latency performance when deployed on 3 nodes using the following recommended settings.

The most significant impact on latency is the number of inner joins. If only one table is queried instead of three, the tp99 drops by 58%; for two tables, it is 29% less. The tp95 drops by 56% and 21% respectively. Because Cassandra is horizontally scalable, querying for more features does not significantly increase the average latency, either.

Lastly, if the latency requirements cannot be met out of box, Cassandra has two additional features: the ability to support denormalized data (and thus reduce inner joins) due to high write-throughput capabilities, and the ability to selectively replicate data to in-memory caches (e.g. Redis) through Change Data Capture. You can find more tips to reduce latency here.

Implement fault tolerant and low latency writes for feature transformations

A key component of real-time AI is the ability to use the most recent data for doing a model inference, so it is important that new data is available for inference as soon as possible. At the same time, for enterprise use cases, it is important that the writes are durable because data loss can cause significant production challenges.

Suggested deployment architecture to enable low-latency feature transformation for inference

*Object store (e.g. S3 or HIVE) can be replaced with other types of batch oriented systems such as data warehouses.

There is a trade off between low latency durable writes and low latency feature serving. For example, it is possible to only store the data in a non-durable location (e.g. Redis), but production failures can make it difficult to recover the most up-to-date features because it would require a large recomputation from raw events.

A common architecture suggests writing features to an offline store (e.g. Hive / S3), and replication of the features to an online store (e.g. in-memory cache). Even though this provides durability and low latency for feature serving, it comes at the cost of introducing latency for feature writes, which invariably causes poorer prediction performance.

Databricks Reference Architecture for Real-time AI

Cassandra provides a good trade-off between low-latency feature serving and low-latency “durable” feature writes. Data written to Cassandra is typically replicated a minimum of three times, and it supports multi-region replication. The latency from writing to availability to read is typically sub-millisecond. As a result, by persisting features directly to the online store (Cassandra) and bypassing the offline store, the application has faster access to recent data to make more accurate predictions. At the same time, CDC from the online store to the offline store allows for batch training or data exploration with existing tools.

Implement low latency and writes for prediction caching and performance monitoring

In addition to storing feature transformation, there is also the need to store predictions and other tracking data for performance monitoring.

There are several use cases for storing predictions:

  1. Prediction store – In this scenario, a database used to cache predictions made by either a batch system or a streaming system. The streaming architecture is particularly useful when the time it takes to inference is beyond what is acceptable in a request-response system.
  2. Prediction performance monitoring It is often necessary to monitor the prediction output of a real-time inference and compare to the final results. This means having a database to log the results of the prediction and the final result.

Cassandra is a suitable store for both use cases because of its high-write throughput capabilities. 

Plan for elastic read and write workloads

The level of query and write transactions per second usually depends on the number of users simultaneously using the system. As a result, workloads may change based on the time of day or time of year. Having the ability to quickly scale up and scale down the cluster to support increased workloads is important. Cassandra and Astra DB have features that enable dynamic cluster scaling.

The second aspect that could affect write workloads is if there are changes in the feature transformation logic. With a large spike in write workloads, Cassandra automatically prioritizes maintaining low-latency queries and write TPS over data consistency, which is typically acceptable for performing real-time inference.

Implement low-latency, multi-region support

As real-time AI becomes ubiquitous across all apps, it’s important to make sure that feature data is available as close as possible to where inference occurs. This means having the feature store in the same region as the application doing inference. Replicating the data in the feature store across regions helps ensure that feature. Furthermore, replicating just the features rather than the raw data used to generate the features significantly cuts down on cloud egress fees.

Astra DB supports multi-region replication out of the box, with a replication latency in the milliseconds. Our recommendation is to stream all the raw event data to a single region, perform the feature generation, and store and replicate the features to all other regions.

Although theoretically one can achieve some latency advantage by generating features in each region, event data often needs to be joined with raw event data from other regions;. from a correctness and efficiency standpoint, it is easier to ship all events to one region for processing for most use-cases. On the other hand, if the model usage makes the most sense in a regional context, and most events are associated with region-specific entities, then it makes sense to treat features as region specific. Any events that do need to be replicated across regions can be placed in keyspaces with global replication strategies, but ideally this should be a small subset of events. At a certain point, replicating event tables globally will be less efficient than simply shipping all events to a single region for feature computations.

Plan for cost-effective and low-latency multi-cloud support

Multi-cloud support increases the resilience of applications, and allows customers to negotiate lower prices. Single-cloud online stores such as DynamoDB result in both increased latency for retrieving features and significant data egress costs, but also creates lockin to a single cloud vendor.

Open source databases that support replication across clouds provide the best balance of performance cost. To minimize the cost of egress, events and feature generation should be consolidated into one cloud, and feature data should be replicated to open source databases across the other clouds. This minimizes egress costs.

Plan for both batch and real-time training of production models

Suggested deployment architecture to enable low-latency feature transformation for inference

Batch processing infrastructure for building models is used for two use cases: building and testing new models, and building models for production. Therefore it was typically sufficient for feature data to be stored in slower object stores for the purpose of training. However, newer model training paradigms include updating the models in real-time or near real-time (real-time training); this is known as “online learning” (e.g. TikTok’s Monolith). The access pattern for real-time training sits somewhere between inference and traditional batch training. The throughput data requirements are higher than inference (because it is not usually accessing a single-row lookup), but not as high as batch processing that would involve full table scans.

Cassandra can support a TPS rating in the hundreds of thousands per second (with an appropriate data model), which can provide enough throughput for most real time training use cases. However in the case the user wants to keep real time training from an object store, Cassandra achieves this through CDC to object storage. For batch training, CDC should replicate data to object storage. It’s worth noting that machine learning frameworks like Tensorflow and PyTorch are particularly optimized for parallel training of ML models from object storage.

For a more detailed explanation of “online learning”, see Chip Huyuen’s explanation on Continual Learning, or this technical paper from Gomes et. al.

Support for Kappa architecture

Kappa architecture is gradually replacing Lambda architecture due to costs and data quality issues due to online/offline skew. Although lots of articles discuss the advantages of moving from separate batch and real-time computation layers to a single real-time layer, articles don’t often describe how to architect the serving layer. 

Using Kappa architecture for generating features brings up some new considerations:

  1. Updating features are being updated en masse and can result in a significant number of writes to the database. It’s important to ensure that query latency does not suffer during these large updates.
  2. The serving layer still needs to support different types of queries, including low-latency queries for inference, and highTPS queries for batch training of models.

Cassandra supports Kappa architecture in the following ways:

  • Cassandra is designed for writes; an increased influx of writes does not significantly reduce the latency of queries. Cassandra opts for processing the writes with eventual consistency instead of strong consistency, which is typically acceptable for making predictions.
  • Using CDC, data can be replicated to object storage for training and in-memory storage for inference. CDC has little impact on the latency of queries to Cassandra.

Support for Lambda architecture

Most companies have a Lambda architecture, with a batch layer pipeline that’s separate from the real-time pipeline. There are several categories of features in this scenario:

  1. Features that are only computed in real time, and replicated to the batch feature store for training
  2. Features that are only computed in batch, and are replicated to the real-time feature store 
  3. Features are computed in real-time first, then recomputed in the batch. Discrepancies are then updated in both the real-time and object store.

In this scenario, however, DataStax recommends the architecture as described in this illustration:

The reasons are the following:

  1. Cassandra is designed to take batch uploads of data with little impact on read latency
  2. By having a single system of record, data management becomes significantly easier than if the data is split between the feature store and object store. This is especially important for features that are first computed realtime, then recomputed in batch.
  3. When exporting data from Cassandra via CDC to the object feature store, the data export can be optimized for batch training (a common pattern used at companies like Facebook), which significantly cuts down on training infrastructure costs.

If it is not possible to update existing pipelines, or there are specific reasons that the features need to be in the object store first, our recommendation is to go with a two-way CDC path between the Cassandra feature store and and the object store, as illustrated below. 

Ensure compatibility with existing ML software ecosystem

To use Cassandra as a feature store, it should be integrated with two portions of the ecosystem: machine learning libraries that perform inference and training, and data processing libraries that perform feature transformation.

The two most popular frameworks for machine learning are TensorFlow and PyTorch. Cassandra has Python drivers that enable easy retrieval of features from the Cassandra database; in other words, multiple features can be fetched in parallel (see this example code). The two most popular frameworks for performing feature transformation are Flink and Spark Structured Streaming. Connectors for both Flink and Spark are available for Cassandra. Practitioners can use tutorials for Flink and Spark Structured Streaming and Cassandra.

Open Source feature stores such as FEAST also have a connector and tutorial for Cassandra as well.

Understand query patterns and throughput to determine costs

Various models of real-time inference courtesy of Swirl.ai

The number of read queries for Cassandra as a feature store is dependent on the number of incoming inference requests. Assuming the feature data is split across multiple tables, or if the data can be loaded in parallel, this should give an estimate of the fanout between real-time inference can be made. For example 200 features across 10 entities in 10 separate tables gives you about a 1:10 ratio between real-time inference and queries to Cassandra.

Calculating the number of inferences being performed will depend on the inference traffic pattern. For example in the case of “streaming inference,” an inference will be performed whenever a relevant feature changes, so the total number of inferences is dependent on how often the feature data changes. When inference is performed in a “request-reply” setting, it’s is only being performed when a user requests it.

Understand batch and realtime write patterns to determine costs

The write throughput is primarily dominated by how frequently the features change. If denormalization occurs, this too could impact the number of features that are written. Other write throughput considerations include caching inferences for either batch or streaming inference scenarios.

Conclusion

When designing a real-time ML pipeline, special attention needs to be paid to the performance and scalability of the feature store. The requirements are particularly well satisfied by NoSQL databases such as Cassandra. Stand up your own feature store with Cassandra or AstraDB and try out the Feast.dev with the Cassandra connector.

New Cassandra Catalyst Program Launched Today

Today the Apache Cassandra community launched the first-ever Cassandra Catalyst program, an effort that aims to recognize individuals who invest in the growth of the Apache Cassandra community by enthusiastically sharing their expertise, encouraging participation, and creating a welcoming environment.

Catalysts are trustworthy, expert contributors with a passion for connecting and empowering others with Cassandra knowledge. They must be able to demonstrate strong knowledge of Cassandra such as production deployments, educational material, conference talks or other ways.

Read the blog, and visit Cassandra Catalyst Program to learn more and nominate someone or apply.

6 Reasons Why You Can’t Miss This Year’s Cassandra Summit 

We’re less than 2 weeks away from Cassandra Summit, the annual event that unites industry experts, developers, and enthusiasts to delve into the world of Apache Cassandra. Taking place in San Jose, California December 12-13, this year’s event will also introduce AI.dev, a nexus for developers delving into the realm of open source generative AI and machine learning. 

If you’re not sure whether to attend or need to convince your boss, here are some key reasons Cassandra Summit shouldn’t be missed: 

#1 Learning Opportunities

The Cassandra Summit is a goldmine of knowledge for both beginners and seasoned professionals. With a diverse range of sessions, workshops, and keynotes, attendees have the chance to deepen their understanding of Apache Cassandra. Whether you’re looking to enhance your development skills, explore advanced topics, or gain insights into real-world use cases, the Cassandra Summit schedule is packed with content for you. 

#2 Networking with Peers and Experts

The Cassandra Summit provides a forum for engaging in discussions with peers and earning from Apache Cassandra experts. There is no better way to share knowledge and spur innovation than face-to-face at a conference like Cassandra Summit. 

#3 Stay Ahead of Industry Trends 

Technology is ever-evolving, and staying ahead of the curve is crucial for professionals in the field. The Cassandra Summit provides a platform to discover the latest trends, innovations, and best practices in the world of Apache Cassandra. Whether it’s exploring new features, understanding emerging use cases, or learning about integrations with other technologies, the Cassandra Summit offers a comprehensive view of the current state and future direction of Cassandra.

#4 Jam Packed Schedule 

The schedule for Cassandra Summit is filled with nearly 60 sessions hosted by speakers from companies like Amazon Web Services, Apple, Hugging Face, LlamaIndex, Microsoft, Netflix and many others. Here’s a snapshot of some of the great speakers Cassandra Summit offers: 

#5 Community Engagement and Support 

Being part of a community is integral to the success of any open-source project, and Apache Cassandra is no exception. The Cassandra Summit fosters community engagement, allowing attendees to connect with others who share a passion for Cassandra. This sense of community support is invaluable—whether you’re troubleshooting a challenge, seeking advice, or simply looking to share your experiences, the connections made at Cassandra Summit can become long-lasting.

#6 NEW AI.dev Conference

The new AI.dev: Open Source GenAI & ML Summit 2023 conference will be co-located with this year’s Cassandra Summit. This means that Cassandra Summit will welcome an expanded audience that includes developers who are delving into the realm of open source generative AI and machine learning. Cassandra Summit and AI.dev will be running together simultaneously and attendees will have access to both events with one single registration. So whether you’ve already registered or are planning to register, you’ll gain access to both of these events for one price.

Cassandra Summit is where the community can connect to share best practices and use cases, celebrate makers and users, forge critical relationships, and learn about advancements in Apache Cassandra. It’s more than a conference; it’s an immersive experience that offers a wealth of knowledge, networking opportunities, and hands-on learning. 

Don’t miss the chance to be a part of this year’s event; register now!

Cassandra Summit Preview:Guerilla Tactics for Building Scalable E-Commerce Services with Apache Cassandra®, Apache Pulsar®, and Vector Search

Aaron Ploetz is a developer advocate at DataStax. He’s been a professional software developer since 1997 and has several years of experience working on and leading DevOps teams for startups and Fortune 50 enterprises. He is a three-time Cassandra MVP, and has worked as an author on the books “Seven NoSQL Databases in a Week” and “Mastering Apache Cassandra 3.x.” 

The importance of a good e-commerce website is illustrated by the fact that worldwide digital sales in 2021 eclipsed five trillion dollars (USD). Most consumers will leave a web page or a mobile app if it takes longer than a few seconds to load. Businesses that want to compete, need a high performing e-commerce website.

At Cassandra Summit this December, Aaron will host a talk titled, “Guerilla Tactics for Building Scalable E-Commerce Services with Apache Cassandra®, Apache Pulsar®, and Vector Search.” During his session, Aaron will cover how to architect high-performing data models and services, helping you to build an e-commerce site with high throughput and low latency. The datastore backend will be built on Apache Cassandra®, allowing you to leverage Cassandra’s well-known features like high-availability and data center awareness.

Modern E-commerce websites consist of many smaller subsystems, as shown in figure 1. These subsystems are really just smaller pieces of a puzzle that, when put together correctly, are capable of working together to provide a positive customer experience.

Figure 1 – The many subsystems which go into building a large-scale E-commerce website are smaller pieces of a larger whole.

However, not all of these subsystems are created equally. Some of them are “nice-to-haves,” which can help maintain customer loyalty or drive additional sales. But some are critical to the function of the site overall, and are absolutely essential for all E-commerce sites to have. These critical systems are Product, User Profile, Shopping Cart, and Order Processing. Aaron will walk through each of these systems during his talk: 

Product 

Every e-commerce site needs a good product system. Aaron will talk about how to model that in the database, with the aim of making it easy for customers to find what they are looking for. The product system will be further broken down into three smaller services, Category Navigation, Product data, and Pricing.

User Profile

All websites need systems for user management and sign-in. Aaron will also talk through how to implement a seamless login process using Google single-sign-on. This way we are helping to ensure a customer experience that is both convenient and secure.

Shopping Cart

Every E-commerce site has a shopping cart, but you want yours to be high-performing and easily expandable in the future. During his talk, Aaron will present ways to model the cart system to accommodate both of these requirements. This way the shopping cart runs well in a distributed environment, while also being highly-available and geographically aware.

Order Processing

Processing an order can be a complex task, especially at the enterprise level. Aaron will cover how to move orders between different business units using Apache Pulsar®.

In addition to covering how to build out and model these systems, we will also cover one of the “nice-to-have” subsystems, and demonstrate a simple way to build a recommendation system.

Product Recommendations

Want to drive additional sales by recommending valid products to your customers? Legacy recommendation systems can be large, complex, and cumbersome. But we can show you how to quickly build real-time recommendations using Vector Search. We will also cover ways to generate vector embeddings.

Conclusions

In the digital age, having a solid, functional and flexible e-commerce system is paramount to success. This session will show you how to architect these systems to take advantage of Apache Cassandra’s® features in the distributed database world. You will be well on your way to having an e-commerce website which will be high-performing, highly-available, and ultimately highly-profitable.

Aaron Ploetz will discuss this topic in more detail during one of his sessions at the Cassandra Summit. A shortened, workshop version of this talk can also be found here. More details about Aaron’s Cassandra Summit sessions can be found here

More about Cassandra Summit

Cassandra Summit 2023 Gains Ai.dev as Co-located Event; NEW AI + Cassandra Track

We are excited to announce that the new AI.dev: Open Source GenAI & ML Summit 2023 conference will be co-located with Cassandra Summit this year! This means that Cassandra Summit will welcome an expanded audience that includes developers who are delving into the realm of open source generative AI and machine learning. 

And with the addition of AI.dev, a NEW AI + Cassandra track will be featured at the event. The Call for Proposals is open until 9:00 AM PDT on Monday, October 23

Here’s what you need to know: 

WHEN + WHERE IS THIS HAPPENING?: Cassandra Summit + AI.dev will take place December 12-13, 2023 at the San Jose, California McEnery Convention Center

WHO SHOULD ATTEND?: data practitioners, developers, engineers and enthusiasts + developers who are interested in open source generative AI and machine learning. 

WHAT ARE THE CFP DETAILS? The CFP for the new AI + Cassandra track is now open. This track will include lightning talks, conference sessions, panel sessions and technical workshops that delve into distributed AI using Cassandra and case studies that cover AI-powered applications using Apache Cassandra. Submit a talk today! 

HOW DO I REGISTER? Cassandra Summit and AI.dev will be running together simultaneously and attendees will have access to both events with one single registration. So whether you’ve already registered or are planning to register, you’ll gain access to both of these events for one price. To learn more or to register, visit https://events.linuxfoundation.org/cassandra-summit/register/ 

Cassandra Summit is where the community can connect to share best practices and use cases, celebrate makers and users, forge critical relationships, and learn about advancements in Apache Cassandra. With the addition of AI.dev, we are excited to expand the community’s flagship event and include talks that showcase how AI and Cassandra synergize, unlocking new possibilities and enhancing data-driven solutions. 

We hope to see you soon!

Showcasing the Power of Apache Cassandra: Four Must-Attend Sessions at Community Over Code

This weekend the Apache Software Foundation’s flagship conference Community Over Code will kick off. The four-day, in-person event will bring the ASF and the broader open source community together in Halifax, Nova Scotia from October 7-10. We’re excited that the Apache Cassandra community will be among the projects represented with four talks included in the event schedule

These talks highlight some of the key features of the forthcoming 5.0 release and underscore how a powerful tool like Cassandra can be used in IoT workloads: 

Adding Vector Search to Apache Cassandra

Speaker: Jonathan Ellis

Saturday, October 7, 2023, 12:10 ADT

Vector search is a hot topic in the world of databases, and Jonathan Ellis, the founder of DataStax and former Apache Cassandra project chair, will shed light on its implementation in Cassandra. This session will explore the fundamentals of k-Nearest Neighbors (kNN) and Approximate Nearest Neighbors (ANN) vector search, introducing the Hierarchical, Navigable Small-World (HNSW) algorithm for vector indexing. You’ll gain insights into the challenges and solutions involved in adapting HNSW to Cassandra, including concurrent updates and queries. Witness the execution of supported queries with the HNSW index and other storage-attached index (SAI) predicates, and learn valuable lessons to enhance performance. As an application developer, understanding Cassandra’s vector search capabilities is crucial in today’s data-driven landscape.

Unified Compaction Strategy in Cassandra (CEP-26)

Speaker: Branimir Lambov

Sunday, October 8, 2023, 12:10 ADT

Cassandra 5.0 is set to revolutionize compaction strategies with CEP-26, offering a unified solution to address existing strategy deficiencies, improve performance, and facilitate easy reconfiguration. In this session, Branimir Lambov, a long-term Cassandra committer, will dive deep into the key features of this strategy and the rationale behind them. Learn how this strategy covers leveled, tiered, and hybrid compaction schemes, employs a flexible SSTable sharding scheme, and selects and prioritizes SSTable sets for compaction based on overlap. Whether you’re dealing with large-scale data or time-series data, CEP-26 has you covered. Discover real-world examples of its impressive performance improvements and the possibilities it unlocks.

IoT Overkill: Running a Cassandra and Kafka cluster on Open Source Hardware

Speaker: Kassian Wren

Sunday, October 8, 2023, 14:20 ADT

Open source hardware meets open source software in this session by Kassian Wren, an Open Source Technology Evangelist. Dive into the world of open source clusters, featuring a unique five-node configuration with Raspberry Pi and Orange Pi nodes. Witness the orchestration of a Docker swarm that runs Cassandra and Kafka services, distributed across worker nodes. This session isn’t just about showcasing the cluster but also delves into automation, setup, and maintenance. If you’re passionate about IoT projects and the intersection of hardware and software, this session promises to be a fascinating journey into the possibilities of open source technology.

Performance Measurement and Tuning of Cassandra 5.0 Transactions on Cloud Infrastructure

Speakers: German Eichberger and Pallavi Iyengar

Tuesday, October 10, 2023, 14:20 ADT

Cassandra 5.0 introduces transaction support based on ACCORD, necessitating new benchmarks for distributed transactional databases. German Eichberger and Pallavi Iyengar from Microsoft’s Azure Managed Instances for Apache Cassandra team will explore this topic in detail. Learn about benchmark scenarios inspired by YCSB+T’s Closed Economy Workload and delve into the challenges of cloud environments. Understand the impact of network topologies, including one-region and multi-region clusters, and discover performance-enhancing techniques like SSD-based write-through cache. Gain insights into tuning Cassandra 5.0 for optimal performance in different scenarios and compare it with previous Cassandra versions.

To see the full Community Over Code schedule, visit https://communityovercode.org/schedule.

To learn more about The ASF’s Community Over Code and register to attend, visit https://communityovercode.org

Cassandra Summit Preview: The Art and Science of Cassandra Performance Tuning

Database tuning can be intimidating if you don’t know where to start, and it’s even harder when it’s a distributed database like Cassandra. Using Cassandra’s out of the box configuration works well for laptops but falls far short of its total potential in production. Cassandra’s distributed nature presents a unique challenge when it comes to understanding its performance bottlenecks and intricacies. 

Apache Cassandra committer Jon Haddad is hosting a talk on Cassandra performance tuning at Cassandra Summit which takes place Dec. 12-13 in San Jose, California. Drawing from a decade of hands-on experience tuning some of the world’s biggest clusters in streaming media, banking, and gaming, Jon will share some of the most important lessons he has learned to help Apache Cassandra users minimize their query latency. The talk will include real-world anecdotes, battle-tested strategies, and time-honed insights including:

  • How to harness the power of observability and proactively identify and address potential issues before they escalate; 
  • How to use Linux tooling to gain insights and effectively uncover hidden performance bottlenecks to optimize your Cassandra clusters;
  • What simple yet powerful profiling techniques that enables users to better understand a Cassandra cluster’s behavior; and
  • How visualizations like flame graphs can help users quickly uncover both internal and system level bottlenecks. 

To see Jon talk a bit about Cassandra performance tuning, you can watch this recording of a recent Apache Cassandra Town Hall.

More about Cassandra Summit

Loading Streaming Data into Cassandra Using Spark Structured Streaming

When creating real-time data platforms, data streaming is a low-latency, high-throughput method of moving data. Where batch processing methods necessarily introduce delays in order to gather a batch worth of data, stream processing methods act on steam events as they occur, with as little delay as possible. In this blog and associated repo, we will discuss how streaming data can be compatible with Cassandra, with Spark Structured Streaming as an intermediary. Cassandra is designed for high volume interactions and thus a great resource for streaming workflows. For simplicity and speed, we are using DataStax’s AstraDB in this demo.

Introduction 

Streaming data is normally incompatible with standard SQL and NoSQL databases, since they can consist of differently structured data with messages only differentiated by timestamp.  With advances in database technologies and continuous development, many databases have evolved to better accommodate streaming data use cases. Additionally, there are specialized databases, such as time-series databases and stream processing systems, that are designed explicitly for handling streaming data with high efficiency and low latency. 

AstraDB, however, is neither of those things. In order to overcome these hurdles, we need to make sure that the specific data stream we are working with has a rigid schema and that the timestamps associated with adding the individual messages to the stream are either all stored or all thrown out.

The Basics

We will create our example stream from the Alpha Vantage stock API, ensuring that all stream messages will have the same format. Inside those messages is a timestamp corresponding to when the data was generated, which we will use to organize the data so that we can ignore the timestamp associated with adding the message to the stream.

An API by itself is not a stream, so we turn it into one. We push this stream to Astra Streaming, Astra’s managed version of Apache Pulsar, but this does not mirror our data into Astra DB. So we set up a Spark cluster with the Spark Pulsar connector and the Spark Cassandra connector and write a program that ingests the stream, transforms the data to match the AstraDB table schema, and loads it into the database.

The Alpha Vantage API

The Alpha Vantage API is a REST API that provides historical data on stock values and trades over time intervals. We can query this API for the recent history of a specific stock symbol.

Astra Streaming

Astra Streaming is Astra’s managed Apache Pulsar instance. It can be used to read or write data streams, and can be connected to in a variety of ways. In this demo, we use the python driver and the spark pulsar connector. 

Apache Spark

Spark is a distributed data processing tool. It is mostly known for batch processing capabilities, but Spark Streaming is one of its core components. Spark streaming allows spark users to process stream data using the same functions and methods that are used to process batch data. It does this by turning an incoming stream into a series of micro-batches and manages to process those in a low latency manner compatible with data streams. 


Astra is a managed database based on Apache Cassandra. It retains Cassandra horizontal scalability and distributed nature and comes with a number of other advantages. Today, we will use Astra as the final destination for our data.

Here’s a quick look at what will be happening in this demo.

Prerequisites

For this tutorial you will need:

  • An Astra Account from DataStax, or to be familiar enough with Cassandra to use an alternative Cassandra database. Sign up for a Free Tier Astra account here
  • A tenant in Astra Streaming, or access to an Apache Pulsar cluster
  • An environment in which to run python code. Make sure that you can install new pip modules in this environment. We’d recommend a Gitpod or your local machine.
  • A Spark cluster. This demo contains the files and instructions necessary to create a single worker cluster.
  • An Alpha Vantage API key. Can be applied for here https://www.alphavantage.co/

For an effortless setup, we have provided a Gitpod quickstart option– though you’ll still need to fill in your own credentials before it will run seamlessly. Simply click on the “Open in Gitpod” button found in our GitHub repository to get started. Alternatively go to this link to open the repo in Gitpod. When creating the workspace for this project it is advantageous to select the large class of machine in order to ensure that enough memory exists on the machine to run the Spark cluster.

Before you can proceed further you will need to set up your Astra database. After creating a free account you will need to create a database within that account and create a Keyspace within that database. You will also need to create a tenant and topic within Astra Streaming. All of this can be done purely using the Astra UI. 

Creating your AstraDB

Setting up your Astra DB

There is a ton of great documentation for how to create an Astra Database which is included in the Knowledge Resources at the bottom of this STACK page.

In brief, go to astra.datastax.com, create/sign in to your account, create a new DB– they are free for most users. For this demo, we use a database called stock_data and a keyspace named spark_streaming. To create that database, select the “Databases” tab (shown on the left menu) then click on the “Create Database button” (shown on the right) and fill out the needed information for your database:

Once you’ve created your database, you’ll need to generate the Token or Secure Connect bundle to connect to your database with the connect tab. Choose the permissions that make the most sense for your use case. For this demo, there’s nothing wrong with choosing Database Administrator, but you can also go as simple as a Read/Write Service account to get the functionality you need.

Never share your token or bundle with anyone. It is a bundle of several pieces of data about your database, and can be used to access it. 

Reminder: for this demo, the assumed name of the keyspace is spark_streaming.

Establishing the Schema

Once the Keyspace has been created, we need to create the Table that we will be using. That table’s name is stock_data. Open the CQL Console for the database and enter these lines to create your tables.

CREATE TABLE spark_streaming.stock_data ( 	
symbol text, 
time timestamp, 
open decimal, 
high decimal, 
low decimal, 
close decimal, 
volume int, 
PRIMARY KEY ((symbol), time));

Next, we need to create the resources we need to connect to the Astra database. Hit the Connect button on the UI and download the Secure Connect Bundle (SCB). Then hit the “Create a Token” button to create a Database Administrator token and download the text that it returns. 

Load the SCB into the environment and put the path to it in the auth.py files second line, between the single quotes. Put the generated id (Client_ID) for the Database Admin token in the third line. Put the generated secret (Client_Secret) for the token in the fourth line.

Creating your Astra Streaming Topic

From the homepage of Datastax Astra, select streaming in the sidebar on the left.

Select the create tenant button and fill out the form to create a tenant.

Open your tenant and immediately scroll down within the connect screen and copy out the Broker Service URL and Web Service URL. Then go to settings on the top bar and scroll to Token Management. Copy your token and paste it into auth.py on line 8. You can then go back to Namespace and Topic and create a namespace and a topic within it. Then go to the try me tab and consume from your created topic. You will see the stream data here once our program reads it from Alpha Vantage.

Connecting to the Alpha Vantage API

Go to alphavantage.co and click to get a free api key.

Sign up for an API key. When it arrives, paste your key into auth.py, on line 6.

Starting your Spark Cluster

In your environment, run the script file ./spark-3.4.1-bin-hadoop3/sbin/start-master.sh to create the Spark Master. Open up port 8080 and copy the Spark Master address. Use that address as the argument and run the script ./spark-3.4.1-bin-hadoop3/sbin/start-worker.sh to create the Spark worker.

Running the Code

Once all of the setup is complete, you can run create_raw_stream.py to get data from the Alpha Vantage API and send it to Astra Streaming as a stream. You should be able to see the results in the try me tab within the Astra UI. Then in order to process and upload that stream to AstraDB, you need to run the following Spark submit command.

./spark-3.4.1-bin-hadoop3/bin/spark-submit \
--master <Spark Master Address> \
--packages io.streamnative.connectors:pulsar-spark-connector_2.12:3.1.1.3,com.datastax.spark:spark-cassandra-connector_2.12:3.4.0 \
--conf spark.files=<bundle filepath> \
--conf spark.cassandra.connection.config.cloud.path=<bundle file name> \
--conf spark.cassandra.auth.username=<astra client id> \
--conf spark.cassandra.auth.password=<astra client secret> \
--conf spark.dse.continuousPagingEnabled=false \
<path to pyspark_stream_read.py>

This command submits the file pyspark_stream_read.py to the Spark cluster, while also telling Spark to download the Spark-cassandra-connector, and Pulsar-spark-connectors from Maven. We then pass the connection information for Astra including the secure connect bundle, client id and client secret.

After those have been run, we should be able to see the results in AstraDB. Open up your Database’s CQL console and type:

SELECT * FROM spark_streaming.stock_data limit 10;

Conclusion

In this demo, you learned how to use Pulsar/Astra Streaming, Spark, SparkSQL, and AstraDB to turn an API into a stream, format the stream using Astra Streaming, and then write it to AstraDB.

While this demo focuses on using the Alpha Vantage API, you can use this stack to use a wide variety of data in an extract, transform, and load workflow. How you use the data once you get it is up to your imagination!

Getting help

You can reach out to us on the Planet Cassandra Discord Server to get specific support for this demo. You can also reach out to the Astra team through the chat on Astra’s website. Enhance your enterprise’s data ingest by incorporating streaming data with Pulsar, Spark, and Cassandra. Happy Coding!

Resources

https://github.com/chimpler/blog-spark-streaming-log-aggregation/tree/master
https://github.com/anjijava16/Spark_Structured_Streaming
https://github.com/jleetutorial/python-spark-streaming
https://github.com/adaltas/spark-streaming-pyspark
https://pulsar.apache.org/docs/3.0.x/adaptors-spark/
https://docs.datastax.com/en/streaming/astra-streaming/getting-started/index.html#your-first-streaming-tenant
https://github.com/tspannhw/FLiPS-SparkOnPulsar/
https://www.datastax.com/products/astra-streaming

Streaming Stock Data with Apache Flink and Cassandra

Introduction

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. This tutorial will show you step-by-step how to use Astra as a sink for results computed by Flink. 

This code is intended as a fairly simple demonstration of how to enable an Apache Flink job to interact with DataStax Astra. We recommend using a Cassandra database for this demo. For the purposes of this tutorial we use Datastax Astra to preview those features.

Basic Activity Diagram

The diagram below shows pyFlink and AstraDB’s batch processing capabilities and how this integration unlocks a whole new realm of possibilities for data-intensive applications.

Prerequisites

  • You should have Gradle, and Java installed in your system.
  • Python v3.7 or later
  • AlphaVantage API Key

Setting up Flink

1: Open a terminal and clone the GitHub Flink repository from using the command: 

git clone git@github.com:Anant/flink-astra-stock-price.git

2: Change directory into the `flink-astra-stock-price` folder and install the dependencies listed in the requirements.txt file.

cd flink-astra-stock-price

3: Install the required packages for this plugin, run the following command.

pip install -r requirements.txt

4: Open the AlphaVantage API. Select *Get Free API Key*.

5: Fill out the form to create your key and click *GET FREE API KEY*.

6: Receive the credentials through the email and add them to the “my_local_secrets.py” document.

Setting up your Astra DB

  1. Create an Astra Database. To match the demo as perfectly as possible, name the database “flink” and the keyspace, `example`.
  2. Add your keyspace name from your newly created database in the `my_local_secrets.py`.
  3. Select *Generate token*. Download the token to connect to your database.
    1. Add the `client_id`, `client_secret`, and `client_token to the corresponding fields on the `my_local_secrets.py` file.
    2. Click *Close*. 
  4. Click *Get Bundle* to download a *Secure Connect Bundle* (SCB). Upload that bundle to your coding environment. Make sure to reference the location where you uploaded the document in the `my_local_secrets` file.

NOTE:

Never share your token or bundle with anyone. It is a bundle of several pieces of data about your database and can be used to access it. 

  1. Move the SCB to app/src/main/resources in your GitHub directory (You do not have to unzip the file).
  2. Create a properties file titled app.properties, and place it in app/src/main/resources/.
  3. Add properties specifying your Astra client ID, Astra secret, and SCB file name. These should map to the “astra.clientid”, “astra.secret”, and “astra.scb” properties respectively. Your app.properties file should look something like this:
astra.clientid=Bwy...
astra.secret=E4dfE...
astra.scb=secure-connect-test.zip

Open the my_local_secrets.py file and fill in the following details.

client_id="<your-client_id>"
client_secret="<your-client_secret>"
token="<your-token>"
db_keyspace="<your-keyspace>"
secure_bundle_path="<path-to-bundle>/secure-connect-<YOUR_DB_NAME>.zip"
astra_id="<astra-id>"
astra_region="<astra-region>"
api_key="<your-alphavantage-api-key>"

Install Jupyterlab

Jupyterlab is a more advanced version of jupyter notebooks. It offers many features in addition to the traditional notebooks.

1: Install Jupyter with pip on your machine. Run the command below to do so.

a: If `pip` does not work, try `pip3 install juypterlab`.

pip install jupyterlab

2: Type `jupyter-lab` in the terminal to open a window in your browser listing your working directory’s content. From the cloned directory, locate the notebook to follow the steps.

If it doesn’t start automatically, you can navigate to the JupyterLabs server by clicking the URLs in your terminal:

3: After running the `jupyter` command from your working directory, you can see the project tree in your browser and navigate to the files. Make sure you configure the secrets by following the next step before starting any coding.

4: Open the `local_secrets.py` file and fill in the details provided/extracted from the Astra Portal after setting up the database details.

5: Navigate to the notebook.

Running the Jupyter Notebook

Jupyterlab has more advanced functionalities than traditional notebooks. As with most Jupyter notebooks, you can run each block of the notebook. If you’ve correctly added all the fields to the my_local_secrets.py file, the notebook runs correctly.

Configure your PyFlink datastream in the notebook so that the data that it pulls from your Astra API is correctly configured. 

The stock symbol for which this demo is pulling the data is “IBM”. Change the third block [3] to pull the data for a different stock symbol.

Our demo stack uses the Python API, but there are a variety of other languages the API that the developer community supports. 

Defining API Queries

The AlphaVantage API can be used to pull different types of data. For this demo, we are using the popular Intraday (Time_Series_Intraday) API.

The query is currently set to pull the maximum number of free requests from the API. The intraday requests for each response includes for each 5 minute period:

  • the value that it started at at the beginning of a 5 minute period,
  • the maximum value in the 5 minute period,
  • the lowest value in that 5 minute period, 
  • the closing volume at the end of the 5 minute period, and
  • the volume of trades for that stock in the 5 minute period.

Next PyFlink structures the data that it gets from AlphaVantage into a data frame, filters out any data fields that have volume > 100000, and creates a data frame to use to upload the data to your Cassandra database – in our case Astra. 

Writing to Astra

The notebook includes a schema creation block (block [13] of the `pyFlink_Astra_batch.ipynb` file) that can be run directly from the Notebook. You can also run the schema creation script in the CQL Console of your Astra Portal. Below is an example of the schema.

CREATE TABLE if not exists market_stock_data (
    date text,
    open float,
    high float,
    low float,
    close float,
    volume float,
    PRIMARY KEY (date)
)

1: Use the AstraDB RestAPI in the `pyflink_Astra_batch.ipynb` file to insert the data into Astra.

The function, send_to_rest_api, is defined which takes a single argument named “data”.

def send_to_rest_api(data):

2: Create a URL for the DataStax Astra REST API endpoint using formatted strings. The endpoint URL is associated with the specific keyspace and table specified in our `my_local_secrets.py` file in DataStax Astra.

Your function then iterates through each row in our data object and for each row, and a POST request is made to the previously constructed URL. The data payload for this POST request is a JSON string that appears to describe stock market data (date, open, high, low, close, and volume).

Datastream and Map

Create a DataStream and Map. After defining the `send_to_rest_api` function, the script interacts with a DataStream named ds.

It maps the `send_to_rest_api` function onto the DataStream using a `lambda` function. This implies that for each item in ds, the `send_to_rest_api` function is called which then sends data to the REST API.

ds.map(lambda x: send_to_rest_api(x))

In the notebook, there is some error handling associated with the send_to_rest_api function.

Congrats! You should now have the AlphaVantage data in your Astra DB. 

Test and Validate

After completing all prerequisites along with the section above, run the sample app and validate the connection between Flink and Astra.

1: In your `flink-astra` cloned GitHub directory, run `./gradlew run`

2: Verify that the application runs and exits normally. If successful, the following message appears:

BUILD SUCCESSFUL in 31s
3 actionable tasks: 2 executed, 1 up-to-date

3: Navigate back to the Astra UI to use the CQL Console. You can run this sample query to confirm that the defined data from the sample app has been loaded properly:

token@cqlsh:example> select * from wordcount ;
word | count
--------+-------
dogs | 1
lazier | 1
least | 1
foxes | 1
jumped | 1
at | 1
are | 1
just | 1
quick | 1
than | 1
fox | 1
our | 1
dog | 2
or | 1
over | 1
brown | 1
lazy | 1
the | 2
(18 rows)
token@cqlsh:example> 

Conclusion

We were able to use pyFlink’s DataStream API and map functions to batch write data into AstraDB. The DataStream API is a set of tools for processing stream data, including individual record transformation and time windowing functionality. We have shown that we can make use of the Stream processing basics to do things like transform and filter out individual records depending on how we want the data to appear in Astra. This acts as a proof of concept for processing streaming data like IoT data coming from physical devices using a combination of Flink and Cassandra.

Getting help

You can reach out to us on the Planet Cassandra Discord Server to get specific support for this demo. You can also reach out to the Astra team through the chat on Astra’s website. Enhance your ELT for batch and stream processing with Flink. Happy coding!

Resources

https://www.alphavantage.co/
https://www.alphavantage.co/documentation/
https://nightlies.apache.org/flink/flink-docs-stable/
https://flink.apache.org/documentation/
https://pyflink.readthedocs.io/en/main/index.html
https://docs.datastax.com/en/astra/docs/ https://docs.datastax.com/en/astra-serverless/docs/manage/db/manage-create.html

Training a Handwritten Digits classifier in Pytorch with Cassandra

Handwritten digit recognition is one of the classic tasks undertaken by students when learning the basics of Neural Networks and Computer Vision. The basic idea is to take a number of labeled images of handwritten digits and use those to train a neural network that is able to classify new unlabeled images. Using this repo, we’ll show how to use data stored in a large-scale database as our training data (the demo uses a managed Cassandra service AstraDB as a great quick start options for this). We also explain how to use that same database as a basic model registry. This addition can enable model serving as well as future retraining.

Introduction

MNIST is a set of datasets that share a particular format useful for educating students about neural networks while presenting them with diverse problems. The MNIST datasets for this demo are a collection of 28 by 28 pixel grayscale images as data and classifications 0-9 as potential labels. This demo works with the original MNIST handwritten digits dataset as well as the MNIST fashion dataset. 

The use of both of these datasets will help calibrate models, testing whether they are affected by the domain of the classification or not. If a neural net is good at classifying digits, but bad at classifying clothing and accessories, even though in this case the datasets have the same structure, it is evidence that something about the training or structure of the network contains knowledge on digits, or on handwriting, or is more suited to simple rather than complex shapes, etc. 

Pytorch is a python library that contains data, types, and methods for working with neural networks. We also make use of torchvision, a related library specifically meant for computer-vision-related tasks. Pytorch works with data typed as Tensors and can define different types of layers that can be combined to do deep learning and gain advantages that single type NNs cannot. Pytorch provides utilities that help us define, train, test, and predict using our models.

Astra is a managed database based on Apache Cassandra. It retains Cassandra’s horizontal scalability and distributed nature. Every feature that we use in today’s tutorial will be part of Cassandra 5.0 which is releasing soon. To preview these new features, we use Datastax Astra. Today, you can use Astra as the primary data store and as a primitive model registry.

Prerequisites

For this tutorial you will need:

  • An Astra Account from DataStax, or to be familiar enough with Cassandra to use an alternative Cassandra database. Sign up for a Free Tier Astra account here
  • An environment in which to run python code. Make sure that you can install new pip modules in this environment. We’d recommend a Gitpod or your local machine.
  • A Jupyter Notebook server or access to the VSCode Jupyter plugin are also necessary.

For an effortless setup, we have provided a Gitpod quickstart option– though you’ll still need to fill in your own credentials before it will run seamlessly. Simply click on the “Open in Gitpod” button found in our GitHub repository to get started. Alternatively go to this link to open the repo in Gitpod. When creating the workspace for this project it is advantageous to select the large class of machine in order to access more RAM and have the data loader run faster. The rest of this article will assume the reader is using Gitpod to follow along unless stated otherwise.

Environment Setup

Assuming you opened the repo in Gitpod, the first thing that will happen is the required python modules will install. If you are not using Gitpod you will need to clone the repo into your environment, change into the working directory and install the prerequisites.

cd pytorch-astra-demo | pip3 install -r requirements.txt

Before you can proceed further you will need to set up your Astra database. After creating a free account you will need to create a database within that account and create a Keyspace within that database. All of this can be done purely using the Astra UI. 

Creating your AstraDB

Setting up your Astra DB

There is a ton of great documentation for how to create an Astra Database which is included in the Knowledge Resources section at the bottom of this blog.

In brief, go to astra.datastax.com, create/sign in to your account, create a new DB– they are free for most users– then create a keyspace and run the schema creation script (located at setup/create_schema.cql in the CassioML repo) in the CLI for your database.

For this demo, we use a database called cassio_db and a keyspace named mnist_digits. To create that database, select the “Databases” tab (shown on the left menu) then click on the “Create Database button” (shown on the right) and fill out the needed information for your database as shown below. The “Vector Database” comes with vector search enabled, but for this example it does not matter what option you choose.

Once you’ve created your database, you’ll need to generate the Token or Secure Connect bundle to connect to your database with the connect tab. Choose the permissions that make the most sense for your use case. For this demo, there’s nothing wrong with choosing Database Administrator, but you can also go as simple as a Read/Write Service account to get the functionality you need.

Never share your token or bundle with anyone. It is a bundle of several pieces of data about your database, and can be used to access it. 

Reminder: for this demo, the assumed name of the keyspace is mnist_digits.

Establishing the Schema

Once the keyspace has been created, we need to create the Tables that we will be using.  Open the CQL Console for the database by clicking on the “CQL Console” tab in the database view.

We can create the raw_train, raw_test, and models tables, as well as a raw_predict table holding data with no labels attached using the commands below.

CREATE TABLE mnist_digits.raw_train (id int PRIMARY KEY, label int, pixels list<int>);
CREATE TABLE mnist_digits.raw_test (id int PRIMARY KEY, label int, pixels list<int>);
CREATE TABLE mnist_digits.models_train (id uuid PRIMARY KEY, network blob, optimizer blob, upload_date timestamp, epoch int, batch_percent text, loss float);
CREATE TABLE mnist_digits.models_test (id uuid PRIMARY KEY, network blob, optimizer blob, upload_date timestamp, loss float, accuracy float);
CREATE TABLE mnist_digits.raw_predict (id int PRIMARY KEY, label int, pixels list<int>);

It is useful to define storage attached indexes (SAIs) for the models_test and models_train tables so that once the training is completed we can easily identify the best models. This will allow us to search for the models with the minimum loss and maximum accuracy.

CREATE CUSTOM INDEX loss_train_sai_idx on mnist_digits.models_train (loss) using 'StorageAttachedIndex';
CREATE CUSTOM INDEX loss_test_sai_idx on mnist_digits.models_test (loss) using 'StorageAttachedIndex';
CREATE CUSTOM INDEX accuracy_test_sai_idx on mnist_digits.models_test (accuracy) using 'StorageAttachedIndex';

Next, we need to create the resources necessary to connect to the newly created Astra database. Hit the Connect button on the UI and download the Secure Connect Bundle (SCB). Then hit the “Create a Token” button to create a Database Administrator token and download the text that it returns. 

Load the SCB into the environment and put the path to it in the auth.py file’s first line, between the single quotes. Put the generated id (Client_ID) for the Database Admin token in the second line. Put the generated secret (Client_Secret) for the token in the third line.

Then run the data loader called load_raw_data.py using this line. Modify the train_split variable in the load_raw_data.py file if you want something other than an 80/20 train/test split.

python3 load_raw_data.py

The data loader will populate the train, test, and predict tables that we created earlier. It may take an hour or more to complete because there are close to 800 columns for each data sample (Note, this is why we asked you to select the high memory option for your GitPod). Once it is complete make sure that the data was created by running these commands in the CQL Console of the Astra UI.

SELECT id, label from mnist_digits.raw_train limit 5;
SELECT id, label from mnist_digits.raw_test limit 5;
SELECT id, label from mnist_digits.raw_predict limit 5;

After that you should be able to step through the model_training_full_sequence.ipynb notebook without issue, following the comments to train and store models.

Running the Notebook

An IPython notebook is a collection of cells containing either Markdown or Python code. Each code cell can be run individually, though they share an environment so variables and imports are carried over between cells. 

If you are running the notebook in Gitpod with a VS code editor, on your first run a pop-up will ask you to select a kernel. Select “Install Python Environments”. Then click on “Select Kernel” in the top right corner, or run a cell again. Select “Python Environments”, and then “Python 3.11.1”.

When first opening the notebook, all cells should have some blank space between them and the next cell. If this is not the case, click Clear All Outputs at the top of the screen. To run an individual cell, click on that cell and press Shift+Enter or click the Run button. You can also use the Run All button at the top of the notebook which will run the cells sequentially. 

A successfully run cell should have a green check mark on it afterwards. A still-running cell will have a loading symbol somewhere in it. Each code cell has comments that describe what that particular cell does. 

The notebook will first walk the user through importing the necessary libraries, and then creating a custom Pytorch data loader that connects to Astra. After importing the rest of the required modules for model definition and training we create the data loaders for our training and testing data sets. 

Adding New DataSets

This example uses the MNIST handwritten digits dataset. This dataset consists of a set of 22 by 22 pixel grayscale images depicting digits from 0-9, meant to be classified into those 10 categories. This repo is easily modified to work with other datasets with this format. The most compatible will be other MNIST datasets, which promise to have the same 22 by 22 image size, the same grayscale pixel values, and the same 10 categories. In fact the fashion MNIST dataset here https://www.kaggle.com/datasets/zalando-research/fashionmnist can be substituted almost exactly for the train and test csv files included in the repo. If you switch the filenames in load_raw_data.py, the rest of the repo can be used as normal.

Once we have done this and set some constants that will be used in model training we test our data loaders by loading in a single batch of examples and then extracting a single data point and examining its shape (the dimensionality of the tensor that represents the hand-drawn number). Here we should see the 28 by 28 pixel nature of the images return as part of the shape of the data. We are changing the dimensionality (it was originally 1 by 784) of the data object so that the model can process it correctly. If you see an error instead this is an indication that the data loader is unable to properly load the data, check your Astra database for missing or empty tables.

Changing the details of how we are changing the model:

Before we train the model, we define a number of constants that change how that training takes place. The first is n_epochs, which define how many training epochs we put the model through. 

During each epoch, we feed in a number of training examples before stopping, at which point we test and save the model. 

  • Batch_size_train tells us how many of our training examples get fed to the model during each epoch. 
  • Batch_size_test defines how many examples are used for testing the model after training. 
  • Learning_rate defines a property of the optimizer, changing the backpropagation step, causing it to make bigger or smaller changes to the model weights. 
  • Momentum determines how much the changes to the model weights carry between the backpropagation step. 

Because backpropagation uses calculus to determine model weight changes, the magnitude of those changes can be affected by the gradient slope of the previous backpropagation step.

After that is complete, we define a class for our neural net that holds its component layers and defines how those layers are connected to each other. We define, train, and test functions that will perform an optimization step on our model, storing the resulting model back into the Astra database and then testing the resulting model on the test data set.

Changing the structure of the model

When we create the Net class we define the structure of our model. In this example repo we set up two convolutional layers, a dropout layer, and two linear layers. The convolutional layers take a number of 2d planes as input, perform a 2d convolution and output a different number. Because our flat grayscale image fits into a single plane, our first Conv2d layer has a single input channel. Conv2d layers are specialized for image processing. 

To use a traditional RGB image as an input we would up the number of input channels to 3, one for each color. The dropout layer randomly zeroes out some channels. The linear layers apply a linear transformation to incoming data. Because our final input has 10 categories, the final linear layer has 10 output layers. They return values between 0 and 1 for each value, roughly corresponding to a probability or confidence score, and we take the highest one and count that as the prediction. They are applied in the order of: first convolutional layer, second convolutional layer, dropout layer, first linear layer, second linear layer. This order can be changed by modifying the order in which they are used in the Net classes forward method.

Once all of these are complete, you should see the accuracy of your model increasing in the space under your final cell like this.

Using the Updated Model on New Data 

In order to use this model on new data the first step is to pull the row concerning the particular model you desire out of Astra. Then you would use pkl.load on the network state object that was saved to turn it back into a dictionary object. Then we create the Net class and network object the same way we did in the notebook. Next, we call network.load_state_dict and pass it the state dictionary that we just loaded as input.

Now we have a network object with the same weights as the one we stored in Astra. We can then load new data from our test loader, whether using the test loader we create in the notebook, the data we placed in the raw+predict table, or new data that we load from somewhere else. Once we have the data and the model we can call network(data) to run the new data through the model and look through the results it gives for the predictions.

Conclusion

Congratulations, you have successfully trained a machine learning model to recognize hand-written numbers! If you were to use the same model to identify the number associated with a new set of hand-written numbers, your model would recognize those numbers better than an untrained model.

Getting help

You can reach out to us on the Planet Cassandra Discord Server to get specific support for this demo. You can also reach out to the Astra team through the chat on Astra’s website. Pytorch is a widely used library in the Machine Learning ecosystem. Now that you’ve gotten started with it, you can use it in a ton of ways; let us know how it helps you and your enterprise. Happy coding!

Resources

https://pytorch.org/tutorials/
https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
https://github.com/ptrblck/pytorch_misc/blob/master/pytorch_redis.py
https://nextjournal.com/gkoehler/pytorch-mnist
https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
https://pytorch.org/vision/stable/transforms.html
https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html
https://pytorch.org/vision/stable/generated/torchvision.transforms.Normalize.html