Skip to content

Conversation

@yy214123
Copy link

@yy214123 yy214123 commented Oct 29, 2025

Extend the existing architecture to cache the last fetched PC instruction, improving instruction fetch hit rate by approximately 2%.

based on performance metrics introduced in #99 (MMU Cache Statistics).

=== MMU Cache Statistics ===

Hart 0:
   Fetch: 443724828 hits,    7671580 misses (98.30% hit rate)
   Load:   65003296 hits,   34592372 misses (2-way) (65.27% hit rate)
   Store:  59967516 hits,   11483892 misses (83.93% hit rate)

Also includes clang-format fixes for several expressions.


Summary by cubic

Implemented a direct-mapped instruction fetch cache with a small victim cache to reduce conflict misses and speed up fetches. Improves instruction fetch hit rate by ~2% (per MMU cache stats from #99) and reduces fetch translations by ~10%.

  • New Features

    • Direct-mapped I-cache (256 blocks × 256 B).
    • 16-block victim cache with swap-on-hit to mitigate conflict misses.
    • I-cache invalidation on MMU invalidate.
    • 2-entry direct-mapped fetch page cache with parity hash to avoid thrashing.
  • Refactors

    • Documented I-cache macros/masks and renamed IC/VC prefixes to ICACHE/VCACHE.
    • Fixed tag calculation, victim-cache fill, and hit/miss accounting in mmu_fetch.

Written for commit 1ff3c37. Summary will update automatically on new commits.

@yy214123 yy214123 marked this pull request as draft October 29, 2025 18:56
cubic-dev-ai[bot]

This comment was marked as outdated.

@jserv

This comment was marked as outdated.

@yy214123 yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from 98114a7 to 0e4f67b Compare October 30, 2025 05:32
@jserv
Copy link
Collaborator

jserv commented Oct 30, 2025

Consider 2-Way Page Cache: Given that the current page-level cache already achieves 98.30% hit rate, consider simply upgrading it instead:

  // Current: single-entry page cache
  mmu_fetch_cache_t cache_fetch;

  // Proposed: 2-way set-associative page cache
  mmu_fetch_cache_t cache_fetch[2];  // Only +16 bytes overhead

  Use parity hash (like load/store caches):
  uint32_t idx = __builtin_parity(vpn) & 0x1;
  if (unlikely(vpn != vm->cache_fetch[idx].n_pages)) {
      // ... fill
  }

@yy214123 yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from bb9e6cb to 74e3b99 Compare October 30, 2025 19:24
@yy214123 yy214123 closed this Oct 31, 2025
@yy214123

This comment was marked as outdated.

@yy214123 yy214123 reopened this Oct 31, 2025
jserv

This comment was marked as resolved.

@yy214123 yy214123 force-pushed the direct-mapped-cache branch from 686cede to 5478710 Compare November 1, 2025 06:40
@yy214123 yy214123 requested a review from jserv November 2, 2025 09:08
Copy link
Collaborator

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify the naming scheme.

@yy214123 yy214123 force-pushed the direct-mapped-cache branch from fe2d95e to f657fb2 Compare November 4, 2025 16:55
Copy link
Collaborator

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest 'master' branch and resolve build errors.

@yy214123 yy214123 force-pushed the direct-mapped-cache branch from aab465c to db3a37f Compare November 4, 2025 18:20
@sysprog21 sysprog21 deleted a comment from yy214123 Nov 4, 2025
Copy link
Collaborator

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash commits and refine commit message.

Copy link
Collaborator

@visitorckw visitorckw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This series appears to contain several "fix-up," "refactor," or "build-fix" commits that correct or adjust a preceding patch.

To maintain a clean history and ensure the project is bisectable, each patch in a series should be complete and correct on its own.

@visitorckw
Copy link
Collaborator

As a friendly reminder regarding project communication:

Please ensure that when you quote-reply to others' comments, you do not translate the quoted text into any language other than English.

This is an open-source project, and it's important that we keep all discussions in English. This ensures that the conversation remains accessible to everyone in the community, including current and future participants who may not be familiar with other languages.

icache_block_t tmp = *blk;
*blk = *vblk;
*vblk = tmp;
blk->tag = tag;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks suspicious to me.

When you move the evicted I-cache block (tmp) back into the victim cache, you are setting the vblk->tag to tmp.tag, which is the 16-bit I-cache tag.

Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.

Thank you for pointing that out. I’ve added the following expressions to ensure correctness :

+   vblk->tag = (tmp.tag << ICACHE_INDEX_BITS) | idx;

@yy214123 yy214123 force-pushed the direct-mapped-cache branch from db3a37f to c6485fc Compare November 6, 2025 09:56
@yy214123
Copy link
Author

yy214123 commented Nov 6, 2025

...

To maintain a clean history and ensure the project is bisectable, each patch in a series should be complete and correct on its own.

I've finished cleaning up the commits and split them into three functionally independent patches.

Please ensure that when you quote-reply to others' comments, you do not translate the quoted text into any language other than English.

Thank you for pointing that out. I've corrected the mistake and will be more careful in the future.

@yy214123 yy214123 requested review from jserv and visitorckw November 6, 2025 10:07
@yy214123 yy214123 force-pushed the direct-mapped-cache branch from c6485fc to f270dca Compare November 6, 2025 18:16
riscv.h Outdated
* - index bits = log2(ICACHE_BLOCKS)
*
* For power-of-two values, log2(x) equals the number of trailing zero bits in
* x. Therefore, we use __builtin_ctz(x) (count trailing zeros) to compute these
Copy link
Collaborator

@jserv jserv Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not worthy to invoke __builtin_ctz for constants.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it was designed this way is because I initially needed to experiment with various cache size combinations.
Do you mean it's better to just directly fill in the log2 value?

-   #define ICACHE_OFFSET_BITS (__builtin_ctz((ICACHE_BLOCKS_SIZE)))
-   #define ICACHE_INDEX_BITS (__builtin_ctz((ICACHE_BLOCKS)))
+   #define ICACHE_OFFSET_BITS  8
+   #define ICACHE_INDEX_BITS 8

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't clz be used here instead of ctz?

Additionally, I don't think the logic for calculating ilog2 and its relationship with clz belongs in riscv.h. IMHO, this isn't RISC-V specific and should be placed in a more general helper file.

Furthermore, if we were to introduce such a helper, we shouldn't rely solely on the gnu extensions __builtin_clz()/ctz(). We would also need to provide a fallback implementation for compilers that don't support them to avoid compatibility issues.

However, since this is just a pre-computable value in this context, I don't believe it's necessary to introduce such a helper at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't clz be used here instead of ctz?

I only considered the case where the value is a power of two. For non–power-of-two values, clz should indeed be used.

We would also need to provide a fallback implementation for compilers that don't support them to avoid compatibility issues.

Thanks for pointing that out. After reviewing the code, I noticed that __builtin_parity already accounts for this situation.
I’ll be more careful when using similar compiler extensions in the future.

#if !defined(__GNUC__) && !defined(__clang__)
/* Portable parity implementation for non-GCC/Clang compilers */
static inline unsigned int __builtin_parity(unsigned int x)
{
    x ^= x >> 16;
    x ^= x >> 8;
    x ^= x >> 4;
    x ^= x >> 2;
    x ^= x >> 1;
    return x & 1;
}
#endif

Additionally, I don't think the logic for calculating ilog2 and its relationship with clz belongs in riscv.h. IMHO, this isn't RISC-V specific and should be placed in a more general helper file.

I’ve already defined it as a constant.

On a related note, I’d like to ask for your advice — I currently place the structures for the i-cache and victim cache, along with their related #defines, inside riscv.h.

Would you suggest separating these parts into a new header file instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than discussing which header they should go into, I’m more interested in knowing how likely it is that these will be shared with other C files in the future. If the probability is low, should they just be placed directly in riscv.c instead of a header?

@yy214123 yy214123 force-pushed the direct-mapped-cache branch from f270dca to 5e9504b Compare November 8, 2025 15:46
Extend the existing architecture to include a direct-mapped
instruction cache that stores recently fetched instructions.

Add related constants and macros for cache size and address fields.
@yy214123 yy214123 force-pushed the direct-mapped-cache branch from 5e9504b to e2527ab Compare November 8, 2025 18:30
Replace the previous 1-entry direct-mapped design with a 2-entry
direct-mapped cache using hash-based indexing (same parity hash as
cache_load). This allows two hot virtual pages to coexist without
thrashing.

Measurement shows that the number of virtual-to-physical translations
during instruction fetch (mmu_translate() calls) decreased by ~10%.
Introduce a small victim cache to reduce conflict misses
in the direct-mapped instruction cache. On an I-cache miss,
probe the victim cache; on hit, swap the victim block with
the current I-cache block and return the data.

Measurement shows that the number of virtual-to-physical translations
during instruction fetch (mmu_translate() calls) decreased by ~8%.
@yy214123 yy214123 force-pushed the direct-mapped-cache branch from e2527ab to 1ff3c37 Compare November 8, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants