2012年2月19日星期日

CCS'03: Randomized instruction set emulation to disrupt binary code injection attacks

This paper is very similar to the previous one, except that is store the original program on disk, and randomize it while loading it into memory.

And its key is very long, may be as long as the program.

CCS'03:Countering Code-Injection Attacks With Instruction-Set Randomization

This paper proposes an approach to counter the code injection attacks. It stores key for de-randomizing the randomized program. When the OS tries to schedule the program to run, it will load this key into a write-only register in CPU, and the CPU will de-randomize the input instruction stream before the CPU actually process it.

2012年2月18日星期六

ASPLOS'10:Speculative Parallelization Using Software Multi-threaded Transactions

This paper proposes a software transaction memory system that can maintain automic across multiple thread.

ASPLOS'10: COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders

This paper proposes to use the GPU as a programable unit to run a GPU program that help prefetching the data for CPU.

ASPLOS'10: Dynamically Replicated Memory: Building Reliable Systems from Nanoscale Resistive Memories

This paper proposes to add another level of page table that maps the physical address to real address in PCM. This mapping can map a physical page to two PCM page that have different stuck-at fault locations. In this way, the PCM pages can still be used instead of discarded.

ASPLOS'10: Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors

This paper proposes mechanism that exploits the relation between the TLBs of different cores in CMP.

ASPLOS'10:Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Systems


This paper proposes a hardware mechanism and instrutions to detect the rare cases that checked repeatedly in STM and GC for some address patten.

2012年2月12日星期日

ASPLOS'10: Flexible Architectural Support for Fine-Grain Scheduling

Fine-grain scheduling, which schedule small threads with only thousands of instructions, requires exchanging huge amount of information across multi-level of cache and memory, which leads to about 100 cycles for every scheduling that is unacceptable for such small threads.On the other hand, purely hardware scheduler is not flexiable enough to deploy multiple scheduling algorithm. So this paper proposes a flexiable message exchange mechnism that can be used by software scheduler to avoid the need of crossing multiple level of caches, while preserve the flexiability.

ISCA'10: Necromancer: Enhancing System Throughput by Animating Dead Cores

In manufactoring test, any core that can not pass will be disabled, but they are most correct with only minor bugs. So this paper use them to run ahead and direct a simpler core with useful information, such as branch destination and cache prefetch, thus lead to bettern performance on the simpler core.

ISCA'10: Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

This paper proposes an interesting cache design, in which every cache is divided into private and shared part. The private part is only for the attached core, while the shared part can be alloced to other core, leading to a much larger private cache for other cores.

ISCA'10 : ynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance

This paper proposes to divide the SIMD lanes into two seperate sets on lane Divergence caused by different in branch or memory latency. And then these sets can be executed as threads interleaved.

ISCA'10 : A Case for FAME: FPGA Architecture Model Execution This paper surveies many types of FPGA emulation system.


Direct FAME blow the rtl directly into FPGA, which is currently used by use.
Decoupled FAME use many cycle to simulate a single cycle of a complex device, such as the multi-port reg file.
Multithread use a single pipeline and many state set to simulate many different copy, such as cores.

ISCA'10: Translation Caching: Skip, Don’t Walk (the Page Table)


Every TLB miss may require several dram access, each for a page table level.

MMU caches are used to store those page entries like data cache.

There are many varaints, such as unified one that store all level of entries in one cache, or seperated one that store entries of each level in different cache. Another varaint is that Page Table Cache that indexed by the physical address of the entries, which shows how to find the entries, or translation cache that store the translation result, thus make it indexed by virtual address.

2012年1月25日星期三

ISCA'10: Sentry: Light-Weight Auxiliary Memory Access Control

This paper proposes a finer grain memory protection mechanism that work at cache line level. It reside on the L1 cache 1 miss path, thus prevent it from slow down the processor pipeline and consume power very cycle.


2012年1月24日星期二

ISCA'11 : The Role of Optics in Future High Radix Switch Design

This paper shows some dark future of the electrical switcher, and proposes to use optical switcher.

We need to further reread of its content when we need optical serdes.

ISCA'11: Dark Silicon and the End of Multicore Scaling

This paper presents a dark future of the multi core methodology, that it will end within 9 years due to the power and utilization wall.

ISCA'11 : SpecTLB: A Mechanism for Speculative Address Translation

This paper proposes to parallel walking the page table and predicate the address translation result with interpolant, such that the translation latency can be hidden.

ISCA'11 : A Case for Globally Shared-Medium On-Chip Interconnect

This paper presents a transmission line link design with standard CMOS implementation, at 26.4Gb/s. It is very impressive and we may need to refer to it latter.

But the diff wires are also similar to what I seen before in serdes reference clocks, can I use them ?


2012年1月18日星期三

ISCA'11 : Releasing Efficient Beta Cores to Market Early

This paper is very interesting that it can run a simple and slow but correct core with a complex, fast but buggy core together.

They check each other, if not match, the simple core is invoked.

ISCA'11 : FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

Blocked-execution processor continuously run atomic blocks of instructions — also called Chunks. Larger chunk may lead to frequently contention, and lost performance.

This paper proposes an automatic algorithm to remove the contention.

2012年1月17日星期二

ISCA'11 : OUTRIDER: Efficient Memory Latency Tolerance with Decoupled Strands

This paper uses compiler to separate the instruction stream into several strands, some of them are memory accessing, others are memory consuming. Thus torelants long memory latency without huge hardware overhead like OOO.

ISCA'11: Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks

This paper proposes to dynamically detect the memory that can only be accessed by a core, and prevent them from being coherented.

ISCA'11 : FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template

This paper proposes to generate superscalar processor from templates and stages with different width and depth.

ISCA'11: CRIB: Consolidated Rename, Issue, and Bypass

Conventional high-performance processors use complex logic structure to deal with register rename, instruction schedule and so on jobs, only to make effecently use of heavily pipelined ALU and memory ports. This leads to huge dynamic power consumption.

This paper proposes a processor with lots of simple computation components-- CRIB, and make the computation happen in place instead of been scheduled by complex logic.