2011年12月25日星期日

MICRO'11: Parallel Application Memory Scheduling

This paper proposes to use run time system and compiler to find out the most contended lock and then the set of threads that may hold this lock, such that these threads may be prioritized in accessing memory.

For the barrier mechanism, a similar approach is also proposed.


2011年12月24日星期六

FMCAD'11: Effective Word-Level Interpolation for Software Verification

This paper proposes a novel approach to generate interpolant for BV using a layered structure, which includes 4 layers: the bit blasting , EUF, Equality Substitution
and linear arithmetic.



FMCAD'11: IC3: Where Monolithic and Incremental Meet

The authors combines their incremental approach with old monolithic approach that overapproximates the reachability on longer and longer state sequences.


2011年12月23日星期五

ASPLOS'10: Conservation Cores: Reducing the Energy of Mature Computations

The dark silicon problem is attacked by this paper again. It states that with the silicon process advancement, the number of devices will double every generation, but the power budget remain unchanged, while the voltage and threshold voltage remain unchanged.

So the percentage of devices that can work at full speed will reduced from a generation to the next. So this paper proposes to synthesize specialized cores to run those hot code segments.


HPCA'10: Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

This paper proposes to use slack, the number of cycle that a package can be delayed without affecting the NOC performance, to arbitrate which package in contention should be forwarded.

And this paper also proposes approaches to approximate the slack by correlate it with i) the number of precedence L2 cache miss package ii) whether the package is itself a L2 miss and iii) the hop of this package.




2011年12月22日星期四

HPCA'03 : Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

It mentions a very interesting technique, when the instruction window is blocked by a long run instruction, such as a memory access, the processor will check point current state and turn to a runahead mode, in which the processor run all following instructions but not commit their result.

When the long run instruction return back, the processor will return to the check pointed state and rerun all following instructions.

This technique can make all the data referred by following instructions to be fetched into cache before they are needed.

2011年12月14日星期三

OSDI'10: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

Current synchronous system call can evict large amount of cache and TLB. So frequently calling system calls will damage the performance.

This paper proposes to use an agent to hold all system call request, and halt the calling process or thread. When a certain number of calls are waiting or all process/threads are halting, then switch to the kernel mode to process all requests.


2011年12月2日星期五

MICRO'11: Q S C ORES: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores

This paper analysizes the program benchmarks to find out hot fragment that are executed frequently, and then find out their common flow graph. And the use these commoness to synthesize new specialized core and integrated them with a general purpose processor.


MICRO'11: Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies

This paper proposes a hierarchical framework for cache coherent protocol, such that it can be designed in a multi-layer way.

Thus, this approach can avoid the design of a global single layer protocol that is neither salable nor power efficient.

MICRO'11:PACMan: Prefetch-Aware Cache Management for High Performance Caching

This paper proposes a novel cache that knows the existence of prefetch, and deal it in a different way than normal data request.


2011年12月1日星期四

MICRO'11:Idempotent Processor Architecture

This paper proposes to use compiler to construct Idempotent region, a list of instructions that can reexecute again without obtaining different results.

And it also proposes a hardware structure that can execute such code. In this way, it can handle mispredication and exception cheaply.

2011年11月30日星期三

MICRO'11: Architectural Support for Secure Virtualization under a Vulnerable Hypervisor

To prevent the hypervisor from reading the guest VM's memory, this paper proposes hardware mechanism to hide the VM's status and memory owner mapping table from the hypervisor.

The hypervisor can still access these VM's private information be calling this novel hardware.


MICRO'11: Towards the Ideal On-chip Fabric for 1-to-Many and Many-to-1 Communication

This paper proposes a NOC that support 1-M broadcast/multicasr and M-1 aggregation.

For 1-M, it can balance the usage of X and Y links.It also use matrix-like crossbar with tri-state driver to easy the layout and reduce power. It also reduce the traditional 3 stages pipeline to 1 cycle by sending the advance request before sending flit, such that the next router can perform arbitration before the flit actually arrive.

To aggregate the ACK messages, every ACK message arrive first will become the master at that router, and then each arriving ACK will be merged with the master by adding its ACK number to master's. After a predefined time, no matter whether the master had aggregate all ACKs, it will be send to upper stream router.


2011年11月27日星期日

CAV'11: Linear Completeness Thresholds for Bounded Model Checking

This paper find a set of LTL formulas that contains only F operator, is cliquey. That is, such formulas's corresponding Buchi automata's each SCC have bidirection edge between every two edges.

So when produced with a kripke structure M with recurence diameter rd, the product machine's diameter is a linear function of rd.

This paper also give an algorithm to computer an even tighter approximation of the product machine's diameter.


2011年11月25日星期五

CAV'11: Interactive Synthesis of Code Snippets

This paper improves PLDI'05 paper "Jungloid mining: helping to navigate the API jungle" by supporting parametric type that can lead to multiple ground type.


CAV'11: A Specialization Calculus for Pruning Disjunctive Predicates to Support Verification

The manipulation of seperation logics typically requires the unfolding of disjunctive predicates which may lead to expensive proof search.

This paper proposes to rule out invalid instances of predicates before unfolding, by propagating hidden properties to prune infeasible disjuncts.




2011年11月24日星期四

CAV'10: Smoothing a Program Soundly and Robustly

Based on previous work in "Smooth interpretation" PLDI'10, this paper propose three correctness requirements that make sure smoothing approach can be correctly applied.

1 soundness: general smoothing problem is undecidable, so an approximation is needed, which computes an k-dimension interval <{l_1,u_i},...,{l_k,u_k}> for each input. The soundness requirement is that the output ideal smooth function should be in this interval.
2 robustness: the resulting smooth function must have derivations of all order.
3 β-robustness: partial derivative is bounded by a function linear in x and β.A large number of numerical search routines work by sampling the input space under the expectation that the derivative around each sample point is well defined.



PLDI'10: Smooth interpretation

In digital signal processing, there is Gaussian smoothing technology that can remove the noise of a signal by computing the convolution of the signal and Gaussian function.

This paper proposes to do the same on a simple imperative program that contains while loop, if statement, assign statement and Boolean/floating expression. This simple syntax seems somewhat like fortran.

This paper first perform symbolic simulation on this program to obtain its denotational semantics. And then take the convolution of this semantics and N(x,β), a joint density function of k independent normal variables, each with mean 0 and standard deviation β.

The resulting convolution will always be smooth, and the higher β, the smoother the convolution.

Thus, this smoothed program can be analysized by numerical optimization techniques. This paper uses technology in finding optimal parameter value of PID controller in physical-cyber system.

CAV'11: Quantitative Synthesis for Concurrent Programs

This paper construct the product machine of the environment, the scheduler, the performance model and the partial program. The partial program contains some unspecified predicates that need to be inferred by the algorithm.

And then it is reduced to 2-player game with probabilistic transitions.

The most interesting thing to me is the probabilistic transition that computes a distribution on the next state set.


2011年11月23日星期三

CAV'11: Resolution Proofs and Skolem Functions in QBF Evaluation and Applications

This paper first mentions some concepts that are not familiar to me:
1 Universal reduction : all universal qualified variables at the inner most layer can be removed
2 Q-resolution : first apply normal resolution to existential variables and then apply universal reduction, repeat this process again and again
3 Any QBF can be converted to skolem normal form that contains only two level of quantifiers, first existential and then universal. In this conversion, every existential variable can be replaced with a function F_x, that refers only to inner layer of universal variables. These functions are then existential quantified before those original universal variables.

This paper then describe an algorithm similar to craig interpolant, to transform the resolution proof of the QBF into the skolem functions.


2011年11月22日星期二

MICRO'10: Register Cache System not for Latency Reduction Purpose

This paper shows some astonished facts about the huge register files in wide superscalar processors:
1 The reg file is almost the same size and latency of the L1 cache
2 the reg file requires 2 or even 3 cycles to access
3 deeper pipeline for reg file increase the penalty of the speculation and also result in result shortage that leads to pipeline stall
4 deeper pipeline for reg file also leads to deep bypass network with large number of mux and long wire.

reg cache can store the most used reg file content, and leads to reduced cycle time and cycle number and power.


MICRO'10: Combating Aging with the Colt Duty Cycle Equalizer

In modern CMOS process, the uneven usage of devices will make some heavily used devices approaching their failue very fast.
For example, adders are often used to add a small constant to a variable, so some devices in the higher bits of this adder are always in a constant state.
At the same time, data cache often contains more 0s than 1s.
So this paper proposes to use these devices in a more even way.
For data path components, it proposes to flip it into complementary mode instantly, in which all data are represented in 1's complementary.
For data cache, it proposes to instantly flip it into complementary mode, in which all data are stored in their 1's complementary.
For cache that suffers from uneven usage, it use LFSR to hash the index pointer, to prevent some address from heavily used, while other are not used.
For data components that suffer from uneven usage of left and right operand, it propose to swap these operand instnatly.

2011年11月21日星期一

MICRO'10:Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

This paper shows that for current processors, the clock and control consume 24% power, data supply, such as data cache, consume 28% power, and instruction supply consume 42% power, the real computation only consume 6%. So over 90% power is overhead.

So it seems that combining general purpose cpu, fpga, asic and gpu in a chip is a good solution. But this paper does not give a clear conclusion yet, only bunch of figure that show trend.

MICRO'10:Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling

This paper shows that the actions taken by the cores, such as TLB miss, branch mispredication, clock gating, will cause supply voltage swing. But actual swing is often smaller than the swing margin of the designer. So the processors are normally run in a too conservative mode, which leads to too low voltage, and therefore too slow speed.

So many papers propose to run the processors with smaller margin, and use some mechanism to recover when errors occur.

This paper proposes to measure the relationship between the voltage swing and the job run on the processor, and extrapolante this relation to wider range. It also proposes a software scheduler that help smooth the voltage swing by balancing the load on cores.

2011年11月19日星期六

MICRO'10 : Probabilistic Distance-based Arbitration: Providing Equality of Service for Many-core CMPs

In a NOC that connect multiple PE, if the router is round-robin, then a PE that is far from its communication destination can be treated very unfair in getting bandwidth, while a PE nearer can get more bandwidth.
This paper proposes use the traveling distance of a package to control the arbitrator in a router, such that fairness can be guaranteed.

MICRO'10:Fractal Coherence: Scalably Verifiable Cache Coherence

This paper proposes a cache coherence protocol whose shape is fractal, which is a special case of induction.
For example, if the whole system is a tree, then for a tree of height n, this paper's approach expand one of its leaf to a small tree of height 2, then this new tree is of height n+1.
Then along the border of this newly expanded small tree, this paper tries to prove that both side can observe the other side just the SAME as the old tree.
Here, SAME doesn't just mean one to one state corresponding, instead it means one side in the new tree can have a temporal mapping of its state sequence to a single state in the old tree.

2011年11月18日星期五

MICRO'10: Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

In a multi-core system running multiple Virtual machines, each VM runs on a relatively fixed set of cores, so this paper proposes a mechanism that perform snoopy based cache coherent within VM boundary.

For those rare cases that cache coherent may cross VM boundary, for example, communication between VMs and hypervisor, VM migration to new core, this mechanism provides some techniques to deal with them.


MICRO'10: Adaptive Flow Control for Robust Performance and Energy

This paper proposes an NOC, whose individual router can switch between backpressured and backpressureless mode. The advantage of the former mode is its high throughput, while the latter don't use buffer in low through put application, which leads to low power overhead. The backless mode just drop contention flit or route it ramdomlly.

When two adjacent routers are in different mode, the backpressureless router will change back to backpressure mode, such that it can communicate correctly with its neighbor.

2011年11月17日星期四

MICRO'10: ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment

Some new architecture support execution of an atomic block of instruction(chunk), and commit their result at the end of the trunk.
When multiple processors commit at the same time, it is desirable that it is scalable.
This paper's approach:
(i) need no centralized structure,
(ii) communicate only with the relevant directories, and
(iii) allow the concurrent commit of chunks that use the same directory, as long as their addresses do not overlap.

2011年11月4日星期五

MICRO10 : Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Based on the observation that multiple flit in a package normally traverse through the same input-output pair in a router, so it is not necessary to arbitrate for every flit. So this paper propose to remember the last arbitrate result and use it for the next flit.

MICRO10: Erasing Core Boundaries for Robust and Configurable Performance

Dynamically merging computation components and pipeline states to form a processor with wider data-path or a group of simple processors with narrow data-path.

This approach can also remove defective component to increase reliability.


MICRO10: Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches

To reduce power consumption, the cache is driven by very low voltage that is near the threshold. But this may significantly increase the possibility of error. so this paper proposes to detect and correct these error with turbo code.

MICRO10:Minimal Multi-Threading: Finding and Removing Redundant Instructions in Multi-Threaded Processors

Detecting instruction flow similarity between multiple thread, and only split them when they expose different computation.

2011年10月26日星期三

Rippling: a heuristic for guiding inductive proofs

It assumes that, which is actually true in most induction, the conclusion and hypothesis have almost the same structure(also called skeleton).
So rippling just convert each inferring rules of this reasoning system into a new rewriting rule, which transforms formulas with the same skeleton.
Such transformation is very similar to the rippling wave on the water surface, which is the source of the name rippling.

2011年10月23日星期日

SOSP11: PTask: Operating System Abstractions To Manage GPUs as Compute Devices

This paper provide a dataflow programming model by designing a Ptask API, which describe the dataflow between varias cmputing tasks.

These tasks can be schedule on to GPU as well as CPUs. In this way, non-neccesarry copy between devices, drivers and user codes are avoid, and the user does not need to care about the data movement explictly.

SOSP11:Scalable Consistency in Scatter

There are so many papers that discuss the key-value storage, their consistency and scalability .

But I can't go into this topic because I dont have access to such environment.

2011年9月27日星期二

MICRO10: Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric

This paper use a FPGA to monitor the instruction stream of the main processor, which can enforce the security and array bound check requirements

MICRO10: Memory Latency Reduction via Thread Throttling


This paper propose a gather computie scatter programing style that seperate the program into computing and memory thread, and schedule both of them to make sure that the number of memory thread would not over run the limited memory bandwidth.

MICRO10: The ZCache: Decoupling Ways and Associativity

This paper seperate the concept of way and association.It can has small number of physical way, but large number of associations. It can access the physical way within one cycle if it hits, otherwise,it will search on the side path with multiply cycle to access more associations.

2011年8月28日星期日

POPL02:Lazy Abstraction

Today I read this paper again, and find that I have not capture its major ideas before.

Actually, It first unfold the control flow of the program to construct a tree that can reach the error position. This graph's each vertex is labeled with its own set of predicates and the reachable states formula expressed in these predicates. Its edge is labeled with the basic block of the program.

Then it backtrack and compute weakest precondition at each edge, until it find a node whose intersection of reachable state and precondition formula is empty. This will discover a new predicate by analyzing the unsat proof of this intersection .

Third, it constructs new abstract model from this pivot state to reach more errors again.

For each node reached, if it is also reached before, and the old node's label formula cover the new reached node's formula, then this node does not need to be explored any more.


2011年8月19日星期五

CAV11:HAMPI: A String Solver for Testing, Analysis and Vulnerability Detection

What is important is the solver on strings.

CAV11:Interpolation-Based Software Verification with Wolverine

This paper is not interesting, but the tool is open source, that is the point.

2011年8月8日星期一

TACAS11:Sound and Complete Monitoring of Sequential Consistency for Relaxed Memory Models

This paper propose some very simple memory models for TSO and PSO, and then, I think it is the most important idea, find out whether there is some behavior on these models that can't occur in normal SC model.


2011年8月7日星期日

FMCAD11:Effective Word-Level Interpolation for Software Verification

It proposes a layered approach to generate word level interpolant for BV.
The traditional approach include two DPLL solver, one for the Boolean skeleton on the formula, the other for the conjunction of BV clauses. When a satisfying assignment is generated by the Bool solver, it is passed to the BV solver to verify it consistency.

This paper's approach inserts some layer between this two solver. These layers are over-approximation of the BV solver. They check for easier cases that can generate interpolants. For example, EUF.

FMCAD07:Lifting Propositional Interpolants to the Word-Level

This paper lift the resolution tree of the SAT instance that generated from an EUF formula.

In the analysis, it not only need the resolution tree, but also the corresponding equality predicates that introduced in encoding EUF equality into CNF, and also the clause structure of this encoding.

So I can not use it to infer word level structure for a purely bit level CNF generated from complementary synthesis.


VMCAI11:SAT-Based Model Checking Without Unrolling

This approach contains two nested loop,

1. in the major loop, it iteratively constructs a list of formulas F0....Fk, in which Fk means the over-approximation of the states set reachable in k steps.

2. in the minor loop, it iteratively refines the F0...Fk, such that they form an inductive path that leads to the property P to be verified.

DAC11:Re-Synthesis for Cost-Efficient Circuit-Level Timing Speculation

This paper reduce the number of paths that need "Timing Speculation" by retiming to improve timing character.

And then it use ILP to share pading buffers between different paths.

MICRO03:Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Dynamic voltage scaling is widely used to reduce the power consumption, but it need to guarentee that all critical paths can run correctly.This may seem too conservative becasue too long path only account for a very small percent of all paths.

So this paper propose to use a shadow register to sample the critical path a little bit later than the real register, and then compare them. If the result is not the same, then a circuit is used to roll back the correct state.

One major overhead of "Timing Speculation" is the need to pad the short paths, to prevent them from making the comparator to think that a too long path has cause a timing error.

DAC11:Abstraction-Based Performance Analysis of NoCs

It infers the traffic pattern for each channel in a NOC by analyzing the simulation trace generated by typical benchmarks.

This traffic pattern include the burst transition number, transition rate and bound.

And then it generate a regulator for each channel that represent this pattern and insert them into the correspoding channel, and constraint that they all hold in verification.

In this way, the traffic pattern that we interest in will be forced to hold.

This approach can also be extended to multiple channels, for example, to constrain that we dont allow five source to flood the sink at the same time, be we allow one of them to send.

Inferring Assertion for Complementary Synthesis" is accepted by ICCAD'1

My paper "Inferring Assertion for Complementary Synthesis" has just been accepted by ICCAD'11.


TACAS11: Canonized Rewriting and Ground AC Completion Modulo Shostak Theories

Very interesting paper that combine AC(ssociativity and commutativity) operators with Shostak approach.

Generally, there are two approaches for theory combinations, nelson-oppen and Shostak.

The former one just pass equalities between T-solvers, so it is relativly easy to combine AC with it.

But for Shostak, it need to computer canonical form, which is not possible for AC.

TACAS11: Transition Invariants and Transition Predicate Abstraction for Program Termination

Very interesting.
Normal invaraints and abstraction are for states, while this paper apply them to transitions.
But I think the disjunctive well foundness is a very well known result.
I dont understand why this paper published on TACAS 11 again, these topic are well discussed before 2006.

CAV11: Verification of Certifying Computations

Instead of directly verifying a program, this paper tries to verify the checker of a particular program.

This checker is actually an over-approximation of the behavior of the original program.

CAV11: Existential Quantification as Incremental SAT

It use ALLSAT approach to perform existential quatification. And use increamental SAT to reduce the run time overhead.

But I think the major innovation is to use sorting network to find out the shortest prime clause. This is bettern than my approach in VMCAI05: Minimizing Counterexample with Unit Core Extraction and Incremental SAT, which reduce the size of a block clause by iteratively remove those literal not in the support set of a proof.

CAV11: Temporal property verification as a program analysis task

It is a very interesting paper. Given a transition system and a temporal logc property, generate a program that search for the proof or counterexample of this property. and then use existing program analysis tools to determine that whether this program terminate with some answer.

All the methods that boost the speed, such as abstraction refinement, will be automatically performed by the program analysis tools.

2011年3月30日星期三

TACAS'11::Optimal Base Encodings for Pseudo-Boolean Constraints

This paper talk about a new method to translate the PSB problem into SAT.

Of course there are some direct solving methods, such as that of PBS solver proposed by Fadi Aloul .

But it seems that translating into SAT is also very interesting. it refer to some early papers:

1. A Translation of Pseudo-Boolean Constraints to SAT:
This paper just iterate though all coefficient, and case split on they to consider the two cases of them, and substract the correspond value from the right handside of (dis)equation.
In this way, an SAT instance is encoded.
2. Translating Pseudo-Boolean Constraints into SAT

it propose three methods
a. BDD based: similar to above method
b. adder network: it express the coefficient with a set of 2-exp, and add them together with adders.but because there are XOR gates in these adders, it is Not arc-consistent.
c. sorting network : for small coefficients n, it use n length bit vector to express it.Assume again that the sum of all coe is N, then a sorter of size N is instanced.For variables x and its coe n, x is connected to n inputs of the sorter. For right hand side x, the 0-th to x-th output of sorter is asserted to be 1.
To deal with larger coe, a set of base must be founds, such that each sorter can be used to express larger number that 1.
To minimize the size of generated SAT instance, a set of optimal base must be found, this is the motivation of this TACAS11 paper.
This TACAS11 paper propose three cost function to evaluate the total cost that need to be minimized:
a. summing the digits used to express coes: only bases with only prime numbers are need to be considered.
b. also taking into acount of the carry bits between sorters: non-prime based must be consider
c. consider the number of comparators in sorters.
With these cost function, it proposes three searching methods, they approximate the cost function and perform branch and bound searching to find a minimized cost base.