fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

9.9K
active users

#arXiv

29 posts22 participants1 post today

arXiv 要搬到 GCP 上

在「arXiv moving from Cornell servers to Google Cloud (arxiv.org)」這邊看到 arXiv 搬到 GCP 的消息,是出自他們的徵才頁面:「Careers at arXiv - arXiv info」。 We are already underway on the arXiv CE ("Cloud Edition") project. This is a project to re-home all arXiv services from VMs at Cornell to a cloud provider (Google Cloud). 不過看 Hacker News 上的 comment,似乎是受到 Trump 政府對大學資金政策的影響,這些職缺目…

blog.gslin.org/archives/2025/0

Gea-Suan Lin's BLOG · arXiv 要搬到 GCP 上在「arXiv moving from Cornell servers to Google Cloud (arxiv.org)」這邊看到 arXiv 搬到 GCP 的消息,是出自他們的徵才頁面:「Careers at arXiv - arXiv info」。 We are already underway on the arXiv CE (Cloud Edition) project.

Pushing the Limits of LLM Quantization via the Linearity Theorem

arxiv.org/abs/2411.17525

arXiv logo
arXiv.orgPushing the Limits of Large Language Model Quantization via the Linearity TheoremQuantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise $\ell_2$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint in the medium-bitwidth regime, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-3.1 and 3.2-family models, as well as on Qwen-family models. Further, we show that our method can be efficiently supported in terms of GPU kernels at various batch sizes, advancing both data-free and non-uniform quantization for LLMs.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

arxiv.org/abs/2502.15840

arXiv logo
arXiv.orgVending-Bench: A Benchmark for Long-Term Coherence of Autonomous AgentsWhile Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

All-in-Memory Stochastic Computing Using ReRAM

arxiv.org/abs/2504.08340

arXiv logo
arXiv.orgAll-in-Memory Stochastic Computing using ReRAMAs the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: (1) generating low-cost true random numbers and SBSs, (2) conducting SC operations, and (3) converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39x, 2.16x) and energy consumption (1.15x, 2.8x) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5% across multiple SBS lengths and image processing tasks.