by jsendak | Feb 22, 2024 | Computer Science
In this article, the researchers introduce EyeEcho, a cutting-edge acoustic sensing system that has the potential to significantly advance the field of facial expression monitoring. By utilizing two pairs of speakers and microphones mounted on glasses, EyeEcho is able to emit encoded inaudible acoustic signals directed towards the face, capturing subtle skin deformations associated with facial expressions.
The ability of EyeEcho to continuously monitor facial expressions in a minimally-obtrusive way is a major breakthrough. Traditional methods of facial expression tracking often require the use of cumbersome and uncomfortable equipment, making it difficult to capture natural and spontaneous expressions in everyday settings. With EyeEcho, users can seamlessly wear the glasses and go about their daily activities while the system accurately tracks their facial movements.
One key technology behind EyeEcho is machine learning. The reflected signals captured by the microphones are processed through a customized machine-learning pipeline, which analyzes the data and estimates the full facial movements. This approach allows for precise and real-time tracking performance.
An impressive aspect of EyeEcho is its low power consumption. Operating at just 167 mW, EyeEcho can provide continuous facial expression monitoring without significantly impacting the battery life of the glasses. This makes it feasible for long-term usage without frequent recharging or battery replacements.
The researchers conducted two user studies to evaluate EyeEcho’s performance. The first study involved 12 participants and demonstrated that with just four minutes of training data, EyeEcho achieved highly accurate tracking performance across different real-world scenarios such as sitting, walking, and after remounting the devices. This indicates that EyeEcho can adapt well to different situations and maintain its accuracy in various contexts.
The second study involved 10 participants and evaluated EyeEcho’s performance in naturalistic scenarios while participants engaged in various daily activities. The results further validated EyeEcho’s accuracy and robustness, showcasing its potential to effectively track facial expressions in real-life situations.
One particularly exciting prospect highlighted in the article is the potential of EyeEcho to be deployed on a commercial-off-the-shelf (COTS) smartphone. By integrating this technology into smartphones, it opens up possibilities for widespread adoption and usage. Real-time facial expression tracking could have numerous applications in areas such as virtual reality, augmented reality, emotion detection, mental health monitoring, and more.
In conclusion, EyeEcho represents a significant advancement in facial expression monitoring technology. Its minimally-obtrusive design, accurate tracking performance, low power consumption, and potential for smartphone integration make it a promising solution for various industries and applications. Further research and development in this field will undoubtedly uncover more potentials and expand the possibilities offered by EyeEcho.
Read the original article
by jsendak | Feb 21, 2024 | Computer Science
arXiv:2402.12760v1 Announce Type: new
Abstract: Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics.
Automated Prompt Optimization for Text-to-Image Models
In this article, the authors propose a novel framework called User-Friendly Fine-Grained Text Generation (UF-FGTG) for automated prompt optimization in text-to-image models. They address the challenge of novice users achieving the desired results when manually entering prompts by bridging the gap between user input behavior and model training datasets.
A multi-disciplinary approach is employed in this research, combining concepts from multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. By leveraging these concepts, the authors aim to improve the generation of visually appealing and diverse images.
Constructing the Coarse-Fine Granularity Prompts Dataset
The authors first construct a novel dataset called Coarse-Fine Granularity Prompts (CFP) specifically for text-to-image tasks. This dataset combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. This approach allows for high-level guidance while ensuring that user preferences are taken into account.
User-Friendly Fine-Grained Text Generation Framework
The UF-FGTG framework proposed in this research provides an automated solution to translate user-input prompts into model-preferred prompts. This framework includes a prompt refiner that continually rewrites prompts to empower users in selecting results that align with their unique needs.
To generate prompts that are preferred by the model, the authors integrate image-related loss functions from the text-to-image model into the training process of text generation. This ensures that the generated prompts are optimized for model performance.
Furthermore, an adaptive feature extraction module is proposed to ensure diversity in the generated results. This helps enhance visual appeal and prevents repetitive or similar images from being generated.
Impact and Implications
This research has significant implications for the field of multimedia information systems. By automating prompt optimization in text-to-image models, it streamlines the process of generating visually appealing and diverse images. This can have applications in fields such as graphic design, advertising, and entertainment where high-quality visuals are crucial.
The concepts of animations, artificial reality, augmented reality, and virtual realities are closely related to this research. Animations and virtual realities require realistic and visually engaging visuals, which can be achieved through improved text-to-image generation. Artificial reality and augmented reality can benefit from more diverse and visually appealing images, enhancing user experiences in these simulated environments.
In conclusion, the authors’ UF-FGTG framework presents a promising solution to automated prompt optimization in text-to-image models. By leveraging multi-disciplinary concepts and constructing the CFP dataset, this research contributes to the wider field of multimedia information systems and has implications for various domains relying on high-quality visuals.
Read the original article
by jsendak | Feb 21, 2024 | Computer Science
Expert Commentary: Evaluating Language Models’ Unethical Behaviors with Human Knowledge
Language models have become an integral part of various downstream tasks, but concerns about fairness and biases in their outputs have been raised. In this article, the authors introduce a new approach to study the behavior of pre-trained language models (LMs) within the context of gender bias. By incorporating human knowledge into natural language interventions, they aim to probe and quantify unethical behaviors exhibited by LMs.
The authors present a checklist-style task inspired by CheckList behavioral testing. This task allows them to evaluate LMs from four key aspects: consistency, biased tendency, model preference, and gender preference switch. By examining these aspects, they can gain insights into how LMs handle and potentially perpetuate gender biases in their outputs.
To conduct their study, the authors probe a transformer-based question-answering (QA) model trained on the SQuAD-v2 dataset and an autoregressive large language model. They find interesting and contrasting results between the two models. The transformer-based QA model’s biased tendency positively correlates with its consistency, suggesting that the model consistently exhibits biased behavior. On the other hand, the autoregressive large language model shows an opposite relationship between biased tendency and consistency.
This research presents a significant contribution by providing the first dataset that involves human knowledge for evaluating biases in large language models. By introducing a checklist-style task, the authors offer a systematic approach to assess language models’ ethical behavior. This is crucial for ensuring fairness and mitigating biases in AI systems that rely on language models.
Further research can build upon this work by expanding the checklist-style task and incorporating more diverse dimensions of bias evaluation. Additionally, exploring techniques to mitigate bias in language models based on the insights gained from this study could be an area for future investigation.
Read the original article
by jsendak | Feb 20, 2024 | Computer Science
arXiv:2402.10805v1 Announce Type: new
Abstract: The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to “recall” the relevant image from its parameters as the response. Achieving this target presents notable challenges, including inbuilt visual memory and visual recall schemes within MLLMs. To address these challenges, we introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images and involves two training steps: learning to memorize and learning to retrieve. The first step focuses on training the MLLM to memorize the association between images and their respective identifiers. The latter step teaches the MLLM to generate the corresponding identifier of the target image, given the textual query input. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
Advancements in Generative Language Models and Cross-Modal Retrieval
In the field of natural language processing, generative language models have recently gained significant attention for their ability to generate coherent and contextually relevant text based on a given prompt. These models, such as GPT-3, have shown remarkable performance in tasks like text completion, translation, and question-answering. Building upon this capability, the authors of this paper propose extending the functionality of these models to incorporate visual content.
Traditionally, cross-modal retrieval refers to the task of retrieving relevant information from one modality (e.g., text) given a query from another modality (e.g., image). This has been primarily approached through discriminative models that try to learn a mapping between the two modalities and retrieve similar instances. However, the authors introduce a novel paradigm by proposing to “memorize” images within the parameters of the multimodal language model.
The key idea behind the proposed framework is to assign unique identifier strings to represent images and train the multimodal language model (MLLM) to memorize the association between these identifiers and the corresponding images. This involves two training steps: learning to memorize and learning to retrieve. During the first step, the MLLM learns to establish the connection between images and their identifiers. In the second step, it learns to generate the identifier of a target image given a textual query input.
The Challenges and Contributions
The main challenge in achieving this goal lies in developing visual memory and recall schemes within MLLMs. Unlike text, which can be easily tokenized and processed by language models, images are high-dimensional data that cannot be directly represented in a language model’s parameters. The authors propose an approach where images are encoded into their unique identifiers using techniques such as deep neural networks.
This proposed framework has several important implications and contributions. Firstly, it introduces a new perspective on cross-modal retrieval by leveraging the generative capabilities of MLLMs. This can potentially lead to more flexible and creative retrieval systems that go beyond simple similarity-based search. Secondly, it expands the scope of multimodal information processing by incorporating images into language models, which have traditionally focused on textual data. This approach allows for a more comprehensive understanding of the content and enables richer interactions between users and models.
Connections to Multimedia Information Systems and AR/VR
The presented research has strong connections to the wider field of multimedia information systems. Multimedia information systems deal with the storage, retrieval, and processing of various types of media, including text, images, audio, and video. The proposed framework addresses the challenge of integrating images seamlessly into language models, which are a fundamental component of multimedia information systems.
Furthermore, this research has implications for the domains of animations, artificial reality, augmented reality, and virtual realities. By enabling language models to memorize and recall images, the framework opens up possibilities for more immersive and interactive experiences in these domains. For example, virtual reality applications could leverage this capability to generate lifelike environments based on textual prompts, creating a more dynamic and realistic user experience.
Conclusion
The introduction of multimodal large language models (MLLMs) that can memorize and recall images presents exciting opportunities for cross-modal retrieval and extending the capabilities of language models. By leveraging generative approaches and training MLLMs to establish associations between images and unique identifiers, the proposed framework provides a new perspective on information retrieval. It also highlights the interdisciplinary nature of the concepts involved, connecting the fields of natural language processing, multimedia information systems, and virtual realities. As further research is conducted in this area, we can expect advancements in multimodal information processing and more immersive user experiences in various multimedia domains.
Read the original article
by jsendak | Feb 20, 2024 | Computer Science
The article discusses a technique to address the challenges associated with surface-surface intersection in computer-aided design (CAD). Surfaces, particularly non-uniform rational B-spline surfaces (NURBS), are commonly used in geometric design. However, when surfaces intersect, trimmed surfaces can emerge, leading to complexities in CAD applications.
One of the main issues with trimmed surfaces is that their parametric domain is not usually a standard shape like a square or rectangle. Instead, it is often bounded by curves. This makes it difficult for downstream applications like computer-aided engineering (CAE) to process the data effectively. Additionally, NURBS surfaces struggle to maintain a closed form when dealing with trimmed surfaces. As a result, a specialized data structure for intersection curves is typically required to support downstream applications. However, this data structure is not standardized in the CAD system, resulting in inefficient calculations.
To address these challenges, the paper proposes a reparameterization or normalization technique for Bezier surfaces, which are a specific case of NURBS. By transforming the trimmed surface into a collection of Bezier surface patches in a standard parametric domain [0,1]X[0,1], the authors aim to eliminate the trimmed surface. The boundary curve of each normalized Bezier surface patch can then be replaced by the intersection curve, resulting in a watertight representation along the boundary. This approach effectively bridges the gap between CAD and CAE, ensuring seamless integration and eliminating any gaps or overlaps that may occur during preprocessing.
Overall, this technique offers a promising solution to the challenges associated with surface-surface intersection in CAD. By normalizing trimmed surfaces into Bezier surface patches, it simplifies the data structure and improves efficiency in downstream applications. Further research and experimentation could focus on evaluating the performance of this technique with different types of surfaces and exploring its applicability to various CAD systems and workflows. Ultimately, this technique has the potential to enhance the overall accuracy and reliability of CAD models, making them more suitable for downstream analysis and applications.
Read the original article
by jsendak | Feb 17, 2024 | Computer Science
arXiv:2402.09720v1 Announce Type: new
Abstract: Low latency and high synchronization among users are critical for emerging multi-user virtual interaction applications. However, the existing ground-based cloud solutions are naturally limited by the complex ground topology and fiber speeds, making it difficult to pace with the requirement of multi-user virtual interaction. The growth of low earth orbit (LEO) satellite constellations becomes a promising alternative to ground solutions. To fully exploit the potential of the LEO satellite, in this paper, we study the satellite server selection problem for global-scale multi-user interaction applications over LEO constellations. We propose an effective server selection framework, called SpaceMeta, that jointly selects the ingress satellite servers and relay servers on the communication path to minimize latency and latency discrepancy among users. Extensive experiments using real-world Starlink topology demonstrate that SpaceMeta reduces the latency by 6.72% and the interquartile range (IQR) of user latency by 39.50% compared with state-of-the-art methods.
Expert Commentary: The Future of Multi-User Virtual Interaction with LEO Satellites
The article highlights the significance of low latency and high synchronization in multi-user virtual interaction applications, which are crucial for providing a seamless and immersive experience to users. However, the existing ground-based cloud solutions face limitations due to complex ground topology and fiber speeds, making it challenging to meet the requirements of these applications. This paves the way for exploring alternative solutions, such as leveraging low earth orbit (LEO) satellite constellations.
LEO satellite constellations offer a promising alternative to ground solutions by providing global coverage and reducing latency issues caused by the constraints of ground-based infrastructure. The article introduces SpaceMeta, an effective server selection framework specifically designed for global-scale multi-user interaction applications over LEO constellations. This framework aims to optimize server selection to minimize latency and latency discrepancies among users.
SpaceMeta takes into account both ingress satellite servers and relay servers on the communication path, ensuring efficient data transmission and reducing latency for enhanced user experience. By jointly selecting these servers, SpaceMeta effectively addresses the challenges posed by multi-user interaction applications in a global context.
The research conducted in this study includes extensive experiments using real-world Starlink topology, demonstrating the effectiveness of SpaceMeta compared to existing state-of-the-art methods. The results indicate a reduction in latency by 6.72% and a significant decrease in the interquartile range (IQR) of user latency by 39.50%, showcasing its potential to enhance the performance of multi-user virtual interaction applications over LEO constellations.
Relevance to Multimedia Information Systems and Virtual Realities
The concepts discussed in this article align with the broader field of multimedia information systems, where real-time communication, low latency, and high synchronization play a crucial role. Multi-user virtual interaction applications heavily rely on multimedia content, including audio, video, and animations, to create immersive virtual environments. The seamless delivery and synchronization of this multimedia content is essential for a seamless user experience.
LEO satellite constellations provide an intriguing solution for overcoming the limitations of traditional ground-based communication infrastructure. By integrating these satellites into the server selection process, SpaceMeta introduces a multi-disciplinary approach combining concepts from satellite communication, network optimization, and multimedia information systems.
The technology behind virtual realities (VR), augmented reality (AR), and artificial reality (AR) can greatly benefit from the advancements discussed in this article. These immersive technologies heavily rely on real-time interactions among users, and any delay or latency can disrupt the user experience. By reducing latency and discrepancies through effective server selection, SpaceMeta can enhance the performance and reliability of these immersive technologies.
Conclusion
The research presented in this article highlights the potential of LEO satellite constellations in addressing the challenges of multi-user virtual interaction applications. Through the development of the SpaceMeta framework, the authors provide a solution that optimizes server selection to minimize latency and improve synchronization among users. This has significant implications for the field of multimedia information systems, as well as virtual realities, augmented reality, and artificial reality technologies.
Read the original article