Welcome to your weekly AI Newsletter! Read and listen on AITechCircle:
This newsletter has become an essential resource for myself and countless others in the AI community, delivering practical, actionable insights you can apply immediately in your work or business.
Before diving into this week’s updates, do a quick favor and share these valuable insights with a friend or colleague who could benefit from them!
Today at a Glance:
Understanding the Limitations of Mathematical Reasoning in LLMs
Generative AI Use cases in Health Care Industry
AI Weekly news and updates covering newly released LLMs
Courses and events to attend
Can Large Language Models (LLMs) truly reason?
This week, I reviewed the groundbreaking research in the paper GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models from Apple. The authors critically examine how well current large language models (LLMs) tackle mathematical reasoning tasks, exposing significant weaknesses in their logical problem-solving capabilities.
The research paper evaluates several state-of-the-art large language models (LLMs), both open and closed, across various experiments.
Some of the models mentioned in the research include: GPT-4o-mini and GPT-4o, Llama3-8b-instruct, Phi-3-medium-128k-instruct, Phi-3.5-mini-instruct, Gemma2-9b-it, Mistral-7b, o1-mini and o1-preview.
These models were tested on the newly developed GSM-Symbolic and GSM-NoOp benchmarks to explore their mathematical reasoning capabilities.
Key Takeaways:
Fragility in Reasoning: The study finds that even slight alterations in mathematical questions—such as changing numerical values—cause LLM performance to drop significantly. This shows that models often rely on pattern recognition rather than logical reasoning.
GSM-Symbolic Benchmark: To better assess LLMs’ reasoning skills, the researchers developed GSM-Symbolic, a new benchmark that tests models on variations of math problems. These variations help reveal the fragility of LLMs significantly when question complexity increases.
Performance Decline with Clauses: The models showed a consistent drop in performance when additional clauses were added to questions, even if these clauses were irrelevant to solving the problem. This highlights the limitations of LLMs in handling more complex problem structures.
GSM-NoOp Dataset: The paper introduces the GSM-NoOp dataset, which adds irrelevant information to mathematical problems. Most models failed to ignore these distractions, illustrating their struggles with genuine logical reasoning.
Call for Better Evaluation: The paper emphasizes that current evaluation methods for LLMs need improvement, especially for reasoning-based tasks. It suggests moving beyond simple accuracy metrics and focusing on more comprehensive assessments, such as the GSM-Symbolic approach.
This research reminds us of the work in developing LLMs that can perform robust, logical reasoning, especially in tasks beyond mere pattern matching.
By understanding these limitations, the AI community can push towards developing more reliable models capable of genuine reasoning, a crucial step for advancing AI’s problem-solving potential in real-world scenarios.
Weekly News & Updates...
Last week's AI breakthroughs marked another leap forward in the tech revolution.
Liquid Foundation Models (LFMs: 1B, 3B, and 40B LFMs. LFM-3B surpasses older 7B and 13B models on multiple performance benchmarks. LFM-40B delivers performance on par with larger models, utilizing only 12B activated parameters link
NVLM 1.0 from Nvidia, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
Aria: the first open-source, multimodal native MoE, with best-in-class performance across multimodal, language, and coding tasks. link
MLE-bench, a new benchmark to measure how well AI agents perform in machine learning engineering. The benchmark consists of 75 machine learning engineering-related competitions sourced from Kaggle. link
open-source text-to-video model with MIT license! Pyramid Flow SD3 is a 2B Diffusion Transformer (DiT) that can generate 10-second videos at 768p with 24fps. link
The Cloud: the backbone of the AI revolution
What’s the ROI? Getting the Most Out of LLM Inference is a good read from Nvidia. link
LLM inferencing with Arm-based OCI Ampere A1 Compute in OCI Data Science AI Quick Actions, link
Gen AI Use Case of the Week:
Generative AI use cases in the health care industry. Several use cases for healthcare providers aiming to increase operational efficiency, reduce administrative burden, and improve patient satisfaction. The impact is significant across revenue, user experience, and operations, as it addresses a key pain point in healthcare.
OpenAI's Agentic AI cookbook covers Orchestrating Agents: Routines and Handoffs and Swarm, an educational framework exploring ergonomic, lightweight multi-agent orchestration.
Potential of AI
The most common question about 3blue1brown is how he animates videos. He has made a video to give a peek behind the scenes; you can look here. He has made awe-inspiring videos, starting with What is a Neural Network and moving to many machine learning topics.
Things to Know...
Federal Trade Commission, USA has announced a Crackdown on Deceptive AI Claims and Schemes. With Operation AI Comply, the agency announces five law enforcement actions against operations that use AI hype or sell AI technology that can be used in deceptive and unfair ways. Link to read in-depth.
The Opportunity...
Podcast:
This week's Open Tech Talks episode 146 is "Mastering Communication in the AI Era with expert Tips from TJ Walker." Over 2 million students on Udemy mark TJ Walker’s commanding digital presence across more than 200 courses. He is the author of six books, including the USA Today #1 Bestseller “Secret to Foolproof Presentations” and “Media Training A to Z."
LongWriter: An open-source project built to generate outputs exceeding 10,000 words using long-context LLMs, with models fine-tuned for extended text generation and evaluated through custom benchmarks to ensure quality and length.
Dify is an open-source LLM app development platform. Its intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, letting you quickly go from prototype to production.
Data Sets...
CROP PHENOLOGY: The dataset contains the ground-based observations of crop growth stages for Canada's prairie provinces (Manitoba, Saskatchewan, and Alberta) from 2019 to 2020.
WINTER WHEAT SEGMENTATION USING AI: In this research, a newly modified UNet (Fast-UNet) was implemented to segment winter wheat from time series Sentinel-2 images for 2021 and 2023. These images were converted to NDVI and utilized to identify wheat fields by tracking the wheat phenology from sowing to harvesting.
Other Technology News
Want to stay updated on the latest information in the field of Information Technology? Here's what you should know:
AMD launches AI chip to rival Nvidia’s Blackwell, as reported by CNBC
Musk unveils Robotaxi, unsupervised full self-driving future: 'That's what we want', story covered by FoxBusiness
Thank you, as always, for taking the time to read.
I’d love to hear your thoughts. Hit reply and let me know what you find most valuable this week! Your feedback means a lot.
Until next week,
Kashif Manzoor
The opinions expressed here are solely my conjecture based on experience, practice, and observation. They do not represent the thoughts, intentions, plans, or strategies of my current or previous employers or their clients/customers. The objective of this newsletter is to share and learn with the community.
You are receiving this because you signed up for the AI Tech Circle newsletter or Open Tech Talks. If you'd like to stop receiving all emails, click here. Unsubscribe · Preferences
AI Tech Circle
Kashif Manzoor
Learn something new every Saturday about #AI #ML #DataScience #Cloud and #Tech with Weekly Newsletter. Join with 278+ AI Enthusiasts!
AI Tech Circle Welcome to your weekly AI Newsletter from AITechCircle! This newsletter has become an essential resource for me and many others in the AI community. It has practical insights that will immediately boost your work or business. Dive into this week’s updates, and take a moment to share them with a friend or colleague who could gain from these valuable insights! AITechCircle Today at a Glance: Build Phase of a Gen AI Implementation Journey Chief AI Officers Corner with 90 days plan...
AI Tech Circle Welcome to your weekly AI Newsletter from AITechCircle! This newsletter has become an essential resource for me and many others in the AI community. It is packed with practical insights that will immediately boost your work or business. Dive into this week’s updates, and take a moment to share them with a friend or colleague who could gain from these valuable insights! AITechCircle Today at a Glance: The Platform for SaaS to LLMs for SaaS - The second part, Buy (Shapers)...
AI Tech Circle Welcome to your weekly AI Newsletter! Read and listen on AITechCircle: This newsletter has quickly become a trusted go-to for me and many others in the AI community, offering practical insights that can be applied immediately to elevate your work or business. Before reading this week’s updates, consider sharing this resource with a friend or colleague who could benefit from these valuable insights! Today at a Glance: Navigating the AI Adoption journey - The first part, Adopt...