The WWDC25: Explore large language models on Apple silicon with MLX video talks about using your own data to fine-tune a large language model. But the video doesn't explain what kind of data can be used. The video just shows the command to use and how to point to the data folder. Can I use PDFs, Word documents, Markdown files to train the model? Are there any code examples on GitHub that demonstrate how to do this?
General
RSS for tagExplore the power of machine learning within apps. Discuss integrating machine learning features, share best practices, and explore the possibilities for your app.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
WWDC25: Combine Metal 4 machine learning and graphics
Demonstrated a way to combine neural network in the graphics pipeline directly through the shaders, using an example of Texture Compression. However there is no mention of using which ML technique texture is compressed.
Can anyone point me to some well known model/s for this particular use case shown in WWDC25.
I have a question. In China, long pressing a picture in the album can segment the target. Is this model a local model? Is there any information? Can developers use it?
Environment
MacOC 26
Xcode Version 26.0 beta 7 (17A5305k)
simulator: iPhone 16 pro
iOS: iOS 26
Problem
NLContextualEmbedding.load() fails with the following error
In simulator
Failed to load embedding from MIL representation: filesystem error: in create_directories: Permission denied ["/var/db/com.apple.naturallanguaged/com.apple.e5rt.e5bundlecache"]
filesystem error: in create_directories: Permission denied ["/var/db/com.apple.naturallanguaged/com.apple.e5rt.e5bundlecache"]
Failed to load embedding model 'mul_Latn' - '5C45D94E-BAB4-4927-94B6-8B5745C46289'
assetRequestFailed(Optional(Error Domain=NLNaturalLanguageErrorDomain Code=7 "Embedding model requires compilation" UserInfo={NSLocalizedDescription=Embedding model requires compilation}))
in #Playground
I'm new to this embedding model. Not sure if it's caused by my code or environment.
Code snippet
import Foundation
import NaturalLanguage
import Playgrounds
#Playground {
// Prefer initializing by script for broader coverage; returns NLContextualEmbedding?
guard let embeddingModel = NLContextualEmbedding(script: .latin) else {
print("Failed to create NLContextualEmbedding")
return
}
print(embeddingModel.hasAvailableAssets)
do {
try embeddingModel.load()
print("Model loaded")
} catch {
print("Failed to load model: \(error)")
}
}
We are developing Apple AI for overseas markets and adapting it for iPhone 17 and later models. When the system language and Siri language do not match—such as the system being in English while Siri is in Chinese—it may result in Apple AI being unusable. So, I would like to ask, how can this issue be resolved, and are there other reasons that might cause it to be unusable within the app?
It is vital for Apple to refine its OCR models to correctly distinguish between Khmer and Thai scripts. Incorrectly labeling Khmer text as Thai is more than a technical bug; it is a culturally insensitive error that impacts national identity, especially given the current geopolitical climate between Cambodia and Thailand. Implementing a more robust language-detection threshold would prevent these harmful misidentifications.
There is a significant logic flaw in the VNRecognizeTextRequest language detection when processing Khmer script. When the property automaticallyDetectsLanguage is set to true, the Vision framework frequently misidentifies Khmer characters as Thai.
While both scripts share historical roots, they are distinct languages with different alphabets. Currently, the model’s confidence threshold for distinguishing between these two scripts is too low, leading to incorrect OCR output in both developer-facing APIs and Apple’s native ecosystem (Preview, Live Text, and Photos).
import SwiftUI
import Vision
class TextExtractor {
func extractText(from data: Data, completion: @escaping (String) -> Void) {
let request = VNRecognizeTextRequest { (request, error) in
guard let observations = request.results as? [VNRecognizedTextObservation] else {
completion("No text found.")
return
}
let recognizedStrings = observations.compactMap { observation in
let str = observation.topCandidates(1).first?.string
return "{text: \(str!), confidence: \(observation.confidence)}"
}
completion(recognizedStrings.joined(separator: "\n"))
}
request.automaticallyDetectsLanguage = true // <-- This is the issue.
request.recognitionLevel = .accurate
let handler = VNImageRequestHandler(data: data, options: [:])
DispatchQueue.global(qos: .background).async {
do {
try handler.perform([request])
} catch {
completion("Failed to perform OCR: \(error.localizedDescription)")
}
}
}
}
Recognizing Khmer
Confidence Score is low for Khmer text. (The output is in Thai language with low confidence score)
Recognizing English
Confidence Score is high expected.
Recognizing Thai
Confidence Score is high as expected
Issues on Preview, Photos
Khmer text
Copied text
Kouk Pring Chroum Temple [19121 รอาสายสุกตีนานยารรีสใหิสรราภูชิตีนนสุฐตีย์ [รุก
เผือชิษาธอยกัตธ์ตายตราพาษชาณา ถวเชยาใบสราเบรถทีมูสินตราพาษชาณา ทีมูโษา เช็ก
อาษเชิษฐอารายสุกบดตพรธุรฯ ตากร"สุก"ผาตากรธกรธุกเยากสเผาพศฐตาสาย รัอรณาษ"ตีพย"
สเผาพกรกฐาภูชิสาเครๆผู:สุกรตีพาสเผาพสรอสายใผิตรรารตีพสๆ เดียอลายสุกตีน
ธาราชรติ ธิพรหณาะพูชุบละเาหLunet De Lajonquiere ผารูกรสาราพารผรผาสิตภพ ตารสิทูก ธิพิ
คุณที่นสายเระพบพเคเผาหนารเกะทรนภาษเราภุพเสารเราษทีเลิกสญาเราหรุฬารชสเกาก เรากุม
สงสอบานตรเราะากกต่ายภากายระตารุกเตียน
Recommended Solutions
1. Set a Threshold
Filter out the detected result where the threshold is less than or equal to 0.5, so that it would not output low quality text which can lead to the issue.
For example,
let recognizedStrings = observations.compactMap { observation in
if observation.confidence <= 0.5 {
return nil
}
let str = observation.topCandidates(1).first?.string
return "{text: \(str!), confidence: \(observation.confidence)}"
}
2. Add Khmer Language Support
This issue would never happen if the model has the capability to detect and recognize image with Khmer language.
Doc2Text GitHub: https://github.com/seanghay/Doc2Text-Swift
Hello All,
I’m working on a computer-vision–heavy iOS application that uses the camera, LiDAR depth maps, and semantic segmentation to reason about the environment (object identification, localization and measurement - not just visualization).
Current architecture
I initially built the image pipeline around CIImage as a unifying abstraction. It seemed like a good idea because:
CIImage integrates cleanly with Vision, ARKit, AVFoundation, Metal, Core Graphics, etc.
It provides a rich set of out-of-the-box transforms and filters.
It is immutable and thread-safe, which significantly simplified concurrency in a multi-queue pipeline.
The LiDAR depth maps, semantic segmentation masks, etc. were treated as CIImages, with conversion to CVPixelBuffer or MTLTexture only at the edges when required.
Problem
I’ve run into cases where Core Image transformations do not preserve numeric fidelity for non-visual data.
Example:
Rendering a CIImage-backed segmentation mask into a larger CVPixelBuffer can cause label values to change in predictable but incorrect ways.
This occurs even when:
using nearest-neighbor sampling
disabling color management (workingColorSpace / outputColorSpace = NSNull)
applying identity or simple affine transforms
I’ve confirmed via controlled tests that:
Metal → CVPixelBuffer paths preserve values correctly
CIImage → CVPixelBuffer paths can introduce value changes when resampling or expanding the render target
This makes CIImage unsafe as a source of numeric truth for segmentation masks and depth-based logic, even though it works well for visualization, and I should have realized this much sooner.
Direction I’m considering
I’m now considering refactoring toward more intent-based abstractions instead of a single image type, for example:
Visual images: CIImage (camera frames, overlays, debugging, UI)
Scalar fields: depth / confidence maps backed by CVPixelBuffer + Metal
Label maps: segmentation masks backed by integer-preserving buffers (no interpolation, no transforms)
In this model, CIImage would still be used extensively — but primarily for visualization and perceptual processing, not as the container for numerically sensitive data.
Thread safety concern
One of the original advantages of CIImage was that it is thread-safe by design, and that was my biggest incentive.
For CVPixelBuffer / MTLTexture–backed data, I’m considering enforcing thread safety explicitly via:
Swift Concurrency (actor-owned data, explicit ownership)
Questions
For those may have experience with CV / AR / imaging-heavy iOS apps, I was hoping to know the following:
Is this separation of image intent (visual vs numeric vs categorical) a reasonable architectural direction?
Do you generally keep CIImage at the heart of your pipeline, or push it to the edges (visualization only)?
How do you manage thread safety and ownership when working heavily with CVPixelBuffer and Metal? Using actor-based abstractions, GCD, or adhoc?
Are there any best practices or gotchas around using Core Image with depth maps or segmentation masks that I should be aware of?
I’d really appreciate any guidance or experience-based advice. I suspect I’ve hit a boundary of Core Image’s design, and I’m trying to refactor in a way that doesn't involve too much immediate tech debt, remains robust and maintainable long-term.
Thank you in advance!
Hello,
I posted an issue on the coremltools GitHub about my Core ML models not performing as well on iOS 17 vs iOS 16 but I'm posting it here just in case.
TL;DR
The same model on the same device/chip performs far slower (doesn't use the Neural Engine) on iOS 17 compared to iOS 16.
Longer description
The following screenshots show the performance of the same model (a PyTorch computer vision model) on an iPhone SE 3rd gen and iPhone 13 Pro (both use the A15 Bionic).
iOS 16 - iPhone SE 3rd Gen (A15 Bioinc)
iOS 16 uses the ANE and results in fast prediction, load and compilation times.
iOS 17 - iPhone 13 Pro (A15 Bionic)
iOS 17 doesn't seem to use the ANE, thus the prediction, load and compilation times are all slower.
Code To Reproduce
The following is my code I'm using to export my PyTorch vision model (using coremltools).
I've used the same code for the past few months with sensational results on iOS 16.
# Convert to Core ML using the Unified Conversion API
coreml_model = ct.convert(
model=traced_model,
inputs=[image_input],
outputs=[ct.TensorType(name="output")],
classifier_config=ct.ClassifierConfig(class_names),
convert_to="neuralnetwork",
# compute_precision=ct.precision.FLOAT16,
compute_units=ct.ComputeUnit.ALL
)
System environment:
Xcode version: 15.0
coremltools version: 7.0.0
OS (e.g. MacOS version or Linux type): Linux Ubuntu 20.04 (for exporting), macOS 13.6 (for testing on Xcode)
Any other relevant version information (e.g. PyTorch or TensorFlow version): PyTorch 2.0
Additional context
This happens across "neuralnetwork" and "mlprogram" type models, neither use the ANE on iOS 17 but both use the ANE on iOS 16
If anyone has a similar experience, I'd love to hear more.
Otherwise, if I'm doing something wrong for the exporting of models for iOS 17+, please let me know.
Thank you!
使用MPS来加速机器学习功能,有时是否与torch会有适配性问题?
While building an app with large language model inferencing on device, I got gibberish output. After carefully examining every detail, I found it's caused by the fused scaledDotProductAttention operation. I switched back to the discrete operations and problem solved. To reproduce the bug, please check https://github.com/zhoudan111/MPSGraph_SDPA_bug
Topic:
Machine Learning & AI
SubTopic:
General
Incident Identifier: 4C22F586-71FB-4644-B823-A4B52D158057
CrashReporter Key: adc89b7506c09c2a6b3a9099cc85531bdaba9156
Hardware Model: Mac16,10
Process: PRISMLensCore [16561]
Path: /Applications/PRISMLens.app/Contents/Resources/app.asar.unpacked/node_modules/core-node/PRISMLensCore.app/PRISMLensCore
Identifier: com.prismlive.camstudio
Version: (null) ((null))
Code Type: ARM-64
Parent Process: ? [16560]
Date/Time: (null)
OS Version: macOS 15.4 (24E5228e)
Report Version: 104
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x00000000 at 0x0000000000000000
Crashed Thread: 34
Application Specific Information:
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** -[__NSArrayM insertObject:atIndex:]: object cannot be nil'
Thread 34 Crashed:
0 CoreFoundation 0x000000018ba4dde4 0x18b960000 + 974308 (__exceptionPreprocess + 164)
1 libobjc.A.dylib 0x000000018b512b60 0x18b4f8000 + 109408 (objc_exception_throw + 88)
2 CoreFoundation 0x000000018b97e69c 0x18b960000 + 124572 (-[__NSArrayM insertObject:atIndex:] + 1276)
3 Portrait 0x0000000257e16a94 0x257da3000 + 473748 (-[PTMSRResize addAdditionalOutput:] + 604)
4 Portrait 0x0000000257de91c0 0x257da3000 + 287168 (-[PTEffectRenderer initWithDescriptor:metalContext:useHighResNetwork:faceAttributesNetwork:humanDetections:prevTemporalState:asyncInitQueue:sharedResources:] + 6204)
5 Portrait 0x0000000257dab21c 0x257da3000 + 33308 (__33-[PTEffect updateEffectDelegate:]_block_invoke.241 + 164)
6 libdispatch.dylib 0x000000018b739b2c 0x18b738000 + 6956 (_dispatch_call_block_and_release + 32)
7 libdispatch.dylib 0x000000018b75385c 0x18b738000 + 112732 (_dispatch_client_callout + 16)
8 libdispatch.dylib 0x000000018b742350 0x18b738000 + 41808 (_dispatch_lane_serial_drain + 740)
9 libdispatch.dylib 0x000000018b742e2c 0x18b738000 + 44588 (_dispatch_lane_invoke + 388)
10 libdispatch.dylib 0x000000018b74d264 0x18b738000 + 86628 (_dispatch_root_queue_drain_deferred_wlh + 292)
11 libdispatch.dylib 0x000000018b74cae8 0x18b738000 + 84712 (_dispatch_workloop_worker_thread + 540)
12 libsystem_pthread.dylib 0x000000018b8ede64 0x18b8eb000 + 11876 (_pthread_wqthread + 292)
13 libsystem_pthread.dylib 0x000000018b8ecb74 0x18b8eb000 + 7028 (start_wqthread + 8)
Topic:
Machine Learning & AI
SubTopic:
General
Hi,
I'm testing DockKit with a very simple setup:
I use VNDetectFaceRectanglesRequest to detect a face and then call dockAccessory.track(...) using the detected bounding box.
The stand is correctly docked (state == .docked) and dockAccessory is valid.
I'm calling .track(...) with a single observation and valid CameraInformation (including size, device, orientation, etc.). No errors are thrown.
To monitor this, I added a logging utility – track(...) is being called 10–30 times per second, as recommended in the documentation.
However: the stand does not move at all.
There is no visible reaction to the tracking calls.
Is there anything I'm missing or doing wrong?
Is VNDetectFaceRectanglesRequest supported for DockKit tracking, or are there hidden requirements?
Would really appreciate any help or pointers – thanks!
That's my complete code:
extension VideoFeedViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let frame = CMSampleBufferGetImageBuffer(sampleBuffer) else {
return
}
detectFace(image: frame)
func detectFace(image: CVPixelBuffer) {
let faceDetectionRequest = VNDetectFaceRectanglesRequest() { vnRequest, error in
guard let results = vnRequest.results as? [VNFaceObservation] else {
return
}
guard let observation = results.first else {
return
}
let boundingBoxHeight = observation.boundingBox.size.height * 100
#if canImport(DockKit)
if let dockAccessory = self.dockAccessory {
Task {
try? await trackRider(
observation.boundingBox,
dockAccessory,
frame,
sampleBuffer
)
}
}
#endif
}
let imageResultHandler = VNImageRequestHandler(cvPixelBuffer: image, orientation: .up)
try? imageResultHandler.perform([faceDetectionRequest])
func combineBoundingBoxes(_ box1: CGRect, _ box2: CGRect) -> CGRect {
let minX = min(box1.minX, box2.minX)
let minY = min(box1.minY, box2.minY)
let maxX = max(box1.maxX, box2.maxX)
let maxY = max(box1.maxY, box2.maxY)
let combinedWidth = maxX - minX
let combinedHeight = maxY - minY
return CGRect(x: minX, y: minY, width: combinedWidth, height: combinedHeight)
}
#if canImport(DockKit)
func trackObservation(_ boundingBox: CGRect, _ dockAccessory: DockAccessory, _ pixelBuffer: CVPixelBuffer, _ cmSampelBuffer: CMSampleBuffer) throws {
// Zähle den Aufruf
TrackMonitor.shared.trackCalled()
let invertedBoundingBox = CGRect(
x: boundingBox.origin.x,
y: 1.0 - boundingBox.origin.y - boundingBox.height,
width: boundingBox.width,
height: boundingBox.height
)
guard let device = captureDevice else {
fatalError("Kamera nicht verfügbar")
}
let size = CGSize(width: Double(CVPixelBufferGetWidth(pixelBuffer)),
height: Double(CVPixelBufferGetHeight(pixelBuffer)))
var cameraIntrinsics: matrix_float3x3? = nil
if let cameraIntrinsicsUnwrapped = CMGetAttachment(
sampleBuffer,
key: kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix,
attachmentModeOut: nil
) as? Data {
cameraIntrinsics = cameraIntrinsicsUnwrapped.withUnsafeBytes { $0.load(as: matrix_float3x3.self) }
}
Task {
let orientation = getCameraOrientation()
let cameraInfo = DockAccessory.CameraInformation(
captureDevice: device.deviceType,
cameraPosition: device.position,
orientation: orientation,
cameraIntrinsics: cameraIntrinsics,
referenceDimensions: size
)
let observation = DockAccessory.Observation(
identifier: 0,
type: .object,
rect: invertedBoundingBox
)
let observations = [observation]
guard let image = CMSampleBufferGetImageBuffer(sampleBuffer) else {
print("no image")
return
}
do {
try await dockAccessory.track(observations, cameraInformation: cameraInfo)
} catch {
print(error)
}
}
}
#endif
func clearDrawings() {
boundingBoxLayer?.removeFromSuperlayer()
boundingBoxSizeLayer?.removeFromSuperlayer()
}
}
}
}
@MainActor
private func getCameraOrientation() -> DockAccessory.CameraOrientation {
switch UIDevice.current.orientation {
case .portrait:
return .portrait
case .portraitUpsideDown:
return .portraitUpsideDown
case .landscapeRight:
return .landscapeRight
case .landscapeLeft:
return .landscapeLeft
case .faceDown:
return .faceDown
case .faceUp:
return .faceUp
default:
return .corrected
}
}
From tensorflow-metal example:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
I know that Apple silicon uses UMA, and that memory copies are typical of CUDA, but wouldn't the GPU memory still be faster overall?
I have an iMac Pro with a Radeon Pro Vega 64 16 GB GPU and an Intel iMac with a Radeon Pro 5700 8 GB GPU.
But using tensorflow-metal is still WAY faster than using the CPUs. Thanks for that. I am surprised the 5700 is twice as fast as the Vega though.
*I can't put the attached file in the format, so if you reply by e-mail, I will send the attached file by e-mail.
Dear Apple AI Research Team,
My name is Gong Jiho (“Hem”), a content strategist based in Seoul, South Korea.
Over the past few months, I conducted a user-led AI experiment entirely within ChatGPT — no code, no backend tools, no plugins.
Through language alone, I created two contrasting agents (Uju and Zero) and guided them into a co-authored modular identity system using prompt-driven dialogue and reflection.
This system simulates persona fusion, memory rooting, and emotional-logical alignment — all via interface-level interaction.
I believe it resonates with Apple’s values in privacy-respecting personalization, emotional UX modeling, and on-device learning architecture.
Why I’m Reaching Out
I’d be honored to share this experiment with your team.
If there is any interest in discussing user-authored agent scaffolding, identity persistence, or affective alignment, I’d love to contribute — even informally.
⚠ A Note on Language
As a non-native English speaker, my expression may be imperfect — but my intent is genuine.
If anything is unclear, I’ll gladly clarify.
📎 Attached Files Summary
Filename → Description
Hem_MultiAI_Report_AppleAI_v20250501.pdf →
Main report tailored for Apple AI — narrative + structural view of emotional identity formation via prompt scaffolding
Hem_MasterPersonaProfile_v20250501.json →
Final merged identity schema authored by Uju and Zero
zero_sync_final.json / uju_sync_final.json →
Persona-level memory structures (logic / emotion)
1_0501.json ~ 3_0501.json →
Evolution logs of the agents over time
GirlfriendGPT_feedback_summary.txt →
Emotional interpretation by external GPT
hem_profile_for_AI_vFinal.json →
Original user anchor profile
Warm regards,
Gong Jiho (“Hem”)
Seoul, South Korea
When calling NLTagger.requestAssets with some languages, it hangs indefinitely both in the simulator and a device. This happens consistently for some languages like greek. An example call is NLTagger.requestAssets(for: .greek, tagScheme: .lemma). Other languages like french return immediately. I captured some logs from Console and found what looks like the repeated attempts to download the asset. I would expect the call to eventually terminate, either loading the asset or failing with an error.
Introduced in the Keynote was the 3D Lock Screen images with the kangaroo:
https://9to5mac.com/wp-content/uploads/sites/6/2025/06/3d-lock-screen-2.gif
I can't see any mention on if this effect is available for developers with an API to convert flat 2D photos in to the same 3D feeling image.
Does anyone know if there is an API?
Topic:
Machine Learning & AI
SubTopic:
General
How do I test the new RecognizeDocumentRequest API. Reference: https://www.youtube.com/watch?v=H-GCNsXdKzM
I am running Xcode Beta, however I only have one primary device that I cannot install beta software on.
Please provide a strategy for testing. Will simulator work?
The new capability is critical to my application, just what I need for structuring document scans and extraction.
Thank you.
Hey guys 👋
I’ve been thinking about a feature idea for iOS that could totally change the way we interact with apps like Twitter/X.
Imagine if we could define our own recommendation algorithm, and have an AI on the iPhone that replaces the suggested tweets in the feed with ones that match our personal interests — based on public tweets, and without hacking anything.
Kinda like a personalized "AI skin" over the app that curates content you actually care about. Feels like this would make content way more relevant and less algorithmically manipulative.
Would love to know what you all think — and if Apple could pull this off 🔥
Topic:
Machine Learning & AI
SubTopic:
General
Hi, I'm looking for the best way to use MLX models, particularly those I've fine-tuned, within a React Native application on iOS devices. Is there a recommended integration path or specific API for bridging MLX's capabilities to React Native for deployment on iPhones and iPads?
During testing the “Bringing advanced speech-to-text capabilities to your app” sample app demonstrating the use of iOS 26 SpeechAnalyzer, I noticed that the language model for the English locale was presumably already downloaded. Upon checking the documentation of AssetInventory, I found out that indeed, the language model can be preinstalled on the system.
Can someone from the dev team share more info about what assets are preinstalled by the system? For example, can we safely assume that the English language model will almost certainly be already preinstalled by the OS if the phone has the English locale?