Image Signal Processing (ISP) Pipeline and 3A Algorithms

I. The Purpose: Handling Imperfections and Matching Human Vision
- A Brief overview of the ISP’s function
- The role and importance of 3A algorithms
III. Core Image Signal Processing Stages
IV. Auto White Balance (AWB)
- Fundamental Concepts
- Assumptions and Algorithms
V. Auto Focus (AF)
- Active AF and Passive AF
- Common AF Techniques
VI. Auto Exposure (AE) / Metering
VII. Trends

I. The Purpose: Handling Imperfections and Matching Human Vision

The Image Signal Processor (ISP) is a crucial component in digital imaging systems, acting as the bridge between the raw data captured by an image sensor and the final, viewable image. Its primary purpose is to perform a series of complex processing steps to convert this raw, often imperfect, sensor data into a high-quality digital image suitable for display or storage. Key functions of the ISP include demosaicing (reconstructing a full-color image from the sensor’s color filter array), noise reduction, color correction, white balance adjustment, auto exposure control, and image enhancement tasks like sharpening and contrast adjustment. Essentially, the ISP takes the initial, unprocessed output from the sensor, which can suffer from issues like noise, incorrect colors, and improper brightness, and refines it into a visually appealing and accurate representation of the scene.

In essence, the ISP and 3A algorithms work in concert to process the raw sensor data in a way that compensates for the limitations and characteristics of the electronic sensor, aiming to produce a final image that more closely matches the rich, adaptive, and consistent visual experience of human eyesight.

A Brief overview of the ISP’s function

The input to the ISP is typically a single-channel (monochrome) image, often arranged in a specific pattern of color filters known as a Color Filter Array (CFA), most commonly the Bayer pattern. The output is usually a standard color image format, such as sRGB.

The conversion is not a single operation but a cascade of processing stages, collectively referred to as the ISP pipeline. Each stage in this pipeline performs a specific task to progressively enhance the image quality and correct for imperfections inherent in the raw sensor data and the image capture process. Key transformations within a typical ISP pipeline include, but are not limited to:

Preprocessing: Black level correction, defective pixel correction, lens shading correction.
CFA Processing: Demosaicing (or debayering) to reconstruct a full-color image from the CFA data.
Noise Reduction: Filtering to remove various types of noise.
Color and Tone Adjustments: White balance correction, color correction (e.g., using a color correction matrix), gamma correction, and tone mapping.
Image Enhancement / Photo-finishing: Sharpening, contrast adjustment, and stylish enhancement, etc.
Output Formatting: Color space conversion and image compression.

The above diagram is from Prof. Mi S. Brown¹’s MUST-READ tutorial² for ISP at ICCV 2019.

The role and importance of 3A algorithms

The term 3A refers to three crucial sets of automated control algorithms: Auto Exposure (AE), Auto White Balance (AWB), and Auto Focus (AF). These algorithms are typically implemented within or in close coordination with the ISP and are designed to emulate the adaptive capabilities of the human visual system, automatically optimizing key image capture parameters.

The overarching goal of 3A algorithms is to ensure that the images captured by a camera are:

Clear and Sharp (AF): The subject of interest is accurately focused.
Well-Exposed (AE): The image brightness is appropriate, neither too dark (underexposed) nor too bright (overexposed), preserving detail in both shadows and highlights where possible.
Color-Accurate (AWB): Colors are rendered naturally, free from unrealistic casts caused by different lighting conditions, ensuring that white objects appear white.

The importance of 3A algorithms cannot be overstated. They are fundamental to the user experience in consumer digital cameras and smartphones, allowing even novice users to capture high-quality images across a wide spectrum of scenes and lighting conditions with minimal manual intervention.10 For professional photographers, while manual controls are often preferred, reliable 3A functions can provide excellent starting points or handle rapidly changing situations. Beyond consumer photography, in fields such as robotics, autonomous vehicles, medical imaging, and surveillance, robust and accurate 3A performance is critical for reliable visual perception, data acquisition, and subsequent decision-making processes.10 The Android camera Hardware Abstraction Layer (HAL), for instance, defines specific state machines and control mechanisms for 3A modes, underscoring their systemic importance in mobile imaging platforms.

III. Core Image Signal Processing Stages

Typical Processing Blocks

The Image Signal Processor (ISP) is a sophisticated pipeline of interconnected processing modules, realized either in dedicated hardware, software, or a combination thereof. Its fundamental role is to take the raw, often imperfect, data from an image sensor and transform it into a high-quality digital image suitable for human viewing or for interpretation by machine vision algorithms.1 This involves correcting various sensor and lens artifacts, reconstructing color information, reducing noise, and enhancing visual appeal.

A typical ISP pipeline, while varying in specific implementation details between manufacturers and applications, generally includes the following categories of processing blocks, often in a sequence similar to this:

Preprocessing Stages: These operations prepare the raw data for subsequent, more complex processing.
- Analog-to-Digital Conversion (ADC): While often considered part of the sensor unit, the ADC converts the analog voltage from each photosite into a digital value, marking the entry point into the digital domain.3
- Black Level Correction (or Subtraction): Image sensors exhibit a “dark current,” producing a small signal even in complete darkness. Black level correction subtracts this baseline offset from the pixel values to ensure true blacks.
- Defective Pixel Correction (DPC): Manufacturing imperfections can result in sensor pixels that are stuck (always bright), dead (always dark), or noisy. DPC algorithms identify and interpolate values for these defective pixels using information from their neighbors.
- Lens Shading Correction (LSC): Lenses, especially wide-angle ones, can cause non-uniform brightness (vignetting) and color shifts across the image field, typically with corners being darker or having a color cast. LSC applies a spatially varying gain to compensate for these effects.
CFA-Related Processing (Primarily for Bayer sensors):
- Demosaicing (Debayering): As each pixel in a CFA-based sensor captures only one color, demosaicing algorithms reconstruct a full-color (typically RGB) image by interpolating the missing two color values at each pixel location.1 This is a critical and often computationally intensive step, detailed further below.
Noise Management:
- Denoising and Filtering: Various types of noise (shot noise, read noise, thermal noise) are inherent in image capture, especially in low-light conditions or at high ISO settings. Denoising algorithms aim to reduce this noise while preserving image detail. This is also discussed in more detail later.
Color and Tone Processing:
- White Balance (AWB): Corrects for the color cast introduced by the scene’s illuminant, ensuring that white objects appear white and other colors are rendered naturally.
- Color Correction Matrix (CCM): Transforms the camera’s native, sensor-specific RGB values into a standard color space (e.g., sRGB, CIE XYZ) for consistent and accurate color reproduction.
- Gamma Correction: Applies a non-linear transformation to the pixel intensities to account for the non-linear response of display devices or to match human perceptual characteristics of brightness.
- Tone Mapping / Dynamic Range Compression (DRC): If the scene has a high dynamic range (HDR) – a large difference between the darkest and brightest areas – tone mapping compresses this range to fit within the capabilities of a standard display, while attempting to preserve local contrast and detail.
Image Enhancement:
- Sharpening / Edge Enhancement: Increases the acutance (perceived sharpness) of edges in the image, often by amplifying high-frequency components.
- Contrast Adjustment: Modifies the global or local contrast of the image to enhance visual impact.
Output Formatting:
- Color Space Conversion: May convert the image from an internal processing color space (e.g., linear RGB) to an output color space (e.g., sRGB for display, YUV for video compression).
- Image Compression: Often, the final processed image is compressed (e.g., into JPEG or HEIC format) to reduce file size for storage or transmission.
- Resizing/Scaling: Adjusts the image dimensions if required.

The following is a example illustration of the above processes that presents in a seminal paper regarding software ISP³, with a visualization of intermediate results after each transformation.

The exact sequence and specific algorithms employed within these blocks can be proprietary and vary significantly. Some ISPs are highly configurable, allowing tuning for different sensors or application requirements. The stages within an ISP are not independent silos; they are highly interconnected. The output quality of one stage directly influences the input and potential effectiveness of all subsequent stages. For example, inadequate black level correction can skew all downstream calculations. Residual noise that isn’t effectively removed by the denoising stage can lead to more pronounced artifacts during demosaicing, as interpolation algorithms might misinterpret noise as detail. Similarly, if demosaicing introduces false colors or blurs edges, the accuracy of white balance and color correction will be compromised, and sharpening algorithms might amplify these unwanted artifacts. This deep interdependence means that optimizing an ISP requires a holistic view.

Demosaic (Debayering)

Demosaicing, also commonly referred to as debayering or color reconstruction, is the digital image processing algorithm used to reconstruct a full-color image from the spatially undersampled color samples output by an image sensor overlaid with a CFA, such as the Bayer filter. Demosaicing is a fundamental and computationally critical stage in any ISP pipeline handling raw data from single-sensor CFA cameras. The quality of the demosaicing algorithm profoundly impacts the final image’s color fidelity, perceived resolution, sharpness, and the presence or absence of visual artifacts. A poor demosaicing algorithm can severely degrade image quality, regardless of how well other ISP stages perform. For Bayer sensors, this stage is not merely processing existing information but is effectively “creating” two-thirds of the color data at each pixel site. The accuracy of this “creation” is therefore paramount for all subsequent color-dependent operations in the ISP.

Denoising and Filtering

Image denoising is the process aimed at removing or attenuating noise – random fluctuations in pixel values that are not part of the true scene information – from a digital image. Noise can arise from various sources, including the inherent quantum nature of light (shot noise), thermal agitation in the sensor (thermal noise), electronics in the signal path (read noise), and the quantization process during analog-to-digital conversion. It is often more pronounced in images taken in low-light conditions, which necessitate higher sensor gain or longer exposure times. The fundamental challenge in image denoising lies in effectively distinguishing unwanted noise from meaningful image details such as edges, textures, and fine structures, as all can manifest as high-frequency components. Overly aggressive denoising can lead to a loss of these details, resulting in a blurred or overly smooth, artificial appearance. Conversely, insufficient denoising leaves distracting noise patterns. Common filters for denoising include Gaussian filter, Bilateral filter, Guided filter, etc.

Color Correction and Management

Color correction is the process within the ISP that transforms the camera’s native (sensor-specific) RGB color values into a standard, device-independent color space, such as sRGB, Adobe RGB, or the CIE XYZ color space. This transformation is essential for ensuring that colors are rendered accurately and consistently when viewed on different display devices or reproduced in print. The primary goal of color correction is to achieve color fidelity, meaning the colors in the final processed image should appear as close as possible to the true colors of the original scene as perceived by a human observer under a standard illuminant. The raw RGB values generated by an image sensor are inherently device-dependent. This dependency arises from the unique Spectral Sensitivity Functions (SSFs) of the sensor’s color channels (R, G, B) and the characteristics of its Color Filter Array (CFA).

The most common method for color correction involves applying a 3x3 linear transformation matrix, known as the Color Correction Matrix (CCM), to the RGB pixel values from the sensor (after white balance).55 If RGBcam represents the camera’s RGB vector for a pixel, and RGBstd represents the corresponding vector in a standard color space. More complex, non-linear methods, such as polynomial regression or the use of 3D Look-Up Tables (LUTs), can also be employed to achieve higher color accuracy, especially if the camera’s response has significant non-linearities that cannot be adequately modeled by a simple 3x3 matrix. The core aim of color correction, particularly the application of a CCM, is to bridge the significant gap between how a specific image sensor “perceives” color—dictated by its unique SSFs—and how humans perceive color or how standard display devices are designed to render color. It is a fundamental step in transforming device-dependent sensor data into a universally interpretable and consistent color representation. This process is pivotal for achieving predictable and faithful color reproduction, which is essential across a vast range of applications, from casual photo sharing, where pleasing colors are expected, to professional photography, print, and video workflows, where color accuracy is paramount. The often-discussed “color preference” (style/taste) of a particular camera/smartphone brand is heavily influenced by the quality and characteristics of its color correction pipeline and the subsequent aesthetic choices made in color rendering and tone mapping, i.e., photo-finishing.

IV. Auto White Balance (AWB)

Auto White Balance (AWB) is a critical 3A function within the ISP, responsible for ensuring that the colors in a digital image are rendered naturally, irrespective of the color temperature of the light source illuminating the scene. It aims to mimic the human visual system’s ability to perceive colors consistently under varying illumination.

Fundamental Concepts

Color Constancy refers to the remarkable ability of the human visual system (HVS) to perceive the intrinsic color of an object as relatively stable, even when the spectral characteristics of the illuminant (the light source) change significantly. For instance, a white sheet of paper is perceived as white whether viewed outdoors under bluish daylight, indoors under yellowish incandescent light, or under greenish fluorescent light, despite the fact that the actual spectrum of light reflected from the paper to the eye differs dramatically in these conditions. The primary objective of AWB algorithms in digital cameras is to computationally emulate this perceptual phenomenon. AWB strives to analyze the image data, estimate the properties of the scene illuminant, and then apply a correction to the image’s color data. This correction aims to remove the color cast introduced by the illuminant, rendering objects as if they were viewed under a neutral, or “white,” light source. The above figure shows the famous “Grey Strawberries” example.

Chromatic adaptation is the physiological and perceptual process by which the HVS adjusts its sensitivity to different colors of illumination, thereby contributing significantly to achieving color constancy. When exposed to a colored illuminant for a period, our visual system “adapts” by reducing its sensitivity to that color, making the illuminant appear less chromatic and helping object colors remain stable. AWB algorithms often incorporate mathematical models of chromatic adaptation to transform the image colors as captured under the estimated scene illuminant to how they would appear under a predefined reference illuminant (typically a standard daylight illuminant like CIE D65 or D50). This process involves first estimating the scene illuminant’s chromaticity and then applying a chromatic adaptation transform (CAT) to the image data. For example, an algorithm might adjust camera output based on the color of a detected face under an unknown illuminant to match what that face color would be under a standard illuminant, which is a form of chromatic adaptation. Chromatic adaptation is one of the key underlying mechanisms that enables the HVS to achieve color constancy.

Neutral Point / White Point refers to a color that, after appropriate white balancing, should appear achromatic – that is, devoid of any hue, such as pure white, a shade of gray, or black. A core objective of AWB algorithms is to identify the chromaticity of the scene illuminant and then transform the image’s color balance such that objects in the scene that are presumed to be neutral (e.g., a white wall, a gray card) are rendered as neutral in the final image.67 Essentially, the AWB algorithm attempts to map the color of the estimated illuminant itself to a “pure white” or a defined neutral gray in the output color space.

Illuminant is the source of light that illuminates the scene being photographed. Its defining characteristic is its Spectral Power Distribution (SPD), which describes the amount of energy the light source emits at each wavelength across the visible spectrum. The SPD determines the “color” of the light. The color of an object as captured by a camera (or perceived by the eye) is a result of the interaction between the object’s surface reflectance properties (which wavelengths it absorbs and reflects) and the SPD of the illuminant. If the illuminant changes (e.g., from sunlight to incandescent light), the spectral composition of the light reaching the camera sensor from the same object will also change, leading to different raw color signals. Common Illuminants with distinct SPDs and color temperatures (measured in Kelvin, K) include:

Daylight: Varies significantly from warm (low K, e.g., sunrise/sunset) to neutral (medium K, e.g., noon sunlight) to cool (high K, e.g., overcast sky, open shade). Standard daylight illuminants like D65 (average daylight, ~6500K) and D50 (horizon light, ~5000K) are often used as references.
Incandescent/Tungsten Light: Emits warm, yellowish light (typically around 2700-3200K).
Fluorescent Light: Can have a wide range of SPDs and color temperatures, often with strong spectral peaks, making them challenging for AWB.
LED Light: Also highly variable SPDs depending on the LED technology, and can be designed to mimic other light sources or produce specific colors.
Flash: Typically designed to be close to daylight color temperature (~5500-6000K).

Assumptions and Algorithms

von Kries hypothesis, proposed by Johannes von Kries in 1902, is a foundational and widely referenced model for explaining and computing chromatic adaptation. It is based on the hypothesis that the human visual system’s adaptation to changes in illumination color occurs through independent sensitivity adjustments (gain controls) in the three types of cone photoreceptors in the retina: Long-wavelength sensitive (L), Medium-wavelength sensitive (M), and Short-wavelength sensitive (S) cones. The model posits that to achieve color constancy (i.e., for an object to appear the same color under different illuminants), the responses of the L, M, and S cones are independently scaled. These scaling factors (or gains) are determined by the relationship between the cone responses to a reference white surface under the current (source) illuminant and a target (reference) illuminant. The idea is that the adapted responses to a white object should be the same regardless of the illuminant. The von Kries transformation typically involves the following steps:

Transformation to Cone Space: The camera’s RGB values (which are device-dependent) are first transformed into an LMS-like cone response space using a 3x3 linear matrix, \(M_{RGB→LMS}\). This matrix aims to approximate the spectral sensitivities of the human cones: \(T=M_{RGB→LMS}×T\).
Gain Calculation: The cone responses for a reference white surface under the source illuminant (\(L_{src},M_{src},S_{src}\)) and the target illuminant (\(L_{tgt},M_{tgt},S_{tgt}\)) are determined. The von Kries adaptation coefficients (gains) are then computed as the ratios: \(k_L=\frac{L_{src}}{L_{tgt}},k_M=\frac{M_{src}}{M_{tgt}},k_S=\frac{S_{src}}{_{Sw,tgt}}\).
Application of Gains: These gains are applied independently to the LMS values of each pixel in the image: \(L_{adapted}=k_L⋅L_{image},M_{adapted}=k_M⋅M_{image},S_{adapted}=k_S⋅S_{image}\).
Transformation to Output Space: The adapted LMS values are then transformed back to an output RGB color space using another 3x3 matrix, \(M_{LMS→RGB}\).

The von Kries model provides a simple yet powerful conceptual framework for chromatic adaptation. It forms the theoretical basis for many modern Chromatic Adaptation Transforms (CATs) used in color science, color management, and AWB algorithms, such as CAT02 and CAT16. These often incorporate refined transformation matrices to LMS-like spaces, non-linearities, or different ways of calculating the adaptation gains to better predict experimental data on corresponding colors (how colors appear to match under different illuminants). The term “Wrong von Kries” is sometimes used to describe the application of similar diagonal scaling directly in an RGB color space that is not well-aligned with the LMS cone fundamentals, which can be effective despite wrong.

Auto White Balancing (AWB) algorithms aim to automatically estimate the chromaticity of the dominant illuminant in a scene and then apply a color correction to the image data to neutralize any color cast caused by that illuminant, making white objects appear white and other colors appear natural. Most AWB algorithms can be conceptually divided into two main stages:

Illuminant Estimation: This stage analyzes the image data (and sometimes metadata) to determine the color (chromaticity) of the light source(s) illuminating the scene.
Color Correction (Chromatic Adaptation): Once the illuminant is estimated, a transformation (often a von Kries type scaling or a more advanced CAT) is applied to the image’s R, G, and B values to shift the colors as if the scene were lit by a reference neutral illuminant.

Common statistical Approaches for Illuminant Estimation:

Gray World Algorithm: This classic algorithm is based on the assumption that, on average, the reflectances of all surfaces in a typical, diverse scene will average out to a neutral gray.
White Patch Algorithm (Max-RGB or Perfect Reflector): This approach assumes that the brightest pixel (or a small patch of the brightest pixels) in the image corresponds to a perfectly white surface or a specular highlight that directly reflects the color of the illuminant.
Gray-Edge / Edge-Based Algorithms: These methods operate on the assumption that the average color of differences (gradients) across edges in an image tends towards the illuminant color, or that edges themselves provide robust cues.
Gamut Mapping / Color by Correlation / Gamut Constraint Algorithms: These algorithms work with the distribution of colors (the gamut) present in the image. The underlying idea is that the scene illuminant restricts the range of colors that can be observed from a set of surfaces. By comparing the gamut of colors in the captured image to a set of canonical gamuts (pre-calculated or learned for various known illuminants), the algorithm can infer the most likely scene illuminant. This often involves finding the illuminant under which the image’s color distribution “fits best” within the expected boundaries.

V. Auto Focus (AF)

Auto Focus (AF) is the 3A system component responsible for automatically adjusting the camera’s lens to ensure that the desired subject in the scene appears sharp and clear in the captured image or video. Modern AF systems employ a variety of technologies and techniques to achieve speed, accuracy, and reliability across diverse shooting conditions⁴.

Active AF and Passive AF

Active AF systems operate by emitting a beam of energy – typically infrared light or ultrasonic sound waves – towards the scene. The system then measures the reflected energy or the time it takes for the emitted energy to travel to the subject and return to the camera.82 Based on this measurement, the camera calculates the distance to the subject and adjusts the lens position accordingly to achieve focus. Examples of Active AF include:

Infrared (IR) AF: This method uses an infrared LED to project a beam of IR light.
Ultrasonic AF: This system emits ultrasonic sound waves and measures the time taken for the echo to return from the subject, much like sonar. This was famously used in Polaroid instant cameras.

The primary advantage of active AF is its ability to function effectively in very low light conditions or even in complete darkness, where passive systems (which rely on analyzing scene light) often struggle or fail. Active AF systems have a limited effective range (e.g., infrared systems are often effective up to around 6 meters or 20 feet). Their performance can be compromised if the subject is non-reflective or highly absorbent to the emitted energy (e.g., black fur might absorb IR light). The emitted beam can also be blocked by intervening objects (like cage bars or a fence) or reflect off surfaces like glass, preventing focus on the intended subject beyond it. The emitted signal itself might be undesirable or distracting in certain environments. While less common as the primary AF mechanism in modern high-performance digital cameras, the principle of active ranging remains relevant in AF-assist lamps and, more significantly, in the dedicated depth-sensing technologies like Time-of-Flight (ToF) sensors that are increasingly used to augment passive AF systems.

Passive AF systems do not emit their own energy source for ranging. Instead, they analyze the characteristics of the light from the scene that passes through the camera lens and falls onto the image sensor (or a dedicated AF sensor) to determine the state of focus. The two main passive AF techniques are Contrast Detection Auto Focus (CDAF) and Phase Detection Auto Focus (PDAF). Passive AF systems can focus on subjects at virtually any distance, as they are not limited by the range of an emitted signal. They are generally more versatile for a wider array of subjects and shooting conditions, provided there is sufficient ambient light and some level of contrast or detail in the subject. Their performance typically degrades in very low light levels or when attempting to focus on subjects with very low contrast (e.g., a plain, uniformly lit wall or a clear blue sky) because they rely on analyzing image features. Passive AF is the dominant autofocus technology in today’s digital SLRs (DSLRs), mirrorless cameras, and advanced smartphone cameras, with on-sensor PDAF variants being particularly prevalent due to their excellent balance of speed, accuracy, and tracking performance.

Common AF Techniques

Contrast Detection Auto Focus (CDAF) operates on the fundamental optical principle that an image is sharpest (i.e., in focus) when the contrast between adjacent pixels or regions is maximized. An out-of-focus image appears blurry because fine details and edges (which represent high-frequency information) are softened, reducing local contrast. The camera’s image processor analyzes the image data directly from the main image sensor (or a portion of it corresponding to the selected AF point). It instructs the lens to make a series of small incremental movements through its focusing range – a process often described as “hunting” or “scanning”. At each lens position, the processor measures the contrast in the designated AF region. This contrast measurement can be based on various metrics, such as the intensity difference between adjacent pixels, the amplitude of high-frequency components derived from a spatial filter (e.g., a Laplacian or Sobel filter), or other sharpness metrics. The lens continues to move, and the system tracks the contrast values. The position at which the measured contrast reaches its peak is considered the point of optimal focus.89 The lens will typically move slightly past this peak and then return to it to confirm.

CDAF has high potential accuracy due to direct image sensor feedback and is relatively simpler to implement in terms of sensor hardware compared to dedicated PDAF sensor modules, as it can leverage the main imaging sensor.
But, CDAF is generally slower than PDAF. The iterative hunting process – moving the lens back and forth to find the contrast peak – takes time. It has no inherent knowledge of which direction to move the lens initially or how far to move it, so it must search. This makes it less suitable for tracking fast-moving subjects. Performance degrades significantly in low light and with low-contrast subjects, as there may not be enough contrast variation for the system to reliably detect a peak.

Phase Detection Auto Focus (PDAF) determines focus by analyzing and comparing two (or more) separate images of the subject that are formed by light rays passing through different parts of the camera lens aperture – typically from opposite sides of the lens. In a DSLR camera, when the mirror is down (for viewfinder use), a portion of the light coming through the lens is reflected by a secondary mirror down to a dedicated PDAF sensor module, usually located in the base of the camera body. This module contains an array of paired microlenses and small line sensors (often CCDs). Each pair of microlenses splits the incoming light from a specific region of the lens into two distinct beams, which then fall onto the corresponding pair of line sensors. If the subject is perfectly in focus, these two light beams (and the patterns they form on the line sensors) will converge and align precisely. If the subject is out of focus, the two beams will be misaligned – they will be “out of phase.” The amount of this misalignment (the phase difference) and the direction of the misalignment (e.g., whether one beam is shifted left or right relative to the other) directly indicate not only that the image is out of focus, but critically, how far the lens is from the correct focus position and in which direction (towards infinity or towards minimum focus distance) the lens needs to be moved to achieve focus. The AF system can then drive the lens directly to the calculated focus position.

Since mirrorless cameras and smartphones lack the mirror box and dedicated PDAF sensor of a DSLR, PDAF functionality is implemented directly on the main imaging sensor (On-Sensor PDAF). This is achieved by modifying a certain percentage of the sensor’s pixels to act as phase detection sites. There are various ways to do this:
- Masked Pixels: Some pixels are partially masked so that they only receive light from one side (e.g., the left or right half) of the lens aperture. By comparing the signals from a “left-looking” masked pixel and a nearby “right-looking” masked pixel, a phase difference can be determined.
- Split Photodiodes (Dual Pixel AF is an advanced example): Each PDAF pixel (or in the case of Dual Pixel AF, nearly every pixel) is designed with two separate photodiodes. Each photodiode effectively sees the scene through a slightly different portion of the lens pupil. The signals from these two photodiodes are compared to detect phase differences.

The key advantage of PDAF is its speed. Because it can determine both the direction and magnitude of defocus from a single measurement (without iterative hunting), it allows the lens to be driven quickly and decisively to the correct focus point. This makes PDAF particularly effective for capturing action and tracking moving subjects. For traditional DSLR PDAF, the dedicated PDAF sensor must be perfectly aligned with the main image sensor. Any misalignment can lead to systematic front-focus or back-focus errors (where the lens consistently focuses slightly in front of or behind the intended subject), which may require lens/camera calibration (micro-adjustment).92 The number and coverage of AF points are typically limited to a central area of the frame. The system adds cost and complexity. For On-Sensor PDAF, when some sensor pixels are dedicated (or partially dedicated) to AF, their light-gathering capability for imaging might be reduced, potentially requiring interpolation from neighboring pixels to fill in image data (though this is often managed to be visually negligible in modern sensors). On-sensor PDAF can be more susceptible to noise in very low light conditions compared to dedicated PDAF sensors with larger photosites. Certain subject patterns (e.g., fine horizontal lines if the PDAF pixels are primarily sensitive to vertical phase differences) can sometimes pose challenges for on-sensor PDAF systems, though cross-type PDAF pixels (sensitive in two orientations) help mitigate this.

VI. Auto Exposure (AE) / Metering

Auto Exposure (AE) is the third pillar of the 3A system, working to ensure that images are captured with an appropriate level of brightness, avoiding overexposure (loss of detail in highlights) or underexposure (loss of detail in shadows).

This balance of light is controlled by three fundamental pillars: aperture, shutter speed, and ISO. Collectively known as the exposure triangle, these three elements work in concert to determine the brightness of an image. The core process underlying AE is metering.

Metering in photography is the process by which a camera’s internal light meter measures the intensity of light in a scene to determine the optimal exposure settings.10 These settings primarily include the shutter speed (duration the sensor is exposed to light), lens aperture (the size of the opening allowing light to pass through the lens), and ISO sensitivity (the sensor’s amplification of the light signal). The AE system uses the information gathered from metering to automatically adjust one or more of these parameters to achieve what it deems a “correct” exposure. Most in-camera light meters are designed to measure reflected light – that is, the light that bounces off the subject(s) and other elements within the scene and then enters the camera lens to strike the sensor (or a dedicated metering sensor). This is distinct from incident light meters (often handheld), which measure the light falling onto the subject.

The 18% Gray Assumption (Middle Gray) is a crucial concept in understanding how most camera meters work is the 18% gray assumption (also known as “middle gray” or “Zone V” in the Zone System).115 Camera meters are typically calibrated to interpret the scene they are measuring as if, on average, it reflects 18% of the incident light. The exposure settings are then calculated to render this average scene brightness as a medium gray tone in the final image. This standard helps in achieving consistent exposure across a variety of typical scenes. However, this assumption is also the primary reason why automatic exposure can sometimes be “fooled.”

If a scene is predominantly very bright (e.g., a snowy landscape, a subject against a white background), the meter, trying to make this bright scene average out to 18% gray, will tend to underexpose the image, making the snow or white background appear grayish.
Conversely, if a scene is predominantly very dark (e.g., a black cat on a dark rug, a subject against a black background), the meter, again trying to force this dark average to 18% gray, will tend to overexpose the image, making the black subject or background appear too light or washed out. Experienced photographers often use exposure compensation to manually override the camera’s metered exposure in such situations to achieve the desired result. The histogram display on digital cameras is an invaluable tool for objectively assessing exposure, showing the distribution of tones from black to white and helping to identify potential underexposure (graph bunched to the left, “clipped shadows”) or overexposure (graph bunched to the right, “blown highlights”).

The metering process effectively measures scene luminance (brightness). Exposure is measured in units of lux-seconds, and can be related to an Exposure Value (EV), which combines shutter speed and aperture into a single number representing a given level of exposure. A change of 1 EV corresponds to a doubling or halving of the amount of light reaching the sensor (a one-stop change).

VII. Trends

Automatic ISP Tuning. The intricate nature of ISP tuning, which involves adjusting numerous interdependent parameters across various processing blocks, has traditionally been a labor-intensive and time-consuming task performed by highly skilled engineers. This complexity, where the ISP can often resemble a “black box,” has spurred significant research into AI-driven 3A/ISP tuning. Such approaches aim to automatically optimize ISP parameters, or even the pipeline structure itself, for specific imaging tasks (e.g., enhancing object detection performance for a machine vision system) or for improved perceptual image quality.

AI-ISP and ISP for Machine Vision. Another trend is the application of data driven approach to certain ISP stages, e.g., denoising, tone mapping, and photo-finishing. Beyond better matching human vision, the AI accelerator on smartphones, or even sensor itself (e.g., IMX500⁵), empower the deployment of advanced recognition models to further provide inteligence sensing ability for the device throught the camera eye. This trend towards intelligent and adaptive ISPs promises to further enhance image quality and democratize access to advanced imaging capabilities.

References

Michael S. Brown ↩
Understanding color & the in-camera image processing pipeline for computer vision, ICCV 2019 Tutorial ↩
A Software Platform for Manipulating the Camera Imaging Pipeline, in ECCV 2016 ↩
Understanding Autofocus in Photography - Imaginated ↩
IMX500 - Sony ↩