Julia Kenyon, T. Stockhammer, A. Begen, Ofer Shem Tov, Louay Bassbouss, D. Silhavy
{"title":"Marrying WebRTC and DASH for interactive streaming","authors":"Julia Kenyon, T. Stockhammer, A. Begen, Ofer Shem Tov, Louay Bassbouss, D. Silhavy","doi":"10.1145/3510450.3517296","DOIUrl":"https://doi.org/10.1145/3510450.3517296","url":null,"abstract":"WebRTC is a set of W3C and IETF standards that allows the delivery of real-time content to users, with an end-to-end latency of under half a second. Support for WebRTC is built into all modern browsers across desktop and mobile devices, and it allows for streaming of video, audio and data. While the original focus of WebRTC has been on videoconferencing, it is increasingly being used today for real-time streaming of premium content because its ultra-low latency features enable several new user experiences, especially those that involve user interactivity, that are not easy to deliver or even possible with the traditional broadcast or streaming delivery protocols. Because of this increasing usage for premium content, the integration of WebRTC with the de facto adaptive streaming protocols such as MPEG's Dynamic Adaptive Streaming over HTTP (DASH) is essential. This paper gives information about the DASH Industry Forum's exploration activity on this very subject.","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133301253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A standards-based framework for real-time media in immersive scenes","authors":"Imed Bouazizi, T. Stockhammer","doi":"10.1145/3510450.3517288","DOIUrl":"https://doi.org/10.1145/3510450.3517288","url":null,"abstract":"Immersive media experiences are anticipated to become the norm in entertainment and communication in the near future, enabled by advances in computer graphics, capture and display systems, and networking technology. Immersive experiences are based on a rich 3D scene that enables immersion, fusion with the real world, and rich interactivity. However, 3D scenes are large, rich and complex - and hence stored and processed not only on devices, but on cloud systems. MPEG is currently working on specifying a set of functionalities that address different aspects of immersive media, including formats, access and delivery, and compression of these emerging media types. The scene description standard as defined in part 14 of the MPEG immersive standard family [1] provides the entry point and glue to such immersive experiences. The key design principle of the architecture behind it, was to separate media access from rendering. The scene description standard achieves this by defining a separate Media Access Function (MAF) and the API to access it. The MPEG-I scene description reference architecture is depicted in 1. The MAF receives instructions from the presentation engine on the media referenced in the scene. It uses this information to establish the proper media pipelines to fetch the media and pass it in the desired format to the presentation engine for rendering. The request for media also includes information about the current viewer's position as well as the scene camera position and intrinsic parameters. This enables the MAF to implement a wide range of optimization techniques, such as the adaptation of the retrieved media to the network conditions based on the viewer's position and orientation with regards to the object to be fetched. These adaptations may include partial retrieval, access at different levels of detail, and adjustment of quality. In this paper, we describe the architecture for immersive media and the functionality performed by the MAF to optimize the streaming of immersive media. We discuss the different adaptation options based on a selected set of MPEG formats for 3D content (i.e. video textures, dynamic meshes, and point clouds). We describe possible designs of such adaptation algorithms for real-time media delivery in the example of immersive conferencing.","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126479024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How innovations in ASIC architectures and novel design approaches power tomorrow's video networks","authors":"Avinash Ramachandran","doi":"10.1145/3510450.3517318","DOIUrl":"https://doi.org/10.1145/3510450.3517318","url":null,"abstract":"This paper describes a hardware AV1 encoder pipeline design and evaluates its performance relative to Advanced Video Coding with x264.","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130126793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Grois, A. Giladi, Praveen Kumar Karadugattu, Niranjan Balasubramanian
{"title":"Novel temporal masking framework for perceptually optimized video coding","authors":"D. Grois, A. Giladi, Praveen Kumar Karadugattu, Niranjan Balasubramanian","doi":"10.1145/3510450.3517306","DOIUrl":"https://doi.org/10.1145/3510450.3517306","url":null,"abstract":"The development of the 1st edition of HEVC by the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Expert Group (MPEG) was officially finalized in January 2013 [11], thereby achieving a significant bitrate reduction of roughly 50% for substantially the same visual quality when compared to its predecessor [6, 7, 14]. The development process of HEVC was driven by the most recent scientific and technological achievements in the video coding field. In turn, video applications continue to gain a lot of traction and to have an enormous demand [9]. A very significant increase in the bandwidth requirements is expected by 2023, particularly due to the increase in the resolution supported by devices. It is expected that 66% of the connected flat-panel TV sets will have the support for the Ultra-High Definition (UltraHD) resolution compared to only 33% in 2018 (note that \"UltraHD\" in this paper refers to the 3840×2160 resolution, also known as 4K or 2160p). The typical bitrate for a 60fps 4K HDR10 video is between 15 to 24 Mbps [9], which is is nearly four times more than twice the typical High-Definition (HD) video bitrate. In addition, the overall IP video traffic [8] is expected to grow to 82% of the overall Internet traffic by 2022, and about 21% of this traffic is expected to be UltraHD. As a result, there is a continuous strong need to further decrease video transmission bitrate, especially for the UltraHD content, substantially without reducing the perceptual visual quality. One of the promising approaches for increasing video coding gain is applying \"visual masking\", which is based on a very interesting phenomenon observed in the human visual system (HVS) [1, 16]. According to this phenomenon, two or more stimuli are presented sequentially to a viewer, with one stimulus acts as a target which has to be detected and described, while other stimuli are used to mask the visibility of that target [1]. With this regard, a good amount of research has been carried out in the video compression field, such as [2] for example, which exploits the above-mentioned phenomenon by providing a psycho-visual algorithm that has been implemented in the x264 encoder [4]. In turn, more advanced studies of [2] are further presented and discussed in [2] and [3]. In addition, in the most recent work, such as [15], it is proposed to mask temporal activities that are unnoticeable by human visual system by using a masking coefficient. Further, [12] presents a video just noticeable difference (JND) scheme by employing compound spatial and structure-based temporal masking, further measuring a JND threshold for each transform coefficient of a color video. Also, [17] proposes an improved transform-based JND estimation model considering multiple masking effects. However, all surveyed existing visual masking approaches, the most interesting of which are indicated above, lead to relatively low bitrate savings. As ","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116127633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Biatek, Eduard François, C. Thienot, W. Hamidouche
{"title":"Sustainable OTT video distribution powered by 5G-multicast/unicast delivery and versatile video coding","authors":"T. Biatek, Eduard François, C. Thienot, W. Hamidouche","doi":"10.1145/3510450.3517273","DOIUrl":"https://doi.org/10.1145/3510450.3517273","url":null,"abstract":"Video over the internet has drastically grown these past years, currently representing more than 80% of the internet bandwidth [3]. The massive usage of unicast delivery leads to network congestion that can result in poor quality of experience for the viewer, high delivery cost for operators and increased energy consumption. The current methods for adaptive video streaming rely more on maximizing the video quality for a given bandwidth rather than minimizing the end-to-end (from video server to end-user display) energy consumption for a given level of quality. This paper aims at leveraging recently standardized delivery and coding technologies to maintain the video quality while monitoring and reducing video energy footprint and delivery costs. The Versatile Video Coding (VVC) [2] recently issued by ISO/IEC and ITU-T is used to further reduce video services bandwidth over previous video coding solutions, in particular its predecessor HEVC [10]. Compared to HEVC, VVC enables around 50% of bandwidth saving, at an equivalent video quality [12]. This performance is achieved by extending existing HEVC coding tools and by introducing new ones. VVC also brings new high-level features to better address new use-cases and applications (e.g variable resolution, scalability, 360° and screen-content coding). This paper uses ATEME Titan encoder while the decoding is performed by a real time software decoder OpenVVC. This latter is a cross-platform library that provides consumers with real time decoding capability under different OS including MAC OS, Windows, Linux and Android on Intel x86 and ARM platforms. A dynamic video format selection is proposed to limit the distribution bandwidth and energy cost based on quality/complexity tradeoffs sent to the player. The carriage of such metadata is achieved by using the Green-MPEG [6] and DASH standard. A content mapping and video artefacts masking is implemented to counterbalance video degradation with post-processing. These are based on SEI messages such as CTI and Grain Synthesis recently standardized in the VSEI specification [5]. An energy reporting solution is proposed in order for the end-viewer to be informed of the energy impact and enabling manual adjustment the energy/quality tradeoff. The Common Media Application Format (CMAF) is used to deliver the video segments over the network [5]. The carriage of video, audio and green metadata is based on the ISO/IEC Base Media File Format [7] (ISOBMFF). To save network traffic, the video segments can be either sent in unicast for low-audience services or in multicast for highly popular ones. The DASH/FLUTE stack is implemented as multicast protocol. The OTT services are delivered in the context of 3GPP Release 17 [1] and DVB-MABR Phase 2 network [4], including delivery over managed networks (IPTV) and mobile (4G-Lte and 5G). The estimated energy saving for the proposed delivery infrastructure is the following. First, the bitrate saving brought by improved compressi","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"73 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116377063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Behind the scene: delivering 4K olympics games","authors":"Derik Yarnell, Grant McGilvray","doi":"10.1145/3510450.3517282","DOIUrl":"https://doi.org/10.1145/3510450.3517282","url":null,"abstract":"The Tokyo Olympic Games offered NBCUniversal and Comcast an opportunity to make a substantial leap forward in live, localized UHD broadcasts. The result was a radically different workflow, delivering Olympics content from 50+ local NBC affiliate stations in stunning 4K HDR with next generation audio. The solution shined again, a mere 6 months later, when Beijing hosted the 2022 Winter Olympic games. Built on a new software-defined IP workflow, NBC was able to dynamically switch UHD simulcast content into locally up-converted affiliate signals. The result was nationwide UHD programming live from venue, while at the same time protecting local news and advertising inventory from each of the NBC affiliate stations. When combined with the Comcast X1 platform, and delivered over the Comcast network, Comcast X1 UHD viewers were able to seamlessly transition into the UHD version of their local NBC station with Dolby Vision and Dolby Atmos.","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122242392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minh Nguyen, C. Timmerer, S. Pham, D. Silhavy, A. Begen
{"title":"Take the red pill for H3 and see how deep the rabbit hole goes","authors":"Minh Nguyen, C. Timmerer, S. Pham, D. Silhavy, A. Begen","doi":"10.1145/3510450.3517302","DOIUrl":"https://doi.org/10.1145/3510450.3517302","url":null,"abstract":"With the introduction of HTTP/3 (H3) and QUIC at its core, there is an expectation of significant improvements in Web-based secure object delivery. As HTTP is a central protocol to the current adaptive streaming methods in all major streaming services, an important question is what H3 will bring to the table for such services. To answer this question, we present the new features of H3 and QUIC, and compare them to those of H/1.1/2 and TCP. We also share the latest research findings in this domain.","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129097432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Standards based end-to-end metadata transport for live production workflows","authors":"Kent Terry","doi":"10.1145/3510450.3517303","DOIUrl":"https://doi.org/10.1145/3510450.3517303","url":null,"abstract":"One of the factors that has driven the rise to prominence of OTT services that deliver content directly to consumers via IP distribution is the increase in the audio and visual quality of content that they provide. The ability to deliver immersive and personalized audio enabled by next generation audio (NGA) codecs, and 4K/8K high dynamic range video, is one reason consumers recognize these services as delivering the highest quality content. A common requirement to fully enable these advanced and video capabilities is the use of rich, dynamic, time accurate metadata. This type of metadata is also key to enabling new emerging technology, such as VR, and future, not yet defined, technologies that will continue to drive content innovation. While file based workflows for scripted and non-live content have added capabilities to utilize rich audio and video metadata in the production and distribution process, support for this type of metadata in live production and distribution has lagged, partly due to the prevalence of legacy audio and video technology that has limited metadata capabilities. The move to IP transport based methods for live content production provides the opportunity to remove these limitations. Work is in progress to define new standards for metadata transport that not only meet the requirements for current use cases but is flexible and extendable for future applications. Work to define metadata transport standards for SMPTE ST 2110 systems, as well as audio metadata standards for AES67 systems is described. Interoperation with legacy systems, and with file based formats and workflows is also considered, and emerging standards in this area are discussed. How these emerging standards fit into a larger vision of \"microphone to speaker\" audio metadata and \"camera to display\" video metadata is also described. Particular focus will be given on enabling rich audio metadata in the latest NGA audio codecs such as AC-4.","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"123 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120882176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frank San Miguel, Naveen Mareddy, Anush K. Moorthy, Xiaomei Liu
{"title":"Microservices for multimedia: video encoding","authors":"Frank San Miguel, Naveen Mareddy, Anush K. Moorthy, Xiaomei Liu","doi":"10.1145/3510450.3517298","DOIUrl":"https://doi.org/10.1145/3510450.3517298","url":null,"abstract":"Netflix has been one of the pioneers that has driven the industry adoption of a new paradigm of system architecture referred to as \"microservices\". Microservices, or more accurately, microservice architecture refers to an architecture where applications are modeled as a collection of services which are: highly maintainable and independently testable, loosely coupled, independently deployable and organized around business capabilities. Typically, each microservice is owned by a small team of developers that is responsible for its development, testing and deployment, i.e., its end-to-end lifecycle. Traditional microservices such as those used outside of multimedia processing at Netflix typically consist of an API with stateless business logic which is autoscaled based on request load. These APIs provide strong contracts and separate the application data and binary dependencies from systems. As useful as traditional microservices are, several peculiarities of multimedia applications render such stateless services non ideal for media processing. Specifically, media processing (which includes video/audio processing, media encoding, timed-text processing, computer vision analysis etc.) relies on data that is embedded in files where the files themselves are contracts as opposed to fully visible data models that are common in non-media applications. At Netflix, media processing is resource intensive and bursty in nature. It is also highly parallelizable and re-triable, and so, even though work is generally a continuous stream with deadlines and priorities, the system can balance resources by evicting jobs as needed which can be retried at a later time. In this talk, we will summarize Cosmos, a project that we've developed in order to enable workflow-driven media processing using a microservice architecture. Cosmos is a computing platform that combines the best aspects of microservices with asynchronous workflows and serverless functions. It is designed specifically for resource intensive algorithms which are coordinated via complex hierarchical workflows. Cosmos supports both high throughput and low-latency workloads. The Cosmos platform offers: observability through built in logging, tracing, monitoring, alerting and error classification; modularity (both compile-time and run-time) through an opinionated framework for structuring a service; productivity through tooling such as code generators, containers, and command line interfaces; and delivery through a managed continuous-delivery pipelines. The Cosmos platform allows media developers to build and run domain-specific, scale-agnostic components which are built atop three scale-aware subsystems that handle distributing the work. Each component can thus be independently developed, tested and deployed with clear abstraction from the underlying platform thereby providing a logical separation between the application and platform so that the details of distributed computing are hidden from media developers ","PeriodicalId":122386,"journal":{"name":"Proceedings of the 1st Mile-High Video Conference","volume":"900 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116814852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}